Zero-shot cross-modal retrieval (ZS-CMR) performs the task of cross-modal retrieval where the classes of test categories have a different scope than the training categories. It borrows the intuition from zero-shot learning which targets to transfer the knowledge inferred during the training phase for seen classes to the testing phase for unseen classes. It mimics the real-world scenario where new object categories are continuously populating the multi-media data corpus. Unlike existing ZS-CMR approaches which use generative adversarial networks (GANs) to generate more data, we propose Inter-Modality Fusion based Attention (IMFA) and a framework ZS INN FUSE (Zero-Shot cross-modal retrieval using INNer product with image-text FUSEd). It exploits the rich semantics of textual data as guidance to infer additional knowledge during the training phase. This is achieved by generating attention weights through the fusion of image and text modalities to focus on the important regions in an image. We carefully create a zero-shot split based on the large-scale MSCOCO and Flickr30k datasets to perform experiments. The results show that our method achieves improvement over the ZS-CMR baseline and self-attention mechanism, demonstrating the effectiveness of inter-modality fusion in a zero-shot scenario.