GAFNet: A Global Fourier Self Attention Based Novel Network for multi-modal downstream tasks
Onkar Susladkar
Gayatri Deshmukh
Dhruv Makwana
Sparsh Mittal
R Sai Chandra Teja
Rekha Singhal
[Paper]
[GitHub]

Abstract

In “vision and language” problems, multimodal inputs are simultaneously processed for combined visual and textual understanding for image-text embedding. In this paper, we discuss the necessity of considering the difference between the feature space and the distribution when performing multimodal learning. We deal with this problem through deep learning and a generative model approach. We introduce a novel network, GAFNet (Global Attention Fourier Net), which learns through large-scale pre-training over three image-text datasets (COCO, SBU, and CC-3M), for achieving high performance on downstream vision and language tasks. We propose a GAF (Global Attention Fourier) module, which integrates multiple modalities into one latent space. GAF module is independent of the type of modality, and it allows combining shared representations at each stage. Various ways of thinking about the relationships between different modalities directly affect the model’s design. In contrast to previous research, our work considers visual grounding as a pretrainable and transferable quality instead of something that must be trained from scratch. We show that GAFNet is a versatile network that can be used for a wide range of downstream tasks. Experimental results demonstrate that our technique achieves state-of-theart performance on multimodal classification on the CrisisMD dataset and image generation on the COCO dataset. For image-text retrieval, our technique achieves competitive performance.


Code


 [GitHub]


Paper and Supplementary Material

Onkar Susladkar, Gayatri Deshmukh, Dhruv Makwana, Sparsh Mittal, R Sai Chandra Teja, Rekha Singhal
GAFNet: A Global Fourier Self Attention Based Novel Network for multi-modal downstream tasks
In Conference WACV, 2022.
(hosted on Researchgate)


[Bibtex]


Acknowledgements

This template was originally made by Phillip Isola and Richard Zhang for a colorful ECCV project; the code can be found here.