Multimodal Representation Learning for Semantic Image Retrieval

A systematic investigation of design choices in training contrastive vision-language models