University of Twente Student Theses


MaskCLIP : masking improves transfer for visual models trained with natural language supervision

Weers, F.R.T. (2022) MaskCLIP : masking improves transfer for visual models trained with natural language supervision.

[img] PDF
Abstract:The machine learning community has always aimed at training models that can be applied to a variety of tasks, without the need to design and train a model for each task individually. Foundation models are trained on large datasets with a training task independent of any specific target task and they have shown to perform very well on unseen tasks. The contrastive learning scheme has been proven to be a good pre-training task for a foundation model, in which the model learns to encode images and their matching texts close in an embedding space. Due to the design of the contrastive loss, it can be applied to an unseen task in a zero-shot setting by comparing pairs of similar or dissimilar image/text samples. In this case, we compute the similarity between encoded potential text labels and an encoded image, where the most similar text-image score tells us what is visible in the image. Large-scale vision and language models using contrastive learning have shown great whole-image classification performance. However, attention to detail and the capability to match text to a region of the image (visual grounding) is lacking. The masked patch prediction task is another pre-training task, in which the input is masked by a high ratio and the model needs to predict the original raw values of the masked areas. This task has shown to result in very good visual grounding. We propose MaskCLIP, an architecture that incorporates a per-sample masking task into an existing technique for learning visual models with natural language supervision via a contrastive loss. We demonstrate that combining the two tasks in a multi-task setting, where we use per-patch similarity scores to improve the masking strategy, substantially improves the quality of the learned image representations for downstream structured and visual Q&A tasks. Our results indicate that the combination of both a contrastive and masked loss contributes to improved effectiveness of transfer learning to downstream tasks beyond whole-image classification in a data-efficient manner when trained on large and relatively noisy paired image-and-text datasets crawled from the web.
Item Type:Essay (Master)
Apple Inc., Cupertino, CA, United States
Faculty:EEMCS: Electrical Engineering, Mathematics and Computer Science
Subject:54 computer science
Programme:Computer Science MSc (60300)
Link to this item:
Export this item as:BibTeX
HTML Citation
Reference Manager


Repository Staff Only: item control page