Video Text Matching : A Deep Learning Model for Video to Descriptive Text Matching
Anazo, M.A.C. (2024)
The goal is to develop a functional multi-modal model that can retrieve short videos based on a text description provided by a user and also give a textual description based on a user-provided video. This will be done by processing textual descriptions and video clips and designing a feature space that will be shared for both text and video, thus enabling the matching of the two data types using contrastive learning. The model is trained and tested on the Microsoft Research Video Description Corpus (MSVD).
Anazo_BA_EEMCS.pdf