University of Twente Student Theses

As of Friday, 8 August 2025, the current Student Theses repository is no longer available for thesis uploads. A new Student Theses repository will be available starting Friday, 15 August 2025.

Audio-visual Correlation from Cross-modal Attention in Self-supervised Transformers on Videos of Musical Performances

Bosma, H.R. (2022) Audio-visual Correlation from Cross-modal Attention in Self-supervised Transformers on Videos of Musical Performances.

PDF
2MB

Abstract:	Attention maps from transformer-based models using the self-attention mechanism are highly interpretable. Works in the vision-and-language domain train general models on large data sets using self-supervised methods, leveraging such attention values between modalities for multiple different learning tasks. Within the audio-visual domain, we see similar approaches, but these are often specialized towards one type of data set, like human speech, and require supervised data sets. This work introduces a general and more versatile audio-visual training framework, based on the approaches of the vision-and-language works. This framework can be applied to many different audio-visual learning scenarios. We apply this framework for the task of audio-source localization. Our implementation uses an audio-visual model based on two separate convolutional-based audio and visual embedding stages, and a single transformer-based encoder stage. This model is trained with the self-supervised proxy task of multi-modal alignment. Our new MUSIC-200k data set of 192 007 videos of musical performances (\url{https://github.com/HesselBosma/MUSIC200k}) was used for training and validation. Visual inspection of the source-localization results shows that the framework is valid for this particular learning task. These results suggest broader applicability of the framework, e.g. different learning tasks. The framework has some promising benefits like more general applicability, zero-shot learning capability, and requiring only non-supervised training data. However, the audio-source localization performance seems to be limited. Opportunities have been identified to increase performance on audio-visual learning tasks. But, these performance-increasing measures were not empirically tested in this work. Furthermore, due to time constraints, only a limited visual evaluation was performed instead of a more informative numerical evaluation. Thus, no direct accurate comparison can be made with the performance of other methods.
Item Type:	Essay (Master)
Faculty:	EEMCS: Electrical Engineering, Mathematics and Computer Science
Subject:	54 computer science
Programme:	Computer Science MSc (60300)
Link to this item:	https://purl.utwente.nl/essays/93211
Export this item as:	BibTeX EndNote HTML Citation Reference Manager

Show download statistics for this publication

Repository Staff Only: item control page