University of Twente Student Theses

Login

Surgical Video Triplet Recognition with Multimodal Learning

Wu, Y. (2025) Surgical Video Triplet Recognition with Multimodal Learning.

This is the latest version of this item.

[img] PDF
5MB
Abstract:Surgical action triplet recognition is crucial for understanding surgical workflows. This work presents a novel multimodal approach that leverages the complementary strengths of RGB features and segmentation information to improve triplet recognition accuracy. Our key innovation lies in the integration of the Segment Anything Model (SAM) with a CAM-guided prompting mechanism, coupled with a gated cross-attention architecture for effective modality fusion. The system not only achieves improved triplet recognition performance but also demonstrates capability in weakly supervised instrument and anatomy segmentation. Through extensive experimentation on the CholecT45 dataset, we show that our fusion approach with selective information flow outperforms traditional concatenation-based methods. We also provide insights into the limitations of certain modalities, such as optical flow in low frame rate scenarios, and the challenges of using generic vision-language models in medical contexts. Our approach offers practical benefits for surgical workflow analysis while reducing the annotation burden through its dual-use nature.
Item Type:Essay (Master)
Faculty:EEMCS: Electrical Engineering, Mathematics and Computer Science
Subject:54 computer science
Programme:Computer Science MSc (60300)
Link to this item:https://purl.utwente.nl/essays/105278
Export this item as:BibTeX
EndNote
HTML Citation
Reference Manager

 

Repository Staff Only: item control page