University of Twente Student Theses
Fine-Tuning Pre-trained End-To-End Automatic Speech Recognition Models to Incorporate the Transcription of Laughter
Spink, S.J. (2024) Fine-Tuning Pre-trained End-To-End Automatic Speech Recognition Models to Incorporate the Transcription of Laughter.
PDF
3MB |
Abstract: | This thesis aims to enhance Automatic Speech Recognition (ASR) by incorporating laughter detection, thereby broadening its applicability to more realistic and authentic real-world scenarios. There are various ASR models, but the pre-trained End-To-End models are particularly promising. These types of models can be fine-tuned on relatively little data. Two models were selected for fine-tuning and comparison: Whisper, a popular and high-performance model, and HuBERT, which emphasises phoneme sounds. Using two datasets that include spontaneous speech and laughter annotations - the AMI corpus and Switchboard - these models were pre-processed, normalised, fine-tuned and evaluated using the Word Error Rate (WER) for the ASR performance and F1-score, recall and precision for the laughter detection performance. The results indicated that the Whisper model performed best on the Switchboard dataset, achieving the highest F1-score (i.e. 0.901) and corresponding lowest WER (i.e. 0.161). On the AMI dataset, the results were more ambiguous. Neither model performed well enough for application on noisy datasets like AMI (i.e. both had an F1-score lower than 0.6). Still, HuBERT achieved the highest F1-score for laughter detection at 0.531. Whisper demonstrated a lower WER\_L (i.e. word error rate including "laughter" annotations as a word) at 0.304, WER of 0.311 and a significantly higher precision of 0.949 (i.e. versus 0.785 precision for HuBERT), which is often critical for practical applications. Therefore, overall, Whisper is identified as the best-performing model for ASR in terms of laughter integration, particularly in applications focused on identifying laughter events without misinformation. |
Item Type: | Essay (Master) |
Faculty: | EEMCS: Electrical Engineering, Mathematics and Computer Science |
Subject: | 54 computer science |
Programme: | Interaction Technology MSc (60030) |
Link to this item: | https://purl.utwente.nl/essays/104322 |
Export this item as: | BibTeX EndNote HTML Citation Reference Manager |
Repository Staff Only: item control page