University of Twente Student Theses
Towards a Standard Fine-Grained Part-of-Speech Tagging for Northern Kurdish
Morad, Peshmerge (2023) Towards a Standard Fine-Grained Part-of-Speech Tagging for Northern Kurdish.
PDF
1MB |
Abstract: | In the growing domain of Natural Language Processing (NLP), low-resourced languages like Northern Kurdish remain largely unexplored due to the lack of resources needed to be part of this growth. In particular, the tasks of Part-of-Speech (POS) tagging and tokenization for Northern Kurdish are still insufficiently addressed by the research community. In this study, we aim to bridge this gap by evaluating a range of statistical and neural network-based POS models specifically tailored for Northern Kurdish. Leveraging limited but valuable datasets, including the revisited Universal Dependency Kurmanji treebank and a novel manually annotated and to kenized gold-standard dataset consisting of 136 sentences (2, 937 tokens). In this research, we set out to establish both a baseline and a state-of-the-art POS tagger for Northern Kurdish. Challenges such as data scarcity, absence of dedicated research, correct representation of linguistic features, and tokenization methods are carefully addressed through a multifaceted approach involving data refinement, manual annotation, and experimentation with multiple tokenization methods. We propose a POS tagging pipeline for Northern Kurdish where various training and test datasets, POS tagging models, and tokenization methods can be integrated, used, and evaluated. We evaluate our proposed POS tagging models on the novel gold-standard dataset; our transformer-based model outperforms traditional statistical models, achieving an accuracy of 0.87 and a macro-averaged F1 score of 0.77. However, among the traditional statistical models, the CRF model achieves competitive results, 0.84 and 0.74 for accuracy and macro-averaged F1, respectively. This study offers crucial insights into the linguistic peculiarities of Northern Kurdish that affect the performance of tokenization and POS tagging methods and lays down a road map for future work, including dataset expansion and adaptability tests for other Kurdish dialects. |
Item Type: | Essay (Master) |
Faculty: | EEMCS: Electrical Engineering, Mathematics and Computer Science |
Subject: | 54 computer science |
Programme: | Computer Science MSc (60300) |
Link to this item: | https://purl.utwente.nl/essays/97597 |
Export this item as: | BibTeX EndNote HTML Citation Reference Manager |
Repository Staff Only: item control page