University of Twente Student Theses
As of Friday, 8 August 2025, the current Student Theses repository is no longer available for thesis uploads. A new Student Theses repository will be available starting Friday, 15 August 2025.
Exploring the Efficacy of LLMs for Expanding the CVE Dataset to Assist in Machine Learning
Spil, M. (2025) Exploring the Efficacy of LLMs for Expanding the CVE Dataset to Assist in Machine Learning.
PDF
94kB |
Abstract: | The Common Vulnerabilities and Exposures (CVE) system is a critical re- source for cybersecurity research and tooling, yet its mapping to the Com- mon Weakness Enumeration (CWE) framework is highly imbalanced. This uneven distribution of CVEs across CWE categories poses a significant challenge for machine learning applications, particularly those tasked with automatically classifying vulnerabilities by weakness type. In this paper, we explore the use of Large Language Models (LLMs), enhanced with retrieval- augmented generation (RAG), to synthetically generate CVE-like entries for underrepresented CWE classes. We present a methodology for augment- ing the training dataset with realistic, grounded synthetic examples and evaluate the impact on classification performance using a BERT-based neu- ral network. Experimental results show that LLM-generated data improves macro-averaged recall and F1-score compared to both baseline and over- sampling approaches. Our findings highlight the potential of generative models to mitigate class imbalance in vulnerability datasets and support more equitable and accurate machine learning systems in cybersecurity. |
Item Type: | Essay (Bachelor) |
Faculty: | EEMCS: Electrical Engineering, Mathematics and Computer Science |
Subject: | 54 computer science |
Programme: | Computer Science BSc (56964) |
Link to this item: | https://purl.utwente.nl/essays/107223 |
Export this item as: | BibTeX EndNote HTML Citation Reference Manager |
Repository Staff Only: item control page