University of Twente Student Theses

Login
As of Friday, 8 August 2025, the current Student Theses repository is no longer available for thesis uploads. A new Student Theses repository will be available starting Friday, 15 August 2025.

Exploring the Efficacy of LLMs for Expanding the CVE Dataset to Assist in Machine Learning

Spil, M. (2025) Exploring the Efficacy of LLMs for Expanding the CVE Dataset to Assist in Machine Learning.

[img] PDF
94kB
Abstract:The Common Vulnerabilities and Exposures (CVE) system is a critical re- source for cybersecurity research and tooling, yet its mapping to the Com- mon Weakness Enumeration (CWE) framework is highly imbalanced. This uneven distribution of CVEs across CWE categories poses a significant challenge for machine learning applications, particularly those tasked with automatically classifying vulnerabilities by weakness type. In this paper, we explore the use of Large Language Models (LLMs), enhanced with retrieval- augmented generation (RAG), to synthetically generate CVE-like entries for underrepresented CWE classes. We present a methodology for augment- ing the training dataset with realistic, grounded synthetic examples and evaluate the impact on classification performance using a BERT-based neu- ral network. Experimental results show that LLM-generated data improves macro-averaged recall and F1-score compared to both baseline and over- sampling approaches. Our findings highlight the potential of generative models to mitigate class imbalance in vulnerability datasets and support more equitable and accurate machine learning systems in cybersecurity.
Item Type:Essay (Bachelor)
Faculty:EEMCS: Electrical Engineering, Mathematics and Computer Science
Subject:54 computer science
Programme:Computer Science BSc (56964)
Link to this item:https://purl.utwente.nl/essays/107223
Export this item as:BibTeX
EndNote
HTML Citation
Reference Manager

 

Repository Staff Only: item control page