University of Twente Student Theses
Text-based classification of websites using self-hosted Large Language Models : An accuracy and efficiency analysis
Sava, D. (2024) Text-based classification of websites using self-hosted Large Language Models : An accuracy and efficiency analysis.
PDF
359kB |
Abstract: | Website categorization is essential for applications like content filtering, targeted advertising, and web analytics. However, traditional approaches face challenges due to the internet’s rapid growth and changing nature. This research explores the potential of using open-source large language models (LLMs) as a more efficient and accurate solution for website categorization. By leveraging the vast knowledge acquired by LLMs through training on large amounts of web data, the aim is to develop an approach that reduces the reliance on manually labelled datasets and adapts to the dynamic internet landscape. The study uses various open-source LLMs, including models from the Llama, Dolphin, Mixtral, Mistral, Gemma, Phi, and Aya families, with different sizes and quantization levels. The performance of these models is evaluated using a benchmark labled dataset from Cloudflare Radar, which includes both AI-based categorization and human validation. The accuracy of the LLMs is assessed based on their ability to assign websites to at least one of the top three categories provided by the benchmark. The findings show the potential of open-source LLMs for website categorization, with some models achieving accuracy rates exceeding 70%. This research provides a promising approach for leveraging open-source LLMs in website categorization tasks, contributing to natural language processing and web classification. |
Item Type: | Essay (Bachelor) |
Faculty: | EEMCS: Electrical Engineering, Mathematics and Computer Science |
Subject: | 54 computer science |
Programme: | Business & IT BSc (56066) |
Link to this item: | https://purl.utwente.nl/essays/101155 |
Export this item as: | BibTeX EndNote HTML Citation Reference Manager |
Repository Staff Only: item control page