University of Twente Student Theses

Login

Text-based classification of websites using self-hosted Large Language Models : An accuracy and efficiency analysis

Sava, D. (2024) Text-based classification of websites using self-hosted Large Language Models : An accuracy and efficiency analysis.

[img] PDF
359kB
Abstract:Website categorization is essential for applications like content filtering, targeted advertising, and web analytics. However, traditional approaches face challenges due to the internet’s rapid growth and changing nature. This research explores the potential of using open-source large language models (LLMs) as a more efficient and accurate solution for website categorization. By leveraging the vast knowledge acquired by LLMs through training on large amounts of web data, the aim is to develop an approach that reduces the reliance on manually labelled datasets and adapts to the dynamic internet landscape. The study uses various open-source LLMs, including models from the Llama, Dolphin, Mixtral, Mistral, Gemma, Phi, and Aya families, with different sizes and quantization levels. The performance of these models is evaluated using a benchmark labled dataset from Cloudflare Radar, which includes both AI-based categorization and human validation. The accuracy of the LLMs is assessed based on their ability to assign websites to at least one of the top three categories provided by the benchmark. The findings show the potential of open-source LLMs for website categorization, with some models achieving accuracy rates exceeding 70%. This research provides a promising approach for leveraging open-source LLMs in website categorization tasks, contributing to natural language processing and web classification.
Item Type:Essay (Bachelor)
Faculty:EEMCS: Electrical Engineering, Mathematics and Computer Science
Subject:54 computer science
Programme:Business & IT BSc (56066)
Link to this item:https://purl.utwente.nl/essays/101155
Export this item as:BibTeX
EndNote
HTML Citation
Reference Manager

 

Repository Staff Only: item control page