University of Twente Student Theses

As of Friday, 8 August 2025, the current Student Theses repository is no longer available for thesis uploads. A new Student Theses repository will be available starting Friday, 15 August 2025.

Web Scraping as a Data Source for Machine Learning Models and the Importance of Preprocessing Web Scraped Data

Fraňo, Maxim (2024) Web Scraping as a Data Source for Machine Learning Models and the Importance of Preprocessing Web Scraped Data.

PDF
257kB

Abstract:	The general concept of machine learning (ML) cannot work without large amounts of data. There are many methods of data gathering, ranging from writing down data manually to using complex algorithms. This research specifically focuses on web scraping as a method for data extraction, and its effect on ML models. The approach in this research is split into two parts, theoretical and practical. First, the connection between web scraping and ML is observed through literature analysis. Later an experiment focused on using non-preprocessed web scraped data in a dataset for training an ML model is carried out using a tool made in Python. The results of the literature review show that web-scraped data can be greatly varied, gathered through a wide range of tools, and used in a large number of different ML models, while the experiment shows the importance of preprocessing web-scraped data to achieve high performance in these models
Item Type:	Essay (Bachelor)
Faculty:	EEMCS: Electrical Engineering, Mathematics and Computer Science
Subject:	54 computer science
Programme:	Business & IT BSc (56066)
Awards:	Best Paper Award
Link to this item:	https://purl.utwente.nl/essays/100867
Export this item as:	BibTeX EndNote HTML Citation Reference Manager

Show download statistics for this publication

Repository Staff Only: item control page