University of Twente Student Theses
Web Scraping as a Data Source for Machine Learning Models and the Importance of Preprocessing Web Scraped Data
Fraňo, Maxim (2024) Web Scraping as a Data Source for Machine Learning Models and the Importance of Preprocessing Web Scraped Data.
PDF
257kB |
Abstract: | The general concept of machine learning (ML) cannot work without large amounts of data. There are many methods of data gathering, ranging from writing down data manually to using complex algorithms. This research specifically focuses on web scraping as a method for data extraction, and its effect on ML models. The approach in this research is split into two parts, theoretical and practical. First, the connection between web scraping and ML is observed through literature analysis. Later an experiment focused on using non-preprocessed web scraped data in a dataset for training an ML model is carried out using a tool made in Python. The results of the literature review show that web-scraped data can be greatly varied, gathered through a wide range of tools, and used in a large number of different ML models, while the experiment shows the importance of preprocessing web-scraped data to achieve high performance in these models |
Item Type: | Essay (Bachelor) |
Faculty: | EEMCS: Electrical Engineering, Mathematics and Computer Science |
Subject: | 54 computer science |
Programme: | Business & IT BSc (56066) |
Awards: | Best Paper Award |
Link to this item: | https://purl.utwente.nl/essays/100867 |
Export this item as: | BibTeX EndNote HTML Citation Reference Manager |
Repository Staff Only: item control page