Web Scraping as a Data Source for Machine Learning Models and the Importance of Preprocessing Web Scraped Data

Fraňo, Maxim (2024)

The general concept of machine learning (ML) cannot work without large amounts of data. There are many methods of data gathering, ranging from writing down data manually to using complex algorithms. This research specifically focuses on web scraping as a method for data extraction, and its effect on ML models. The approach in this research is split into two parts, theoretical and practical. First, the connection between web scraping and ML is observed through literature analysis. Later an experiment focused on using non-preprocessed web scraped data in a dataset for training an ML model is carried out using a tool made in Python. The results of the literature review show that web-scraped data can be greatly varied, gathered through a wide range of tools, and used in a large number of different ML models, while the experiment shows the importance of preprocessing web-scraped data to achieve high performance in these models
frano_BA_eemcs.pdf