University of Twente Student Theses

Login

Classifying Companies Based on Textual Webpage Data: A Comparative Analysis

Weening, J. (2024) Classifying Companies Based on Textual Webpage Data: A Comparative Analysis.

[img] PDF
4MB
Abstract:Growth of the World Wide Web consistently causes innovative ideas of companies to promote themselves and market their products. Large corporations invest many resources to achieve top spots in search queries, making it infeasible for small business owners to compete. Q-info.com, a web platform created by E-Active, offers new solutions for these companies. With their platform, it becomes easy and affordable to attract customers, sell products and manage their finances. However, Q-info.com has the same problem of getting businesses to find their platform. With hundreds of industry-specific sites, they employ a new strategy to attract small businesses. This research is done to answer the question: What macro-precision, recall and F1-score performance is achievable with NB, SVM and BERT classifiers, determining the industry of a company using the textual data from its website? With this information, E-Active can set up a system to classify a company by its website, and consequently invite it to that specific site on Q-info.com. This research was able to achieve macro-averaged performances of 81% in precision, and 78% on recall and F1-score. These best results were shown using an SVM classifier, predicting industries on a cleaned dataset with 178 distinct classes. This study compared the different models, tuning them to optimize precision. Additionally, a voting ensemble has been implemented to study the combined predictive power of three classifiers. Data cleaning was done by removing records, incorrectly predicted by each model, using 10-fold cross validation, this resulted in a maximum performance increase of 25 percentage points.
Item Type:Essay (Master)
Clients:
e-Active, Zwolle, The Netherlands
Faculty:EEMCS: Electrical Engineering, Mathematics and Computer Science
Subject:54 computer science
Programme:Computer Science MSc (60300)
Link to this item:https://purl.utwente.nl/essays/98408
Export this item as:BibTeX
EndNote
HTML Citation
Reference Manager

 

Repository Staff Only: item control page