University of Twente Student Theses
Classifying Companies Based on Textual Webpage Data: A Comparative Analysis
Weening, J. (2024) Classifying Companies Based on Textual Webpage Data: A Comparative Analysis.
PDF
4MB |
Abstract: | Growth of the World Wide Web consistently causes innovative ideas of companies to promote themselves and market their products. Large corporations invest many resources to achieve top spots in search queries, making it infeasible for small business owners to compete. Q-info.com, a web platform created by E-Active, offers new solutions for these companies. With their platform, it becomes easy and affordable to attract customers, sell products and manage their finances. However, Q-info.com has the same problem of getting businesses to find their platform. With hundreds of industry-specific sites, they employ a new strategy to attract small businesses. This research is done to answer the question: What macro-precision, recall and F1-score performance is achievable with NB, SVM and BERT classifiers, determining the industry of a company using the textual data from its website? With this information, E-Active can set up a system to classify a company by its website, and consequently invite it to that specific site on Q-info.com. This research was able to achieve macro-averaged performances of 81% in precision, and 78% on recall and F1-score. These best results were shown using an SVM classifier, predicting industries on a cleaned dataset with 178 distinct classes. This study compared the different models, tuning them to optimize precision. Additionally, a voting ensemble has been implemented to study the combined predictive power of three classifiers. Data cleaning was done by removing records, incorrectly predicted by each model, using 10-fold cross validation, this resulted in a maximum performance increase of 25 percentage points. |
Item Type: | Essay (Master) |
Clients: | e-Active, Zwolle, The Netherlands |
Faculty: | EEMCS: Electrical Engineering, Mathematics and Computer Science |
Subject: | 54 computer science |
Programme: | Computer Science MSc (60300) |
Link to this item: | https://purl.utwente.nl/essays/98408 |
Export this item as: | BibTeX EndNote HTML Citation Reference Manager |
Repository Staff Only: item control page