University of Twente Student Theses

As of Friday, 8 August 2025, the current Student Theses repository is no longer available for thesis uploads. A new Student Theses repository will be available starting Friday, 15 August 2025.

Automatic Generic Web Information Extraction at Scale

Aljabary, Mahmoud (2021) Automatic Generic Web Information Extraction at Scale.

PDF
4MB

Abstract:	The internet is growing at a rapid speed, as well as the need for extracting valuable information from the web. Web data is messy and disconnected, which poses a challenge for information extraction research. Current extraction methods are limited to a specific website schema, require manual work, and hard to scale. In this thesis, we propose a novel component-based design method to solve these challenges in a generic and automatic way. The global design consists of 1. a relevancy filter (binary classifier) to clean out irrelevant websites. 2. a feature extraction component to extract useful features from the relevant websites, including XPath. 3. an XPath-based clustering component to group similar web page elements into clusters based on Levenshtein distance. 4. a knowledge-based entity recognition component to link clusters with their corresponding entities. The last component remains for future work. Our experiments show huge potential in this generic approach to structure and extract web data at scale without the need for pre-defined website schemas. More work is needed in future experiments to link entities with their attributes and explore untapped candidate features.
Item Type:	Essay (Master)
Faculty:	EEMCS: Electrical Engineering, Mathematics and Computer Science
Subject:	54 computer science
Programme:	Computer Science MSc (60300)
Link to this item:	https://purl.utwente.nl/essays/86153
Export this item as:	BibTeX EndNote HTML Citation Reference Manager

Show download statistics for this publication

Repository Staff Only: item control page