University of Twente Student Theses


Automatic Generic Web Information Extraction at Scale

Aljabary, Mahmoud (2021) Automatic Generic Web Information Extraction at Scale.

[img] PDF
Abstract:The internet is growing at a rapid speed, as well as the need for extracting valuable information from the web. Web data is messy and disconnected, which poses a challenge for information extraction research. Current extraction methods are limited to a specific website schema, require manual work, and hard to scale. In this thesis, we propose a novel component-based design method to solve these challenges in a generic and automatic way. The global design consists of 1. a relevancy filter (binary classifier) to clean out irrelevant websites. 2. a feature extraction component to extract useful features from the relevant websites, including XPath. 3. an XPath-based clustering component to group similar web page elements into clusters based on Levenshtein distance. 4. a knowledge-based entity recognition component to link clusters with their corresponding entities. The last component remains for future work. Our experiments show huge potential in this generic approach to structure and extract web data at scale without the need for pre-defined website schemas. More work is needed in future experiments to link entities with their attributes and explore untapped candidate features.
Item Type:Essay (Master)
Faculty:EEMCS: Electrical Engineering, Mathematics and Computer Science
Subject:54 computer science
Programme:Computer Science MSc (60300)
Link to this item:
Export this item as:BibTeX
HTML Citation
Reference Manager


Repository Staff Only: item control page