University of Twente Student Theses


APEx: Zero-shot Cross-Knowledge Graph Named Entity Extraction leveraging Wikification

Wiefferink, C.A. (2022) APEx: Zero-shot Cross-Knowledge Graph Named Entity Extraction leveraging Wikification.

[img] PDF
Abstract:A key step to bridge the gap between natural language (NL) text and knowledge graphs (KG) such as Wikipedia, is named entity extraction (NEE). In a KG, the nodes represent concepts or named entities, and the edges between the nodes represent semantic relations. NEE is the automated extraction of mentions of named entities appearing in the text and linking them to their corresponding entities in a KG. NEE with a target KG based on Wikipedia is also called wikification. The links contribute to the vision of the semantic web and can help readers to better understand the resource. Furthermore, they aid in creating a semantic representation of the document, and can play a key role in a number of natural language processing (NLP) and information retrieval (IR) tasks. While wikification models are powerful due to the vast amounts of training data available in Wikipedia that can be used to train an NEE model, they lack domain-specific entities and are computationally expensive. On the other hand, domain-specific KGs lack the more general concepts, and often have no training data available. Leveraging wikification to extract entities from a specific domain that consists of more than one target KG (e.g. a subset of Wikidata combined with a domain-specific KG) is a problem that has not been attempted to solve before. As a case study, we have used the job market as application domain. To extract entities with two, originally unaligned, target KGs we propose a method that consists out of three main steps: KG Alignment, KG Pruning, and named entity Extraction (APEx). First, we have aligned our domain-specific KG with Wikidata to deal with the overlapping entities. Second, we have employed seed enrichment and a (strongly) local graph clustering (LGC) method, using the set of overlapping entities as seed, to prune entities from Wikidata that are not relevant to the job market domain. Then, we constructed multiple target KGs consisting of (a pruned version of) Wikidata, ESCO, and combinations thereof. Finally, a form of zero-shot learning (ZSL) has been used to leverage a wikification model named Bootleg to perform NEE with the newly constructed target KGs. We have evaluated the NEE performance in two-fold: quantitatively and qualitatively. We show that APEx outperforms the exact string matching (ESM) with popularity voting baseline, and achieves competitive results compared with Bootleg's original model, in nearly all combinations of target KGs and evaluation strictness levels. The main loss in performance can be attributed to not recognizing entities being mentioned in the text, as opposed to disambiguating the entities. However, the NEE performance is better for IT than non-IT related documents. Finally, we show that APEx can significantly reduce the computation cost in terms of initialization time and memory usage. To summarize, we show the potential of using a wikification model for other applications than merely extracting entities to Wikidata, without the drawbacks of the computation cost of wikification, and without the need for additional training.
Item Type:Essay (Master)
Little Rocket, Enschede, Netherlands
Faculty:EEMCS: Electrical Engineering, Mathematics and Computer Science
Subject:54 computer science
Programme:Computer Science MSc (60300)
Link to this item:
Export this item as:BibTeX
HTML Citation
Reference Manager


Repository Staff Only: item control page