University of Twente Student Theses


Identify and extract entities from bibliography references in a free text

Chenet, M. (2017) Identify and extract entities from bibliography references in a free text.

[img] PDF
Abstract:Elsevier is the world’s largest scientific publishing company. It owns a database named Scopus that stores a multitude of scientific papers, books and manuscripts. Besides scientific documents, author’s profiles are also stored. Author’s profiles contain information such as documents published and the number of citations that an author gets on his works from other scientific publication. Sometimes scientific articles are missing or the number of citation can be lower than the author expected. There are several reasons why a scientific report is missing in the author profile or the number of a citation can be lower than expected. One of those is because the document can be out of policy and therefore not yet referenced in the database. In such cases the author can contact the Scopus customer service by email including the reference of the missing scientific document. These references are written within the text of the email using several styles and are often incomplete. Sometimes the year of publication is missing, only the first author is mentioned, or the title is not complete. Elsevier is developing a technology that aims to support the Customer Service operator to a faster understanding of the problem behind a missing scientific document record. In order to understand and automate those problems, such as automatic recognition out of policy papers, the machine has to be able to understand which documents are referenced within the text of the emails. In order to extract the right entities and retrieve the correct document mentioned in the free text, such as emails or web forms, this research project has been divided into three main steps. The first step is to identify the parts of a text that refer to a bibliographic referent, the reference text. The second step is to identify the components of a reference text or references’ entities, e.g. author name(s), title, year of publication, journal name, etc. The third step is to find a match with the referent in the text and the real document according to the components identified. All of the above have led to the following question: can each of these steps be done fully automatically using machine learning approaches in order to retrieve the correct scientific document included by the reference in the email? This report presents the answer to this question. It focuses on references’ entities recognition in emails data, considered free text. It has been found that narrowing down the problem by dividing and classifying the email’s sentences that contain the references and subsequently performing entities extraction on these sentences, it is possible to obtain a good performance in terms of the reference entities extraction. By having good accuracy on entities recognition it is possible to recognize and retrieve the corrected scientific document mentioned within an unstructured (free) text.
Item Type:Essay (Master)
Faculty:EEMCS: Electrical Engineering, Mathematics and Computer Science
Subject:54 computer science
Programme:Interaction Technology MSc (60030)
Link to this item:
Export this item as:BibTeX
HTML Citation
Reference Manager


Repository Staff Only: item control page