University of Twente Student Theses


Extracting information from funding opportunities using state-of-the-art NLP method BERT with minimized data annotation

Rietvelt, D.C.J.C (2022) Extracting information from funding opportunities using state-of-the-art NLP method BERT with minimized data annotation.

[img] PDF
Abstract:Elsevier is an information analytics business that helps researchers advance science. One of the services Elsevier offers, is to allow researchers to find suitable funding opportunities. Information on the eligibility of potential applicants is crucial to recommend suitable funding opportunities. Currently, eligibility criteria extraction is performed manually within Elsevier. This thesis is a first step towards automating eligibility criteria extraction from funding data. It researches BERT as a method to extract information on eligibility from funding data. The focus is on BERT specifically, since it shows state-of-the-art results in various natural language understanding tasks, including information extraction. Furthermore, BERT is a pre-trained neural network, which requires less labeled training data to perform well compared to other neural networks. Combined with the fact that BERT was not yet used on funding data. These findings made it interesting to research BERT in information extraction from funding data. Literature shows that a three-step approach with a segmentation step and two classification steps, where BERT is deployed as a classifier, is common in information extraction. In this work, this approach is tested on funding data from Elsevier. As the creation of training and evaluation datasets is a labor-intensive task, an approach with minimized data annotation is applied by using a small dataset and balancing the data prior to data annotation. The three-step approach, using BERT BERT as classifier, is compared to a BiLSTM and Conventional Machine Learning implementation. The results show that BERT outperforms the two other implementations. It is suspected that the BERT embeddings are crucial to outperforming the different implementations. An additional experiment explored this and found that BERT embeddings indeed play an essential role. Although BERT outperforms the two other implementations, its performances are too low to be viable in an actual situation. It is suspected that the suboptimal performances can be attributed to the data (quality, size, and distribution) used in this case study. Future work would need to be done to confirm that the data limited the performance of this study. If so, a three-step approach with BERT as a method for the classification tasks is a viable approach to extract information from funding opportunities with minimized data annotation, as long as the data is of high quality and the resources to run a deep neural network are available.
Item Type:Essay (Master)
Elsevier, Amsterdam, The Netherlands
Faculty:EEMCS: Electrical Engineering, Mathematics and Computer Science
Subject:17 linguistics and theory of literature, 50 technical science in general, 54 computer science
Programme:Interaction Technology MSc (60030)
Link to this item:
Export this item as:BibTeX
HTML Citation
Reference Manager


Repository Staff Only: item control page