Extracting document structure of a text with visual and textual cues

HE, Y. (2017) Extracting document structure of a text with visual and textual cues.

This is the latest version of this item.

[img]
Preview
PDF
588kB
Abstract:Scientific papers, as important channels in the academic world, act as bridges to connect researchers. At ELSEVIER, current journal publishing work flow is semi-automated and a lot of manual work is still required. Aiming at automating the publishing process, this research work investigate the possibility of applying machine learning approaches to automate structuring step in current publishing pipeline, which focuses on identifying different document structure information from manuscripts, including Title, Author, Affiliation, Section Heading, Caption and Reference. In this work, we at first propose an intermediate document representation form, which we call "Structure Document Format (SDF)". Based on this document representation format, we come up with an approach with which we make use of available manuscript and build labelled data set for our machine learning experiments. In the experiments, we explored how machine learning models(Naive Bayes, Multilayer Perceptron, Decision Tree, RF, Random Subspace, LogitBoost, Support Vector Machine) perform on both binary classification and multi-class classification problem. Based on types of information we feed in our models, we also compare the performance of different features, including visual, shallow-textual and syntactic-textual features. As a result, our models achieve generally reasonable performance in our experiments, where our Random Forest based model outperformed the other classifiers. Besides, we also notice that our visual and textual features are able to complement each other and achieve better performance. At the end, we put a short discussion about the challenges we have met during this work and some hints how the performance can be further improved.
Item Type:Essay (Master)
Clients:
ELSEVIER, Amsterdam, Netherlands
Faculty:EEMCS: Electrical Engineering, Mathematics and Computer Science
Subject:54 computer science
Programme:Human Media Interaction MSc (60030)
Link to this item:http://purl.utwente.nl/essays/72979
Export this item as:BibTeX
EndNote
HTML Citation
Reference Manager

 

Repository Staff Only: item control page