University of Twente Student Theses

Login
As of Friday, 8 August 2025, the current Student Theses repository is no longer available for thesis uploads. A new Student Theses repository will be available starting Friday, 15 August 2025.

Information Extraction From Sustainability Reports Using Document AI

Vreeman, H.S. (2025) Information Extraction From Sustainability Reports Using Document AI.

This is the latest version of this item.

[img] PDF
1MB
Abstract:Document AI aimed at extracting information from multi-page documents has advanced rapidly in recent years. During this study, we assess the usability of the current state of Document AI for automatic information extraction from sustainability reports, to address the gap of the inability of another frequently used method for information extraction, namely LLMs, to capture visual data from the document. First, we determine the requirements for an information extraction tool used for sustainability benchmarking together with four sustainability reporting experts. Second, we evaluate the performance of publicly available methods on sustainability reporting data, and third, we aim to adapt the best model to a sustainability reporting setting. We show how quantized low-rank adaption fine-tuning and hypothetical document embeddings can improve Document AI models in a sustainability reporting setting, by increasing the retrieval performance of a state-of-the-art page-retrieval model, while significantly reducing the required memory. In addition, we show that an automatic fine-tuning pipeline can effectively increase the performance of this retrieval model while reducing the time needed to apply fine-tuning. Despite performance increases, we observe that our solution remains to perform suboptimally on lengthy documents that contain complex business terminology.
Item Type:Essay (Master)
Clients:
EY, Amsterdam, Netherlands
Faculty:EEMCS: Electrical Engineering, Mathematics and Computer Science
Subject:43 environmental science, 54 computer science
Programme:Business Information Technology MSc (60025)
Link to this item:https://purl.utwente.nl/essays/106974
Export this item as:BibTeX
EndNote
HTML Citation
Reference Manager

 

Repository Staff Only: item control page