Information Extraction From Sustainability Reports Using Document AI
Vreeman, H.S. (2025)
Document AI aimed at extracting information from multi-page documents has advanced rapidly in recent years. During this study, we assess the usability of the current state of Document AI for automatic information extraction from sustainability reports, to address the gap of the inability of another frequently used method for information extraction, namely LLMs, to capture visual data from the document. First, we determine the requirements for an information extraction tool used for sustainability benchmarking together with four sustainability reporting experts. Second, we evaluate the performance of publicly available methods on sustainability reporting data, and third, we aim to adapt the best model to a sustainability reporting setting. We show how quantized low-rank adaption fine-tuning and hypothetical document embeddings can improve Document AI models in a sustainability reporting setting, by increasing the retrieval performance of a state-of-the-art page-retrieval model, while significantly reducing the required memory. In addition, we show that an automatic fine-tuning pipeline can effectively increase the performance of this retrieval model while reducing the time needed to apply fine-tuning. Despite performance increases, we observe that our solution remains to perform suboptimally on lengthy documents that contain complex business terminology.
HSVreeman_MA_EEMCS.pdf