University of Twente Student Theses

Automatic evaluation of companies' alignment with EU Taxonomy using Large Language Models

Nguyen, Q.H. (2024) Automatic evaluation of companies' alignment with EU Taxonomy using Large Language Models.

This is the latest version of this item.

PDF
2MB

Abstract:	This project presents an end-to-end system using Large Language Models (LLM) and Retrieval-Augmented Generation (RAG) to automatically evaluate a company's EU Taxonomy performance based on their sustainability reports by answering two different questions: (1) What is the most suitable prompt between Zero-shot Chain-of-Thought (CoT), Few-shot CoT, and no-CoT prompting, and (2) What is the most suitable retriever system to retrieve EU Taxonomy-related information from company's reports. For the first question, we developed a qualitative human evaluation score to compare the answers' informativeness and correctness. We investigated whether automated metrics such as BERTScore or BLEU correlate with these human-evaluation scores. For the 2nd question, we compare different keyword extraction techniques (for keyword retrievers), query splitting and expansion techniques (for vector retrievers), and investigate the role of reranking in retriever systems. For question (1), results show that Zero-shot CoT prompting performs slightly better than traditional prompting followed by Few-shot CoT prompting, possibly due to the significantly longer prompts of Few-shot CoT. We also discovered that CoT prompting demonstrated a higher correlation between automated and human-evaluation metrics than noCoT prompting. Thus, it is easier to flag errors automatically. For the second question, we discovered: (i) Keyword extraction techniques do not concretely improve BM25 Keyword Retriever's performance; (ii) Splitting long queries into more self-contained sub-queries, whether using separators or using LLMs, achieves considerable performance boost for vector retriever; (iii) LLM-generated hypothetical answer also show significant improvement compared to the naive query splitting method; and (iv) Cross-Encoder reranking often filters out good results annotated by human, and the choice of reranking question also play a significant role in the Cross-Encoder Reranking model's performance. Finally, although our system and evaluation methods are not flawless, we have demonstrated that LLM and RAG can assist humans in extracting information related to EU taxonomy from a company's report and measuring that company's EU Taxonomy performance.
Item Type:	Essay (Master)
Clients:	ING Groep N.V., Amsterdam, Netherlands
Faculty:	EEMCS: Electrical Engineering, Mathematics and Computer Science
Subject:	54 computer science
Programme:	Computer Science MSc (60300)
Link to this item:	https://purl.utwente.nl/essays/102987
Export this item as:	BibTeX EndNote HTML Citation Reference Manager

Show download statistics for this publication

Daily downloads in the past month

Monthly downloads in the past 12 months

More statistics for this item...

Show download statistics for this publication

Repository Staff Only: item control page