University of Twente Student Theses

Login

LLMs for OCR Post-Correction

Veninga, M.E.B. (2024) LLMs for OCR Post-Correction.

[img] PDF
288kB
Abstract:In this thesis, I examine the use of Large Language Models (LLMs) on the task of Optical Character Recognition (OCR) post-correction. Pretrained LLMs exhibit an understanding of language which can be exploited to correct mistakes in OCR output, but for good performance fine-tuning of the models is needed. I show that fine-tuned versions of the ByT5 LLM are able to correct mistakes in OCR text better than a state-of-the-art method can. Preprocessing techniques are shown to impact the capability of the language models to correct OCR errors. ByT5 models achieve the highest Character Error Rate (CER) reduction rates when using lowercasing as well as removing strange characters. Context length is also shown to have a strong impact on the effectiveness of the models. The best context length was found to be 50 characters, with longer and shorter context lengths having worse CER reduction and worse F1 scores. I also show that few-shot learning is not able to teach a generative LLM to correct OCR text without fine-tuning the model. Future research can investigate the potential increase in effectiveness of fine-tuning larger language models on the post-OCR error correcting task
Item Type:Essay (Master)
Faculty:EEMCS: Electrical Engineering, Mathematics and Computer Science
Subject:54 computer science
Programme:Computer Science MSc (60300)
Link to this item:https://purl.utwente.nl/essays/102117
Export this item as:BibTeX
EndNote
HTML Citation
Reference Manager

 

Repository Staff Only: item control page