University of Twente Student Theses
LLMs for OCR Post-Correction
Veninga, M.E.B. (2024) LLMs for OCR Post-Correction.
PDF
288kB |
Abstract: | In this thesis, I examine the use of Large Language Models (LLMs) on the task of Optical Character Recognition (OCR) post-correction. Pretrained LLMs exhibit an understanding of language which can be exploited to correct mistakes in OCR output, but for good performance fine-tuning of the models is needed. I show that fine-tuned versions of the ByT5 LLM are able to correct mistakes in OCR text better than a state-of-the-art method can. Preprocessing techniques are shown to impact the capability of the language models to correct OCR errors. ByT5 models achieve the highest Character Error Rate (CER) reduction rates when using lowercasing as well as removing strange characters. Context length is also shown to have a strong impact on the effectiveness of the models. The best context length was found to be 50 characters, with longer and shorter context lengths having worse CER reduction and worse F1 scores. I also show that few-shot learning is not able to teach a generative LLM to correct OCR text without fine-tuning the model. Future research can investigate the potential increase in effectiveness of fine-tuning larger language models on the post-OCR error correcting task |
Item Type: | Essay (Master) |
Faculty: | EEMCS: Electrical Engineering, Mathematics and Computer Science |
Subject: | 54 computer science |
Programme: | Computer Science MSc (60300) |
Link to this item: | https://purl.utwente.nl/essays/102117 |
Export this item as: | BibTeX EndNote HTML Citation Reference Manager |
Repository Staff Only: item control page