Development of a Model to Extract Diabetes Type and Year of Diagnosis from Medical Texts

Author(s): Kenzy Dario Sanjaya (2026)

Abstract:

Extracting key clinical information from unstructured electronic health record (EHR) text can support research and clinical workflows, but performance is often limited by inconsistent phrasing, missing context, and conflicting dates. This thesis investigates automatic extraction of diabetes type and year of diagnosis from Dutch EHR narratives from ZGT, and compares two approaches: a supervised Named Entity Recognition (NER) pipeline based on MedRoBERTa.nl and prompt-based extraction using locally deployable large language models (LLMs).

Document(s):

Final_Project_Newcover.pdf