University of Twente Student Theses
As of Friday, 8 August 2025, the current Student Theses repository is no longer available for thesis uploads. A new Student Theses repository will be available starting Friday, 15 August 2025.
Benchmarking the Programming Capabilities of Large Language Models
Nizamudeen, M.F. (2025) Benchmarking the Programming Capabilities of Large Language Models.
PDF
2MB |
Abstract: | This thesis benchmarks the programming abilities of the most widely used large language models (LLMs), particularly the flagship offerings from OpenAI, Google, and Anthropic. The experiments were separated into two sections, small problems and large problems. There were a total of 75 LeetCode problems to determine small problem performance. Models were assessed based on accuracy and code quality, with measures such as maintainability index, source lines of code, and cyclomatic complexity as indicators of the code quality produced. The results found that OpenAI's o1-mini had the best accuracy, while Claude 3.7 Sonnet produced the highest quality code overall. GPT-4o-mini performed significantly worse than the other models. For the large problems, five unique tasks were chosen across various programming languages and project types. The models tested in these experiments were OpenAI o4-mini, Claude 3.7 Sonnet, and Gemini 2.5 Pro. Each solution was assessed based on functional correctness, maintainability, and ease of prompting the model. All of the experiments were done in an agentic manner using the Cursor IDE. In these experiments, Claude 3.7 Sonnet had the best overall scores for all three metrics. OpenAI o4-mini came in as the second-best model, with Gemini 2.5 Pro showcasing the worst average performance across all models tested. While these results are encouraging, there are ample opportunities for growth in future research. Future research areas include testing other models from other providers, sampling a larger variety of problems from other sources, and comparing LLM-generated code to human-generated code. Other aspects of software development, such as test generation and debugging can also be explored. While LLMs are far from perfect, this paper shows that with the right prompting and human guidance, they can already serve as a helpful tool for both small and large programming tasks. |
Item Type: | Essay (Master) |
Faculty: | EEMCS: Electrical Engineering, Mathematics and Computer Science |
Subject: | 54 computer science |
Programme: | Computer Science MSc (60300) |
Link to this item: | https://purl.utwente.nl/essays/107146 |
Export this item as: | BibTeX EndNote HTML Citation Reference Manager |
Repository Staff Only: item control page