Investigating the Impact of Synthetic Data Balancing Techniques on Fairness in Credit Risk Machine Learning Models

Author(s): Tunc, Johan (2025)

Abstract:
In credit risk modeling, machine learning (ML) algorithms often face the challenge of class imbalance, where default cases are significantly underrep- resented compared to non-default cases. A default case being a lender not paying back their loan-fees. To address this issue, synthetic data balancing techniques like Synthetic Minority Oversampling Technique (SMOTE) and Adaptive Synthetic Sampling (ADASYN) are commonly applied prior to the use of ML models for credit risk assessments. However, the impact of these methods on both predictive performance and fairness is underexplored. This thesis investigates how synthetic data balancing techniques influence model accuracy and fairness in credit risk datasets. Using open-source data, classifiers such as logistic regression (LR) and XGBoost are evaluated with standard metrics including area under the ROC curve (AUC-ROC), precision, recall, and F1 score. Fairness is assessed using Equalized Odds, Demographic Parity, and Disparate Impact Ratio (DIR). Results show that while both bal- ancing techniques modestly improve the predictive performance of logistic regression, their effect on XGBoost is minimal. Importantly, both methods contribute to reduced fairness disparities between genders, supporting more equitable model outcomes and aligning with regulatory requirements such as the EU AI Act and GDPR.