Applications of ML Algorithms to Assess Credit Risk : Experiments With Missing Data

Herbert, G.C. (2023)

Machine learning algorithms are increasingly more often used in credit risk predictions by banks and financial institutions. One of the biggest problems when working with machine learning algorithms is the quality of the data that is used by the algorithm. Therefore, this thesis looks at sparse and bad quality datasets and aims to find the best way to replace the missing data in these datasets. In the literature review, the relevance is presented as there are few, if any, studies researching this particular topic. Finding the best method of data replacement is done by comparing the methods of replacing by the mean, the median, and the mode of the variables with missing data, in addition, replacing by zero is also compared to these methods. These comparisons are done by replacing the missing data in the German credit dataset and implementing a Random Forest machine learning algorithm on these datasets. The comparisons are judged by comparing the feature importance of the algorithms and several accuracy metrics of said algorithms. The result of the experiment is that replacing by zero scores a combined first place, along with replacing by the mean of the available data in the variables, at the accuracy comparison and an absolute first place for the feature importance test. This means that replacing by zero is the preferred option for replacing the missing data in sparse and bad quality datasets when making consumer credit risk predictions using a machine learning algorithm.
Herbert_BA_BMS.pdf