University of Twente Student Theses


Geographical risk in the Dutch car insurance : a data-driven approach to measure regional effects on the claim frequency

Bont, Diederik de (2022) Geographical risk in the Dutch car insurance : a data-driven approach to measure regional effects on the claim frequency.

[img] PDF
Abstract:It is important that car insurance companies can set an accurate premium. In order to do this, policyholders need to be classified in different risk levels based on their risk profile. For this, the frequency-severity method is most often used, in which the claim frequency and claim severity are modeled separately. To model the claim frequency, generalized linear models (GLMs) are the current industry standard, as they are easily interpretable and explainable. However, one of the major issues with using GLMs is the implementation of numerical risk factors, as often, these cannot be inserted into the model directly. This is why, we aim to improve upon the risk classification of the claim frequency model of the car insurance product, taking the geographical area (longitude and latitude of a policyholder’s postal code) as the bivariate numerical risk factor to be incorporated into the model, while maintaining the interpretability and explainability of the frequency model. In order to achieve this goal, we use a dataset containing records from over 1.7 million policyholders from one specific car insurance policy. To deal with missing values, we pre-process this dataset using the MissForest data imputation algorithm. Subsequently, we take two different approaches for incorporating the postal codes in the frequency model. Firstly, we incorporate the spatial variable into a generalized additive model (GAM) using smooth functions. Secondly, we incorporate the spatial variable as a categorical risk factor into a GLM by clustering the postal codes using Jenks natural breaks algorithm. The latter algorithm has been chosen after an in-depth comparative analysis between several clustering methods. Using the AIC, BIC and the Goodness of Variance Fit, 28 clusters have been chosen for this specific dataset. We refer to the two models developed here as the GAM and the improved GLM, respectively. Next, we compare these models to the current industry-standard GLM, and assess them on their ability to predict accurately claim frequency using K-fold cross-validation. When comparing the two models (GLM and GAM) that include the new spatial variable to the industry-standard GLM, we observed that the two models developed here resulted in more accurate predictions on the claim frequency (the mean squared error (MSE) and mean absolute error (MSA) decreased with 1.65% and 1.60% respectively in the out-of-sample test, respectively, for both models). These may seem relatively small improvements in the predictive performance of the models. However, based on a portfolio of 1.7 million policies, this can have a huge impact on the premiums. Between the two proposed approaches, no notable difference in predictive accuracy could be observed. However, even though the GAM allows for the flexible modeling of numerical bivariates without imposing any nonlinear assumptions, the improved GLM has a practical advantage over the GAM, since it is easier to interpret and explain to stakeholders as it is formulated within the well-known framework of generalized linear models. The approach developed here has two shortcomings. First, determining the optimal number of clusters using these metrics did not lead to a clear-cut optimal number of clusters. Second, feature selection was not performed, even though analysis of the parameters of the improved GLM model showed that, after adding the spatial variable, not all remaining parameters remain statistically significant. Additionally, the proposed approach significantly increased computing time for the model. Despite this, our approach has led to many new insights on the origin of the effect of the spatial variable on claim frequency due to the visualizations produced in this research. Additionally, with the clustering method developed here, the spatial effect is distributed more gradually throughout the Netherlands, which leads to fairer premiums. In conclusion, we show that both of the models developed here allowed for the implementation of a numerical spatial variable in the frequency model, and that both result in an improved risk classification. We recommend to apply the improved GLM including the spatial variable and use the produced dashboards to review shortcomings of the current industry-standard GLM. This approach can be extended for other numerical variables or interactions between covariates but also for claim severity models to determine the effect on premiums eventually. New experiments could be conducted for different scenarios to analyze specific effects. Looking into the future, we recommend looking beyond the regression models presented here, by focusing on more flexible machine learning methods, since these methods might prove beneficial in order to predict the pricing models more accurately. Knowing that these machine learning methods also have their own limitations, we propose to focus future research on a hybrid system in which actuaries obtain part of the data from machine learning models as input for the statistical models. This seems to be the most efficient way to combine the quality benefits of an machine learning model with the standard-industry regression models.
Item Type:Essay (Master)
Faculty:BMS: Behavioural, Management and Social Sciences
Programme:Industrial Engineering and Management MSc (60029)
Link to this item:
Export this item as:BibTeX
HTML Citation
Reference Manager


Repository Staff Only: item control page