Jean-Philippe Boucher, Université du Québec À Montréal (🐦 @J_P_Boucher)

Arthur Charpentier, Université du Québec À Montréal (🐦 @freakonometrics)

Ewen Gallic, Aix-Marseille Université (🐦 @3wen)

1 Duration Models

1.1 Composition of the portfolio

Analyse the composition of the portfolio: covariates, frequency by covariates.

First, load the training and testing sets, located in the following folder: data/canada_panel/ (CanadaPanelTrain.csv and CanadaPanelTest.csv, respectively).

Define a function that computes summary statistics for a vector of numerics:

  • Average
  • Standard deviation
  • Min and Max
  • Median
  • Other percentiles (e.g., 10th, 25th, 75th, and 90th)

Try to account for possible NA values. Name the function my_summary.

Use the function on a single variable from the training sample:

Apply this function to multiple variables of your choice to explore the dataset and put the results in a table. You may either use a for loop, lapply() or map() (pick the option you are more comfortable with). Do not forget to add the variable name to be able to identify the variables.

Plot the distribution of claims on a barplot. You may display the distribution of gender among each claims levels (by showing the proportion of each gender within the bars). The result may look like the figure displayed below:

Plot a boxplot of exposure time depending on marital status and vehicle use. Use faceting to separate the data according to vehicule use. The resulting plot should look like the one below:

1.2 GLM Poisson

using GLM Poisson and several covariates (of your choice), compute the impact of some telematic informations on the frequency part of the premium. To do so, start with loading the training and testing sets located in the following folder: data/canada_panel/ (CanadaPanelTrain.csv and CanadaPanelTest.csv, respectively).

First, some variables need to be recoded in both datasets.

The gender variable currently offers three levels: female (2), male (1) and unknown (3). Create a new variable (gender) which takes the following values:

  • 1 if male
  • 2 otherwise.

Define Other as the reference.

Change the marital status variable (RA_MARITALSTATUS) to a factor variable which takes two values:

  • Single if the insured is single (i.e., when RA_MARITALSTATUS equals 0)
  • Other otherwise.

Vehicle use (RA_VEH_USE) is currently labelled as follows:

  • 0: other
  • 1: commute
  • 2: pleasure.

Turn that variable into a factor and add the corresponding labels. Define Other as the reference.

Using the glm() function, fit a Poisson regression model on claims, without any offset. Use the following covariates: gender, RA_MARITALSTATUS, and RA_VEH_USE. Store the estimation results in an object named mod_1.

Fit another model using exposure time (RA_EXPOSURE_TIME) as an offset. Store the estimation results in a object named mod_2.

Fit another model using distance driven (RA_DISTANCE_DRIVEN) as an offset. Store the estimation results in a object named mod_3.

Fit another model using the number of trips (RA_NBTRIP) as an offset. Store the estimation results in a object named mod_4.

Fit another model using hours driven (RA_HOURS_DRIVEN) as an offset. Store the estimation results in a object named mod_5.

With the help of the function mtable() from {memisc}, create a table showing the results of the 5 estimations. Include in the table the number of observations as well as Akaike Information Criterion (AIC) values.

1.3 Duration Models and Modified Count Distributions

Let us assume an exponential “time” between claims. Use the neg_logl_Expo_Duration() function defined below to fit different Gamma Duration using different measures of exposure (exposure time, distance driven, number of trips, hours driven).

Using the model.matrix() function, create the model matrix containing the dummy variables for the insured gender (gender), marital status (RA_MARITALSTATUS) and vehicle use (RA_VEH_USE).

Then extract the variable of interest, i.e., the number of claims from the dataset and store it in a vector named nb_sin:

Complete the body of the following fit_gamma() function so that:

  1. it extracts from the dataset the exposure measure (you can store the values in an object named var_expo)
  2. it estimates the coefficients of the model thanks to the function optim() which is fed with some initial values (do not forget to set the parameter hessian to TRUE)
  3. it then extracts the coefficients from the estimation results
  4. it computes the standard errors of the estimates
  5. it computes the confidence intervals
  6. it creates a tibble with the estimates, their 95% confidence interval, and their standard error 7.it computes the AIC.

The function should return a list with three elements:

  • fit: the model fit
  • coefs: the tibble with the estimates and their 95% confidence interval
  • aic: the AIC

Use that function to estimate a Gamma Duration model for the number of claims, using exposure time in calendar year (RA_EXPOSURE_TIME) as the exposure measure:

Use that function to estimate a Gamma Duration model for the number of claims, using the distance driven (RA_DISTANCE_DRIVEN) as the exposure measure:

Use that function to estimate a Gamma Duration model for the number of claims, using the number of trips (RA_NBTRIP) as the exposure measure:

Use that function to estimate a Gamma Duration model for the number of claims, using the hours driven (RA_HOURS_DRIVEN) as the exposure measure:

Now we should compare the performances of the different models with out-of-sample predictions.

Define the function sq_error() which computes the squared error given some observations and prediction.

Define a function that you will name predict_gamma which performs out of sample predictions for a Gamma Duration model. The function may be constructed as follows:

  1. define the exposition variable
  2. extract the coefficients (beta and alpha) from the fit and compute \(\lambda\)
  3. use the estimates to predict the values from the testing set
  4. compute the probability for each insured
  5. compute the scores (logarithmic score and squared error).

This function may return a list of three elements:

  1. pred: the predicted values
  2. proba: the predicted probabilities
  3. scores: a table with the out-of-sample scores

Now you can use the function predict_gamma() on the four Gamma Duration modes previously estimated.

Compute the out-of-sample predictions for the model which uses exposure time (RA_EXPOSURE_TIME) as the exposure measure (fit_gamma_exposure).

Compute the out-of-sample predictions for the model which uses distance drivent (RA_DISTANCE_DRIVEN) as the exposure measure (fit_gamma_distance).

Compute the out-of-sample predictions for the model which uses the number of trips (RA_NBTRIP) as the exposure measure (fit_gamma_nbtrips).

Compute the out-of-sample predictions for the model which uses hours driven (RA_HOURS_DRIVEN) as the exposure measure (fit_gamma_hours).

Lastly, bind the rows of the tables containing the scores of the four models to compare the performances of your models with out-of-sample data.