Jean-Philippe Boucher, Université du Québec À Montréal (🐦 @J_P_Boucher)
Arthur Charpentier, Université du Québec À Montréal (🐦 @freakonometrics)
Ewen Gallic, Aix-Marseille Université (🐦 @3wen)
Analyse the composition of the portfolio: covariates, frequency by covariates.
First, load the training and testing sets, located in the following folder: data/canada_panel/
(CanadaPanelTrain.csv
and CanadaPanelTest.csv
, respectively).
Define a function that computes summary statistics for a vector of numerics:
Try to account for possible NA
values. Name the function my_summary
.
#' my_summary
#' Returns a tibble with summary statistics for a numerical vector
#' @param x vector of numerics
my_summary <- function(x) {
}
Use the function on a single variable from the training sample:
Apply this function to multiple variables of your choice to explore the dataset and put the results in a table. You may either use a for loop, lapply()
or map()
(pick the option you are more comfortable with). Do not forget to add the variable name to be able to identify the variables.
Plot the distribution of claims on a barplot. You may display the distribution of gender among each claims levels (by showing the proportion of each gender within the bars). The result may look like the figure displayed below:
Plot a boxplot of exposure time depending on marital status and vehicle use. Use faceting to separate the data according to vehicule use. The resulting plot should look like the one below:
using GLM Poisson and several covariates (of your choice), compute the impact of some telematic informations on the frequency part of the premium. To do so, start with loading the training and testing sets located in the following folder: data/canada_panel/
(CanadaPanelTrain.csv
and CanadaPanelTest.csv
, respectively).
First, some variables need to be recoded in both datasets.
The gender variable currently offers three levels: female (2), male (1) and unknown (3). Create a new variable (gender
) which takes the following values:
1
if male2
otherwise.Define Other
as the reference.
Change the marital status variable (RA_MARITALSTATUS
) to a factor variable which takes two values:
Single
if the insured is single (i.e., when RA_MARITALSTATUS
equals 0)Other
otherwise.Vehicle use (RA_VEH_USE
) is currently labelled as follows:
0
: other1
: commute2
: pleasure.Turn that variable into a factor and add the corresponding labels. Define Other
as the reference.
Using the glm()
function, fit a Poisson regression model on claims, without any offset. Use the following covariates: gender
, RA_MARITALSTATUS
, and RA_VEH_USE
. Store the estimation results in an object named mod_1
.
Fit another model using exposure time (RA_EXPOSURE_TIME
) as an offset. Store the estimation results in a object named mod_2
.
Fit another model using distance driven (RA_DISTANCE_DRIVEN
) as an offset. Store the estimation results in a object named mod_3
.
Fit another model using the number of trips (RA_NBTRIP
) as an offset. Store the estimation results in a object named mod_4
.
Fit another model using hours driven (RA_HOURS_DRIVEN
) as an offset. Store the estimation results in a object named mod_5
.
With the help of the function mtable()
from {memisc
}, create a table showing the results of the 5 estimations. Include in the table the number of observations as well as Akaike Information Criterion (AIC) values.
Let us assume an exponential “time” between claims. Use the neg_logl_Expo_Duration()
function defined below to fit different Gamma Duration using different measures of exposure (exposure time, distance driven, number of trips, hours driven).
#' neg_logl_Expo_Duration
#' @param parms vector of coefficients values
#' @param X matrix of predictors (including the constant as the first column)
#' @param y variable of interest
#' @param expo exposure values
neg_logl_Gamma_Duration <- function(parms, X, y, expo) {
}
Using the model.matrix()
function, create the model matrix containing the dummy variables for the insured gender (gender
), marital status (RA_MARITALSTATUS
) and vehicle use (RA_VEH_USE
).
Then extract the variable of interest, i.e., the number of claims from the dataset and store it in a vector named nb_sin
:
Complete the body of the following fit_gamma()
function so that:
var_expo
)optim()
which is fed with some initial values (do not forget to set the parameter hessian
to TRUE
)The function should return a list with three elements:
fit
: the model fitcoefs
: the tibble with the estimates and their 95% confidence intervalaic
: the AIC#' fit_gamma
#' @param expo_name name of the variable to use as exposure
#' @param init starting values for the optimization algorithm
fit_gamma <- function(expo_name, init) {
# Exposure
# Coefficients
# Standard errors
# Confidence interval
# Score
list()
}
Use that function to estimate a Gamma Duration model for the number of claims, using exposure time in calendar year (RA_EXPOSURE_TIME
) as the exposure measure:
Use that function to estimate a Gamma Duration model for the number of claims, using the distance driven (RA_DISTANCE_DRIVEN
) as the exposure measure:
Use that function to estimate a Gamma Duration model for the number of claims, using the number of trips (RA_NBTRIP
) as the exposure measure:
Use that function to estimate a Gamma Duration model for the number of claims, using the hours driven (RA_HOURS_DRIVEN
) as the exposure measure:
Now we should compare the performances of the different models with out-of-sample predictions.
Define the function sq_error()
which computes the squared error given some observations and prediction.
#' sq_error
#' Computes the Squared Error
#' @param obs vector of observed values
#' @param pred vector of predicted values
sq_error <- function(obs, pred){
sum((pred - obs)**2)
}
Define a function that you will name predict_gamma
which performs out of sample predictions for a Gamma Duration model. The function may be constructed as follows:
This function may return a list of three elements:
pred
: the predicted valuesproba
: the predicted probabilitiesscores
: a table with the out-of-sample scores#' predict_gamma
#' Performs out-of-sample predictions for a GCD model
#' @param fit fit of the model
#' @param expo_name name of the exposition variable
#' @param model_name name of the model
predict_gamma <- function(fit, expo_name, model_name) {
# Exposition variable
# Estimated coefficients
# Predicted values
# Probability
# Scores
list()
}
Now you can use the function predict_gamma()
on the four Gamma Duration modes previously estimated.
predictors <-
model.matrix(~ gender + RA_MARITALSTATUS + RA_VEH_USE, data = canada_test)
nb_sin <- canada_test$RA_ACCIDENT_IND
Compute the out-of-sample predictions for the model which uses exposure time (RA_EXPOSURE_TIME
) as the exposure measure (fit_gamma_exposure
).
Compute the out-of-sample predictions for the model which uses distance drivent (RA_DISTANCE_DRIVEN
) as the exposure measure (fit_gamma_distance
).
Compute the out-of-sample predictions for the model which uses the number of trips (RA_NBTRIP
) as the exposure measure (fit_gamma_nbtrips
).
Compute the out-of-sample predictions for the model which uses hours driven (RA_HOURS_DRIVEN
) as the exposure measure (fit_gamma_hours
).
Lastly, bind the rows of the tables containing the scores of the four models to compare the performances of your models with out-of-sample data.