Jean-Philippe Boucher, Université du Québec À Montréal (🐦 @J_P_Boucher)

Arthur Charpentier, Université du Québec À Montréal (🐦 @freakonometrics)

Ewen Gallic, Aix-Marseille Université (🐦 @3wen)

1 Data

As mention in the course, we will use simulated datasets, based on true data from a private Canadian insurance company (The Co-operators General Insurance Company) for this hands-on session on telematics. Each observation corresponds to an insurance policy for a vehicule and a driver (a vehicule may therefore be insured for multiple drivers). The available variables are as follows:

  • Falsevin: id of the vehicule
  • RA_GENDER: gender (1 male, 2 female, 3 unknown)
  • RA_MARITALSTATUS: marital status (1 married, 0 other)
  • RA_VEH_USE: vehicule use (1 commute, 2 pleasure, 0 other)
  • RA_EXPOSURE_TIME: exposure time (in years)
  • RA_DISTANCE_DRIVEN: total distance traveled by the driver
  • RA_NBTRIP: number of trips
  • RA_HOURS_DRIVEN: number of hours driven
  • RA_ACCIDENT_IND: number of claims

The dataset is in CSV format (comma-separated values). It has been separated into a training sample (CanadaPanelTrain.csv, with 29,108 observations) and a test sample (CanadaPanelTest.csv, with 19,570 observations).

## [1] 29108
## [1] 19570
## # A tibble: 29,108 x 9
##    Falsevin RA_GENDER RA_MARITALSTATUS RA_VEH_USE RA_EXPOSURE_TIME
##       <dbl>     <dbl>            <dbl>      <dbl>            <dbl>
##  1        1         2                1          2            0.863
##  2        1         2                1          2            0.485
##  3        6         2                1          1            0.329
##  4        6         2                1          1            0.964
##  5        6         2                1          1            1.00 
##  6        7         2                0          1            0.573
##  7        7         2                0          1            0.912
##  8        8         1                0          2            0.496
##  9        8         1                0          2            0.501
## 10        8         1                0          2            0.501
## # … with 29,098 more rows, and 4 more variables: RA_DISTANCE_DRIVEN <dbl>,
## #   RA_NBTRIP <dbl>, RA_HOURS_DRIVEN <dbl>, RA_ACCIDENT_IND <dbl>
## # A tibble: 19,570 x 9
##    Falsevin RA_GENDER RA_MARITALSTATUS RA_VEH_USE RA_EXPOSURE_TIME
##       <dbl>     <dbl>            <dbl>      <dbl>            <dbl>
##  1        2         2                0          1            0.496
##  2        2         2                0          1            0.778
##  3        3         1                0          2            1.00 
##  4        3         1                0          2            0.414
##  5        4         2                1          1            0.479
##  6        4         2                1          1            0.499
##  7        4         2                1          1            0.414
##  8        5         1                1          1            0.666
##  9        5         1                1          1            0.918
## 10        9         1                1          2            0.392
## # … with 19,560 more rows, and 4 more variables: RA_DISTANCE_DRIVEN <dbl>,
## #   RA_NBTRIP <dbl>, RA_HOURS_DRIVEN <dbl>, RA_ACCIDENT_IND <dbl>

1.1 Basic Statistics

1.1.1 Numerical Variables

While R offers a convenient function to display summary statistics of a dataset, it lacks some outputs of interest, especially variance. Let us create our own summary function:

We can then apply this function to some desired columns:

This table can be displayed in a markdown format. To obtain a prettier output, we can add a label to the variables instead of leaving the name of the variable. To that end, we can create a tibble where the labels can be defined.

## # A tibble: 3 x 8
##   Variable Average Variance Minimum Maximum `25th percentil…
##   <chr>      <dbl>    <dbl>   <dbl>   <dbl>            <dbl>
## 1 Exposur… 6.52e-1  6.11e-2   0.277    1.08            0.485
## 2 # Trips  1.07e+3  3.90e+5  15     3283             607    
## 3 # Claims 3.47e-2  3.61e-2   0        3               0    
## # … with 2 more variables: `50th percentile` <dbl>, `75th
## #   percentile` <dbl>

The output of the table can be in Markdown:

Variable Average Variance Minimum Maximum 25th percentile 50th percentile 75th percentile
Exposure Time 0.652 0.061 0.277 1.079 0.485 0.521 0.94
# Trips 1 074.081 390 413.843 15.000 3 283.000 607.000 937.000 1 415.25
# Claims 0.035 0.036 0.000 3.000 0.000 0.000 0.00

A \(\LaTeX\) format can also be obtained:

## 
## \begin{tabular}{l|r|r|r|r|r|r|r}
## \hline
## Variable & Average & Variance & Minimum & Maximum & 25th percentile & 50th percentile & 75th percentile\\
## \hline
## Exposure Time & 0.652 & 0.061 & 0.277 & 1.079 & 0.485 & 0.521 & 0.94\\
## \hline
## \# Trips & 1 074.081 & 390 413.843 & 15.000 & 3 283.000 & 607.000 & 937.000 & 1 415.25\\
## \hline
## \# Claims & 0.035 & 0.036 & 0.000 & 3.000 & 0.000 & 0.000 & 0.00\\
## \hline
## \end{tabular}

Some scatter plots can be graphes:

To explore the data, faceting can be very useful:

Densities can easily be plotted. For example, we can use the function geom_density_ridges() from {ggridges}:

Boxplots can also be easily graphed:

1.1.2 Categorical Variables

We may want to recode our categorical variables so that the levels display more intelligible values. To that end, we can use the mutate() function coupled with the case_when() function, both from {dplyr}. As our data is separated into two samples, we may create a wrapper function that will be applied to both samples.

Now we can use that function on our two datasets:

## # A tibble: 29,108 x 9
##    Falsevin RA_GENDER RA_MARITALSTATUS RA_VEH_USE RA_EXPOSURE_TIME
##       <dbl> <fct>     <fct>            <fct>                 <dbl>
##  1        1 Female    Married          Pleasure              0.863
##  2        1 Female    Married          Pleasure              0.485
##  3        6 Female    Married          Commute               0.329
##  4        6 Female    Married          Commute               0.964
##  5        6 Female    Married          Commute               1.00 
##  6        7 Female    Other            Commute               0.573
##  7        7 Female    Other            Commute               0.912
##  8        8 Male      Other            Pleasure              0.496
##  9        8 Male      Other            Pleasure              0.501
## 10        8 Male      Other            Pleasure              0.501
## # … with 29,098 more rows, and 4 more variables: RA_DISTANCE_DRIVEN <dbl>,
## #   RA_NBTRIP <dbl>, RA_HOURS_DRIVEN <dbl>, RA_ACCIDENT_IND <dbl>
## # A tibble: 19,570 x 9
##    Falsevin RA_GENDER RA_MARITALSTATUS RA_VEH_USE RA_EXPOSURE_TIME
##       <dbl> <fct>     <fct>            <fct>                 <dbl>
##  1        2 Female    Other            Commute               0.496
##  2        2 Female    Other            Commute               0.778
##  3        3 Male      Other            Pleasure              1.00 
##  4        3 Male      Other            Pleasure              0.414
##  5        4 Female    Married          Commute               0.479
##  6        4 Female    Married          Commute               0.499
##  7        4 Female    Married          Commute               0.414
##  8        5 Male      Married          Commute               0.666
##  9        5 Male      Married          Commute               0.918
## 10        9 Male      Married          Pleasure              0.392
## # … with 19,560 more rows, and 4 more variables: RA_DISTANCE_DRIVEN <dbl>,
## #   RA_NBTRIP <dbl>, RA_HOURS_DRIVEN <dbl>, RA_ACCIDENT_IND <dbl>

To perform quick and dirty summary statistics on categorical variables that will be displayed in the console, we can group observations by variable and count the frequency of each level as follows:

## # A tibble: 8 x 4
## # Groups:   variable_name [3]
##   variable_name    value     freq   pct
##   <chr>            <chr>    <int> <dbl>
## 1 RA_GENDER        Female   14832   0.5
## 2 RA_GENDER        Male     13993   0.5
## 3 RA_GENDER        Unknown    283   0  
## 4 RA_MARITALSTATUS Married   9189   0.3
## 5 RA_MARITALSTATUS Other    19919   0.7
## 6 RA_VEH_USE       Commute  15948   0.5
## 7 RA_VEH_USE       Other      501   0  
## 8 RA_VEH_USE       Pleasure 12659   0.4

For a specific variable, it is also possible to use table() and prop.table() to quickly compute frequency and proportions. For example, for the number of claims in the training set:

## 
##     0     1     2     3 
## 28133   940    34     1
## 
##     0     1     2     3 
## 0.967 0.032 0.001 0.000

We can also display the frequency of the number of claims in a barplot. First we can count the frequency using the tally() function from {dplyr}.

Then, we can plot the result using ggplot(). The function reorder() can be used in the aesthetics arguments to order the values of bars:

Let us suppose that we want to display the distribution of the number of claims while also showing the distribution of vehicle use:

We may want to reorder the labels in the legend, and to place the latter below the graph:

Now, if we want prettier outputs, we can use the {qwraps2} package to display summary statistics either in \(\LaTeX\) or in Markdown tables.

. (N = 29,108)
Gender   
   Female 14,832 (51)
   Male 13,993 (48)
   Unknown 283 (1)
Marital Status   
   Married 9,189 (32)
   Other 19,919 (68)
Vehicle use   
   Commute 15,948 (55)
   Other 501 (2)
   Pleasure 12,659 (43)

It is also possible to obtain the summary statistics depending on groups:

Marital Status: Married (N = 9,189) Marital Status: Other (N = 19,919)
Gender      
   Female 4,109 (45) 10,723 (54)
   Male 5,042 (55) 8,951 (45)
   Unknown 38 (0) 245 (1)
Marital Status      
   Married 9,189 (100) 0 (0)
   Other 0 (0) 19,919 (100)
Vehicle use      
   Commute 5,984 (65) 9,964 (50)
   Other 70 (1) 431 (2)
   Pleasure 3,135 (34) 9,524 (48)

More options can be set, as explained in the vignette of {qwraps2}.

1.2 Risk Exposure

We are particularly interested in two metrics:

  • the distance driven (in kilometres)
  • the exposure time, which can be measured using either:

    • the duration of the contract (usually one year)
    • the mileage traveled annually by the policyholders

The empirical distribution for distance driven can be plotted using a barplot:

And the exposure time:

We can also plot the number of trips:

And the number of hours driven:

Let us visualize relationship between claim frequency and exposure. To that end, we can graph a scatter plot showing the relationship between claim frequency as a function of kilometers driven. Insureds can be grouped according to the distance travelled, considering intervals of 500 kilometres for example. For each group, the average number of claims can be calculated.

## # A tibble: 119 x 3
##    km_group    freq n_drivers
##       <dbl>   <dbl>     <int>
##  1        0 0              97
##  2      500 0.00330       303
##  3     1000 0.0215        512
##  4     1500 0.0159        630
##  5     2000 0.0194        878
##  6     2500 0.0231        909
##  7     3000 0.0188       1008
##  8     3500 0.0241       1077
##  9     4000 0.0173       1159
## 10     4500 0.0239       1171
## # … with 109 more rows

Then, using the resulting table, the scatter plot can be displayed. The color of the dots can be set accordingly with the number of insureds for each group.

We can also plot the number of claims as a function of exposure duration. To that end, we can compute the average number of claims withing groups of insured, where groups are created by considering exposure time intervals of 0.01 years.

The same kind of plot can be made for the number of trips and

2 Duration Models

As indicated in the course, we have at our disposal several risk exposure measures that allow us to revisit several classic definitions of dependence:

  • occurrence dependence: occurs when the occurrence of an event increases the probability of occurrence of an other event
  • duration dependence: occurs when the outcome of an experiment depends on the time that has elapsed since the last success.

We will focus on two definitions of the exposure:

  • the calendar time (RA_EXPOSURE_TIME)
  • the distance driven (RA_DISTANCE_DRIVEN)

We could also use:

  • the number of trips (RA_NBTRIP)
  • the number of hours driven (RA_HOURS_DRIVEN)

2.1 Fitting the Number of Claims

2.1.1 Poisson GLM

We will model the number of claims first, by fitting a Poisson regression model. This is achieved using the glm() function from {stats}.

We can first consider a model without offset:

## 
## Call:
## glm(formula = RA_ACCIDENT_IND ~ RA_GENDER + RA_MARITALSTATUS + 
##     RA_VEH_USE, family = "poisson", data = canada_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3065  -0.2763  -0.2693  -0.2380   4.6937  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -3.11006    0.06499 -47.852  < 2e-16 ***
## RA_GENDERMale          0.05152    0.06335   0.813  0.41605    
## RA_GENDERUnknown      -0.54859    0.44959  -1.220  0.22238    
## RA_MARITALSTATUSOther -0.20725    0.06584  -3.148  0.00165 ** 
## RA_VEH_USEOther       -2.92925    0.99864  -2.933  0.00335 ** 
## RA_VEH_USEPleasure    -0.29798    0.06601  -4.514 6.35e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 6894.9  on 29107  degrees of freedom
## Residual deviance: 6828.8  on 29102  degrees of freedom
## AIC: 8812.7
## 
## Number of Fisher Scoring iterations: 7

We can fit the model using exposure time as an offset:

## 
## Call:
## glm(formula = RA_ACCIDENT_IND ~ RA_GENDER + RA_MARITALSTATUS + 
##     RA_VEH_USE + offset(log(RA_EXPOSURE_TIME)), family = "poisson", 
##     data = canada_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3820  -0.3001  -0.2372  -0.2098   4.4425  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -2.72206    0.06492 -41.930  < 2e-16 ***
## RA_GENDERMale          0.02799    0.06333   0.442 0.658488    
## RA_GENDERUnknown      -0.13356    0.44963  -0.297 0.766435    
## RA_MARITALSTATUSOther -0.19141    0.06577  -2.910 0.003614 ** 
## RA_VEH_USEOther       -2.71438    0.99810  -2.720 0.006537 ** 
## RA_VEH_USEPleasure    -0.21749    0.06597  -3.297 0.000977 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 6731.1  on 29107  degrees of freedom
## Residual deviance: 6686.6  on 29102  degrees of freedom
## AIC: 8670.5
## 
## Number of Fisher Scoring iterations: 7

We can also fit the model using distance driven as an offset:

## 
## Call:
## glm(formula = RA_ACCIDENT_IND ~ RA_GENDER + RA_MARITALSTATUS + 
##     RA_VEH_USE + offset(log(RA_EXPOSURE_TIME)), family = "poisson", 
##     data = canada_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3820  -0.3001  -0.2372  -0.2098   4.4425  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -2.72206    0.06492 -41.930  < 2e-16 ***
## RA_GENDERMale          0.02799    0.06333   0.442 0.658488    
## RA_GENDERUnknown      -0.13356    0.44963  -0.297 0.766435    
## RA_MARITALSTATUSOther -0.19141    0.06577  -2.910 0.003614 ** 
## RA_VEH_USEOther       -2.71438    0.99810  -2.720 0.006537 ** 
## RA_VEH_USEPleasure    -0.21749    0.06597  -3.297 0.000977 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 6731.1  on 29107  degrees of freedom
## Residual deviance: 6686.6  on 29102  degrees of freedom
## AIC: 8670.5
## 
## Number of Fisher Scoring iterations: 7

Using the function mtable() from {memisc} allows to nicely display the coefficients as well as some summary statistics of different models in a table.

Without Offset Exp. Time Distance
(Intercept) −3 . 110*** −2 . 722*** −12 . 524***
(0 . 065) (0 . 065) (0 . 064)
RA_GENDER: Male/Female 0 . 052 0 . 028 0 . 065
(0 . 063) (0 . 063) (0 . 063)
RA_GENDER: Unknown/Female −0 . 549 −0 . 134 −0 . 142
(0 . 450) (0 . 450) (0 . 450)
RA_MARITALSTATUS: Other/Married −0 . 207** −0 . 191** −0 . 196**
(0 . 066) (0 . 066) (0 . 066)
RA_VEH_USE: Other/Commute −2 . 929** −2 . 714** −2 . 451*
(0 . 999) (0 . 998) (1 . 001)
RA_VEH_USE: Pleasure/Commute −0 . 298*** −0 . 217*** 0 . 130*
(0 . 066) (0 . 066) (0 . 066)
AIC 8812 . 707 8670 . 459 8841 . 987
N 29108 29108 29108

Significance: *** = p < 0.001; ** = p < 0.01; * = p < 0.05

Assuming that the models are correctly specified, they can then be used to predict the number of claims for new individuals (\(E[N]\)). Let us consider the example of the lesson: a singla male driver, who drove exactly 12,000 kilometres for 8 months and who uses his car to commute.

Let us create that insured:

The levels of categorical variables need to be set accordingly with those of the dataset that was used to fit the models:

Then using the function predict.glm(), the predicted number of claims can be computed. For the model without offset:

##          1 
## 0.03816697

It is also possible to set the argument type to "response" to obtain the predicted number of claims without having to use the exponential function:

##          1 
## 0.03816697

For the model that uses exposure time as an offset:

##          1 
## 0.03721907

And for the model that uses distances driven as the exposure base, we finally have:

##          1 
## 0.03830516

2.1.1.1 Manual estimation

Instead of using the glm() function, we can maximize the log-likelihood (or minimize the negative log-likelihood) with exposure as an exponential distribution.

The exposure time:

The predictor matrix (where the constant in placed in the first column):

The variable of interest, i.e., the number of claims:

The negative log-likelihooh function can then be defined:

We will use the optim() function to find the parameters that minimize the neg_logl_Poisson() function. The former requires starting values for the algorithm to try for the algorithm to try to converge towards a minimum (which we hope is global).

Let us provide the following starting values for the coefficients:

With such values for the coefficients, the negative log-likelihood for the Poisson model with an offset can be estimated as follows:

##          [,1]
## [1,] 4329.468

The maximum likelihood estimation gives the following coefficients:

##           (Intercept)         RA_GENDERMale      RA_GENDERUnknown 
##           -2.72280982            0.02798973           -0.11686434 
## RA_MARITALSTATUSOther       RA_VEH_USEOther    RA_VEH_USEPleasure 
##           -0.19191782           -2.65778697           -0.21627071

We can check that the estimates are the same as those obtained previously using the glm() function:

##           (Intercept)         RA_GENDERMale      RA_GENDERUnknown 
##           -2.72205681            0.02799404           -0.13355746 
## RA_MARITALSTATUSOther       RA_VEH_USEOther    RA_VEH_USEPleasure 
##           -0.19140622           -2.71437901           -0.21748955

To compute the confidence intervals of the coefficients, we first need to compute their standard errors. First, let us calculate the Fisher information matrix and then extract its diagonal terms.

The confidence intervals can then easily be computed:

## # A tibble: 6 x 5
##   coef_name               value   upper   lower     se
##   <chr>                   <dbl>   <dbl>   <dbl>  <dbl>
## 1 (Intercept)           -2.72   -2.60   -2.85   0.0649
## 2 RA_GENDERMale          0.0280  0.152  -0.0962 0.0634
## 3 RA_GENDERUnknown      -0.117   0.757  -0.991  0.446 
## 4 RA_MARITALSTATUSOther -0.192  -0.0630 -0.321  0.0658
## 5 RA_VEH_USEOther       -2.66   -0.750  -4.57   0.973 
## 6 RA_VEH_USEPleasure    -0.216  -0.0870 -0.346  0.0660

The AIC can be computed as follows:

## [1] 8670.464

2.1.2 Exponential Duration

As mentioned in the lesson, assuming an exponential distribution “time” between claims corresponds to a classic Poisson distribution with an offset variable.

Let us still use calendar time to measure exposition, use the same predictors as above and the same variable of interest:

The log-likelihood function writes:

Some initial starting values for the optimization algorithm:

Using these values, the negative log-likelihood equals:

##          [,1]
## [1,] 4329.468

Once again, we can use the optim() function to obtain the ML estimates:

##           (Intercept)         RA_GENDERMale      RA_GENDERUnknown 
##           -2.72280982            0.02798973           -0.11686434 
## RA_MARITALSTATUSOther       RA_VEH_USEOther    RA_VEH_USEPleasure 
##           -0.19191782           -2.65778697           -0.21627071

The confidence intervals of the coefficients can again be computed:

The confidence intervals can again easily be computed:

## # A tibble: 6 x 5
##   coef_name               value   upper   lower     se
##   <chr>                   <dbl>   <dbl>   <dbl>  <dbl>
## 1 (Intercept)           -2.72   -2.60   -2.85   0.0649
## 2 RA_GENDERMale          0.0280  0.152  -0.0962 0.0634
## 3 RA_GENDERUnknown      -0.117   0.757  -0.991  0.446 
## 4 RA_MARITALSTATUSOther -0.192  -0.0630 -0.321  0.0658
## 5 RA_VEH_USEOther       -2.66   -0.750  -4.57   0.973 
## 6 RA_VEH_USEPleasure    -0.216  -0.0870 -0.346  0.0660

And the AIC equals:

## [1] 8670.464

2.1.3 Gamma Duration

Instead of setting the value of \(\alpha\) to 1 as in the exponential model, we can estimate it. This results in a more general gamma distribution for the “time” between claims. Let us estimate the coefficients using the Gamma Count Distribution (GCD) model for our two different measures of exposure: exposure time (calendar year) and distance drivent (in kilometres).

2.1.3.1 Using Exposure Time

Let us use exposure time in calendar years as the measure of exposure.

Some initial starting values for the optimization algorithm:

Using these values, the negative log-likelihood equals:

##          [,1]
## [1,] 4329.468

The estimated coefficients:

##           (Intercept)         RA_GENDERMale      RA_GENDERUnknown 
##           -3.04696138            0.02279892           -0.02211765 
## RA_MARITALSTATUSOther       RA_VEH_USEOther    RA_VEH_USEPleasure 
##           -0.21436492           -2.76231289           -0.24986620 
##                 alpha 
##            0.91396719

The standard errors:

The confidence intervals can again easily be computed:

## # A tibble: 7 x 5
##   coef_name               value   upper  lower     se
##   <chr>                   <dbl>   <dbl>  <dbl>  <dbl>
## 1 (Intercept)           -3.05   -2.36   -3.74  0.353 
## 2 RA_GENDERMale          0.0228  0.158  -0.112 0.0690
## 3 RA_GENDERUnknown      -0.0221  0.870  -0.914 0.455 
## 4 RA_MARITALSTATUSOther -0.214  -0.0689 -0.360 0.0742
## 5 RA_VEH_USEOther       -2.76   -0.771  -4.75  1.02  
## 6 RA_VEH_USEPleasure    -0.250  -0.0983 -0.401 0.0773
## 7 alpha                  0.914   1.08    0.752 0.0824

And the AIC:

## [1] 8671.137

We can use the estimated coefficients to make prediction on new observations.

We need to extract the estimated coefficients:

Let us assume that our insured in a single man who uses his car to commute.

Let us assume that he drove exactly 12,000 kilometres for 8 months. His exposure time is therefore:

The prediction can be computed using the pgamma() function:

## [1] 0.03729604

2.1.3.2 Using Distance Driven

Now, let us use the distance driven (RA_DISTANCE_DRIVEN) as the measure of exposure.

Let us estimate the coefficients:

And their associated confidence intervals:

## # A tibble: 7 x 5
##   coef_name                value   upper   lower      se
##   <chr>                    <dbl>   <dbl>   <dbl>   <dbl>
## 1 (Intercept)           -15.8    -15.5   -16.0     0.123
## 2 RA_GENDERMale           0.0983   0.342  -0.145   0.124
## 3 RA_GENDERUnknown       -1.43     0.695  -3.56    1.08 
## 4 RA_MARITALSTATUSOther  -0.495   -0.243  -0.747   0.128
## 5 RA_VEH_USEOther       -27.5    NaN     NaN     NaN    
## 6 RA_VEH_USEPleasure     -0.131    0.124  -0.386   0.130
## 7 alpha                   0.497  NaN     NaN     NaN

And the AIC:

## [1] 8725.809

Now turning to the predictions. Let us assume again that we are facing a single man who drove his car to commute exactly 12,000 kilometres for 8 month.

We need to extract the estimated coefficients:

The exposure of our insured is equal to:

And once again, using the pgamma() function, we can compute the predicted value of claims for that insured:

## [1] 0.03960649

2.1.4 Modified Counts Distribution

The modified Poisson Distribution presented during the lesson seems to offer a good approximation to the Weibull Count Distribution (skipped here), the latter offering a much better estimate of the number of claims than the Poisson distribution. In addition, the modified Poisson Distribution is simplier to use than the Weibull Count Distribution.

Using exposure time as a covariate:

## 
## Call:
## glm(formula = RA_ACCIDENT_IND ~ factor(RA_GENDER) + factor(RA_MARITALSTATUS) + 
##     factor(RA_VEH_USE) + log(RA_EXPOSURE_TIME), family = "poisson", 
##     data = canada_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3822  -0.3002  -0.2372  -0.2098   4.4421  
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   -2.72144    0.07041 -38.652  < 2e-16 ***
## factor(RA_GENDER)Male          0.02795    0.06336   0.441  0.65908    
## factor(RA_GENDER)Unknown      -0.13265    0.45143  -0.294  0.76888    
## factor(RA_MARITALSTATUS)Other -0.19137    0.06579  -2.909  0.00363 ** 
## factor(RA_VEH_USE)Other       -2.71392    0.99860  -2.718  0.00657 ** 
## factor(RA_VEH_USE)Pleasure    -0.21733    0.06634  -3.276  0.00105 ** 
## log(RA_EXPOSURE_TIME)          1.00195    0.08649  11.585  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 6894.9  on 29107  degrees of freedom
## Residual deviance: 6686.6  on 29101  degrees of freedom
## AIC: 8672.5
## 
## Number of Fisher Scoring iterations: 7

Now using distance driven as a covariate:

## 
## Call:
## glm(formula = RA_ACCIDENT_IND ~ factor(RA_GENDER) + factor(RA_MARITALSTATUS) + 
##     factor(RA_VEH_USE) + log(RA_DISTANCE_DRIVEN), family = "poisson", 
##     data = canada_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.4648  -0.2915  -0.2560  -0.2192   4.5273  
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   -7.32907    0.44000 -16.657  < 2e-16 ***
## factor(RA_GENDER)Male          0.05389    0.06333   0.851  0.39482    
## factor(RA_GENDER)Unknown      -0.37119    0.44998  -0.825  0.40943    
## factor(RA_MARITALSTATUS)Other -0.20513    0.06586  -3.114  0.00184 ** 
## factor(RA_VEH_USE)Other       -2.70689    0.99867  -2.710  0.00672 ** 
## factor(RA_VEH_USE)Pleasure    -0.08674    0.06901  -1.257  0.20876    
## log(RA_DISTANCE_DRIVEN)        0.45396    0.04630   9.805  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 6894.9  on 29107  degrees of freedom
## Residual deviance: 6727.2  on 29101  degrees of freedom
## AIC: 8713
## 
## Number of Fisher Scoring iterations: 7

The modified Poisson distribution can handle both the distance and the exposure time as covariates.

## 
## Call:
## glm(formula = RA_ACCIDENT_IND ~ factor(RA_GENDER) + factor(RA_MARITALSTATUS) + 
##     factor(RA_VEH_USE) + log(RA_EXPOSURE_TIME) + log(RA_DISTANCE_DRIVEN), 
##     family = "poisson", data = canada_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.4238  -0.3040  -0.2424  -0.2058   4.4163  
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   -4.86988    0.53076  -9.175  < 2e-16 ***
## factor(RA_GENDER)Male          0.03360    0.06337   0.530  0.59599    
## factor(RA_GENDER)Unknown      -0.13463    0.45141  -0.298  0.76553    
## factor(RA_MARITALSTATUS)Other -0.19419    0.06582  -2.950  0.00317 ** 
## factor(RA_VEH_USE)Other       -2.65134    0.99867  -2.655  0.00793 ** 
## factor(RA_VEH_USE)Pleasure    -0.12994    0.06944  -1.871  0.06130 .  
## log(RA_EXPOSURE_TIME)          0.77233    0.10267   7.522 5.38e-14 ***
## log(RA_DISTANCE_DRIVEN)        0.22155    0.05407   4.098 4.17e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 6894.9  on 29107  degrees of freedom
## Residual deviance: 6669.3  on 29100  degrees of freedom
## AIC: 8657.1
## 
## Number of Fisher Scoring iterations: 7

As for the Poisson regression, we can display the results of the estimations in a table, using memisc::mtable():

Exp. Time Distance Exp. Time & Distance
(Intercept) −2 . 721*** −7 . 329*** −4 . 870***
(0 . 070) (0 . 440) (0 . 531)
factor(RA_GENDER): Male/Female 0 . 028 0 . 054 0 . 034
(0 . 063) (0 . 063) (0 . 063)
factor(RA_GENDER): Unknown/Female −0 . 133 −0 . 371 −0 . 135
(0 . 451) (0 . 450) (0 . 451)
factor(RA_MARITALSTATUS): Other/Married −0 . 191** −0 . 205** −0 . 194**
(0 . 066) (0 . 066) (0 . 066)
factor(RA_VEH_USE): Other/Commute −2 . 714** −2 . 707** −2 . 651**
(0 . 999) (0 . 999) (0 . 999)
factor(RA_VEH_USE): Pleasure/Commute −0 . 217** −0 . 087 −0 . 130
(0 . 066) (0 . 069) (0 . 069)
log(RA_EXPOSURE_TIME) 1 . 002*** 0 . 772***
(0 . 086) (0 . 103)
log(RA_DISTANCE_DRIVEN) 0 . 454*** 0 . 222***
(0 . 046) (0 . 054)
AIC 8672 . 459 8713 . 031 8657 . 139
N 29108 29108 29108

Significance: *** = p < 0.001; ** = p < 0.01; * = p < 0.05

2.1.5 Wrap-up

In this section, we have estimated claims using different count models. The “time” between two claims was modeled assuming different distributions:

  • Poisson:

    • without offset (mod_claims_poisson_1)
    • using exposure time as an offset (mod_claims_poisson_2)
    • using distance drivent as an offset (mod_claims_poisson_3)
  • The Exponential distribution

    • using exposure time as the exposure measure (fit_exponential_duration)
    • which is equivalent to a Poisson model with exposure time as an offset (mod_claims_poisson_2)
  • The Gamma distribution

    • using exposure time as the exposure measure (fit_gamma_exp_time)
    • using distance driven as the exposure measure (fit_gamma_dist)
  • The Poisson modified distribution

    • using exposure time as a covariate (fit_m_Poisson_duration)
    • using distance driven as a covariate (fit_m_Poisson_distance)
    • using both exposure time and distance as covariates (fit_m_Poisson_dist_dur)

2.2 Out-of-Sample

To decide which models best fits the data, we have se far displayed log-likelihood values or criterion such as Aikaike Information Criterion (AIC). All the models were fit on a sub-sample of the (simulated) data. We will now compare the performance of the models on out-of-sample data.

As mentioned in the lesson, it is difficult to see if count models accurately estimate the mean parameter for a given insured. Instead, we will rely on scores that measure the quality of probabilistic forecasts. The lower the score, the better the quality of the forecast.

We will only focus on the gamma and on the modified Poisson distributions here.

2.2.1 Gamma Duration

2.2.1.1 Exposure Time

Recall that the estimation results of the model assuming a Gamma distribution for the time between two claims and using exposure time as the exposure measure are stored in the object fit_gamma_exp_time.

First, we need to prepare the different variables: the exposure measure, the true value of the number of claims and the matrix of predictors in the testing set.

Then, let us extract the estimated coefficients and set \(\lambda\):

Finally, we can predict the values for the testing set, using a loop:

The probability can be calculated using the pgamma() function:

Let us define a function that computes the squared error:

And lastly, we can compute the two scores of interest and store the result in a tibble:

## # A tibble: 1 x 3
##   Model           Logarithmic `Squared Error`
##   <chr>                 <dbl>           <dbl>
## 1 GCD (Exp. Time)       2731.            635.

2.2.1.2 Distance Driven

We can easily modify the previous code to compute the scores of the GCD where the distance driven is used as the exposition measure:

First, we need to prepare the different variables: the exposure measure, the true value of the number of claims and the matrix of predictors in the testing set.

The estimated coefficients:

The predicted values:

The probability:

And lastly, we can compute the two scores of interest and store the result in a tibble:

## # A tibble: 1 x 3
##   Model          Logarithmic `Squared Error`
##   <chr>                <dbl>           <dbl>
## 1 GCD (Distance)       2736.            635.

2.2.2 Modified Poisson

Uing the predict.glm() function, we can easily predict the number of claims according to our different models:

Let us define the function logarithm_score_Poisson() that will compute the logarithmic score.

Let us compute the squared error and the logarithmic score:

## # A tibble: 3 x 3
##   Model                                   Logarithmic `Squared Error`
##   <chr>                                         <dbl>           <dbl>
## 1 Modified Poisson (Exp. Time)                  2732.            635.
## 2 Modified Poisson (Distance)                   2726.            635.
## 3 Modified Poisson (Exp. Time + Distance)       2719.            635.

2.2.3 Wrap-up

To compare the out-of-sample predictive performances of the models, we now have different scores for each of the models:

Model Logarithmic Squared Error
Modified Poisson (Exp. Time) 2731.718 635.4926
Modified Poisson (Distance) 2726.487 634.9512
Modified Poisson (Exp. Time + Distance) 2719.291 634.5665
GCD (Exp. Time) 2731.414 635.4657
GCD (Distance) 2735.714 635.0224

3 Flexible Models: GAM Models

In this section, we will model claims using semi-parametric models: generalized additive models (GAM).

3.1 Independent Cubic Splines

We first consider that the number of claims is following a Poisson distribution of mean \(\lambda_i\), with:

\[\log (\lambda_i) = \beta_0 + f_1(\text{km}_i) + f_2(\text{d}_i),\]

where \(\beta_0\) is the intercept, \(\text{km}_i\) and \(\text{d}_i\) are the distance traveled and the duration of the insurance contract of insured \(i\), respectively. The functions \(f_1\) and \(f_2\) are assumed to be cubic splines here, defined as a univariate smoothing functiond and having the following linear form:

\[f(x) = \sum_{k=1}^{q} b_k(x) \beta_k,\]

where \(\beta_k\), \(k=1, \ldots, q\) are parameters to be estimated by the GAM, and \(b_k(x)\), \(k=1, \ldots, q\) are functions from cublic spline basis of dimensions \(q\).

We need to chose the number of knots for both \(f_1\) and \(f_2\). We need to be careful here since increasing the number of knots allows more flexibility but increases the risk of over-fitting in the mean time.

To estimate a GAM in R, the function gam() from {mgcv} can be used. We set the scale argument to a negative value to state that it is unknown. The function s() from the same package allows us to define smooths in the GAM formula. The number of knots is set through the argument k. As we want cubic splines, we set the argument bs to "cr".

Let us estimate a GAM with six knots for \(f_1\) and six knots for \(f_2\) on our training sample of insured:

## 
## Family: poisson 
## Link function: log 
## 
## Formula:
## RA_ACCIDENT_IND ~ s(RA_DISTANCE_DRIVEN, bs = "cr", k = k) + s(RA_EXPOSURE_TIME, 
##     bs = "cr", k = k)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.46441    0.03556  -97.42   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                         edf Ref.df     F  p-value    
## s(RA_DISTANCE_DRIVEN) 4.852  4.983  4.59 0.000348 ***
## s(RA_EXPOSURE_TIME)   4.749  4.967 13.80 2.22e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.00662   Deviance explained = 2.96%
## GCV = 0.23003  Scale est. = 1.0415    n = 29108

The parametric parameters of the model can be extracted from the object resulting from the estimation (here, the only parametric parameter is the intercept \(\beta_0\)).

##              Estimate Std. Error   t value Pr(>|t|)
## (Intercept) -3.464411 0.03556106 -97.42148        0

The smooth terms can also be extracted. The resulting table shows the effective degrees of freedom (EDF) as well as the \(F\) value which are the result of a statistical significance test for nonparametric terms.

##                            edf   Ref.df         F      p-value
## s(RA_DISTANCE_DRIVEN) 4.851891 4.983142  4.590386 3.479437e-04
## s(RA_EXPOSURE_TIME)   4.748824 4.966588 13.803902 2.216688e-13

We can also extract the GCV (generalized cross-validation) score of the model which can later be compared with the GCV from other models:

##    GCV.Cp 
## 0.2300328

We can use the function plot.gam() (we can also use its alias plot()) to display the predicted values for each function, as well as the associated 95% confidence intervals.

In the previous code, setting the argument pages to 1 allows to display the graphs in a single page. With 1-d smooths, when the argument scheme is set to 1, the error bounds are displayed using a shaded region (setting it to 0 results in error bounds depicted with dashed curves).

We can create a simple function to vary k:

Let us model the number of claims with GAM models with different values for k:

We can create an animated representation using a loop. At each iteration, we can plot the graphs showing predicted values for each function. The loop is wraped in the function saveGIF() function from {animation}(https://cran.r-project.org/web/packages/animation/index.html) that saves the frames obtained at each iteration in a gif file.

Smoothing functions for k=2,5,10,20 knots.

Smoothing functions for \(k=2,5,10,20\) knots.

For the sake of the exercise, let us use \(k=1\) for both functions.

The smoothing functions for the chosen GAM are the following:

A 3d graph displaying the surface for every possible pair \((\text{km},\text{d})\), derived from the predictions produced by the estimation can be plotted using the function vis.gam() from {mgcv}:

3.2 Dependant Splines

We will now add an interaction term between the distance travelled and the exposure time. This time, instead of using two separate cubic splines to include the distance travelled and exposure time in the model, we will use a smoothing tensor product base.

We still assume that the number of claims follows a Poisson distribution. The model writes:

\[\log(\mu_i) = \beta_0 + f(\text{km}_i, \text{d}_i),\] where \(\beta_0\) corresponds to the independent term of the model and \(f(\text{km}_i, \text{d}_i)\) is a smoothing function which depends on the distance driven \(\text{km}_i\) and the duration \(\text{d}_i\) of insured \(i\). The function \(f\) is a tensor product smoothing base which relates the expectation with the linear predictor.

In R, the tensor product smoothing base can be defined using the function te() from {mgcv}.

## 
## Family: poisson 
## Link function: log 
## 
## Formula:
## RA_ACCIDENT_IND ~ te(RA_DISTANCE_DRIVEN, RA_EXPOSURE_TIME, k = c(6, 
##     3), bs = c("cr", "cr"))
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.47154    0.03575  -97.11   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                                           edf Ref.df     F p-value    
## te(RA_DISTANCE_DRIVEN,RA_EXPOSURE_TIME) 15.77  16.07 12.22  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.00683   Deviance explained = 3.16%
## GCV = 0.22965  Scale est. = 1.0386    n = 29108

As in the summary of the model with independent splines, wen can extract the parametric parameter estimates (only the constant here), the smooth terms and the GCV score.

##              Estimate Std. Error   t value Pr(>|t|)
## (Intercept) -3.471538 0.03574674 -97.11482        0
##                                              edf   Ref.df        F
## te(RA_DISTANCE_DRIVEN,RA_EXPOSURE_TIME) 15.76826 16.07399 12.21948
##                                              p-value
## te(RA_DISTANCE_DRIVEN,RA_EXPOSURE_TIME) 6.882064e-33
##   GCV.Cp 
## 0.229651

The p-value from the non-pamaretric part of the model is low, suggesting that the interaction term between the distance driven and the exposure time is important to explain the number of claims of the policyholder. The GCV score of this model (0.229651) is not much different from that of the first model (0.2300328), suggesting that there is not much improvment of model 2 with respect to model 1.

It is not possible to analyze the impact of the distance travelled or the exposure time on the prediction independently as in the first model, since the function \(f(\text{km}_i, \text{d}_i)\) in the second model is not expressed in terms of \(\text{km}\) or \(\text{d}\) separately. We can however plot the surface for the pairs \((\text{km}, \text{d})\) as in the first estimation:

3.3 Pricing Application

In this section, we will compare the conventional methodology used in practice by insurance companies to model the frequency of claims reported by policyholders and the models presented in the previous subsection. Let us consider that the frequency of automobile insurance claims is modeled by insurers by a GLM that assumes that the number of reported claims follows a Poisson distribution.

We wish to establish a pricing system based on the results obtained by the GAM. To that end, the distance travelled will be categorized into multiple classes (including a reference category). This will introduce new regressors in the models.

3.3.1 A Model with only distances

Let us exclude for simplicity the exposure time in the model (from the surface plots obtained previously with the GAM models, we see that the effect is rather flat over \([0,1]\)).

On the sampe plots, if we look at the effects of the distance driven, we note a sharp increase of claim frequency for the interval \([0,10,000]\). We observe a lower slope for distances greater than 10,000 km. We will the following segmentation for the distance travelled:

Variable Description
\(x_1\) Takes value 1 if \(km < 3,000\)
\(x_2\) Takes value 1 if \(3,000 \leq km \leq 7,000\)
\(x_3\) Takes value 1 if \(7,000 \leq km \leq 12,000\)
\(x_4\) Takes value 1 if \(12,000 \leq km \leq 25,000\)
\(x_5\) Takes value 1 if \(km > 40,000\)

The reference category corresponds to a distance travelled ranging from \(25,000\) km and \(40,000\) km.

Let us define a categorical variable in R which reflects this segmentation:

Using the function dummy_cols() from {fastDummies}, we can create dummy variables from the new column cat_km.

Let us assume that the number of reported claims follows a Poisson distribution. A multiplicative link is used to specify the relationship between the linear predictor and the expectation of the response variable. Specifically, the model is represented by the following equation:

\[\log(\lambda_i) = \beta_0 + \beta_1 x_{1,i}+ \beta_2 x_{2,i}+ \beta_3 x_{3,i}+ \beta_4 x_{4,i}+ \beta_5 x_{5,i},\]

where \(\beta_0\) is the constant term in the model.

We can estimate the model with the glm() function:

## 
## Call:
## glm(formula = RA_ACCIDENT_IND ~ cat_km_1 + cat_km_2 + cat_km_3 + 
##     cat_km_4 + cat_km_5, family = poisson(link = "log"), data = canada_train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3362  -0.3155  -0.2598  -0.2260   4.5816  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.87332    0.12039 -23.868  < 2e-16 ***
## cat_km_1    -1.14306    0.17652  -6.476 9.44e-11 ***
## cat_km_2    -0.79436    0.13768  -5.769 7.96e-09 ***
## cat_km_3    -0.51533    0.13489  -3.820 0.000133 ***
## cat_km_4    -0.12673    0.13083  -0.969 0.332709    
## cat_km_5    -0.07112    0.39667  -0.179 0.857709    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 6894.9  on 29107  degrees of freedom
## Residual deviance: 6779.6  on 29102  degrees of freedom
## AIC: 8763.5
## 
## Number of Fisher Scoring iterations: 6

Let us explore graphically thanks to a surface plot how the predictions generated by this model vary depending on the distance driven and the exposure time.

First, we need to create two vectors: one (km) where the distance driven varies and another (dur) where the duration varies (although it is not modeled in the current model):

Thanks to the expand.grid() function, we can create a data frame from all combinations of duration and distances. We will predict the number of claims for all these combinations. This will then provide materials to plot the surface response.

## # A tibble: 900 x 2
##    RA_DISTANCE_DRIVEN RA_EXPOSURE_TIME
##                 <dbl>            <dbl>
##  1                1.1            0.277
##  2             2631.             0.277
##  3             5261.             0.277
##  4             7891.             0.277
##  5            10521.             0.277
##  6            13151.             0.277
##  7            15781.             0.277
##  8            18411.             0.277
##  9            21041.             0.277
## 10            23671.             0.277
## # … with 890 more rows

We need to create the dummy variables providing the segments from the distance variable:

## # A tibble: 900 x 9
##    RA_DISTANCE_DRI… RA_EXPOSURE_TIME cat_km cat_km_1 cat_km_2 cat_km_3
##               <dbl>            <dbl>  <dbl>    <int>    <int>    <int>
##  1              1.1            0.277      1        1        0        0
##  2           2631.             0.277      1        1        0        0
##  3           5261.             0.277      2        0        1        0
##  4           7891.             0.277      3        0        0        1
##  5          10521.             0.277      3        0        0        1
##  6          13151.             0.277      4        0        0        0
##  7          15781.             0.277      4        0        0        0
##  8          18411.             0.277      4        0        0        0
##  9          21041.             0.277      4        0        0        0
## 10          23671.             0.277      4        0        0        0
## # … with 890 more rows, and 3 more variables: cat_km_4 <int>,
## #   cat_km_6 <int>, cat_km_5 <int>

Now we can predict claim frequency for each combination of duration and distance:

And plot the surface response:

One again, using the animation package, we can create a gif showing the response surface from different angles:

Response surface for the Poisson model.

Response surface for the Poisson model.

As mention in the lesson, this parametric model hardly approximates to the surface illustrated in the previous figures.

3.4 Model Comparison

In this section, to predict claim frequency, we have estimated three models:

  • a GAM with independent cubic splines (model_1)
  • a GAM with tensor product (model_2)
  • a GLM assuming a Poisson distribution (glm_pois).

To compare these models, we can measure their predictive performance against out-of-sample data. As in the section on duration models, we will rely on two scores: the logarithmic score and the squared error.

3.4.1 Out-of-Sample Predictions

First, we need to prepare the testing to get the segmentation variables.

Now we can use the predict() function to predict claim frequency for our three models:

## # A tibble: 19,570 x 4
##      obs pred_gam_ind pred_gam_dep pred_glm
##    <dbl>        <dbl>        <dbl>    <dbl>
##  1     0       0.0199       0.0193   0.0255
##  2     0       0.0579       0.0580   0.0498
##  3     0       0.0377       0.0331   0.0255
##  4     0       0.0122       0.0135   0.0180
##  5     0       0.0192       0.0191   0.0255
##  6     0       0.0185       0.0183   0.0255
##  7     0       0.0154       0.0167   0.0180
##  8     0       0.0424       0.0556   0.0498
##  9     0       0.0550       0.0573   0.0565
## 10     0       0.0159       0.0172   0.0180
## # … with 19,560 more rows

3.4.3 Boxplot

Boxplots can also be used to study out-of-sample performance for count data. We will construct them so that they show the predocted residuals for each model, depending on the number of claims. We expect the residuals to be as close as zero as possible.

Let us compute residuals for the three models, as well as the probability of observing \(n_i\) for each insured \(i\):

## # A tibble: 19,570 x 13
##      obs pred_gam_ind pred_gam_dep pred_glm res_1_gam_ind res_2_gam_ind
##    <dbl>        <dbl>        <dbl>    <dbl>         <dbl>         <dbl>
##  1     0       0.0199       0.0193   0.0255        0.0199         0.980
##  2     0       0.0579       0.0580   0.0498        0.0579         0.944
##  3     0       0.0377       0.0331   0.0255        0.0377         0.963
##  4     0       0.0122       0.0135   0.0180        0.0122         0.988
##  5     0       0.0192       0.0191   0.0255        0.0192         0.981
##  6     0       0.0185       0.0183   0.0255        0.0185         0.982
##  7     0       0.0154       0.0167   0.0180        0.0154         0.985
##  8     0       0.0424       0.0556   0.0498        0.0424         0.959
##  9     0       0.0550       0.0573   0.0565        0.0550         0.946
## 10     0       0.0159       0.0172   0.0180        0.0159         0.984
## # … with 19,560 more rows, and 7 more variables: res_3_gam_ind <dbl>,
## #   res_1_gam_dep <dbl>, res_2_gam_dep <dbl>, res_3_gam_dep <dbl>,
## #   res_1_glm <dbl>, res_2_glm <dbl>, res_3_glm <dbl>

To be able to graph the boxplots, the data needs to be further pre-processed (to be in a correct format for use with ggplot):

## # A tibble: 58,710 x 6
##      obs    id model         residuals model_label nb_claim
##    <dbl> <int> <chr>             <dbl> <chr>       <fct>   
##  1     0     1 res_1_gam_ind    0.0199 GLM         0 claim 
##  2     0     2 res_1_gam_ind    0.0579 GLM         0 claim 
##  3     0     3 res_1_gam_ind    0.0377 GLM         0 claim 
##  4     0     4 res_1_gam_ind    0.0122 GLM         0 claim 
##  5     0     5 res_1_gam_ind    0.0192 GLM         0 claim 
##  6     0     6 res_1_gam_ind    0.0185 GLM         0 claim 
##  7     0     7 res_1_gam_ind    0.0154 GLM         0 claim 
##  8     0     8 res_1_gam_ind    0.0424 GLM         0 claim 
##  9     0     9 res_1_gam_ind    0.0550 GLM         0 claim 
## 10     0    10 res_1_gam_ind    0.0159 GLM         0 claim 
## # … with 58,700 more rows

Then the boxplots can be plotted. We create a faceting depending on the number of claims.

Based on the logarithmic score, we computed the probability to observe \(n_i\) claims for each insured \(i\), depending on the model. In this case, for a good model, we expect the boxplots to show that a lot of insured are close to a probability of one.

As for the residuals, the data containing the probabilities need to be processed before it can be used by ggplot.

## # A tibble: 58,710 x 6
##      obs    id model         proba model_label nb_claim
##    <dbl> <int> <chr>         <dbl> <chr>       <fct>   
##  1     0     1 res_2_gam_ind 0.980 GLM         0 claim 
##  2     0     2 res_2_gam_ind 0.944 GLM         0 claim 
##  3     0     3 res_2_gam_ind 0.963 GLM         0 claim 
##  4     0     4 res_2_gam_ind 0.988 GLM         0 claim 
##  5     0     5 res_2_gam_ind 0.981 GLM         0 claim 
##  6     0     6 res_2_gam_ind 0.982 GLM         0 claim 
##  7     0     7 res_2_gam_ind 0.985 GLM         0 claim 
##  8     0     8 res_2_gam_ind 0.959 GLM         0 claim 
##  9     0     9 res_2_gam_ind 0.946 GLM         0 claim 
## 10     0    10 res_2_gam_ind 0.984 GLM         0 claim 
## # … with 58,700 more rows

Then the boxplots can be plotted. Again, we create a faceting depending on the number of claims.

From all the results, we see that the GAMs perform better than the simple GLM that was constructed to approximamte the distance driven structure suggested by both GAM. The difference is however surprisingly small. The GAMs tend to overfit the data. However, it should be remembered that the exposure time was excluded in the GLM. Only the distance has been used. Such a choice implies that the price paid by an insured who drove 20,000 km should be the same regardless of how long it took to cover this distance.