Jean-Philippe Boucher, Université du Québec À Montréal (🐦 @J_P_Boucher)
Arthur Charpentier, Université du Québec À Montréal (🐦 @freakonometrics)
Ewen Gallic, Aix-Marseille Université (🐦 @3wen)
Analyse the composition of the portfolio: covariates, frequency by covariates.
First, load the training and testing sets, located in the following folder: data/canada_panel/
(CanadaPanelTrain.csv
and CanadaPanelTest.csv
, respectively).
library(tidyverse)
canada_train <- readr::read_csv("./data/canada_panel/CanadaPanelTrain.csv")
canada_test <- readr::read_csv("./data/canada_panel/CanadaPanelTest.csv")
nrow(canada_train)
canada_train
Define a function that computes summary statistics for a vector of numerics:
Try to account for possible NA
values. Name the function my_summary
.
#' my_summary
#' Returns a tibble with summary statistics for a numerical vector
#' @param x vector of numerics
my_summary <- function(x) {
tibble(
Average = mean(x, na.rm=TRUE),
Min = min(x, na.rm = TRUE),
Max = max(x, na.rm = TRUE),
`10th percentile` = quantile(x, probs = .1, names = FALSE),
`25th percentile` = quantile(x, probs = .25, names = FALSE),
Median = median(x, na.rm = TRUE),
`75th percentile` = quantile(x, probs = .75, names = FALSE),
`90th percentile` = quantile(x, probs = .9, names = FALSE)
)
}
Use the function on a single variable from the training sample:
Apply this function to multiple variables of your choice to explore the dataset and put the results in a table. You may either use a for loop, lapply()
or map()
(pick the option you are more comfortable with). Do not forget to add the variable name to be able to identify the variables.
some_variables <- c("RA_DISTANCE_DRIVEN", "RA_NBTRIP", "RA_HOURS_DRIVEN")
# Using a for loop
summary_table_canada <- NULL
for( var in some_variables ) {
summary_table_canada <-
summary_table_canada %>%
bind_rows(
my_summary(canada_train[[var]])
)
}
summary_table_canada$variable <- some_variables
summary_table_canada
# Using lapply
summary_table_canada <-
lapply(some_variables, function(var) my_summary(canada_train[[var]])) %>%
bind_rows()
summary_table_canada$variable <- some_variables
summary_table_canada
# Using map
summary_table_canada <-
canada_train %>%
dplyr::select(!!some_variables) %>%
purrr::map(my_summary) %>%
bind_rows(.id = "variable")
summary_table_canada
Plot the distribution of claims on a barplot. You may display the distribution of gender among each claims levels (by showing the proportion of each gender within the bars). The result may look like the figure displayed below:
p_distribution_claims_gender <-
canada_train %>%
dplyr::group_by(RA_ACCIDENT_IND, RA_GENDER) %>%
dplyr::tally() %>%
mutate(RA_GENDER = factor(RA_GENDER, levels = c("2", "1", "3"),
labels = c("Female", "Male", "Unknown"))) %>%
ggplot(data = ., aes(x = reorder(RA_ACCIDENT_IND, -n), y = n, fill = RA_GENDER)) +
geom_col() +
scale_fill_manual(name = "Gender",
values = c("Female" = "#66c2a5",
"Male" = "#fc8d62",
"Unknown" = "#8da0cb")) +
labs(x = "Number of Claims", y = "Frequency",
title = "Distribution of Number of Claims and Gender") +
# Legend below the graph
theme(legend.position = "bottom")
ggsave(p_distribution_claims_gender,
file = "figs/p_distribution_claims_gender.png", width = 10, height = 6)
p_distribution_claims_gender
Plot a boxplot of exposure time depending on marital status and vehicle use. Use faceting to separate the data according to vehicule use. The resulting plot should look like the one below: