Tonight I am participating in the Machine Learning Aix-Marseille Meetup, for the second session of this fourth edition. I am speaking after Leonardo Noleto, senior data scientist at Bleckwen FinTech who is developing a solution to fight financial fraud with machine learning. I will present the project on which Enora Belz, Romain Gaté, Vincent Malardé, Jimmy Merlet, Arthur Charpentier and I worked on last summer for the 2018 Football World Cup (see a previous post). The idea was to use machine learning techniques to predict the outcome of football matches (win, draw or defeat).

The slides are available here (in French): http://www.egallic.fr/Recherche/Worldcup_2018/2018_meetup_ML/egallic_meetup.html

]]>As part of a Python programming course for graduate students (Magistère Ingénieur Économiste) at Aix-Marseille School of Economics, I have prepared course notes. These are available in different formats (in French):

These different documents were produced using the R package {bookdown}, all the files needed for compilation are available on a GitHub repository.

The documents may be updated from time to time.

]]>On this occasion, before lunch, I will present our paper (working paper) co-authored with Arthur Charpentier on the use of collaborative genealogy data in historical demography. The presentation is a lightning talk: 14 slides that scroll every 24 seconds.

The slides are online (in French), and recall that the R codes are available on GitHub.

]]>On the occasion of Euro 2008 and Mondial 2010, the Oberhausen oracle (more commonly known as "Paul the octopus") made the headlines. His exact predictions regarding the results of the German team at Euro 2008 and the appointment of the winning team of the 2010 World Cup (Spain) are still etched in the memories. With some colleagues (Enora Belz, Romain Gaté, Vincent Malardé and Jimmy Merlet) we tried to continue the work of the late Paul the octopus to predict the outcome of the upcoming meetings of the 2018 World Cup. To do this, we rely on the results of past World Cup and Continental Cup meetings.^{1}

On the occasion of Euro 2008 and Mondial 2010, the Oberhausen oracle (more commonly known as “Paul the octopus”) made the headlines. His exact predictions regarding the results of the German team at Euro 2008 and the appointment of the winning team of the 2010 World Cup (Spain) are still etched in the memories. With some colleagues (Enora Belz, Romain Gaté, Vincent Malardé and Jimmy Merlet) we tried to continue the work of the late Paul the octopus to predict the outcome of the upcoming meetings of the 2018 World Cup. To do this, we rely on the results of past World Cup and Continental Cup meetings.^{1}

*Note* : the display is optimized for reading on a computer; some graphics are not accessible on a mobile phone.

Forecasts are based on actual data from international competitive football matches (excluding friendly matches) since August 1993. The set of variables used is described in our working paper. We use the results of previous matches, the rank of team 1 in the FIFA World Ranking, the difference between the rank of team 2 and the offensive/defensive form of each team (the number of goals scored/contested in the last three matches, on average), the type of match (if it is a world competition such as the World Cup or continental competition such as the European Cup of Nations), the phase of the competition (preliminary or final), the month, the year, the continent.

At your own risk: Forecasting is no synonym for knowing. Even if the results of past matches can have a certain predictive capacity, the result of a match is obviously determined by the talent of the players, but is also associated with a share of chance.

When we submit our models to new matches, which have not been used for estimation, they predict the good result in about 60% of the cases. They are therefore wrong in the remaining 40% of cases. In comparison, the chance concerning three outcomes (1/ Draw /2) only gives a third of good prediction, or 33%.

Predicting the results of a football match with so few variables in our models is a difficult exercise. However, even adding many variables, as online betting operators can do, the predictive quality of the models would be far from perfect. At least that is what we can read in the academic literature on this subject.

Simply put, the results of our forecasts are based on probabilities. The real result of the 2018 World Cup will probably be different from what we are proposing here. The idea is to consider that our predictions would be better if we repeated this exercise a very large number of times compared to total chance to determine the winner.

They’re coming! We offer several types:

- group match forecasts, which give for each match the probabilities of each outcome;
- the probability of winning the World Cup for each team;
- the probability of being eliminated in each round, depending on the promotion in the competition;
- probable paths.

For the group matches, we already know which team will meet. All we have to do is ask our models for the results of each match. There’s just one small downside: to make a forecast, our models are based on past results, notably for the offensive and defensive form variables, as well as on the results of the last three games. For the offensive and defensive variables, we set the values to the last observed, which remain the same throughout the competition. For the outcomes of the last three games, we update them after each match. Without further ado, here are the results. The graph below, indicates for a given match, the probabilities to observe a victory of team 1 (on the left), a draw (in the middle) or a victory of team 2 (on the right). By default, the graph shows the results for the opening match of the competition between Russia and Saudi Arabia; to change matches, simply click on the menu at the top left of the graph to select another. We can read that our favorite model (the drop-down menu on the right allows to see the results proposed by other models) gives Russia as the winner after the match with a probability of 53.38%. The probability of seeing a draw is lower (27.03%) and that of seeing Saudi Arabia win is even lower (19.59%).

After each team has played its three games, the group rankings are calculated. Points are awarded to each team after each match: 3 points for a win, 1 for a draw, 0 for a loss. At the end of the forecasts for all the group matches, the ranking in each group is made, counting the number of points obtained over the three matches each team has played. In the event of a tie, FIFA regulations state that the goal difference after all group matches is decisive. In the event of a new tie, the greater number of goals scored is used to discriminate. If there is still a tie, other criteria based on the number of goals are used. FIFA will ultimately draw lots. As the models in this study do not predict the number of goals, it is impossible to use the criteria normally applicable, with the exception of the random draw. Also, in case of a tie in the classification for each group, a draw is made to decide between the teams.

For the subsequent phases of the competition, all that is needed is to follow the progress schedule proposed by FIFA by bringing together the first and second groups in the Round of 16: the first in Group A against the second in Group B, the first in Group C against the second in Group D, etc. The winners continue in the quarter-finals, then in the semi-finals and eventually in the final.

The table below shows the probability of victory for each team. Our favourite model gives us Brazil as the team with the highest probability (19%) of winning the 2018 World Cup. Next come Germany (14%) and Spain (11%).

Equipe | Probabilité de Victoire (%) |
---|---|

Brazil | 19.124 |

Germany | 14.522 |

Spain | 10.644 |

France | 9.708 |

Portugal | 8.248 |

Switzerland | 6.936 |

Belgium | 6.708 |

England | 5.386 |

Poland | 3.702 |

Peru | 3.072 |

Denmark | 2.472 |

Argentina | 2.252 |

Croatia | 1.718 |

Uruguay | 1.632 |

Mexico | 1.396 |

Colombia | 0.632 |

Tunisia | 0.402 |

Sweden | 0.230 |

Egypt | 0.208 |

Iceland | 0.160 |

Costa Rica | 0.136 |

Russia | 0.102 |

IR Iran | 0.100 |

Senegal | 0.076 |

Morocco | 0.074 |

Nigeria | 0.064 |

Japan | 0.058 |

Australia | 0.056 |

Saudi Arabia | 0.056 |

Serbia | 0.050 |

Korea Republic | 0.040 |

Panama | 0.036 |

**Tableau 1.** *Estimated probability of winning the 2018 World Cup.*

Let’s focus on one team at a time. What are its risks of losing in the group phase? Losing in the eighth grade? In the quarter-final? In the finale? To answer this question we look again at the results of our simulations. For each team, we count the number of cases in which it loses in each phase. Then we divide that number by the total number of draws. This gives the proportion of simulations in which each team loses in the group phase, round of 16, quarter-finals, etc.

The graph below gives by default the case of Argentina. Among our 50,000 simulations, 20.8% of them saw Argentina finish 3rd or 4th in their group and thus stop after their first three matches; 37.65% indicated the end of the course in the Round of 16 for Argentina, 23.64% in the quarter-finals, 12% in the semi-finals and 3.65% in the final. As in the previous table, we find the value of 2.25% simulations giving Argentina winner of the World Cup.

To see what is happening for another team, as before, simply scroll down the menu at the top left of this graph.

What happens now if we want to look at the distribution of the different outcomes in the competition **conditionally** to the fact that a given team has already managed to pass a stage? To answer this question, we suggest you choose a phase already passed on the drop-down menu at the top right of the graph. Let’s take again the example of Argentina, and let’s see what happens in case it managed to pass the round of 16 (select the value `Round of 16`

in the right menu). The results are as follows: in our simulations, when Argentina managed to qualify in the quarter-finals, in 57% of the cases, they then lost to their opponent in the quarter-finals. In 29% of the cases, they reached the semi-finals, but were immediately defeated. Argentina won the cup in 5% of the simulations among which they reached the round of 16 stage.

Having the odds of winning the World Cup or losing in the quarter-finals or finals is all well and good, but it doesn’t tell us what the likely paths of each team in the competition are.

Be careful, understanding the graphs that follow can be a little tricky. Shortcut are easy to do, and the interpretation is then made is completely wrong.

To know the potential opponents a team faces, we rely on the simulations performed, to follow possible paths for each team. Figure 3 shows in a tree form, all the courses obtained during the 50 000 simulations for each of the 5 top teams. The tree of a team is composed of a root (the name of the team), leaves (the phases of play and potential opponents) linked together by branches. The size of a leaf is proportional to the number of simulations in which the event described by the leaf was observed. This number is indicated on the second line of the label that appears when hovering a leaf. Thus, for the tree of France (displayed by default, use the menu above the graph to display the tree of another country) the root indicates that the tree refers to 50 000 simulations. The following leaves show the ranking obtained in the simulations at the end of the group phase: 27,526 cases in which France finished first in its group, 12,755 in which it finished second, and 9,735 cases in which it did not pass the group phases (7109 third and 2626 last). By clicking on a leaf whose legend indicates the ranking at the end of the group matches (*First*, *Second*, *Third* or *Fourth*), the rest of the competition is displayed. For example, by clicking on the *First* leaf for France, four potential opponents appear for the Round of 16: Argentina, Croatia, Iceland and Nigeria. The size of Croatia’s leaf being the largest, this reflects the fact that if France qualify for the Round of 16, its most likely opponent would be Croatia. By clicking from leaf to leaf, the different possibilities of France’s route are revealed (it is possible to use the zoom with the mouse wheel or the touchpad).

We propose another way of representing the course possibilities for each team, this time for all the competitors (and no longer the 5 teams with the highest probability of winning the cup). This other representation, called “Sunburst” is perhaps a little less understandable at first glance. Here’s how it works. The reasoning is identical to that adopted when reading the previous graph. After selecting a team (by default, France is displayed), the different phases of the competition for this first one are displayed, in the form of rings. Each ring is split in proportion to the number of simulations in which the corresponding outcome (which is displayed when the mouse hovers over the ring) is observed. When clicking on a ring portion, the remaining portions are then hidden for convenience of view and navigation. To display the previously hidden rings again, simply click on the central circle of the graph. At any time, it is possible to know the path taken to the proposed view by following the arrows at the top of the graph.

We would like to point out that the reasoning adopted to read the two previous graphs does not necessarily reflect the most probable outcome: the process is gradual, and many possible outcomes are therefore not taken into account once a choice has been made. Let us take an example to clarify this point. Consider a three-stage competition: group matches, a semi-final and a final. Let us consider for simplicity that 100 simulations have been performed and that the results obtained are as indicated on the probability tree below. If we follow the reasoning adopted previously to describe a team’s path during the competition, we must proceed as follows: the team finishes first in its group and thus reaches the semi-final. Knowing this, it will win its match in 20 simulations and will lose in 15. We will then consider that it reaches the final, and that it will win in 15 simulations. Also, this most likely *path* will proclaim this team as the winner of the tournament. However, this is not the most likely *issue*. Indeed, if we look closely at the tree, this team loses the competition in 83 cases out of 100. It’s probability of losing is much higher than its probability of winning. In summary, the most likely path does not necessarily equal the outcome of the most likely competition.

We are junior researchers in economics, members of the Centre de Recherche en Économie et Management. We are also part of an association, named PROJECT (PROmotion des Jeunes ÉConomistes en Thèse, literally, Promotion of Young Economists in Thesis).

By alphabetical order :

Statistical learning techniques are currently not widely used in the economic discipline within the academic world. Some researchers are trying to convince researchers that the economy could benefit from successful research in other disciplines using statistical tools related to *big data*. To increase our knowledge of these techniques, we decided to use this World Cup year to test different methods with real data. The results obtained led us to believe that it could be interesting to share them.

The slides that will be displayed during the presentation are available below (in French).

I presented the new version of the real business cycle model we have been working on with Gauthier Vermandel. This model aims at investigating the short run effects of weather shocks on business cycles as well as the potential long run effects of climate change on macroeconomic volatility and welfare. The working paper is available on RePEc : Weather Shocks, Climate Change and Business Cycles.

The slides from my presentation are available below.

In the digital age, collaborative data can be collected massively at low costs. Genealogy sites are blooming on the Internet to offer their users the chance to recover their family tree online. The collection and digitalization processes done by these users can potentially be reused in historical demography to complete the knowledge of our ancestors’ past. In our study, based on records of 2,457,450 French or French-born individuals who lived in the nineteenth century, we show that it is possible to find, although some biases sometimes remain, certain results of the literature. We propose to explore the temporal characteristics contained in the family trees to study longevity. We also investigate the spatial characteristics of the data to analyze internal migrations of France.

In this paper, we explore a dataset of 2.45 million individuals, corresponding to people born between 1800 and 1804 in France and their descendants over 3 generations. The raw data was huge: more than 700 million lines. Each line represents an event (birth, marriage or death) for an individual in the tree of a geneanet.org user. However, as each user creates his own tree (it should be noted that we do not have access to the trees of users who did not want to make it public), some individuals are duplicated in the database. A lot of work has been done to match and clean the trees, which has led to 2.45 million people at the end of the day.

In the paper, we investigated two aspects: a first using temporal characteristics, i.e., the mortality of individuals; and a second exploring spatial characteristics, i.e., the migratory movements from generation to generation.

A small snapshot of what has been done is shown in the figure below, for which we have drawn estimates of survival function (left) and force of mortality (right). We compared our estimates with those of Vallin and Meslé (2001).

With regard to migration, for example, we examined the distances between the birth places of ancestors born between 1800 and 1804 and those of their descendants. We can see in the figure below the distribution of these distances, with a logarithmic scale in abscissa.

The rest of the https://hal.archives-ouvertes.fr/hal-01724269/document”>paper is available online on HAL. We also provide a companion online methodology annex published on Github.

Vallin, J. et Meslé, F. (2001). Tables de mortalité françaises pour les XIXe et XXe siècles et projections pour le XXIe siècle. Éditions de l’Institut national d’études démographiques.

]]>- récupérer les frontières des communes ;
- les étendre ;
- regarder quelles communes sont en intersection avec les frontières étendues.

Dans mes travaux actuels, j’ai besoin d’identifier pour une commune en particulier, quelles sont les autres communes proches, pour un rayon donné de 20km. Pour obtenir une telle information, je me suis appuyé sur les données de communes d’Open Street Map. L’idée est simple :

- récupérer les frontières des communes ;
- les étendre ;
- regarder quelles communes sont en intersection avec les frontières étendues.

Pour commencer, je récupère les données via la plateforme ouverte des données publiques françaises data.gouv.fr. En particulier, je télécharge le *shapefile* le plus récent des communes (Janvier 2018 au moment de la rédaction de ce billet).

La suite se passe sur `R`

. Je commence par charger des *packages* :

```
library(tidyverse)
library(lubridate)
library(stringr)
library(stringi)
library(data.table)
library(pbapply)
library(dplyrExtras)
library(stringdist)
library(sp)
library(rgeos)
library(maptools)
library(rgdal)
```

Ensuite, je charge les données :

```
communes <- readOGR(dsn="communes-20180101-shp/", layer="communes-20180101")
communes <- spTransform(communes, CRS( "+init=epsg:2154" ))
```

Je vais extraire les informations de chaque commune de l’objet `communes`

. Il faut dans ce cas faire attention à un détail technique qui concerne la taille de chaque objet contenant les informations d’une seule commune. Si l’on se contente d’extraire telle quelle une commune, dans l’état actuel, l’objet créé occupera une lourde place en mémoire, puisqu’il contiendra de nombreuses (très nombreuses) informations inutiles. En effet, le *slot* `data`

de l’objet `communes`

contient des informations : les codes communes INSEE, les noms et le liens Wikipedia. Or, ces informations sont stockées sous forme de facteurs : `R`

considère donc chaque valeur comme des entiers, et se réfère à un dictionnaire indiquant les niveaux correspondants. Lorsque l’on extrait un facteur d’un vecteur de facteurs en `R`

, on récupère un sous-élément de ce vecteur… et le dictionnaire au complet ! Aussi, lorsque ce dictionnaire est très volumineux, on perd en efficacité. Ici, comme chaque ligne du *slot* `data`

contient un code INSEE unique, un nom unique et un lien Wikipedia unique, il est sous-optimal de stocker ces informations sous forme de facteurs, de simples chaînes de caractères suffisent et se révèlent **nettement** plus efficace par la suite, lors des extractions de communes.

```
codes_insee <- unique(communes$insee) %>% as.character()
communes@data <- communes@data %>%
mutate(insee = as.character(insee),
nom = as.character(nom),
wikipedia = as.character(wikipedia))
```

Je vais agrandir les polygones limitant chaque commune, selon une distance donnée. Pour ce faire, j’utilise la fonction `gBuffer()`

du *package* `rgeos`

. Je choisis d’étendre les frontières des communes de 20km.

```
distance <- 20000 # en metres
```

Je crée une fonction qui se chargera d’agrandir les frontières d’une commune, pour une distance donnée. Cette fonction retourne une liste contenant 4 objets :

- les coordonnées d’un rectangle délimitant les limites de la commune ;
- celles d’un rectangle délimitant les limites étendues de la commune ;
- les objets spatiaux contenant les coordonnées de la commune ;
- ceux contenant les coordonnées de la commune étendue.

```
#' communes_buffer
#' Obtenir la surface etendue de la commune
#' avec une distance donnee
#' @code_insee: (string) code INSEE de la commune
#' @distance: (num) distance pour etendre
communes_buffer <- function(code_insee, distance){
tmp <- communes[communes$insee == code_insee,]
tmp_buffer <- gBuffer(tmp, width = distance, byid = TRUE)
bbox_commune <- bbox(tmp)
bbox_commune_buffer <- bbox(tmp_buffer)
tmp_buffer <- spTransform(tmp_buffer, CRS("+proj=longlat +datum=WGS84"))
tmp <- spTransform(tmp, CRS("+proj=longlat +datum=WGS84"))
list(bbox_commune = bbox_commune, bbox_commune_buffer = bbox_commune_buffer, tmp = tmp, tmp_buffer = tmp_buffer)
}# Fin de communes_buffer()
```

Voici un exemple du résultat pour une commune en particulier, Rennes, avec un facteur d’agrandissement de 1km.

```
res_rennes <- communes_buffer(code_insee = "35238", distance = 1000)
```

Les coordonnées du cadre délimitant la commune, et celles du cadre délimitant la commune étendue :

```
min max
x 346353.8 356295.1
y 6785457.4 6793920.0
> res_rennes$bbox_commune_buffer
min max
x 345356.3 357295
y 6784457.4 6794920
>
```

Les limites de la commune et l’agrandissement :

```
plot(res_rennes$tmp_buffer, border = "red")
plot(res_rennes$tmp, add=TRUE)
```

Pour permettre à l’ordinateur d’avoir à gérer de moins gros objets, je sépare en 20 tronçons les 36 000 codes INSEE, applique la fonction `communes_buffer()`

sur chaque code INSEE des tronçons, et sauvegarde le résultat de chaque tronçon.

```
chunk2 <- function(x,n) split(x, cut(seq_along(x), n, labels = FALSE))
a_parcourir <- chunk2(1:length(codes_insee), 20)
if(!(dir.exists(str_c("communes/")))) dir.create(str_c("communes/"), recursive = TRUE)
for(i in 1:length(a_parcourir)){
communes_cercles_tmp <-
pblapply(codes_insee[a_parcourir[[i]]], communes_buffer, distance = distance)
save(communes_cercles_tmp, file = str_c("communes/communes_cercles_tmp_", i, ".rda"))
rm(communes_cercles_tmp)
}
```

Reste alors à charger les 20 résultats intermédiaires, pour obtenir les limites étendues de chaque communes :

```
communes_cercles <-
lapply(1:length(a_parcourir), function(i){
load(str_c("communes/communes_cercles_tmp_", i, ".rda"))
lapply(communes_cercles_tmp, function(x) x$tmp_buffer)
})
communes_cercles <- unlist(communes_cercles)
names(communes_cercles) <- codes_insee
```

Puis celles des communes non étendues :

```
communes_sans_cercle <-
lapply(1:length(a_parcourir), function(i){
load(str_c("communes/communes_cercles_tmp_", i, ".rda"))
lapply(communes_cercles_tmp, function(x) x$tmp)
})
communes_sans_cercle <- unlist(communes_sans_cercle)
names(communes_sans_cercle) <- codes_insee
```

Et enfin les rectangles délimitant les frontières des communes et des communes étendues :

```
communes_cercles_bbox <-
lapply(1:length(a_parcourir), function(i){
load(str_c("communes/communes_cercles_tmp_", i, ".rda"))
lapply(communes_cercles_tmp, function(x) x$bbox_commune_buffer)
})
communes_cercles_bbox <- unlist(communes_cercles_bbox, recursive=FALSE)
names(communes_cercles_bbox) <- codes_insee
communes_bbox <-
lapply(1:length(a_parcourir), function(i){
load(str_c("communes/communes_cercles_tmp_", i, ".rda"))
lapply(communes_cercles_tmp, function(x) x$bbox_commune)
})
communes_bbox <- unlist(communes_bbox, recursive=FALSE)
names(communes_bbox) <- codes_insee
```

Je transforme ensuite en tableaux de données les objets spatiaux contenant les limites des communes.

```
options(warn=-1)
communes_cercles_df <-
pblapply(communes_cercles, function(x){
suppressMessages(broom::tidy(x, region = "insee"))
}) %>%
bind_rows()
options(warn=1)
```

Je fais de même pour les communes :

```
communes <- spTransform(communes, CRS("+proj=longlat +datum=WGS84"))
communes_df <- broom::tidy(communes, region = "insee")
communes_df <- tbl_df(communes_df)
```

À présent, je peux utiliser les limites des communes étendues pour identifier, pour chacune des communes, les autres proches dans un rayon de 20km. Je crée une fonction qui fonctionne en deux temps, pour une commune donnée. Dans un premier, j'utilise les *bounding box* des communes pour réaliser un écrémage rapide des communes potentiellement proches de la commune de référence. Cette étape vise à accélérer la seconde étape qui consiste à utiliser la fonction `gIntersects()`

du *package* `rgeos`

. Cette fonction, qui n'est pas des plus rapides à s'exécuter, indique si deux polygones s'intersectent. Elle me permet donc d'identifier les communes en intersection avec la commune dont les limites ont été élargies de 20km.

```
#' trouver_intersection_commune
#' Pour la commune i de communes_cercles, retourne
#' l'indice Insee de cette commune et les indices Insee des
#' communes dans un rayon de 20km de cette commune
#' @i (int) : indice de la commune
trouver_intersection_commune <- function(i){
comm_courante <- communes_cercles[[i]]
comm_restantes <- communes_sans_cercle[-i]
# On fait un premier ecremage à l'aide des box
bbox_courante <- communes_cercles_bbox[[i]]
bbox_restantes <- communes_bbox[-i]
box_se_touchent <- function(x){
# Est-ce que les box se touchent
touche <-
bbox_courante["x", "min"] <= x["x", "max"] & bbox_courante["x", "max"] >= x["x", "min"] &
bbox_courante["y", "min"] <= x["y", "max"] & bbox_courante["y", "max"] >= x["y", "min"]
touche
}# Fin de box_se_touchent()
touchent <- sapply(bbox_restantes, box_se_touchent)
inter <- sapply(comm_restantes[touchent], function(x){
gIntersects(x, comm_courante)
})
insee_intersection <- names(comm_restantes)[which(touchent)[which(inter)]]
list(insee = names(communes_cercles[i]), limitrophes_20 = insee_intersection)
}
```

J'applique cette fonction à toutes les communes. Pour accélérer les choses, je parallélise l'exécution.

```
library(parallel)
ncl <- detectCores()-1
(cl <- makeCluster(ncl))
invisible(clusterEvalQ(cl, library(tidyverse, warn.conflicts=FALSE, quietly=TRUE)))
invisible(clusterEvalQ(cl, library(geosphere, warn.conflicts=FALSE, quietly=TRUE)))
invisible(clusterEvalQ(cl, library(rgeos, warn.conflicts=FALSE, quietly=TRUE)))
clusterExport(cl, c("communes_cercles", "communes_sans_cercle"), envir=environment())
clusterExport(cl, c("communes_cercles_bbox", "communes_bbox"), envir=environment())
communes_proches_20km <- pblapply(1:length(communes_cercles), trouver_intersection_commune, cl = cl)
names(communes_proches_20km) <- names(communes_cercles)
stopCluster(cl)
```

Voici un aperçu du résultat, en prenant à nouveau l'exemple de Rennes, avec un rayon de 20km.

```
ind_rennes <- which(names(communes_cercles) == "35238")
proche_rennes_20 <- trouver_intersection_commune(i = ind_rennes)
map_rennes <-
ggplot(data = communes_df %>%
filter(id %in% unlist(proche_rennes_20)) %>%
mutate(limitrophe = ifelse(id %in% proche_rennes_20$limitrophes_20, yes = "limitrophe", no = "non"),
limitrophe = ifelse(id == proche_rennes_20$insee, yes = "focus", no = limitrophe))) +
geom_polygon(data = map_data("france"), aes(x = long, y = lat, group = group), fill = NA, col = "white") +
geom_polygon(aes(x = long, y= lat, group = group, fill = limitrophe)) +
geom_polygon(data = communes_cercles[[ind_rennes]] %>%
broom::tidy(region = "insee") %>%
tbl_df(),
aes(x = long, y=lat, group = group), fill = NA, col = "red", linetype = "dashed") +
scale_fill_manual("", values = c("limitrophe" = "dodgerblue", "non" = "white", "focus" = "red"), guide = FALSE) +
coord_quickmap(xlim = c(-5,0),
ylim = c(47.5,48.5))
```

Note : si on choisit une distance plus courte, le code peut être utilisé pour trouver les communes limitrophes...

]]>I am going to Paris today to attend a meeting this morning with the people from the Actinfo Chair that I am now part of for the duration of my post-doc with Arthur Charpentier.

I will present the research on genealogy using collaborative data we have been working on this summer. This will be the occation to speak about what we plan to do with Arthur regarding those data in the near future.

Olivier Wintenberger will also share with us his recent research.

]]>This week I will attend the GEOMED2017 conference in Porto, in Portugal. Researchers from different horizons will be gathering here to attend to some talks about spatial statistics, spatial epidemiology and public health. This will be the perfect occasion for me to learn more on those subjects during the three days of the conferences. I will also attend a workshop given by Lee Duncan from Glasgow University on modelling spatial data in R with the package CARBayes).

I will also have the pleasure to present my recent work. During the summer, with Arthur Charpentier, I worked on collaborative genealogy data. I was hosted in the GERAD office, in Montreal. In fact, with Olivier Cabrignac, we obtained a really nice dataset from a website called Geneanet. This dataset provides information on people born between 1800 and 1875 (for now; hopefully we can get more soon!). There are several million lines! These data are obtained thanks to the users of Geneanet who construct their family tree as a hobby.

As a first step, we had a look at people’s migration between generations. It should be interesting to link the migrations with diseases. We had to work a lot on data cleaning and formatting to be able to use them. But I will write more in the following months about these data, as we intend to work on mortality. There seems to be a lot of really interesting analysis to run with these data.

You can find below the slides for the small talk I’ll give in the session called “Data Science applied to Health: Strategies and tools for big data, machine learning and data mining” session.

]]>

## How does it work?

First of all, for the most curious, we propose a much more detailed

version of the approach we have adopted to make these forecasts, in a working paper(only in French at the moment, though) available at this address:http://egallic.fr/Recherche/Worldcup_2018/worldcup.html.

## Nine models to forecast results

To keep it simple, eight supervised learning methods are used to

predict the results of upcoming meetings. These methods have names that are perhaps familiar to you: the k nearest neighbours, Bayesian naive classification, classification trees, random forests, stochastic gradient boosting, logistic regression by boosting, support vector machines, artificial neural networks. We also have a ninth model that we have named “combination“. The latter uses the eight previous models to improve forecasts. As it offers slightly better performance than the others, it is the one we prefer.## Simulations launched to predict World Cup results

To predict the possible

World Cup results, we simulate the competition a large number of times, advancing match by match. The reason is as follows. When we make a forecast for a match between a team 1 and a team 2, our models show us a probability for each possible outcome. Here is an example:Even if the model tells us that the most probable outcome is the victory of team 1, this does not mean that in reality team 1 will necessarily win. It is just more likely to win by our estimates.

In our simulations, to consider the possible (but rarer) scenarios in which team 2 wins, we randomly decide the outcome of the match, giving more chances to the event in which team 1 wins. In other words, in this example, it is like rolling a six-sided dice and observing the result. The victory of team 1 having a probability of 50% we attribute it 50% of the faces (3/6 faces). If the top side of the dice shows a 1, 2 or 3, for example, we conclude that Team 1 wins the game. There is a 17% probability of a tie game, i.e. only 1 side out of 6 possible for the dice. If the top side of the dice shows a 4 for example, we conclude that the game ends with a draw. Finally, there is still a 33% probability of winning by 2. This number corresponds to 2 sides out of 6. If the top side of the dice shows a 5 or 6, we conclude that team 2 wins. By rolling the dice many times, we will have about 50% throws that will give the winning team 1, 17% a draw and 33% a win for team 2. Each throw corresponds to one simulation in our exercise, and we run 50,000 of them. We go forward game by game in each simulation, to get to the winner of the competition, then we move on to the next simulation, until we get to the 50,000th.