On the occasion of Euro 2008 and Mondial 2010, the Oberhausen oracle (more commonly known as “Paul the octopus”) made the headlines. His exact predictions regarding the results of the German team at Euro 2008 and the appointment of the winning team of the 2010 World Cup (Spain) are still etched in the memories. With some colleagues (Enora Belz, Romain Gaté, Vincent Malardé and Jimmy Merlet) we tried to continue the work of the late Paul the octopus to predict the outcome of the upcoming meetings of the 2018 World Cup. To do this, we rely on the results of past World Cup and Continental Cup meetings.^{1}
Note : the display is optimized for reading on a computer; some graphics are not accessible on a mobile phone.
How does it work?
First of all, for the most curious, we propose a much more detailed version of the approach we have adopted to make these forecasts, in a working paper (only in French at the moment, though) available at this address:
http://egallic.fr/Recherche/Worldcup_2018/worldcup.html.
Nine models to forecast results
To keep it simple, eight supervised learning methods are used to predict the results of upcoming meetings. These methods have names that are perhaps familiar to you: the k nearest neighbours, Bayesian naive classification, classification trees, random forests, stochastic gradient boosting, logistic regression by boosting, support vector machines, artificial neural networks. We also have a ninth model that we have named “combination“. The latter uses the eight previous models to improve forecasts. As it offers slightly better performance than the others, it is the one we prefer.
Simulations launched to predict World Cup results
To predict the possible World Cup results, we simulate the competition a large number of times, advancing match by match. The reason is as follows. When we make a forecast for a match between a team 1 and a team 2, our models show us a probability for each possible outcome. Here is an example:
 Team 1 wins with a 50% probability;
 the match ends in a draw with a 17% probability;
 Team 2 wins with a 33% probability.
Even if the model tells us that the most probable outcome is the victory of team 1, this does not mean that in reality team 1 will necessarily win. It is just more likely to win by our estimates.
In our simulations, to consider the possible (but rarer) scenarios in which team 2 wins, we randomly decide the outcome of the match, giving more chances to the event in which team 1 wins. In other words, in this example, it is like rolling a sixsided dice and observing the result. The victory of team 1 having a probability of 50% we attribute it 50% of the faces (3/6 faces). If the top side of the dice shows a 1, 2 or 3, for example, we conclude that Team 1 wins the game. There is a 17% probability of a tie game, i.e. only 1 side out of 6 possible for the dice. If the top side of the dice shows a 4 for example, we conclude that the game ends with a draw. Finally, there is still a 33% probability of winning by 2. This number corresponds to 2 sides out of 6. If the top side of the dice shows a 5 or 6, we conclude that team 2 wins. By rolling the dice many times, we will have about 50% throws that will give the winning team 1, 17% a draw and 33% a win for team 2. Each throw corresponds to one simulation in our exercise, and we run 50,000 of them. We go forward game by game in each simulation, to get to the winner of the competition, then we move on to the next simulation, until we get to the 50,000th.
What are the model predictions based on?
Forecasts are based on actual data from international competitive football matches (excluding friendly matches) since August 1993. The set of variables used is described in our working paper. We use the results of previous matches, the rank of team 1 in the FIFA World Ranking, the difference between the rank of team 2 and the offensive/defensive form of each team (the number of goals scored/contested in the last three matches, on average), the type of match (if it is a world competition such as the World Cup or continental competition such as the European Cup of Nations), the phase of the competition (preliminary or final), the month, the year, the continent.
Awesome, I can use these predictions to make online bets then?
At your own risk: Forecasting is no synonym for knowing. Even if the results of past matches can have a certain predictive capacity, the result of a match is obviously determined by the talent of the players, but is also associated with a share of chance.
When we submit our models to new matches, which have not been used for estimation, they predict the good result in about 60% of the cases. They are therefore wrong in the remaining 40% of cases. In comparison, the chance concerning three outcomes (1/ Draw /2) only gives a third of good prediction, or 33%.
But then, are these forecasts bad?
Predicting the results of a football match with so few variables in our models is a difficult exercise. However, even adding many variables, as online betting operators can do, the predictive quality of the models would be far from perfect. At least that is what we can read in the academic literature on this subject.
Simply put, the results of our forecasts are based on probabilities. The real result of the 2018 World Cup will probably be different from what we are proposing here. The idea is to consider that our predictions would be better if we repeated this exercise a very large number of times compared to total chance to determine the winner.
And where are these forecasts?
They’re coming! We offer several types:
 group match forecasts, which give for each match the probabilities of each outcome;
 the probability of winning the World Cup for each team;
 the probability of being eliminated in each round, depending on the promotion in the competition;
 probable paths.
Group matches: what probabilities for each outcome?
For the group matches, we already know which team will meet. All we have to do is ask our models for the results of each match. There’s just one small downside: to make a forecast, our models are based on past results, notably for the offensive and defensive form variables, as well as on the results of the last three games. For the offensive and defensive variables, we set the values to the last observed, which remain the same throughout the competition. For the outcomes of the last three games, we update them after each match. Without further ado, here are the results. The graph below, indicates for a given match, the probabilities to observe a victory of team 1 (on the left), a draw (in the middle) or a victory of team 2 (on the right). By default, the graph shows the results for the opening match of the competition between Russia and Saudi Arabia; to change matches, simply click on the menu at the top left of the graph to select another. We can read that our favorite model (the dropdown menu on the right allows to see the results proposed by other models) gives Russia as the winner after the match with a probability of 53.38%. The probability of seeing a draw is lower (27.03%) and that of seeing Saudi Arabia win is even lower (19.59%).
Who’s gonna win the world cup?
After each team has played its three games, the group rankings are calculated. Points are awarded to each team after each match: 3 points for a win, 1 for a draw, 0 for a loss. At the end of the forecasts for all the group matches, the ranking in each group is made, counting the number of points obtained over the three matches each team has played. In the event of a tie, FIFA regulations state that the goal difference after all group matches is decisive. In the event of a new tie, the greater number of goals scored is used to discriminate. If there is still a tie, other criteria based on the number of goals are used. FIFA will ultimately draw lots. As the models in this study do not predict the number of goals, it is impossible to use the criteria normally applicable, with the exception of the random draw. Also, in case of a tie in the classification for each group, a draw is made to decide between the teams.
For the subsequent phases of the competition, all that is needed is to follow the progress schedule proposed by FIFA by bringing together the first and second groups in the Round of 16: the first in Group A against the second in Group B, the first in Group C against the second in Group D, etc. The winners continue in the quarterfinals, then in the semifinals and eventually in the final.
The table below shows the probability of victory for each team. Our favourite model gives us Brazil as the team with the highest probability (19%) of winning the 2018 World Cup. Next come Germany (14%) and Spain (11%).
Beware! This does not mean that the first will be Brazil, the second Germany and the third Spain. These probabilities are calculated by counting the number of simulations in which each country comes first at the end of the competition, and dividing it by the total number of simulations. However, there is a good chance that the winner is among the top 5.
Equipe  Probabilité de Victoire (%) 

Brazil  19.124 
Germany  14.522 
Spain  10.644 
France  9.708 
Portugal  8.248 
Switzerland  6.936 
Belgium  6.708 
England  5.386 
Poland  3.702 
Peru  3.072 
Denmark  2.472 
Argentina  2.252 
Croatia  1.718 
Uruguay  1.632 
Mexico  1.396 
Colombia  0.632 
Tunisia  0.402 
Sweden  0.230 
Egypt  0.208 
Iceland  0.160 
Costa Rica  0.136 
Russia  0.102 
IR Iran  0.100 
Senegal  0.076 
Morocco  0.074 
Nigeria  0.064 
Japan  0.058 
Australia  0.056 
Saudi Arabia  0.056 
Serbia  0.050 
Korea Republic  0.040 
Panama  0.036 
Tableau 1. Estimated probability of winning the 2018 World Cup.
How far will my favorite team go?
Let’s focus on one team at a time. What are its risks of losing in the group phase? Losing in the eighth grade? In the quarterfinal? In the finale? To answer this question we look again at the results of our simulations. For each team, we count the number of cases in which it loses in each phase. Then we divide that number by the total number of draws. This gives the proportion of simulations in which each team loses in the group phase, round of 16, quarterfinals, etc.
The graph below gives by default the case of Argentina. Among our 50,000 simulations, 20.8% of them saw Argentina finish 3rd or 4th in their group and thus stop after their first three matches; 37.65% indicated the end of the course in the Round of 16 for Argentina, 23.64% in the quarterfinals, 12% in the semifinals and 3.65% in the final. As in the previous table, we find the value of 2.25% simulations giving Argentina winner of the World Cup.
To see what is happening for another team, as before, simply scroll down the menu at the top left of this graph.
What happens now if we want to look at the distribution of the different outcomes in the competition conditionally to the fact that a given team has already managed to pass a stage? To answer this question, we suggest you choose a phase already passed on the dropdown menu at the top right of the graph. Let’s take again the example of Argentina, and let’s see what happens in case it managed to pass the round of 16 (select the value Round of 16
in the right menu). The results are as follows: in our simulations, when Argentina managed to qualify in the quarterfinals, in 57% of the cases, they then lost to their opponent in the quarterfinals. In 29% of the cases, they reached the semifinals, but were immediately defeated. Argentina won the cup in 5% of the simulations among which they reached the round of 16 stage.
What are the most likely paths for my team?
Having the odds of winning the World Cup or losing in the quarterfinals or finals is all well and good, but it doesn’t tell us what the likely paths of each team in the competition are.
Be careful, understanding the graphs that follow can be a little tricky. Shortcut are easy to do, and the interpretation is then made is completely wrong.
To know the potential opponents a team faces, we rely on the simulations performed, to follow possible paths for each team. Figure 3 shows in a tree form, all the courses obtained during the 50 000 simulations for each of the 5 top teams. The tree of a team is composed of a root (the name of the team), leaves (the phases of play and potential opponents) linked together by branches. The size of a leaf is proportional to the number of simulations in which the event described by the leaf was observed. This number is indicated on the second line of the label that appears when hovering a leaf. Thus, for the tree of France (displayed by default, use the menu above the graph to display the tree of another country) the root indicates that the tree refers to 50 000 simulations. The following leaves show the ranking obtained in the simulations at the end of the group phase: 27,526 cases in which France finished first in its group, 12,755 in which it finished second, and 9,735 cases in which it did not pass the group phases (7109 third and 2626 last). By clicking on a leaf whose legend indicates the ranking at the end of the group matches (First, Second, Third or Fourth), the rest of the competition is displayed. For example, by clicking on the First leaf for France, four potential opponents appear for the Round of 16: Argentina, Croatia, Iceland and Nigeria. The size of Croatia’s leaf being the largest, this reflects the fact that if France qualify for the Round of 16, its most likely opponent would be Croatia. By clicking from leaf to leaf, the different possibilities of France’s route are revealed (it is possible to use the zoom with the mouse wheel or the touchpad).
We propose another way of representing the course possibilities for each team, this time for all the competitors (and no longer the 5 teams with the highest probability of winning the cup). This other representation, called “Sunburst” is perhaps a little less understandable at first glance. Here’s how it works. The reasoning is identical to that adopted when reading the previous graph. After selecting a team (by default, France is displayed), the different phases of the competition for this first one are displayed, in the form of rings. Each ring is split in proportion to the number of simulations in which the corresponding outcome (which is displayed when the mouse hovers over the ring) is observed. When clicking on a ring portion, the remaining portions are then hidden for convenience of view and navigation. To display the previously hidden rings again, simply click on the central circle of the graph. At any time, it is possible to know the path taken to the proposed view by following the arrows at the top of the graph.
We would like to point out that the reasoning adopted to read the two previous graphs does not necessarily reflect the most probable outcome: the process is gradual, and many possible outcomes are therefore not taken into account once a choice has been made. Let us take an example to clarify this point. Consider a threestage competition: group matches, a semifinal and a final. Let us consider for simplicity that 100 simulations have been performed and that the results obtained are as indicated on the probability tree below. If we follow the reasoning adopted previously to describe a team’s path during the competition, we must proceed as follows: the team finishes first in its group and thus reaches the semifinal. Knowing this, it will win its match in 20 simulations and will lose in 15. We will then consider that it reaches the final, and that it will win in 15 simulations. Also, this most likely path will proclaim this team as the winner of the tournament. However, this is not the most likely issue. Indeed, if we look closely at the tree, this team loses the competition in 83 cases out of 100. It’s probability of losing is much higher than its probability of winning. In summary, the most likely path does not necessarily equal the outcome of the most likely competition.
Who are we?
We are junior researchers in economics, members of the Centre de Recherche en Économie et Management. We are also part of an association, named PROJECT (PROmotion des Jeunes ÉConomistes en Thèse, literally, Promotion of Young Economists in Thesis).
By alphabetical order :
Why did we work on such a project?
Statistical learning techniques are currently not widely used in the economic discipline within the academic world. Some researchers are trying to convince researchers that the economy could benefit from successful research in other disciplines using statistical tools related to big data. To increase our knowledge of these techniques, we decided to use this World Cup year to test different methods with real data. The results obtained led us to believe that it could be interesting to share them.

European Nations Cup, African Nations Cup, Copa America, etc.↩
Do you have any github repo or way to reproduce this analysis?
It is coming (soon, I hope).
Bonjour,
Vraiment un bon travail, je suis Data Miner pour l’entreprise Golden eyes ( un cabinet d’études). J’ai une question assez bête, comment vous intégrer vos graphique dynamique. Pour faire simple, dans le monde des data malheureusement powert point reste la façon la plus courante de partager des résultats ( souvent trop statistiques) votre manière de faire est exactement ce que je recherche.
J’espere avoir une réponse de votre part. Bon travail Cordialement Valentin MAes
Bonjour,
Les graphiques sont réalisés avec le logiciel R, en utilisant le package {plotly}.
Bien à vous