Forecasting the 2022 World Cup
How I used Elo Ratings and Monte Carlo simulation to enter the Essex Department of Government World Cup Prediction Contest
Published:
21 Nov 2022
Category:
I recently entered a world cup prediction contest in the Department of Government at Essex University. This isn’t super-abnormal, except that my football knowledge is marginally less than my chances of home ownership.
Not being willing to let that stop me, I decided to turn to my skills in programming and data science to see if they could help me. After doing some research, I settled on building an elo ranking system to generate my predictions for me.
If you’re from the world of chess, you’ll be familiar with an elo ranting system. The basic idea is you give everyone a starting rating. Then, as they play games against each other, their elo ratings get updated. If you beat a player with a much lower rating, you don’t gain many points. If you beat a player with a much higher rating, you gain many points. And vice versa for losing points.
Elo rating systems have a lot of appeal for football prediction. They implicitly handle the time factor: as teams get better and worse, their elo scores will average out accordingly. They also come with a ready-made formula for producing win probabilities.
The one thing it may handle poorly in this case is rating teams across continents. Since teams play more often within continent, the predictions here may suffer somewhat from a lack of updates across continents outside major events.
Throughout this blog post, I’ll share some of my code snippets from R. However, since all of the code is available in a github repo at this link, I won’t set it out in full here.
Data
Before implementing any kind of predictive algorithm or model, I need data to base it on. Forunately, there’s a nice dataset of basically most (all?) historic international football matches currently being maintained by Mart Jürisoo available on Kaggle1.
There’s also a nice dataset of fixture dates available on the creatively named Fixture Dates website2, which nicely avoids most of the need to manually code these.
I therefore used these two datasets throughout the modelling process.
Elo Ratings
There are two main decisions to be made when building elo ratings:
- Start values
- Computing the updates
I’ll now go through both of these in turn.
Initial Ratings
In my procedure, I started all teams with a rating of 1500. In the world elo rankings some slightly different starts were used3, but that ultimately should have a relatively marginal effect.
team_ratings <- data.frame(
team = unique(c(matches$home_team, matches$away_team)),
rating = 1500
)
I chose 1500 simply because it’s a relatively standard choice - no special reasons or motivations here. Regardless, after roughly ~30 games, ratings should reach their ‘correct’ value. And there are a lot more than 30 here!
Update Algorithm
Once that was done, the big thing to implement is the iterative update algorithm. At the time of writing there are 44,060 matches in the dataset, starting in 1872.
The eloratings.net update is calculated as3
\[R_n = R_o + K \times (W - W_e)\]where \(R_n\) is the new elo rating, \(R_o\) is the old elo rating, \(K\) is an importance value, \(W\) is the result, and \(W_e\) is the expected result. These last three terms obviously need some defining.
Starting with the importance value, \(K\), this essentially determines the size of the update. At about 1, there’s barely any update. At 100, the update is usually too large. 30 is a fairly standard value4. The eloratings.net use the following values, conditional on the matches being played3:
- 60 for world cup finals
- 50 for continental championship finals and major intercontinental tournaments
- 40 for World Cup and continental qualifiers and major tournaments
- 30 for all other tournaments
- 20 for friendly matches
From here, in the eloratings.net calculation \(K\) is adjusted based on the goal difference in the game4. For games where the goal difference is 2, it is increased by \(\frac{1}{2}\). Where the goal difference is 3, it’s increased by \(\frac{3}{4}\). Where the goal difference is 4 or more, it is increased by \(\frac{3}{4} + \frac{N-3}{8}\).
I used a value of 30 across all games, but kept the goal multiplier as set out above. This was largely because after some initial trial and error along with some quick eyeballing of results (and comparing them to the World Elo Ratings), this just seemed to produce the most sensible results. We’ll see how far off that statement is at the end of December!
The observed result \(W\) is simple in its definition:
- 1 for a win
- 0.5 for a tie
- 0 for a loss
The expected result \(W_e\) is calculated as
\[W_e = \frac{1}{10(\frac{-dr}{400}) + 1}\]where \(dr\) is the difference in ratings between the teams, plus 100 for the home team.
You could compute this for both teams (remembering to flip the 100 for the away team, but this value for both teams sums to 1.
So, with all the pieces in place, I implemented a loop in R over all matches (minus one with missing scores)
# Loop over matches
for (i in 1:nrow(matches)) {
# Teams
home_team <- matches$home_team[i]
away_team <- matches$away_team[i]
# Current ratings
rating_home <- team_ratings$rating[team_ratings$team == home_team]
rating_away <- team_ratings$rating[team_ratings$team == away_team]
# Compute expected result for both teams
expected_home = 1 / (1 + 10**(-(rating_home - rating_away + 100)/400))
expected_away = 1 - expected_home
# Update ratings
team_ratings$rating[team_ratings$team == home_team] <- rating_home + matches$importance[i] * (matches$result[i] - expected_home)
team_ratings$rating[team_ratings$team == away_team] <- rating_away + matches$importance[i] * (1 - matches$result[i] - expected_away)
}
# World Cup Team Elo Ratings
teams <- team_ratings %>% filter(team %in% unique(c(fixtures$`Home Team`, fixtures$`Away Team`)))
Here’s the top 5 teams:
teams %>% arrange(rating) %>% `[`(32:1,) %>% head(5)
team rating
30 Brazil 2217.258
29 Argentina 2175.373
28 Spain 2092.173
27 Netherlands 2089.409
26 Belgium 2049.297
Reassuringly, these are the same top 5 as eloratings.net3, with the same rank order. The scores are slightly different, but this shouldn’t dramatically alter things.
Monte Carlo Simulation
With elo ratings calculated, the final task was to use these to produce a Monte Carlo simulation of the World Cup based on these ratings. A Monte Carlo simulation is one in which the several draws are made from a random simulation. Since there’s a lot of randomness in football results, this is useful for my purposes!
For those not in the know, the world cup stages are done as follows:
- The first stage is the ‘group stages’
- 32 teams in 8 groups
- 3 points for a win, 1 point for a draw
- 1st and 2nd place in the group go forward
- 48 games total
- Victors go to the ‘last 16’ stage.
- 8 games
- The pattern is Group A 1st vs Group B 2nd, Group B 1st vs Group A 2nd
- Winners from the ast 16 play in the 4 quarter-finals
- Winners from the quarters play in the 2 semi-finals
- Winners from the semi-finals play in the final
That’s a lot of predictions to make! So, Monte Carlo to the rescue.
I did include one simplificaiton: I skipped out the points system for the group stage, and simple sampled for winners and losers. Hopefully this isn’t too drastic a simplification. However, it nicely avoids the issue of deciding thresholds for loss, win, and draw on a 0-1 range. Obviously if I wanted to predict points, it would be a different story.
Since I don’t however, I used the elo expected win formula, but dropped the +100 component for home teams. Typically home teams do do better in the World Cup, but I suspected this may not be the case in Qatar (and at the time of writing this instinct seems initially vindicated by the first game - the first time the hosts have lost the opening game).
I therefore coded up a fairly hefty for loop for the simulation. Given more time, I would have refactored it to (mostly) re-use a single function. As-is, there’s a lot of code duplication. It starts with probabilities for the 48 group-stage games, which can simply be generated by using the elo rankings. From here, work out the next fixtures based on 1st and 2nd places, then compute new probabilities, then rinse and repeat.
Here’s an example from the loop for the quarter finals:
fixtures_quarter <- data.frame(
home_team = c(teams_16[seq(1,7,2)]),
away_team = c(teams_16[seq(2,8,2)])
) %>% left_join(teams %>%
rename(home_team = team,
rating_home = rating),
by = "home_team") %>%
left_join(teams %>%
rename(away_team = team,
rating_away = rating),
by = "away_team") %>%
mutate(home_win = 1 / (1 + 10**(-(rating_home - rating_away)/400)))
# Simulate quarter results
num_quarter <- 4
wins_quarter <- rep(NA_real_, num_quarter)
for (i in 1:num_quarter) {
wins_quarter[i] <- sample(c(0,1), 1, prob=c(1 - fixtures_quarter$home_win[i], fixtures_quarter$home_win[i]))
}
# Vector of 16s results
teams_quarter <- vector(mode='character', length=num_quarter)
for (i in 1:num_quarter) {
teams_quarter[i] <- ifelse(wins_quarter[i] == 1, fixtures_quarter$home_team[i], fixtures_quarter$away_team[i])
}
I ran this simulation 25,000 times. I would have preferred to run it for longer, but in the end I needed to run it pretty close to the wire (i.e. before the first game had begun) to get my submission to the Essex contest in on time. Had I had more time, I probably would’ve run it for something closer to about 100,000 times.
Results
Here, I’m going to focus mostly on results relevant to the Essex prediction contest, with a lot of reference to the points system in that . Where relevant however, I’ll note some files in the github repo that contain some extra predictions. Since at a later point I may produce a second blog post with some Breir scores for this prediction and others, I’ll be including some probabilities.
Group Rank Order
The first requirement was to predict the rank order of teams in the group stages, at 2pts a pop. Individual game predictions were made using elo ratings alone, and are in the results/results_01_group_fixtures.csv
file.
Here are my predicted group order results (with teams ordered from 1st to last):
Order | Probability | |
---|---|---|
Group A | Netherlands, Ecuador, Senegal, Qatar | 15.8 |
Group B | England, Iran, USA, Wales | 9.9 |
Group C | Argentina, Mexico, Poland, Saudi Arabia | 17.9 |
Group D | France, Denmark, Australia, Tunisia | 13.5 |
Group E | Spain, Germany, Japan, Costa Rica | 13.8 |
Group F | Belgium, Croatia, Morocco, Canada | 12.5 |
Group G | Brazil, Switzerland, Serbia, Cameroon | 21.9 |
Group H | Portugal, Uruguay, South Korea, Ghana | 15.2 |
All in all, each rank order has a fairly low probability. This shouldn’t be surprising - each group has 24 different possible outcomes! Group B looks like the one most likely to produce interesting results, while group G seems the most predictable (though by means settled according to the simulation).
Team Progression Points
The next set of points was for predicting which teams would get through to the last 16 stage, with one point each. These were based on the above rank order submission.
It’s worth presenting these together, as most other predictions took on this character. There were 2 points for each team reaching the quarter-finals, 2 points for each team reaching the semi-infals, 2 points for each team reaching the final, and 3 points for predicting the winner.
Here are all the probabilities I have for these quantities:
Last 16 | Quarter Finals | Semi Finals | Finals | Win | |
---|---|---|---|---|---|
Brazil | 90.2 | 68.6 | 50 | 34.9 | 25.1 |
Argentina | 91 | 65.9 | 47.8 | 28.8 | 18.7 |
Spain | 78.5 | 51.8 | 27.6 | 15.9 | 8.5 |
France | 77.3 | 46.5 | 29.1 | 15.1 | 7.1 |
Belgium | 77.3 | 44.9 | 23.8 | 13.2 | 6.2 |
Netherlands | 80.5 | 51.2 | 25.9 | 12.8 | 6.3 |
Portugal | 71.8 | 34.1 | 18.1 | 9.5 | 3.9 |
England | 68.8 | 40.3 | 19.9 | 8.6 | 3.6 |
Uruguay | 70.5 | 33.1 | 17.1 | 8.5 | 3.5 |
Denmark | 64.7 | 32.5 | 17.7 | 7.7 | 2.9 |
Germany | 60.6 | 32.7 | 14.4 | 6.8 | 2.6 |
Switzerland | 52.4 | 25.4 | 11.6 | 5.1 | 2 |
Iran | 56.7 | 28.8 | 11.9 | 4.5 | 1.5 |
Croatia | 57.4 | 26.3 | 10.3 | 4.5 | 1.6 |
Ecuador | 55.9 | 25.9 | 9.8 | 3.6 | 1.1 |
Serbia | 43.9 | 19.3 | 7.6 | 3 | 0.9 |
USA | 45.9 | 20.7 | 7.6 | 2.6 | 0.8 |
Mexico | 46.1 | 17.5 | 7 | 2 | 0.5 |
South Korea | 44 | 14.5 | 5.5 | 1.9 | 0.6 |
Senegal | 44.7 | 18 | 5.8 | 1.8 | 0.4 |
Japan | 35.7 | 14.5 | 4.7 | 1.8 | 0.6 |
Poland | 40.8 | 14.3 | 5.4 | 1.6 | 0.4 |
Morocco | 36.6 | 12.5 | 3.6 | 1.2 | 0.3 |
Australia | 30.6 | 10 | 3.7 | 1 | 0.2 |
Wales | 28.6 | 10.3 | 3 | 0.9 | 0.2 |
Tunisia | 27.3 | 7.9 | 2.9 | 0.7 | 0.1 |
Costa Rica | 25.3 | 8.8 | 2.5 | 0.7 | 0.1 |
Canada | 28.7 | 8.6 | 2.2 | 0.6 | 0.1 |
Saudi Arabia | 22 | 5.4 | 1.5 | 0.3 | 0 |
Qatar | 19 | 4.7 | 1 | 0.2 | 0 |
Cameroon | 13.5 | 3.1 | 0.6 | 0.1 | 0 |
Ghana | 13.7 | 2.1 | 0.4 | 0.1 | 0 |
Except for reaching the last 16, I simply took the top 8/4/2/1 for the respective predictions in the contest.
Multiplying points by probabilities, my expected score so far is roughly 26.6 (though naturally with a lot of uncertainty either way), out of a total possible score of 63. That doesn’t sound too great, but my expectation is that given the overall unpredictability, most other contestants won’t have a high expected score either.
Of course, this expectation is conditional on the probabilities being correct - and there’s a good chance they aren’t and that there are at least sum issues in the elo rankings. Hopefully however these errors average out, keeping the above expectation intact.
Since a lot of events are rare, this expected value is probably on the high side - it’s more likely below this than above it.
Extras
The Essex contest also asked some other questions. Namely:
- Top Goal Scorer (2 pts)
- Number of goals scored by top (bonus 1pt)
- Number of red cards (tie breaker)
- Whether the correlation between group stage points will be (2pts):
- Positive and statistically significant
- Positive and statistically insigificant
- Negative and statistically significant
- Negative and statistically insignificant
For the first one, the best betting odds are on Harry Kane, who was also the last world cup’s top scorer. I therefore put him, with 6 goals - the same as the previous world cup.
I also used the same number of red cards from the previous world cup: 4.
Finally, for the last question I found a report from 2010 indicating that for that world cup this was positive and statistically significant5. So that was my answer.
Conclusion
I learned a lot putting together this elo system. Both some really interesting things about elo rankings, and some other things about football.
Hopefully these predictions do okay. At some point, I may try and find some other predictions to compare them against with Brier scores. Until then, best of luck to the other contestants!
Footnotes
-
Football matches dataset: https://www.kaggle.com/datasets/martj42/international-football-results-from-1872-to-2017 ↩
-
Fixture dates website: https://fixturedownload.com/ ↩
-
Elo Ratings website: http://eloratings.net/about ↩ ↩2 ↩3 ↩4
-
Blog post by Edouard Mathieu implementing a similar project for 2018, using the
elo
R package: https://edomt.github.io/Elo-R-WorldCup/ ↩ ↩2 -
PDF page 5/report page 3, if you’re interested: https://www.pwc.com/gx/en/issues/economy/global-economy-watch/assets/pdfs/global-economy-watch-june-2014-how-to-win-the-world-cup.pdf ↩