Forecasting the 2022 World Cup

How I used Elo Ratings and Monte Carlo simulation to enter the Essex Department of Government World Cup Prediction Contest

I recently entered a world cup prediction contest in the Department of Government at Essex University. This isn’t super-abnormal, except that my football knowledge is marginally less than my chances of home ownership.

Not being willing to let that stop me, I decided to turn to my skills in programming and data science to see if they could help me. After doing some research, I settled on building an elo ranking system to generate my predictions for me.

If you’re from the world of chess, you’ll be familiar with an elo ranting system. The basic idea is you give everyone a starting rating. Then, as they play games against each other, their elo ratings get updated. If you beat a player with a much lower rating, you don’t gain many points. If you beat a player with a much higher rating, you gain many points. And vice versa for losing points.

Elo rating systems have a lot of appeal for football prediction. They implicitly handle the time factor: as teams get better and worse, their elo scores will average out accordingly. They also come with a ready-made formula for producing win probabilities.

The one thing it may handle poorly in this case is rating teams across continents. Since teams play more often within continent, the predictions here may suffer somewhat from a lack of updates across continents outside major events.

Throughout this blog post, I’ll share some of my code snippets from R. However, since all of the code is available in a github repo at this link, I won’t set it out in full here.

Data

Before implementing any kind of predictive algorithm or model, I need data to base it on. Forunately, there’s a nice dataset of basically most (all?) historic international football matches currently being maintained by Mart Jürisoo available on Kaggle1.

There’s also a nice dataset of fixture dates available on the creatively named Fixture Dates website2, which nicely avoids most of the need to manually code these.

I therefore used these two datasets throughout the modelling process.

Elo Ratings

There are two main decisions to be made when building elo ratings:

I’ll now go through both of these in turn.

Initial Ratings

In my procedure, I started all teams with a rating of 1500. In the world elo rankings some slightly different starts were used3, but that ultimately should have a relatively marginal effect.

team_ratings <- data.frame(
  team = unique(c(matches$home_team, matches$away_team)),
  rating = 1500
)

I chose 1500 simply because it’s a relatively standard choice - no special reasons or motivations here. Regardless, after roughly ~30 games, ratings should reach their ‘correct’ value. And there are a lot more than 30 here!

Update Algorithm

Once that was done, the big thing to implement is the iterative update algorithm. At the time of writing there are 44,060 matches in the dataset, starting in 1872.

The eloratings.net update is calculated as3

\[R_n = R_o + K \times (W - W_e)\]

where \(R_n\) is the new elo rating, \(R_o\) is the old elo rating, \(K\) is an importance value, \(W\) is the result, and \(W_e\) is the expected result. These last three terms obviously need some defining.

Starting with the importance value, \(K\), this essentially determines the size of the update. At about 1, there’s barely any update. At 100, the update is usually too large. 30 is a fairly standard value4. The eloratings.net use the following values, conditional on the matches being played3:

From here, in the eloratings.net calculation \(K\) is adjusted based on the goal difference in the game4. For games where the goal difference is 2, it is increased by \(\frac{1}{2}\). Where the goal difference is 3, it’s increased by \(\frac{3}{4}\). Where the goal difference is 4 or more, it is increased by \(\frac{3}{4} + \frac{N-3}{8}\).

I used a value of 30 across all games, but kept the goal multiplier as set out above. This was largely because after some initial trial and error along with some quick eyeballing of results (and comparing them to the World Elo Ratings), this just seemed to produce the most sensible results. We’ll see how far off that statement is at the end of December!

The observed result \(W\) is simple in its definition:

The expected result \(W_e\) is calculated as

\[W_e = \frac{1}{10(\frac{-dr}{400}) + 1}\]

where \(dr\) is the difference in ratings between the teams, plus 100 for the home team.

You could compute this for both teams (remembering to flip the 100 for the away team, but this value for both teams sums to 1.

So, with all the pieces in place, I implemented a loop in R over all matches (minus one with missing scores)

# Loop over matches
for (i in 1:nrow(matches)) {
  
  # Teams
  home_team <- matches$home_team[i]
  away_team <- matches$away_team[i]
  
  # Current ratings
  rating_home <- team_ratings$rating[team_ratings$team == home_team]
  rating_away <- team_ratings$rating[team_ratings$team == away_team]
  
  # Compute expected result for both teams
  expected_home = 1 / (1 + 10**(-(rating_home - rating_away + 100)/400))
  expected_away = 1 - expected_home
  
  # Update ratings
  team_ratings$rating[team_ratings$team == home_team] <- rating_home + matches$importance[i] * (matches$result[i] - expected_home)
  team_ratings$rating[team_ratings$team == away_team] <- rating_away + matches$importance[i] * (1 - matches$result[i] - expected_away)
  
}

# World Cup Team Elo Ratings
teams <- team_ratings %>% filter(team %in% unique(c(fixtures$`Home Team`, fixtures$`Away Team`)))

Here’s the top 5 teams:

teams %>% arrange(rating) %>% `[`(32:1,) %>% head(5)

          team   rating
30      Brazil 2217.258
29   Argentina 2175.373
28       Spain 2092.173
27 Netherlands 2089.409
26     Belgium 2049.297

Reassuringly, these are the same top 5 as eloratings.net3, with the same rank order. The scores are slightly different, but this shouldn’t dramatically alter things.

Monte Carlo Simulation

With elo ratings calculated, the final task was to use these to produce a Monte Carlo simulation of the World Cup based on these ratings. A Monte Carlo simulation is one in which the several draws are made from a random simulation. Since there’s a lot of randomness in football results, this is useful for my purposes!

For those not in the know, the world cup stages are done as follows:

That’s a lot of predictions to make! So, Monte Carlo to the rescue.

I did include one simplificaiton: I skipped out the points system for the group stage, and simple sampled for winners and losers. Hopefully this isn’t too drastic a simplification. However, it nicely avoids the issue of deciding thresholds for loss, win, and draw on a 0-1 range. Obviously if I wanted to predict points, it would be a different story.

Since I don’t however, I used the elo expected win formula, but dropped the +100 component for home teams. Typically home teams do do better in the World Cup, but I suspected this may not be the case in Qatar (and at the time of writing this instinct seems initially vindicated by the first game - the first time the hosts have lost the opening game).

I therefore coded up a fairly hefty for loop for the simulation. Given more time, I would have refactored it to (mostly) re-use a single function. As-is, there’s a lot of code duplication. It starts with probabilities for the 48 group-stage games, which can simply be generated by using the elo rankings. From here, work out the next fixtures based on 1st and 2nd places, then compute new probabilities, then rinse and repeat.

Here’s an example from the loop for the quarter finals:

fixtures_quarter <- data.frame(
    home_team = c(teams_16[seq(1,7,2)]),
    away_team = c(teams_16[seq(2,8,2)])
  ) %>% left_join(teams %>% 
                    rename(home_team = team,
                           rating_home = rating),
                  by = "home_team") %>%
    left_join(teams %>% 
                rename(away_team = team,
                       rating_away = rating),
              by = "away_team") %>%
    mutate(home_win = 1 / (1 + 10**(-(rating_home - rating_away)/400)))
  
  # Simulate quarter results
  num_quarter <- 4
  wins_quarter <- rep(NA_real_, num_quarter)
  for (i in 1:num_quarter) {
    wins_quarter[i] <- sample(c(0,1), 1, prob=c(1 - fixtures_quarter$home_win[i], fixtures_quarter$home_win[i]))
  }
  
  # Vector of 16s results
  teams_quarter <- vector(mode='character', length=num_quarter)
  for (i in 1:num_quarter) {
    teams_quarter[i] <- ifelse(wins_quarter[i] == 1, fixtures_quarter$home_team[i], fixtures_quarter$away_team[i])
  }

I ran this simulation 25,000 times. I would have preferred to run it for longer, but in the end I needed to run it pretty close to the wire (i.e. before the first game had begun) to get my submission to the Essex contest in on time. Had I had more time, I probably would’ve run it for something closer to about 100,000 times.

Results

Here, I’m going to focus mostly on results relevant to the Essex prediction contest, with a lot of reference to the points system in that . Where relevant however, I’ll note some files in the github repo that contain some extra predictions. Since at a later point I may produce a second blog post with some Breir scores for this prediction and others, I’ll be including some probabilities.

Group Rank Order

The first requirement was to predict the rank order of teams in the group stages, at 2pts a pop. Individual game predictions were made using elo ratings alone, and are in the results/results_01_group_fixtures.csv file.

Here are my predicted group order results (with teams ordered from 1st to last):

Order Probability
Group A Netherlands, Ecuador, Senegal, Qatar 15.8
Group B England, Iran, USA, Wales 9.9
Group C Argentina, Mexico, Poland, Saudi Arabia 17.9
Group D France, Denmark, Australia, Tunisia 13.5
Group E Spain, Germany, Japan, Costa Rica 13.8
Group F Belgium, Croatia, Morocco, Canada 12.5
Group G Brazil, Switzerland, Serbia, Cameroon 21.9
Group H Portugal, Uruguay, South Korea, Ghana 15.2

All in all, each rank order has a fairly low probability. This shouldn’t be surprising - each group has 24 different possible outcomes! Group B looks like the one most likely to produce interesting results, while group G seems the most predictable (though by means settled according to the simulation).

Team Progression Points

The next set of points was for predicting which teams would get through to the last 16 stage, with one point each. These were based on the above rank order submission.

It’s worth presenting these together, as most other predictions took on this character. There were 2 points for each team reaching the quarter-finals, 2 points for each team reaching the semi-infals, 2 points for each team reaching the final, and 3 points for predicting the winner.

Here are all the probabilities I have for these quantities:

Last 16 Quarter Finals Semi Finals Finals Win
Brazil 90.2 68.6 50 34.9 25.1
Argentina 91 65.9 47.8 28.8 18.7
Spain 78.5 51.8 27.6 15.9 8.5
France 77.3 46.5 29.1 15.1 7.1
Belgium 77.3 44.9 23.8 13.2 6.2
Netherlands 80.5 51.2 25.9 12.8 6.3
Portugal 71.8 34.1 18.1 9.5 3.9
England 68.8 40.3 19.9 8.6 3.6
Uruguay 70.5 33.1 17.1 8.5 3.5
Denmark 64.7 32.5 17.7 7.7 2.9
Germany 60.6 32.7 14.4 6.8 2.6
Switzerland 52.4 25.4 11.6 5.1 2
Iran 56.7 28.8 11.9 4.5 1.5
Croatia 57.4 26.3 10.3 4.5 1.6
Ecuador 55.9 25.9 9.8 3.6 1.1
Serbia 43.9 19.3 7.6 3 0.9
USA 45.9 20.7 7.6 2.6 0.8
Mexico 46.1 17.5 7 2 0.5
South Korea 44 14.5 5.5 1.9 0.6
Senegal 44.7 18 5.8 1.8 0.4
Japan 35.7 14.5 4.7 1.8 0.6
Poland 40.8 14.3 5.4 1.6 0.4
Morocco 36.6 12.5 3.6 1.2 0.3
Australia 30.6 10 3.7 1 0.2
Wales 28.6 10.3 3 0.9 0.2
Tunisia 27.3 7.9 2.9 0.7 0.1
Costa Rica 25.3 8.8 2.5 0.7 0.1
Canada 28.7 8.6 2.2 0.6 0.1
Saudi Arabia 22 5.4 1.5 0.3 0
Qatar 19 4.7 1 0.2 0
Cameroon 13.5 3.1 0.6 0.1 0
Ghana 13.7 2.1 0.4 0.1 0

Except for reaching the last 16, I simply took the top 8/4/2/1 for the respective predictions in the contest.

Multiplying points by probabilities, my expected score so far is roughly 26.6 (though naturally with a lot of uncertainty either way), out of a total possible score of 63. That doesn’t sound too great, but my expectation is that given the overall unpredictability, most other contestants won’t have a high expected score either.

Of course, this expectation is conditional on the probabilities being correct - and there’s a good chance they aren’t and that there are at least sum issues in the elo rankings. Hopefully however these errors average out, keeping the above expectation intact.

Since a lot of events are rare, this expected value is probably on the high side - it’s more likely below this than above it.

Extras

The Essex contest also asked some other questions. Namely:

For the first one, the best betting odds are on Harry Kane, who was also the last world cup’s top scorer. I therefore put him, with 6 goals - the same as the previous world cup.

I also used the same number of red cards from the previous world cup: 4.

Finally, for the last question I found a report from 2010 indicating that for that world cup this was positive and statistically significant5. So that was my answer.

Conclusion

I learned a lot putting together this elo system. Both some really interesting things about elo rankings, and some other things about football.

Hopefully these predictions do okay. At some point, I may try and find some other predictions to compare them against with Brier scores. Until then, best of luck to the other contestants!

Footnotes

  1. Football matches dataset: https://www.kaggle.com/datasets/martj42/international-football-results-from-1872-to-2017 

  2. Fixture dates website: https://fixturedownload.com/ 

  3. Elo Ratings website: http://eloratings.net/about  2 3 4

  4. Blog post by Edouard Mathieu implementing a similar project for 2018, using the elo R package: https://edomt.github.io/Elo-R-WorldCup/  2

  5. PDF page 5/report page 3, if you’re interested: https://www.pwc.com/gx/en/issues/economy/global-economy-watch/assets/pdfs/global-economy-watch-june-2014-how-to-win-the-world-cup.pdf