Multiplayer Elo

The Elo rating system is designed for two-player games, like chess.

It would be nice to have a similar rating system for games with more players. Unfortunately, generalization of the Elo system to n-player (multiplayer) games has proven difficult.


Simple Multiplayer Elo (SME)

Generalization may not be necessary, though. I've come up with a simple and effective way to apply the two-player Elo system to multiplayer scenarios:
  • At the end of a game, make a list of all the players and sort it by performance.
  • Think of each player as having played two matches: a loss vs. the player right above him on the list, and a win vs. the player right below him.
  • Update each player's rating accordingly using the two-player Elo equations.
I call this method "Simple Multiplayer Elo" (SME) and am making it public domain.

If you decide to use SME, I would appreciate an e-mail!


Performance

To measure the performance of SME, I did the following experiment:
  • Imagine 10 players with "true" Elo ratings of 1100, 1200, ..., 2000.
  • Each player is assigned a random initial SME rating of 1500 ± 1.
  • Multiple rounds of an imaginary game are simulated:
    • The players are assigned random game scores according to their true ratings. The scores have normal distributions with μ = true rating, σ² = 200.
    • The predictive ability of the SME ratings is calculated by enumerating the 45 possible pairings between the 10 players, counting the number of times the higher-rated player beat the lower-rated player, and dividing by 45.
    • The predictive ability of the true ratings is calculated using the same method. This is the ideal predictive ability.
    • The SME ratings are updated according to the SME method, described above.
I ran this experiment millions of times and averaged the results:

After round... SME predictive ability Ideal predictive ability
50.0% (random) 84.5%
1 62.3% 84.5%
2 72.5% 84.5%
3 77.3% 84.5%
4 79.2% 84.5%
5 80.3% 84.5%
10 82.2% 84.5%
20 83.3% 84.5%
1000 83.9% 84.5%

You can see a graph of this curve here.

I think this experiment shows that SME only needs a small number of rounds to achieve near-ideal predictive ability, and is thus a strong candidate for any application that requires a multiplayer rating system.


Compared to Microsoft's TrueSkill

A well-known multiplayer rating system is Microsoft's TrueSkill (web site, paper). It's a much more complicated system that's based on Bayesian inference.

I ran the experiment described above using Jesse Buesking's implementation of TrueSkill (GitHub):

Round SME TrueSkill
1 62.3% 63.4%
2 72.5% 68.7%
3 77.3% 71.7%
4 79.2% 73.7%
5 80.3% 75.1%
10 82.2% 78.7%
20 83.3% 81.1%
100 83.9% 83.7%
1000 83.9% 84.4%

SME approaches ideal accuracy much faster than TrueSkill. Eventually TrueSkill produces ratings that are almost perfect (after hundreds of rounds) but at that point the difference is marginal.


Files

sme_graph.png Graph of SME predictive ability