Chess Ratings — Move Over Elo
databuff writes "Less than 24 hours ago, Jeff Sonas, the creator of the Chessmetrics rating system, launched a competition to find a chess rating algorithm that performs better than the official Elo rating system. The competition requires entrants to build their rating systems based on the results of more than 65,000 historical chess games. Entrants then test their algorithms by predicting the results of another 7,809 games. Already three teams have managed create systems that make more accurate predictions than the official Elo approach. It's not a surprise that Elo has been outdone — after all, the system was invented half a century ago before we could easily crunch large amounts of historical data. However, it is a big surprise that Elo has been bettered so quickly!"
However, it is a big surprise that Elo has been bettered done so quickly!
Absolutely. I can almost guarantee no one thought that Elo would have been bettered done so quickly.
William of Ockham had no beard. The most likely explanation is that it was chewed off by squirrels every morning.
Elo-L
Not really. Jeff Sagarin has had two systems of rating sports teams for a while now. One, ELO_CHESS, is based purely on win-loss, while the other, PURE POINTS, takes into account margin of victory. According to him, the latter is better at predicting future results. From his analysis:
Less than 24 hours ago, the readers of Slashdot launched a competition to find an editing algorithm that performs better than the official "editors" of the site. The competition requires entrants to build their comment systems based on the results of over 9,000 historical submissions. Entrants then test their algorithms by predicting the results of the next 7,809 dup^H^H^Hstories. Already three teams have managed to create systems that make more accurate predictions than the official /. approach. It's not a surprise that Timothy has been outdone -- after all, he was invented half a century ago before English had been standardized. However, it is no big surprise that Slashdot has been bettered done so quickly! The winner: Texas Instruments!
Haida Manga
Looking at the table, the differences in predictive power are small enough that it's not obvious they aren't due to chance alone; there needs to be some calculation that shows that the differences are meaningful validating the claim that the alternative methods actually extract more information than Elo does. Perhaps there is enough inherent randomness in Chess that even simple predictive models can extract most of the systematics so that what remains after Elo is mostly noise?
That number is "Root Mean Square Error", so lower is better
Indeed, Sagarin has shown that applying Elo in sports where the winner is based on points scored is not optimal, since the average margin of victory is a better predictor of strength than won-loss record. But this has nothing to do with applying the Elo method to its original setting of chess, where the outcome of the game is only "win/draw/loss" and there is no margin of victory.
Are the better entries as transparent? ELO's a pretty simple way do do this - add or subtract a few points from the rating based on a win or a loss based on the relative difference of the ratings. Would anyone understand (other than "It's a neural net") the ratings produced by these competitors? Would anything human be able to calculate them?
Also, are the new models' improvements in prediction statistically relevant? Or are they just fitting the noise? Both the training dataset and the test dataset seem rather small to me.
Finally, and most importantly, how stable are the ratings? If I'm drunk and lose to a "patzer", do I go down to his level? Fairness of tournaments having small numbers of games has a lot to do with rating stability (unless we're assuming a population periodically beset by huge random shifts in ability).
All-in-all, there's a lot of problems coming up with a good rating system. Opening the dataset to the world, saying "Have at it!", and looking at a single scorecard based solely on predictability is nowhere near sufficient.
That is all.
Ah man, no matter how inadequate the Elo system may be for chess, it's much worse seeing it applied to other games where it doesn't belong, which happens regrettably often. The trouble is that the Elo system depends on the premise that nothing affects the outcome of a game other than the skill of each player (and who gets the white pieces).
In chess, that assumption is a pretty good approximation to reality, since every tournament game in run the same way. But many games do have variations in rules or format across different events, such as different maps or races in a real-time strategy game, or different card pools in Magic: The Gathering. Then Elo ratings are biased by how often a player has the chance to play to his strong areas. Players in turn are compelled to game the system: "I should avoid this event because they're using Format X and my rating will stay stronger if I stick to Format Y." The Elo system is meant precisely to obviate that kind of gamesmanship: chess players should need to think only about the strengths of their opponents, which (in principle) will be weighted fairly when calculating rating adjustments. But if there are other competitive factors, which is true for most any popular game invented in the last 30 years, Elo ratings become that much less meaningful.
"This algorithm runs in constant time. Come on, 2,147,483,648 is a constant..."
The Elo Benchmark was submitted a second time. I wrote to Sonas about this. Apparently the rating system has to be seeded. He tried a different approach to calculating seed ratings and this performed better - pushing him one place higher in the rankings.
Three teams done bettered Elo with betterer done algorithms, and the submitter is surprised that it was bettered done so quickly. I'm done. Was that better?
He sounds like Lady Macbeth on crack.
Here’s the problem with Battle.net 2.0: 2002s Warcraft III: Reign of Chaos is one of the most underrated video games ever created. And that’s before you learn its online apparatus is the foundation for modern matchmaking, where Blizzard Entertainment should get royalties every time you brag about your X-Box Live Trueskill rating. (Then again, I shouldn’t be giving Blizzard ideas right now.)
Here’s how Warcraft III matchmaking worked: Everyone starts at level one. The maximum level is fifty. You play players within six levels of your own. Win five games, gain a level. Lose five games, lose a level. The penalty for losing is reduced during levels one to nine. Thus, players who win half their games will become level ten.
It was simple and transparent. That was the hook, and people choked on it. It turned Warcraft III ladder play into what ICCUP serves for Starcraft players, a stomping ground so competitive that climbing the food chain gave you a shot at the guys who played for a living. That’s what a good online gaming system does.
The quote comes from Battle.net 2.0: The Antithesis of Consumer Confidence. I would encourage you to read the entire thing, but for reasons completely unrelated to this thread.
Not relevant specifically to this story, but I always laugh at the story of how a prisoner manpiulated the Elo system via closed pool ratings inflation.
Short summary: said prisoner only played against other prisoners, who he'd trained. Due to careful scheduling of the games, he rose from his true strength (probably sub-master) to being the second-highest rated played in the U.S. in 1996.
Advice: on VPS providers