Chess Ratings — Move Over Elo
databuff writes "Less than 24 hours ago, Jeff Sonas, the creator of the Chessmetrics rating system, launched a competition to find a chess rating algorithm that performs better than the official Elo rating system. The competition requires entrants to build their rating systems based on the results of more than 65,000 historical chess games. Entrants then test their algorithms by predicting the results of another 7,809 games. Already three teams have managed create systems that make more accurate predictions than the official Elo approach. It's not a surprise that Elo has been outdone — after all, the system was invented half a century ago before we could easily crunch large amounts of historical data. However, it is a big surprise that Elo has been bettered so quickly!"
Looking at the table, the differences in predictive power are small enough that it's not obvious they aren't due to chance alone; there needs to be some calculation that shows that the differences are meaningful validating the claim that the alternative methods actually extract more information than Elo does. Perhaps there is enough inherent randomness in Chess that even simple predictive models can extract most of the systematics so that what remains after Elo is mostly noise?
Are the better entries as transparent? ELO's a pretty simple way do do this - add or subtract a few points from the rating based on a win or a loss based on the relative difference of the ratings. Would anyone understand (other than "It's a neural net") the ratings produced by these competitors? Would anything human be able to calculate them?
Also, are the new models' improvements in prediction statistically relevant? Or are they just fitting the noise? Both the training dataset and the test dataset seem rather small to me.
Finally, and most importantly, how stable are the ratings? If I'm drunk and lose to a "patzer", do I go down to his level? Fairness of tournaments having small numbers of games has a lot to do with rating stability (unless we're assuming a population periodically beset by huge random shifts in ability).
All-in-all, there's a lot of problems coming up with a good rating system. Opening the dataset to the world, saying "Have at it!", and looking at a single scorecard based solely on predictability is nowhere near sufficient.
That is all.
Much as my lower UID implies this comment should be more valuable than your high UID comment.
I used to think of myself as having a particularly high UID... until I realized that mine is actually lower than a majority of the total UIDs. Weirded me out a little. There are UIDs that are farther from the 1,000,000 mark than I am from Taco.
William of Ockham had no beard. The most likely explanation is that it was chewed off by squirrels every morning.
Winning with only a king and a bishop remaining is no "better" than winning with all your pieces remaining. A win is a win. That said, winning a game while having many more pieces remaining than one's opponent may imply that the difference between your skill and your opponent's is greater than if you won with only a kind and bishop left. There may be some merit to working that into an algorithm if the goal is to predict the outcome of future matches.
Another data point that might be valuable is simply "how many moves did the game take before checkmate"? Without any other knowledge, the guy who beats me in 10 moves is likely to be a better player than the guy who takes 50 moves to beat me.