Chess Ratings — Move Over Elo
databuff writes "Less than 24 hours ago, Jeff Sonas, the creator of the Chessmetrics rating system, launched a competition to find a chess rating algorithm that performs better than the official Elo rating system. The competition requires entrants to build their rating systems based on the results of more than 65,000 historical chess games. Entrants then test their algorithms by predicting the results of another 7,809 games. Already three teams have managed create systems that make more accurate predictions than the official Elo approach. It's not a surprise that Elo has been outdone — after all, the system was invented half a century ago before we could easily crunch large amounts of historical data. However, it is a big surprise that Elo has been bettered so quickly!"
However, it is a big surprise that Elo has been bettered done so quickly!
Absolutely. I can almost guarantee no one thought that Elo would have been bettered done so quickly.
William of Ockham had no beard. The most likely explanation is that it was chewed off by squirrels every morning.
However, it is a big surprise that Elo has been bettered done so quickly!"
*facepalm*
Not that they'd use it, but it certainly couldn't hurt.
If brevity is the soul of wit, then how does one explain Twitter?
Elo-L
Not really. Jeff Sagarin has had two systems of rating sports teams for a while now. One, ELO_CHESS, is based purely on win-loss, while the other, PURE POINTS, takes into account margin of victory. According to him, the latter is better at predicting future results. From his analysis:
Already three teams have managed create systems that make more accurate predictions than the official Elo approach.
1 EdR* 0.729125
2 whiteknight* 0.731656
3 Elo Benchmark* 0.738107 {-- The "official Elo approach"
Maybe we're counting from zero and they forgot to put it on the leaderboard?
[Fuck Beta]
o0t!
I can't think of anything other than 70's cheese and largest white afro up until the release of Bobobo-bo Bo-bobo.
Less than 24 hours ago, the readers of Slashdot launched a competition to find an editing algorithm that performs better than the official "editors" of the site. The competition requires entrants to build their comment systems based on the results of over 9,000 historical submissions. Entrants then test their algorithms by predicting the results of the next 7,809 dup^H^H^Hstories. Already three teams have managed to create systems that make more accurate predictions than the official /. approach. It's not a surprise that Timothy has been outdone -- after all, he was invented half a century ago before English had been standardized. However, it is no big surprise that Slashdot has been bettered done so quickly! The winner: Texas Instruments!
Haida Manga
Organized crime members linked to gambling rackets have been endicted for kidnapping a busload of nerds after they refused to program similar algorithms in exchange for Warcraft game time and photoshopped Natalie Portman porn.
We all know that's not true though. They totally would have done it.
Looking at the table, the differences in predictive power are small enough that it's not obvious they aren't due to chance alone; there needs to be some calculation that shows that the differences are meaningful validating the claim that the alternative methods actually extract more information than Elo does. Perhaps there is enough inherent randomness in Chess that even simple predictive models can extract most of the systematics so that what remains after Elo is mostly noise?
That number is "Root Mean Square Error", so lower is better
Well, everyone knows that arena is serious business.
Indeed, Sagarin has shown that applying Elo in sports where the winner is based on points scored is not optimal, since the average margin of victory is a better predictor of strength than won-loss record. But this has nothing to do with applying the Elo method to its original setting of chess, where the outcome of the game is only "win/draw/loss" and there is no margin of victory.
Are the better entries as transparent? ELO's a pretty simple way do do this - add or subtract a few points from the rating based on a win or a loss based on the relative difference of the ratings. Would anyone understand (other than "It's a neural net") the ratings produced by these competitors? Would anything human be able to calculate them?
Also, are the new models' improvements in prediction statistically relevant? Or are they just fitting the noise? Both the training dataset and the test dataset seem rather small to me.
Finally, and most importantly, how stable are the ratings? If I'm drunk and lose to a "patzer", do I go down to his level? Fairness of tournaments having small numbers of games has a lot to do with rating stability (unless we're assuming a population periodically beset by huge random shifts in ability).
All-in-all, there's a lot of problems coming up with a good rating system. Opening the dataset to the world, saying "Have at it!", and looking at a single scorecard based solely on predictability is nowhere near sufficient.
That is all.
Since the Elo system is not designed to predict future performance (it's designed to capture current relative rankings), then is it really surprising that programs designed to predict future performance are better at it?
I don't think so. The time I'd spend on this project is worth a bit more than $50...
Does Timothy even glance at the stories he approves or is it pure pin the tail on the donkey?
Timothy's e-mail address is timothy@monkey.org according to his home page. Tired of the half-assed submissions where he couldn't bother to read it over before submitting it for millions to read? Send him an e-mail.
After all, it's not like other ideas haven't already been created in the meantime to address Elo's perceived shortcomings, right?
Timothy bettered done goofed
Ah man, no matter how inadequate the Elo system may be for chess, it's much worse seeing it applied to other games where it doesn't belong, which happens regrettably often. The trouble is that the Elo system depends on the premise that nothing affects the outcome of a game other than the skill of each player (and who gets the white pieces).
In chess, that assumption is a pretty good approximation to reality, since every tournament game in run the same way. But many games do have variations in rules or format across different events, such as different maps or races in a real-time strategy game, or different card pools in Magic: The Gathering. Then Elo ratings are biased by how often a player has the chance to play to his strong areas. Players in turn are compelled to game the system: "I should avoid this event because they're using Format X and my rating will stay stronger if I stick to Format Y." The Elo system is meant precisely to obviate that kind of gamesmanship: chess players should need to think only about the strengths of their opponents, which (in principle) will be weighted fairly when calculating rating adjustments. But if there are other competitive factors, which is true for most any popular game invented in the last 30 years, Elo ratings become that much less meaningful.
"This algorithm runs in constant time. Come on, 2,147,483,648 is a constant..."
Three teams done bettered Elo with betterer done algorithms, and the submitter is surprised that it was bettered done so quickly. I'm done. Was that better?
He sounds like Lady Macbeth on crack.
Man I was like WTF? Cheese ratings? Got confused with seeing the packman icon.
by TheSpoom (715771) Uncaring Linux user here. I have nothing to add to this but please continue. *munches popcorn*
I believe the algorithm used by Microsoft to match players for X-Box games was already beating Elo before this competition. They have a description of their algorithm at http://research.microsoft.com/en-us/projects/trueskill/
by Beethoven
It looks to me like the data-set is rather small and so are the differences in the results. I don't see a clear winner yet by any means.
Steven, 2,156 Elo at my best.
Fuck me, I'd forgotten what a pile of shite Deacon Park South Texas were. Thanks a bastarding bunch for reminding me, you heiferflap.
WTF? Is this what happens when some late-1980s Scottish bands get mixed up in a transporter with a popular animation series?
:-/
If something that tenuous links to "Real Gone Kid" in your head, you must have some major trauma
"Slashdot - News and Chat Sites Deviant". (Click "homepage" link above for details).
The Glicko chess rating system and its successor Glicko2 (creative, huh?) are better than Elo and have been around for years. Various online chess sites use it, as does the Australian Chess Federation.
Absolutely. I can almost guarantee no one thought that Elo would have been bettered done so quickly.
Is it because elo would have been bettered done so quickly that you came to me?
The problem is not just to find another _method_ to predict game results, but to construct and evaluate a better workable scientific model of chess ability. That's hard, because the criterion 'game result' itself possibly is not a valid indicator of the quality of game play, and the stability of playing strength over time, which is reliability. To estimate these criteria, it is necessary, to design the data collection, as scientists do e.g. in experimental design.
In addition, the available tests of logistic models, like ELO, are not sufficient.
1 Elo Benchmark 0.723834 3 6:03pm, Wednesday 4 August 2010
2 EdROpen 0.729125 2 11:47pm, Tuesday 3 August 2010
3 whiteknightOpen 0.731656 4 2:29am, Wednesday 4 August 2010
Some people die at 25 and aren't buried until 75. -Benjamin Franklin
Not relevant specifically to this story, but I always laugh at the story of how a prisoner manpiulated the Elo system via closed pool ratings inflation.
Short summary: said prisoner only played against other prisoners, who he'd trained. Due to careful scheduling of the games, he rose from his true strength (probably sub-master) to being the second-highest rated played in the U.S. in 1996.
Advice: on VPS providers
The problem with this kind of modeling is that many "good fitting" algorithms would, if implemented, change the system itself. There's more to competition chess than just the rules on how to move pieces. For example, while a game in isolation would almost always be played to win, there are many times that because of information from ratings (or due to the method of the tournament) you would start the game being equally happy to draw, which will affect how you play.
Now, even if the difference in the number of pieces remaining (e.g.) is a much better predictor of who will win than the ELO system, if you were ever to actually implement it you would no longer be playing the game the ELO system was trying to track--suddenly you have made players more conservative, not as willing to sacrifice pieces for a better mating position. Possibly some would say you had ruined the game.
When things get complex, multiply by the complex conjugate.
Given that ELO is relatively simple, it is more surprising that more complex algorithms with the benefit of acces to a lot of historical data only marginally outperform it. i.e. the transparency and simplicity of ELO combined with a relatively accurate outcome is better.
The differences are indeed quite small, but it seems obvious that you should be able to do better than ELO by splitting it into two parts:
Games played as White and games played as Black.
In fact, this seems so obvious that I suspect there's something I have overlooked! :-)
As the contest site mentions, there's a very significant advantage to White, enough so that in their training data set White has 30+% win vs 20+% for Black.
I suggest that taking the normal ELO-predicted outcome and then biasing it according to this known historical trend, would have to result in slightly better predictions than the naked ELO number.
Terje
"almost all programming can be viewed as an exercise in caching"
What we need now is a chess rating system rating system. Then chess rating systems can compete with each other and be rated as to how well they rate chess.
...
Currently I'm ranked 7th (Turn to Stone - yeah I suck), but managed to beat a 6th ranked player after a close match (luckily he lived up to his title), so I should hopefully rank up soon. Can't wait to meet the Mr Wood and Mr Lynne at the ranking ceremony which is held every month at The Ship Inn, Frimley, UK.