Turning Data Science Into a Spectator 'Sport'
vu1986 writes "Kaggle has a 'predictive-modeling competition platform that makes public the competitors in invite-only private competitions. Think of it like watching a major tournament in golf or tennis, where you can watch the best in the world shoot it out to see whose algorithms are king. Kaggle's tagline is "We're making data science a sport." Maybe now it can make data science a spectator sport.'"
Well, according to that model, soon enough you'll start seeing some slashdot comments scrolling across the bottom of the screen on ESPN 7.
I've been working on the Heritage Health Prize that Kaggle is running for over a year now. It's a fantastic way to learn data science and tackle real world problems with real data and a co-op-etitive spirit. The forums and winning solutions are great for learning the art, and if you've never used R, it's a great opportunity to learn it and talk to people that have a ton of experience in the area.
I've entered a couple of Kaggle competitions, but I'm 'kinda put off by the opaque results.
After the first one ended (predict HIV progression), the released full dataset indicated that the data had been sorted before it was separated into train and test sets. IOW, after being sorted by length, all the short sequences were put into the training set, and the longer ones into the test set. This mistake may have invalidated the competition, and I strongly suspect it would have invalidated any paper written about the results.
More recently, the organizers of one competition stated flatly in the forums that they would release the entire data set once the competition had ended, but then didn't. I inquired about this, and a Kaggle data scientist replied saying "we almost never release the test data".
I'm not sure that Kaggle is all that scientific. If the full dataset can't be examined after the competitions close, there's no way to verify the results.