Slashdot Mirror


Augmenting Data Beats Better Algorithms

eldavojohn writes "A teacher is offering empirical evidence that when you're mining data, augmenting data is better than a better algorithm. He explains that he had teams in his class enter the Netflix challenge, and two teams went two different ways. One team used a better algorithm while the other harvested augmenting data on movies from the Internet Movie Database. And this team, which used a simpler algorithm, did much better — nearly as well as the best algorithm on the boards for the $1 million challenge. The teacher relates this back to Google's page ranking algorithm and presents a pretty convincing argument. What do you think? Will more data usually perform better than a better algorithm?"

4 of 179 comments (clear)

  1. Depends on the Problem by roadkill_cr · · Score: 4, Insightful

    I think it heavily depends on what you're kind of data your mining.

    I worked for a while on the Netflix prize, and if there's one thing I learned it's that a recommender system almost always gets better the more data you put into it, so I'm not sure if this one case study is enough to apply the idea to all algorithms.

    Though, in a way, this is sort of a "duh" result - data mining relies on lots of good data, and the more there is generally the better a fit you can make with your algorithm.

    1. Re:Depends on the Problem by teh+moges · · Score: 4, Insightful

      Think less in sheer numbers and more in density. If there are 200 million possible 'combinations' (say, 50,000 customers and 4000 movies in a Netflix-like situation), then with 10 million data samples, we only have 5% of the possible data. This means that if we are predicting inside the data scope, we are predicting into an unknown field that is 19 times larger then the known.
      Say we were looking at 100 million fields, suddenly we have 50% of the possible data, and our unknown field is the same size as the known field. Much more likely to get a result then.

  2. Um, Yes? by randyest · · Score: 4, Insightful

    Of course. Why wouldn't more (or bettter) relevant data that applies on a case-y-case basis provide more improved results than a "improved algorithm" (what does that mean, really?) that applied generally and globally?

    I think we need much, much more rigorous definitions of "more data" and "better algorithm" in order to discuss this in any meaningful way.

    --
    everything in moderation
  3. Five stars by CopaceticOpus · · Score: 5, Insightful

    If more data is helpful, then Netflix is really hurting themselves with their 5-star rating system. I'd only give 5 stars to a really amazing movie, but to only give 3/5 stars to a movie I enjoyed feels too low. Many movies that range from a 7/10 to a 9/10 get lumped into that 4 star category, and the nuances of the data are lost.

    How to translate the entire experience of watching a movie into a lone number is a separate issue.