Augmenting Data Beats Better Algorithms

← Back to Stories (view on slashdot.org)

Augmenting Data Beats Better Algorithms

Posted by kdawson on Tuesday April 1, 2008 @07:10AM from the tell-it-to-the-dhs dept.

eldavojohn writes "A teacher is offering empirical evidence that when you're mining data, augmenting data is better than a better algorithm. He explains that he had teams in his class enter the Netflix challenge, and two teams went two different ways. One team used a better algorithm while the other harvested augmenting data on movies from the Internet Movie Database. And this team, which used a simpler algorithm, did much better — nearly as well as the best algorithm on the boards for the $1 million challenge. The teacher relates this back to Google's page ranking algorithm and presents a pretty convincing argument. What do you think? Will more data usually perform better than a better algorithm?"

13 of 179 comments (clear)

Min score:

Reason:

Sort:

Depends on the Problem by roadkill_cr · 2008-04-01 07:14 · Score: 4, Insightful

I think it heavily depends on what you're kind of data your mining.

I worked for a while on the Netflix prize, and if there's one thing I learned it's that a recommender system almost always gets better the more data you put into it, so I'm not sure if this one case study is enough to apply the idea to all algorithms.

Though, in a way, this is sort of a "duh" result - data mining relies on lots of good data, and the more there is generally the better a fit you can make with your algorithm.
1. Re:Depends on the Problem by teh+moges · 2008-04-01 10:01 · Score: 4, Insightful
  
  Think less in sheer numbers and more in density. If there are 200 million possible 'combinations' (say, 50,000 customers and 4000 movies in a Netflix-like situation), then with 10 million data samples, we only have 5% of the possible data. This means that if we are predicting inside the data scope, we are predicting into an unknown field that is 19 times larger then the known.
  Say we were looking at 100 million fields, suddenly we have 50% of the possible data, and our unknown field is the same size as the known field. Much more likely to get a result then.
Um, Yes? by randyest · 2008-04-01 07:15 · Score: 4, Insightful

Of course. Why wouldn't more (or bettter) relevant data that applies on a case-y-case basis provide more improved results than a "improved algorithm" (what does that mean, really?) that applied generally and globally?

I think we need much, much more rigorous definitions of "more data" and "better algorithm" in order to discuss this in any meaningful way.

--
everything in moderation
Hold on a sec... by peacefinder · 2008-04-01 07:23 · Score: 4, Funny

"What do you think? Will more data usually perform better than a better algorithm?"

I need more data.

--
With reasonable men I will reason; with humane men I will plead; but to tyrants I will give no quarter. -- William Lloyd
Five stars by CopaceticOpus · 2008-04-01 07:24 · Score: 5, Insightful

If more data is helpful, then Netflix is really hurting themselves with their 5-star rating system. I'd only give 5 stars to a really amazing movie, but to only give 3/5 stars to a movie I enjoyed feels too low. Many movies that range from a 7/10 to a 9/10 get lumped into that 4 star category, and the nuances of the data are lost.

How to translate the entire experience of watching a movie into a lone number is a separate issue.
Re:attn computer scientists: stop renaming stuff by Anonymous Coward · 2008-04-01 07:25 · Score: 5, Funny

you guys are nothing more than glorified engineers. Computer scientists are not glorified engineers. They're the butt of engineers' jokes too.
Re:Is it just me that is surprised here? by gnick · 2008-04-01 07:27 · Score: 5, Informative

The netflix challenge is to arrive at a better algorithm with the supplied data. Actually, the rules explicitly allow supplementing the data set and Netflix points out that they explore external data sets as well.

--
He's getting rather old, but he's a good mouse.
Re:Heuristics?? by EvanED · 2008-04-01 07:33 · Score: 5, Informative

One would hope that the thing that calculates the heuristic is an algorithm. See wikipedia.
Re:attn computer scientists: stop renaming stuff by Freeside1 · 2008-04-01 07:45 · Score: 5, Funny

Say what you want about computer scientists, but without them you'd probably be complaining on a chalkboard.
Re:attn computer scientists: stop renaming stuff by jank1887 · 2008-04-01 07:45 · Score: 4, Funny

Mathematics is physics without purpose, Chemistry is physics without thought, Engineering is physics - CliffsNotes edition.
Re:attn computer scientists: stop renaming stuff by JasonKChapman · 2008-04-01 08:02 · Score: 5, Funny

Mathematics is physics without purpose, Chemistry is physics without thought, Engineering is physics

Mathematics is physics without purpose, Chemistry is physics without thought, Engineering is physics without tenure.

--
Sorry, I'm a writer. That makes you raw material.
Re:attn computer scientists: stop renaming stuff by Arthur+B. · 2008-04-01 08:50 · Score: 5, Funny

"machine learning" is just statistical inference

Riiiht. And mathematical research is just finding a Hamiltonian cycle in a graph defined by the set of axioms used.

--
\u262D = \u5350
This does not mean what I think you think it means by aibob · 2008-04-01 09:09 · Score: 4, Informative

I am a graduate student in computer science, emphasizing the use of machine learning.

The sound bite conclusion of this blog post is that algorithms are a waste of time and that you are better off adding more training data.

The reality is that a lot of really smart people have been trying to come up with better algorithms for classification, clustering, and (yes) ranking for a very long time. Unless you are already familiar with the field, you really are unlikely to invent something new that will work better than what is already out there.

But that does not mean that the algorithm does not matter - for the problems I work on, using logistic regression or support vector machines outperforms naive bayes by 10% - 30%, which is huge. So if you want good performance, you try a few different algorithms to see what works.

Adding more training data does not always help either, if the distributions of the data are significantly different. You are much better off using the data to design better features which represent/summarize the data.

In other words, the algorithm is not unimportant, it just isn't the place your creative work is going to have the highest ROI.