Augmenting Data Beats Better Algorithms

← Back to Stories (view on slashdot.org)

Augmenting Data Beats Better Algorithms

Posted by kdawson on Tuesday April 1, 2008 @07:10AM from the tell-it-to-the-dhs dept.

eldavojohn writes "A teacher is offering empirical evidence that when you're mining data, augmenting data is better than a better algorithm. He explains that he had teams in his class enter the Netflix challenge, and two teams went two different ways. One team used a better algorithm while the other harvested augmenting data on movies from the Internet Movie Database. And this team, which used a simpler algorithm, did much better — nearly as well as the best algorithm on the boards for the $1 million challenge. The teacher relates this back to Google's page ranking algorithm and presents a pretty convincing argument. What do you think? Will more data usually perform better than a better algorithm?"

26 of 179 comments (clear)

Min score:

Reason:

Sort:

Heuristics?? by nategoose · 2008-04-01 07:13 · Score: 1, Insightful

Aren't these heuristics and not algorithms?
1. Re:Heuristics?? by glwtta · 2008-04-01 10:39 · Score: 2, Insightful
  
  Algorithms must have the same correct results by definition.
  
  Since we are obviously talking about the "goodness" of the results produced by the algorithm, I think it's pretty safe to assume that the broader definition of "algorithm" is being used.
  
  --
  sic transit gloria mundi
2. Re:Heuristics?? by EvanED · 2008-04-01 13:59 · Score: 3, Insightful
  
  Algorithms are ranked on their resource usage.
  Not always. Approximation algorithms are often ranked on their accuracy. Online algorithms are often ranked on something called the competitive ratio. Randomized algorithms are usually ranked on their resource uses, but all three of these needn't be optimal (in the context of an optimization problem) -- or produce correct results (in the context of a decision problem).
  
  Algorithms must have the same correct results by definition.
  [citation needed]
Depends on the Problem by roadkill_cr · 2008-04-01 07:14 · Score: 4, Insightful

I think it heavily depends on what you're kind of data your mining.

I worked for a while on the Netflix prize, and if there's one thing I learned it's that a recommender system almost always gets better the more data you put into it, so I'm not sure if this one case study is enough to apply the idea to all algorithms.

Though, in a way, this is sort of a "duh" result - data mining relies on lots of good data, and the more there is generally the better a fit you can make with your algorithm.
1. Re:Depends on the Problem by Brian+Gordon · 2008-04-01 07:52 · Score: 2, Insightful
  
  It's not always going to be more important. There's really no difference between a sample of 10 million and a sample of 100 million.. at that point it's obviously more effective to put work into improving the algorithm.. but that turning point (again obviously) would come way before 10 million samples of data. It's a balance.
2. Re:Depends on the Problem by RedHelix · 2008-04-01 08:20 · Score: 3, Insightful
  
  Well, yeah, augmenting data can produce more reliable results than better algorithms. If a legion of film buffs went through every single film record on Netflix's database and assigned "recommendable" films to it, then went and looked up the rental history of every Netflix user and assigned them individual recommendations, you would probably end up with a recommendation system that beats any algorithm. The dataset here would be ENORMOUS. But the reason algorithms exist is so that doesn't have to happen. i like turtles
3. Re:Depends on the Problem by roadkill_cr · 2008-04-01 08:30 · Score: 3, Insightful
  
  It's true that you lose some anonymity, but there is so much to gain. To be perfectly honest, I'm completely fine with rating products on Amazon.com and Netflix - I only go to these sites to shop for products and movies, so why not take full advantage of their recommendation system? If I am in consumer mode, I want the salesman to be as competent as possible.
  
  Anyways, if you're paranoid about data on you being used - there's a less well-known field of recommender systems which uses implicit data gathering which can be easily setup on any site. For example, it might say that because you clicked on product X many times today, you're probably in want of it and they can use that data. Of course, implicit data gathering is more faulty than explicit data gathering, but it just goes to show that if you spend time on the internet, websites can always use your data for their own means.
4. Re:Depends on the Problem by teh+moges · 2008-04-01 10:01 · Score: 4, Insightful
  
  Think less in sheer numbers and more in density. If there are 200 million possible 'combinations' (say, 50,000 customers and 4000 movies in a Netflix-like situation), then with 10 million data samples, we only have 5% of the possible data. This means that if we are predicting inside the data scope, we are predicting into an unknown field that is 19 times larger then the known.
  Say we were looking at 100 million fields, suddenly we have 50% of the possible data, and our unknown field is the same size as the known field. Much more likely to get a result then.
5. Re:Depends on the Problem by epine · 2008-04-01 11:52 · Score: 2, Insightful
  
  It seems to be a bad day for science writing. The piece on rowing a galley was a joke. And now we're being told that one data mining problem with a dominant low-hanging return on augmenting data represents a general principle.
  
  The Netflick data shouldn't be regarded as representative of anything. That data set has shockingly low dimensionality. So far as I know, they make no attempt to differentiate what kind of enjoyment the viewer obtained from the movie, or even determine whether the movie was viewed in a solo or group situation. They don't ask "who was your favorite character / actor / actress". Nor do they follow up on aging opinions: "which of these two movies would you presently rate higher?" so the corroboration factor is zero.
  
  I'm pretty fussy about the movies I rent. The worst movie I've endured this year was "Night at the Museum", which was loaned to me. I managed to get through it at the 1.4x speed setting on a slow evening.
  
  As bad as it was, I wouldn't rate it less than a 3. I'd like to save 1 and 2 for outright incompetence. Was "Museum" a manipulative piece of crap? Absolutely. I'd tick that box in a heartbeat. Did I feel personally soiled by Genghis' emotional discharge? I've been showering for days. From what I've read about Genghis, the only way to get him to discharge would have been to lock him in a room with Sacagawea.
  
  If you give "Museum" a three for competence squandered, what do you give Soderberg's "Solaris"? I'm glad I watched it. It was interesting to see what they did with the sets, and to find out whether anything ever happens (spoiler: no). I still recall the intensity of the black woman, though unfortunately her fine acting served no real purpose. While I was happy to rent it, it also earned a place on my list of movies least likely to rent twice.
  
  Really, Netflick deserves five gold stars for having created the least augmented opinion stream since baby spit out his brussel sprouts.
I think better is subjective... by 3p1ph4ny · 2008-04-01 07:14 · Score: 3, Insightful

In problems like minimizing lateness et. al. "better" can be simply defined as "closer to optimal" or "fewer time units late."

Here, better means different things to different people. The more data you have gives you a larger set of people, and probably a more accurate definition of better for a larger set of people. I'm not sure you can really compare the two.
attn computer scientists: stop renaming stuff by Anonymous Coward · 2008-04-01 07:15 · Score: 0, Insightful

"machine learning" is just statistical inference
"page rank algorithm" is just an eigenvalue calculation.

i know you computer scientists like playing mathematician, but there's a reason why you're the butt of mathematicians jokes. because you guys are nothing more than glorified engineers.
1. Re:attn computer scientists: stop renaming stuff by Metasquares · 2008-04-01 08:09 · Score: 3, Insightful
  
  And nonlinear dimensionality reduction is just nonconvex trace optimization coupled with kernel principal component analysis (fine, call it "singular value decomposition") using Mercer's theorem to map the resulting dot product through a kernel function (usually represented as a Hermitian positive semidefinite Gram matrix), yielding an inner product space of higher (possibly infinite) dimensionality in which the original problem is linearly separable.
  
  Now take this description and write an algorithm that performs it efficiently. And you use PageRank as an example, so let's call "efficient" "performs as well as Google on the entire web's worth of data".
  
  If you can't do this, perhaps you should reconsider your view of computer scientists. There's no reason whatsoever to play up the boundaries between two very related fields. Arbitrary boundaries in knowledge are already bad enough; they need to be knocked down, not reinforced.
2. Re:attn computer scientists: stop renaming stuff by egyptiankarim · 2008-04-01 09:18 · Score: 2, Insightful
  
  Without mathematics, chemistry and physics would be boring. Without chemistry and physics engineering would be impossible. Without engineering, computer science would be useless. Without computer science, today's best designers would be bored. Without today's best designers, many questions of logic would go unpondered. Logic is rooted in mathematics.
  
  I think we all need each other, folks :)
  
  --
  Eek!
3. Re:attn computer scientists: stop renaming stuff by Anonymous Coward · 2008-04-01 09:41 · Score: 1, Insightful
  
  Computer engineers are people who design your CPUs, motherboards, hard drives, etc. Computer engineering is a specialization within electrical engineering, even if people constantly abuse the term incorrectly.
  
  If they're not engineers, then what are they?
Um, Yes? by randyest · 2008-04-01 07:15 · Score: 4, Insightful

Of course. Why wouldn't more (or bettter) relevant data that applies on a case-y-case basis provide more improved results than a "improved algorithm" (what does that mean, really?) that applied generally and globally?

I think we need much, much more rigorous definitions of "more data" and "better algorithm" in order to discuss this in any meaningful way.

--
everything in moderation
Is it just me that is surprised here? by zappepcs · 2008-04-01 07:16 · Score: 1, Insightful

What do you think? Will more data usually perform better than a better algorithm?" Duh... the algorithm can ONLY be as good as the data supplied to it. Better data always improves performance in this type of problem. The netflix challenge is to arrive at a better algorithm with the supplied data. Adding more data gives you a richer data set to choose from. This is obvious, no?

I read the article in question here and can say that I'm surprised that this is even a question.

--
Support NYCountryLawyer RIAA vs People
"Better data" not "more data" by Anonymous Coward · 2008-04-01 07:19 · Score: 1, Insightful

Just having more data to process doesn't produce better results in this sort of field.

Look at the application. Netflix alone VS Netflix+IMDB. The second not only has more data, but it has "better" data in terms of having more human decision inputs applied to it thus weighting the data to produce more correct results.

But if you looked at it this way Netflix 2007 data VS Netflix 2006-2007 data I don't think you would find a significant difference in results. This is the same "type" of data, only more of it, where as the former is a practical example of data fusion.

Char-Lez
More vs Better by Mikkeles · 2008-04-01 07:19 · Score: 3, Insightful

Better data is probably most important and having more data makes having better data more likely. It would probably make sense to analyse the impact of each datum on the accuracy of the ruslt, then choose a better algorithm using the most influential data. That is, a simple algorithm on good data is better than a great algorithm on mediocre data.

--
Great minds think alike; fools seldom differ.
All things being equal... by Just+Some+Guy · 2008-04-01 07:21 · Score: 3, Insightful

One team used a better algorithm while the other harvested augmenting data on movies from the Internet Movie Database. And this team, which used a simpler algorithm, did much better nearly as well as the best algorithm on the boards for the $1 million challenge.
And the teams were identically talented? In my CS classes, I could have hand-picked teams that could make O(2^n) algorithms run quickly and others that could make O(1) take hours.

--
Dewey, what part of this looks like authorities should be involved?
Five stars by CopaceticOpus · 2008-04-01 07:24 · Score: 5, Insightful

If more data is helpful, then Netflix is really hurting themselves with their 5-star rating system. I'd only give 5 stars to a really amazing movie, but to only give 3/5 stars to a movie I enjoyed feels too low. Many movies that range from a 7/10 to a 9/10 get lumped into that 4 star category, and the nuances of the data are lost.

How to translate the entire experience of watching a movie into a lone number is a separate issue.
For the Sake of Discussion by eldavojohn · 2008-04-01 07:29 · Score: 3, Insightful

Well, for the sake of discussion I will try to give you an example so that you might pick it apart.
"more data" More data means that you understand directors and actors/actresses often do a lot of the same work. So for every movie that the user likes, you weight their stars they gave it with a name. Then you cross reference movies containing those people using a database (like IMDB). So if your user loved The Sting and Fight Club, they will also love Spy Games which had both Redford & Pitt starring in it.
"better algorithm" If you naively look at the data sets, you can imagine that each user represents a taste set and that high correlations between two movies in several users indicates that a user who has not seen the second movie will most likely enjoy it. So if 1,056 users who saw 12 Monkeys loved Donnie Darko but your user has only seen Donnie Darko, highly recommend them 12 Monkeys.

You could also make an elaborate algorithm that uses user age, sex & location ... or even a novel 'distance' algorithm that determines how far away they are from liking 12 Monkeys based on their highly ranked other movies.

Honestly, I could provide endless ideas for 'better algorithms' although I don't think any of them would even come close to matching what I could do with a database like IMDB. Hell, think of the Bayesian token analysis you could do on the reviews and message boards alone!

--
My work here is dung.
1. Re:For the Sake of Discussion by Anonymous Coward · 2008-04-01 09:25 · Score: 1, Insightful
  
  The critical failure of this example points out the flaw in the original premise:
  You used an algorithm to sort out the "data" that you are using. The act of weighting and comparing the different properties of the data you have IS an algorithm. Period.
  
  No algorithm can operate without data, and data is useless without an algorithm to DO something with it.
  
  More data will give a better statistical reading of the data, and help eliminate 'bad data points', so in many cases more data can be better... but that depends on the algorithm used, the quantity of data, etc.
  
  I would suggest the original person simply take a couple courses in computer science. There they will see classic examples of situations in which two algorithms are compared, and how in some situations one will excel in speed with limited data, and others will excel in speed with prolific data.
This is assuming... by jd · 2008-04-01 07:41 · Score: 2, Insightful

...that algorithms and data are, in fact, different animals. Algorithms are simply mapping functions, which can in turn be entirely represented as data. A true algorithm represents a set of statements which, when taken as a collective whole, will always be true. In other words, it's something that is generic, across-the-board. Think object-oriented design - you do not write one class for every variable. Pure data will contain a mix of the generic and the specific, with no trivial way to always identify which is which, or to what degree.
Thus, an algorithm-driven design should always out-perform data-driven designs when knowledge of the specific is substantially less important than knowledge of the generic. Data-driven designs should always out-perform algorithm-driven design when the reverse is true. A blend of the two designs (in order to isolate and identify the nature of the data) should outperform pure implementations following either design when you want to know a lot about both.
The key to programming is not to have one "perfect" methodology but to have a wide range at your disposal.
For those who prefer mantras, have the serenity to accept the invariants aren't going to change, the courage to recognize the methodology will, and the wisdom to apply the difference.

--
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
Recommendations Systems and subjectivity by mlwmohawk · 2008-04-01 07:52 · Score: 3, Insightful

I have written two recommendations systems and have taken a crack at the Netflix prize (but have been hard pressed to make time for the serious work.)

The article is informative and generally correct, however, having done this sort of stuff on a few projects, I have some problems with the netflix data.

First, the data is bogus. The preferences are "aggregates" of rental behaviors, whole families are represented by single accounts. Little 16 year old Tod, likes different movies than his 40 year old dad. Not to mention his toddler sibling and mother. A single account may have Winnie the Pooh and Kill Bill. Obviously, you can't say that people who like Kill Bill tend to like Winnie the Pooh. (Unless of course there is a strange human behavioral factor being exposed by this, it could be that parents of young children want the thrill of vicarious killing, but I digress)

The IMDB information about genre is interesting as it is possibly a good way to separate some of the aggregation.

Recommendation systems tend to like a lot of data, but not what you think. People will say, if you need more data, why just have 1-5 and not 1-10? Well, that really isn't much more added data it is just greater granularity of the same data. Think of it like "color depth" vs "resolution" on a video monitor.

My last point about recommendations is that people have moods are are not as predictable as we may wish. On an aggregate basis, a group of people is very predictable. A single person setting his/her preferences one night may have had a good day and a glass of wine and numbers are higher. The next day could have had a crappy day and had to deal with it sober, the numbers are different.

You can't make a system that will accurately predict responses of a single specific individual at an arbitrary time. Let alone based on an aggregated data set. That's why I haven't put much stock in the Netflix prize. Maybe someone will win it, but I have my doubts. A million dollars is a lot of money, but there are enough vagaries in what qualifies as a success to make it a lottery or a sham.

That being said, the data is fun to work with!!
One Trivial Result, One Big Assumption by fygment · 2008-04-01 08:03 · Score: 3, Insightful

Two things. The first is that it is tritely obvious that adding more data improves your results. But there are two possible mechanisms at work. On the one hand add more of the same data ie. just make your original database larger with more entries. That form of augmentation will hopefully give you more insight into the underlying distribution of the data. On the other hand you can augment the existing data. In the latter you are really adding extra dimensions/features/attributes to the data set. That's what seems to be alluded to in the article i.e. the students are adding extra features to the original data set. The success of the technique is a trivial result which depends very much on whether the features you add are discriminating or not. In this case, the IMDB presumably added discriminating features. However, if it had not, then "improved algorithms" would have had the upper hand.

The second thing about the claim seems to be that there is always additional information actually available. The comment is made that academia and business don't seem to appreciate the value of augmenting the data. That is false. In business additional data is often just not available (physically or for cost reasons). Consequently, improving your algorithms is all you can do. Similarly in academia (say a computer science department) the assumption is often that you are trying to improve your algorithms while assuming that you have all the data available.

--
"Consensus" in science is _always_ a political construct.
Does nobody know Shannon anymore? by eigengott · 2008-04-01 09:52 · Score: 2, Insightful

It's pretty simple: If you have random noise your algorithm can be as good as you want - you still get no useful information out of it. On the other hand, if the "more data" actually contains additional information, your entropy goes up and with a given algorithm you get better results. Bent to the extreme you just get the desired output as additional information and you can reduce your algorithm to just print it (should be O(1)).