Augmenting Data Beats Better Algorithms
eldavojohn writes "A teacher is offering empirical evidence that when you're mining data, augmenting data is better than a better algorithm. He explains that he had teams in his class enter the Netflix challenge, and two teams went two different ways. One team used a better algorithm while the other harvested augmenting data on movies from the Internet Movie Database. And this team, which used a simpler algorithm, did much better — nearly as well as the best algorithm on the boards for the $1 million challenge. The teacher relates this back to Google's page ranking algorithm and presents a pretty convincing argument. What do you think? Will more data usually perform better than a better algorithm?"
I think it heavily depends on what you're kind of data your mining.
I worked for a while on the Netflix prize, and if there's one thing I learned it's that a recommender system almost always gets better the more data you put into it, so I'm not sure if this one case study is enough to apply the idea to all algorithms.
Though, in a way, this is sort of a "duh" result - data mining relies on lots of good data, and the more there is generally the better a fit you can make with your algorithm.
In problems like minimizing lateness et. al. "better" can be simply defined as "closer to optimal" or "fewer time units late."
Here, better means different things to different people. The more data you have gives you a larger set of people, and probably a more accurate definition of better for a larger set of people. I'm not sure you can really compare the two.
Of course. Why wouldn't more (or bettter) relevant data that applies on a case-y-case basis provide more improved results than a "improved algorithm" (what does that mean, really?) that applied generally and globally?
I think we need much, much more rigorous definitions of "more data" and "better algorithm" in order to discuss this in any meaningful way.
everything in moderation
This reminds me of those articles who say that the amount of data humanity has archived is so much data that nobody could possibly use it in a lifetime. I think what people fail to remember is this: the point is to have available data just-in-case you need to reference it in the future. Nobody watches security tapes in full. The review the day or hour that the robbery occured. Does that mean we should stop recording everything? No. Let's keep archiving.
Combine that with the speed at which computers are getting more efficient - and I see no reason to just keep piling up this crap. More is always better. (More efficient might be better- but add the two together, and you're unstoppable)
Belief? Hope? Preference?The Existential Vortex
I can see that more data (especially more varied data) could be better than a tweaked algorithm. Especially in machine learning, I see many people publish papers on a new method that does 1% better than preexisting methods.
Now, I won't deny that algorithmic advances are important, but it seems to me that unless you have a better understanding of the underlying system (which might be a physical system or a social system) tweaking algorithms would only lead to marginal improvements.
Obviously, there will be a big jump when going from a simplistic method (say linear regression) to a more sophisticated method (say SVM's). But going from one type of SVM to another slightly tweaked version of the fundamental SVM algorithm is probably not as worthwhile as sitting down and trying to understand what is generating the observed data in the first place.
Better data is probably most important and having more data makes having better data more likely. It would probably make sense to analyse the impact of each datum on the accuracy of the ruslt, then choose a better algorithm using the most influential data. That is, a simple algorithm on good data is better than a great algorithm on mediocre data.
Great minds think alike; fools seldom differ.
And the teams were identically talented? In my CS classes, I could have hand-picked teams that could make O(2^n) algorithms run quickly and others that could make O(1) take hours.
Dewey, what part of this looks like authorities should be involved?
"What do you think? Will more data usually perform better than a better algorithm?"
I need more data.
With reasonable men I will reason; with humane men I will plead; but to tyrants I will give no quarter. -- William Lloyd
If more data is helpful, then Netflix is really hurting themselves with their 5-star rating system. I'd only give 5 stars to a really amazing movie, but to only give 3/5 stars to a movie I enjoyed feels too low. Many movies that range from a 7/10 to a 9/10 get lumped into that 4 star category, and the nuances of the data are lost.
How to translate the entire experience of watching a movie into a lone number is a separate issue.
He's getting rather old, but he's a good mouse.
You could also make an elaborate algorithm that uses user age, sex & location
Honestly, I could provide endless ideas for 'better algorithms' although I don't think any of them would even come close to matching what I could do with a database like IMDB. Hell, think of the Bayesian token analysis you could do on the reviews and message boards alone!
My work here is dung.
One would hope that the thing that calculates the heuristic is an algorithm. See wikipedia.
Thus, an algorithm-driven design should always out-perform data-driven designs when knowledge of the specific is substantially less important than knowledge of the generic. Data-driven designs should always out-perform algorithm-driven design when the reverse is true. A blend of the two designs (in order to isolate and identify the nature of the data) should outperform pure implementations following either design when you want to know a lot about both.
The key to programming is not to have one "perfect" methodology but to have a wide range at your disposal.
For those who prefer mantras, have the serenity to accept the invariants aren't going to change, the courage to recognize the methodology will, and the wisdom to apply the difference.
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
A machine with swap enabled will always have more throughput than a machine without. It's a better use of the resources available. However, replace that swap space with the same amount of RAM, and of course that will be even better. Some use this as an argument against swap space, but it's not a fair comparison, since you can enable swap space in the RAM increased machine and increase throughput even more.
So when I think of this recommendation system, a better algorithm is like having swap space enabled. It's a more sophisticated use of the data you have. Having more data is like having more RAM. And of course the best option is to have more reference data and a better algorithm. It's not an exclusive disjunction, and it's silly to think it has to be.
Say what you want about computer scientists, but without them you'd probably be complaining on a chalkboard.
Mathematics is physics without purpose, Chemistry is physics without thought, Engineering is physics - CliffsNotes edition.
What noobery. You're confusing the "what" with the "how". Finding eigenvalues is part of a particular page rank algorithm. It's not THE page rank algorithm. Likewise, statistical inference is part of particular "machine learning" systems. It's not THE system. Using statistical inference alone will give you crude (albeit good, with enough training data) baselines to work from in some applications such as automatic text translation, but you'll need more than that to overcome issues like data sparseness, etc.
I know anonymous cowards like playing expert, but there's a reason why you're the butt of so many jokes here -- only thing you're usually expert in is misinformation and disingenuity.
I like basketball!!1!
I have written two recommendations systems and have taken a crack at the Netflix prize (but have been hard pressed to make time for the serious work.)
The article is informative and generally correct, however, having done this sort of stuff on a few projects, I have some problems with the netflix data.
First, the data is bogus. The preferences are "aggregates" of rental behaviors, whole families are represented by single accounts. Little 16 year old Tod, likes different movies than his 40 year old dad. Not to mention his toddler sibling and mother. A single account may have Winnie the Pooh and Kill Bill. Obviously, you can't say that people who like Kill Bill tend to like Winnie the Pooh. (Unless of course there is a strange human behavioral factor being exposed by this, it could be that parents of young children want the thrill of vicarious killing, but I digress)
The IMDB information about genre is interesting as it is possibly a good way to separate some of the aggregation.
Recommendation systems tend to like a lot of data, but not what you think. People will say, if you need more data, why just have 1-5 and not 1-10? Well, that really isn't much more added data it is just greater granularity of the same data. Think of it like "color depth" vs "resolution" on a video monitor.
My last point about recommendations is that people have moods are are not as predictable as we may wish. On an aggregate basis, a group of people is very predictable. A single person setting his/her preferences one night may have had a good day and a glass of wine and numbers are higher. The next day could have had a crappy day and had to deal with it sober, the numbers are different.
You can't make a system that will accurately predict responses of a single specific individual at an arbitrary time. Let alone based on an aggregated data set. That's why I haven't put much stock in the Netflix prize. Maybe someone will win it, but I have my doubts. A million dollars is a lot of money, but there are enough vagaries in what qualifies as a success to make it a lottery or a sham.
That being said, the data is fun to work with!!
Mathematics is physics without purpose, Chemistry is physics without thought, Engineering is physics without tenure.
Sorry, I'm a writer. That makes you raw material.
Two things. The first is that it is tritely obvious that adding more data improves your results. But there are two possible mechanisms at work. On the one hand add more of the same data ie. just make your original database larger with more entries. That form of augmentation will hopefully give you more insight into the underlying distribution of the data. On the other hand you can augment the existing data. In the latter you are really adding extra dimensions/features/attributes to the data set. That's what seems to be alluded to in the article i.e. the students are adding extra features to the original data set. The success of the technique is a trivial result which depends very much on whether the features you add are discriminating or not. In this case, the IMDB presumably added discriminating features. However, if it had not, then "improved algorithms" would have had the upper hand.
The second thing about the claim seems to be that there is always additional information actually available. The comment is made that academia and business don't seem to appreciate the value of augmenting the data. That is false. In business additional data is often just not available (physically or for cost reasons). Consequently, improving your algorithms is all you can do. Similarly in academia (say a computer science department) the assumption is often that you are trying to improve your algorithms while assuming that you have all the data available.
"Consensus" in science is _always_ a political construct.
And nonlinear dimensionality reduction is just nonconvex trace optimization coupled with kernel principal component analysis (fine, call it "singular value decomposition") using Mercer's theorem to map the resulting dot product through a kernel function (usually represented as a Hermitian positive semidefinite Gram matrix), yielding an inner product space of higher (possibly infinite) dimensionality in which the original problem is linearly separable.
Now take this description and write an algorithm that performs it efficiently. And you use PageRank as an example, so let's call "efficient" "performs as well as Google on the entire web's worth of data".
If you can't do this, perhaps you should reconsider your view of computer scientists. There's no reason whatsoever to play up the boundaries between two very related fields. Arbitrary boundaries in knowledge are already bad enough; they need to be knocked down, not reinforced.
Riiiht. And mathematical research is just finding a Hamiltonian cycle in a graph defined by the set of axioms used.
\u262D = \u5350
I am a graduate student in computer science, emphasizing the use of machine learning.
The sound bite conclusion of this blog post is that algorithms are a waste of time and that you are better off adding more training data.
The reality is that a lot of really smart people have been trying to come up with better algorithms for classification, clustering, and (yes) ranking for a very long time. Unless you are already familiar with the field, you really are unlikely to invent something new that will work better than what is already out there.
But that does not mean that the algorithm does not matter - for the problems I work on, using logistic regression or support vector machines outperforms naive bayes by 10% - 30%, which is huge. So if you want good performance, you try a few different algorithms to see what works.
Adding more training data does not always help either, if the distributions of the data are significantly different. You are much better off using the data to design better features which represent/summarize the data.
In other words, the algorithm is not unimportant, it just isn't the place your creative work is going to have the highest ROI.
Without mathematics, chemistry and physics would be boring. Without chemistry and physics engineering would be impossible. Without engineering, computer science would be useless. Without computer science, today's best designers would be bored. Without today's best designers, many questions of logic would go unpondered. Logic is rooted in mathematics.
:)
I think we all need each other, folks
Eek!
I tend to agree that augmenting data helps improve the model if the model is not yet overwhelmed with data, but you have to have a decent model to begin with or it won't work. Additionally, the payoff of additional data added to the model is a diminishing return as the amount of data available begins to overwhelm any given model. In other words, the more data you collect and put into your model, the more expensive, time consuming, and difficult it becomes to continue to rely on the original model.
In linear regression models for forecasting there is what's known as a "variable inflation factor". This factor helps a statistician know when their linear regression model is beginning to perform poorly when too much data is in the equation because different variables (containing different, but inter-related data) will eventually begin to conflict with one another.
For the Netflix thing, this could show up as a problem if the model is trying to recommend which movie you should rent next based on actors/actresses in previous movies you've watched, which movies you rated higher than others, which genres those highly rated movies were in, which actors/actresses you had rated highly, and which movies those highly rated actors/actresses had been in that you hadn't seen yet. It's quite likely that someone like Kevin Bacon has been in some romantic comedy with another one of your favorite actors or actresses, but you absolutely hate horror movies and he's in a "horror" film with that same actor or actress. The recommendation model would likely try to recommend a movie to you based on three positives (a favorite film and two separate favorite actors) because there's only one negative in the equation. (your hatred for horror movies) This is a very simplistic example, but that's the problem of too much data with too simplistic of an algorithm. A linear regression might have this problem, but if one were to build in an additional bit of algorithm magic that made sure horror movies were "filtered out" or severely punished for being in the horror genre before looking for other factors like favorite actors/actresses in movies then the algorithm would perform better. But then, of course, additional types of data would be needed to adequately "fill in the gaps" for the new monster algorithm that you've created.
Aren't these heuristics and not algorithms?
Lets not be overly pedantic: a heuristic is a type of algorithm, in casual speech.
sic transit gloria mundi
It's pretty simple: If you have random noise your algorithm can be as good as you want - you still get no useful information out of it. On the other hand, if the "more data" actually contains additional information, your entropy goes up and with a given algorithm you get better results. Bent to the extreme you just get the desired output as additional information and you can reduce your algorithm to just print it (should be O(1)).
In this particular case I think that the distinction is important. Saying that something is a better algorithm doesn't imply that it gives a better result(s) as all correct results are semantically the same. Algorithms are ranked on their resource usage. Heuristics are ranked on the perceived goodness of their results. Algorithms must have the same correct results by definition.
Algorithms must have the same correct results by definition.
Since we are obviously talking about the "goodness" of the results produced by the algorithm, I think it's pretty safe to assume that the broader definition of "algorithm" is being used.
sic transit gloria mundi
Lets not be overly pedantic: a heuristic is a type of algorithm, in casual speech.
"In casual speech"? That's just wrong... a heuristic is a type of algorithm, period. (Assuming it meets the other requirements of being an algorithm, such as termination.) That it doesn't produce an optimal result doesn't enter into it. [In this post I say "doesn't produce" as a shorthand for "isn't guaranteed to produce".]
CS theorists talk about randomized algorithms. They don't produce an optimal result. CS theorists talk about online algorithms. They don't produce an optimal result. CS theorists talk about approximation algorithms. They don't produce an optimal result.
Producing an optimal result isn't a requirement of being an algorithm. Heuristics are just algorithms that tend to produce useful results most of the time. In fact, Wikipedia page for the CS notion of a heuristic is called "heuristic algorithm."
Algorithms are ranked on their resource usage.
Not always. Approximation algorithms are often ranked on their accuracy. Online algorithms are often ranked on something called the competitive ratio. Randomized algorithms are usually ranked on their resource uses, but all three of these needn't be optimal (in the context of an optimization problem) -- or produce correct results (in the context of a decision problem).
Algorithms must have the same correct results by definition.
[citation needed]
Adapted from a joke I saw on Jester the other day:
A physicist, a computer scientist and a mathematician are sharing a hotel room. It must have bad wiring or something.
Late at night when they're all asleep a small fire starts in the room. The smell of smoke wakes the physicist. He gets up, notices the fire and looking round the room, sees a bucket and a sink. He calculates how much water will be required, fills the bucket with precisely that much, douses the flames and goes back to bed.
A little later, another small fire starts. This time the smell of smokes wakes the computer scientist. He wakes up and sees the flames. He looks around and sees the bucket and the sink. He reasons that calculating the quantity of water required would take at least as long as filling the bucket, so he fills it right up, douses the flames and goes back to bed.
Again there is a fire. This time the mathematician smells the smoke and wakes up. He sees the flames, sees the bucket and the sink. He exclaims "there is a solution!" and goes back to bed.
Chernobyl 'not a wildlife haven' - BBC News