Augmenting Data Beats Better Algorithms
eldavojohn writes "A teacher is offering empirical evidence that when you're mining data, augmenting data is better than a better algorithm. He explains that he had teams in his class enter the Netflix challenge, and two teams went two different ways. One team used a better algorithm while the other harvested augmenting data on movies from the Internet Movie Database. And this team, which used a simpler algorithm, did much better — nearly as well as the best algorithm on the boards for the $1 million challenge. The teacher relates this back to Google's page ranking algorithm and presents a pretty convincing argument. What do you think? Will more data usually perform better than a better algorithm?"
Aren't these heuristics and not algorithms?
I think it heavily depends on what you're kind of data your mining.
I worked for a while on the Netflix prize, and if there's one thing I learned it's that a recommender system almost always gets better the more data you put into it, so I'm not sure if this one case study is enough to apply the idea to all algorithms.
Though, in a way, this is sort of a "duh" result - data mining relies on lots of good data, and the more there is generally the better a fit you can make with your algorithm.
In problems like minimizing lateness et. al. "better" can be simply defined as "closer to optimal" or "fewer time units late."
Here, better means different things to different people. The more data you have gives you a larger set of people, and probably a more accurate definition of better for a larger set of people. I'm not sure you can really compare the two.
Unless the solution has a provably optimal algorithm, more data is always going to beat a better algorithm. Trivial example: The data includes the answers to the question...
That doesn't mean a better algorithm is useless, though. If the data isn't available, you're kinda up a creek.
"machine learning" is just statistical inference
"page rank algorithm" is just an eigenvalue calculation.
i know you computer scientists like playing mathematician, but there's a reason why you're the butt of mathematicians jokes. because you guys are nothing more than glorified engineers.
Of course. Why wouldn't more (or bettter) relevant data that applies on a case-y-case basis provide more improved results than a "improved algorithm" (what does that mean, really?) that applied generally and globally?
I think we need much, much more rigorous definitions of "more data" and "better algorithm" in order to discuss this in any meaningful way.
everything in moderation
Yes. Yes it will.
This reminds me of those articles who say that the amount of data humanity has archived is so much data that nobody could possibly use it in a lifetime. I think what people fail to remember is this: the point is to have available data just-in-case you need to reference it in the future. Nobody watches security tapes in full. The review the day or hour that the robbery occured. Does that mean we should stop recording everything? No. Let's keep archiving.
Combine that with the speed at which computers are getting more efficient - and I see no reason to just keep piling up this crap. More is always better. (More efficient might be better- but add the two together, and you're unstoppable)
Belief? Hope? Preference?The Existential Vortex
I read the article in question here and can say that I'm surprised that this is even a question.
Support NYCountryLawyer RIAA vs People
I can see that more data (especially more varied data) could be better than a tweaked algorithm. Especially in machine learning, I see many people publish papers on a new method that does 1% better than preexisting methods.
Now, I won't deny that algorithmic advances are important, but it seems to me that unless you have a better understanding of the underlying system (which might be a physical system or a social system) tweaking algorithms would only lead to marginal improvements.
Obviously, there will be a big jump when going from a simplistic method (say linear regression) to a more sophisticated method (say SVM's). But going from one type of SVM to another slightly tweaked version of the fundamental SVM algorithm is probably not as worthwhile as sitting down and trying to understand what is generating the observed data in the first place.
Algorithms are nothing more than efficient representations of data.
Algorithms and a data are just the two extreme ends of a continuum.
A piece of pertinent data is worth a thousand (code) lines of speculation.
Strange things are afoot at the Circle-K.
Just having more data to process doesn't produce better results in this sort of field.
Look at the application. Netflix alone VS Netflix+IMDB. The second not only has more data, but it has "better" data in terms of having more human decision inputs applied to it thus weighting the data to produce more correct results.
But if you looked at it this way Netflix 2007 data VS Netflix 2006-2007 data I don't think you would find a significant difference in results. This is the same "type" of data, only more of it, where as the former is a practical example of data fusion.
Char-Lez
Better data is probably most important and having more data makes having better data more likely. It would probably make sense to analyse the impact of each datum on the accuracy of the ruslt, then choose a better algorithm using the most influential data. That is, a simple algorithm on good data is better than a great algorithm on mediocre data.
Great minds think alike; fools seldom differ.
Is it just me, or wouldn't it make even more sense to use both? It's like asking which would you choose to make a room brighter, the floor lamp or the overhead light? My guess is that both lights together would produce the most brightness.
I think that's the principle behind Metascore (though it seems vague at the moment).
http://www.metascore.org/
Massive amounts of data and massive amounts of recursive algorithms.
And the teams were identically talented? In my CS classes, I could have hand-picked teams that could make O(2^n) algorithms run quickly and others that could make O(1) take hours.
Dewey, what part of this looks like authorities should be involved?
Is it just me, or is it pretty obvious that this all just depends on the algorithm and the data?
Like I could "augment" the data with worthless or misleading data, and get the same or worse results. If I have a huge set of really good and useful data, I can get better results without making my algorithm more advanced. And no matter how advanced my algorithm is, it won't return good results if it doesn't have sufficient data.
When a challenge is put out to improve these algorithms, it's really because these companies are operating with limited and/or bad data. They have to deal with crap data and people trying to game the system. They can't pull data from other sites because they don't own the other sites' data. They can't necessarily track their own customers' searches and compile that because (sometimes) their customers would be outraged at the "invasion of privacy".
"What do you think? Will more data usually perform better than a better algorithm?"
I need more data.
With reasonable men I will reason; with humane men I will plead; but to tyrants I will give no quarter. -- William Lloyd
If more data is helpful, then Netflix is really hurting themselves with their 5-star rating system. I'd only give 5 stars to a really amazing movie, but to only give 3/5 stars to a movie I enjoyed feels too low. Many movies that range from a 7/10 to a 9/10 get lumped into that 4 star category, and the nuances of the data are lost.
How to translate the entire experience of watching a movie into a lone number is a separate issue.
You could also make an elaborate algorithm that uses user age, sex & location
Honestly, I could provide endless ideas for 'better algorithms' although I don't think any of them would even come close to matching what I could do with a database like IMDB. Hell, think of the Bayesian token analysis you could do on the reviews and message boards alone!
My work here is dung.
I would suggest that one both go for better algorythms AND more/better data.
mcgrew's razor: Never attribute to stupidity that which can be explained by greedy self-interest
...the algorithm wasn't 'better' enough.
The last sentence of TFA sums up the non-usefulness of the result: "Of course, you have to be judicious in your choice of the data to add to your data set."
I refer you to the question of training Bayesian data sets for anti-spam: should you classify every single email, or only the ones that are "clearly" well-defined? Without a good algorithm to extract the search terms, the additional data just poisons the data sets, reducing the effectiveness of the filter.
See also any decent physiological study, in which "extraneous" factors are "corrected". Without enough data pruning, you have a correlation like the study that showed that losing weight, and keeping it off, reduces life expectancy. They didn't correct for the terminally ill, who lost weight as a result of their conditions. However, do too much pruning, and you have the controversial Harvard study, which reached the "common sense" conclusion almost at the expense of the data.
For more examples of massaging data using a bad algorithm, see studies that demonstrate a better TCO by going Microsoft.
In short, adding additional data is no guarantee of good results. The students clearly got lucky in finding a similar data set on a well-researched topic, based on an established taxonomy rather than a murky preference rating.
So, the upshot is to look at both approaches and take the best course of action for your needs.
Ruby Neural Evolution of Augmenting Topologies
How do you define a "better" algorithm? Well, a better algorithm is an algoithm that works better on the field, it may seem obvious, but it is not at all. Usually it is not possible to test an algorithm deeply enough until its development is finished. On the other hand you would rather not spend a lot of time developing an algorithm that is not good enough. Hence the quality of algorithms is often deduced by some indicators, like some small test samples. Finally, as the general theory improves, the difference in performance between the top ranking algorithm decreases, and may start to depend quite strongly on the subset of the general total population to wich they are actually applied. We cannot simply say that "given two algorithms, the best one is the one which performs better on all possible samples;" we should rather say "the best one is the one which performs better on most of the real world samples." You can clearly see how actually impractical this definition is, this is why finding a good ranking algorithm requires constant tuning, as they do in google. A better algorithm may not be so much better, or may lack of generality when tested in the real world. More data always helps.
this post contain no useful information, no need to mod it down
I mean, if we balloon up to 10,000 feet, the problem really is, where do you put the extra data? Do you encode it in an algorithm, or do you have less code but more dynamic data. Given that POV, then, it stands to reason the best place to put the extra data is outside of the code, so that it is easier and less costly to modify.
This is my sig.
In case you haven't heard yet, April Fool's day has been postponed to May.
mcgrew's razor: Never attribute to stupidity that which can be explained by greedy self-interest
Thus, an algorithm-driven design should always out-perform data-driven designs when knowledge of the specific is substantially less important than knowledge of the generic. Data-driven designs should always out-perform algorithm-driven design when the reverse is true. A blend of the two designs (in order to isolate and identify the nature of the data) should outperform pure implementations following either design when you want to know a lot about both.
The key to programming is not to have one "perfect" methodology but to have a wide range at your disposal.
For those who prefer mantras, have the serenity to accept the invariants aren't going to change, the courage to recognize the methodology will, and the wisdom to apply the difference.
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
A machine with swap enabled will always have more throughput than a machine without. It's a better use of the resources available. However, replace that swap space with the same amount of RAM, and of course that will be even better. Some use this as an argument against swap space, but it's not a fair comparison, since you can enable swap space in the RAM increased machine and increase throughput even more.
So when I think of this recommendation system, a better algorithm is like having swap space enabled. It's a more sophisticated use of the data you have. Having more data is like having more RAM. And of course the best option is to have more reference data and a better algorithm. It's not an exclusive disjunction, and it's silly to think it has to be.
In the long term, if gamed Data determines hidden features of an algorithm's output, that output will not be completely understood in case it needs to be analyzed.
I've seen this on several systems over the years where legacy programmers tweak the data just a bit to affect sort order, etc etc and it leads to nightmares when you try to actually understand what's really happening to try to replace it's functionality.
There's no hard rule but beware, Data has no comments, so you'll never completely understand all the actions of your algorithm.
Google Page Rank probably suffers from this.
It is obvious that both will help. Your first big chunk of augmenting data will help a lot, as will your first few algorithm adjustments. As you go forward, however, you will get smaller and smaller returns for each new tweak to the algorithm and each new set of data. It seems obvious after these results that the best course is BOTH.
Personally, I more likely to watch a movie based on genre, producer, director, writers, actors... Especially with plot specifics like era and technology.
The more data you have, the more likely your results are going to be significant. I think we already knew this. ;)
Really though, it's the "design fix" vs the "statistics fix" (or the algorithm fix in this case) and a proper design always beats a crappy design with statistical band aids.
That begs the question... most of us use pseudonyms on here.
How do you know you aren't socializing with closet
Plus, of course, you're sort of stuck with yourself
I showed my boss a sallary survey...
Luckily for me he fell for it!
How much is your data worth? Back it up now.
I have written two recommendations systems and have taken a crack at the Netflix prize (but have been hard pressed to make time for the serious work.)
The article is informative and generally correct, however, having done this sort of stuff on a few projects, I have some problems with the netflix data.
First, the data is bogus. The preferences are "aggregates" of rental behaviors, whole families are represented by single accounts. Little 16 year old Tod, likes different movies than his 40 year old dad. Not to mention his toddler sibling and mother. A single account may have Winnie the Pooh and Kill Bill. Obviously, you can't say that people who like Kill Bill tend to like Winnie the Pooh. (Unless of course there is a strange human behavioral factor being exposed by this, it could be that parents of young children want the thrill of vicarious killing, but I digress)
The IMDB information about genre is interesting as it is possibly a good way to separate some of the aggregation.
Recommendation systems tend to like a lot of data, but not what you think. People will say, if you need more data, why just have 1-5 and not 1-10? Well, that really isn't much more added data it is just greater granularity of the same data. Think of it like "color depth" vs "resolution" on a video monitor.
My last point about recommendations is that people have moods are are not as predictable as we may wish. On an aggregate basis, a group of people is very predictable. A single person setting his/her preferences one night may have had a good day and a glass of wine and numbers are higher. The next day could have had a crappy day and had to deal with it sober, the numbers are different.
You can't make a system that will accurately predict responses of a single specific individual at an arbitrary time. Let alone based on an aggregated data set. That's why I haven't put much stock in the Netflix prize. Maybe someone will win it, but I have my doubts. A million dollars is a lot of money, but there are enough vagaries in what qualifies as a success to make it a lottery or a sham.
That being said, the data is fun to work with!!
The team with more data performed better, probably because their data allowed them to clearly differentiate between movies using a far significant dimension than the given ratings per movie dimension.
The fundamental idea is to be able to identify clusters of movies, or users (who like a certain type of movie), and the idea of clusters is built on some form of distance. When you add a new dimension to your feature vector, you get a chance to identify groups of entities better, using that dimension. You may do worse as well, a new dimension may blur the lines between groups. Genres for movies looks like a good label for identifying groups of movies. Trying to do the same with more complex methods, using only ratings is harder.
More data does not necessarily mean you'll do better, it has to allow you to identify differences better, it should either contain or add a dimension with a "good" data. It seems team B directly went for generating a more relevant data set for the problem at hand.
The quality (accuracy) of the result is a function of how much data you put in and how you operate on it, but entering more data can yield a much greater improvement in the quality of the output than a better algorithm.
"When information is power, privacy is freedom" - Jah-Wren Ryel
Two things. The first is that it is tritely obvious that adding more data improves your results. But there are two possible mechanisms at work. On the one hand add more of the same data ie. just make your original database larger with more entries. That form of augmentation will hopefully give you more insight into the underlying distribution of the data. On the other hand you can augment the existing data. In the latter you are really adding extra dimensions/features/attributes to the data set. That's what seems to be alluded to in the article i.e. the students are adding extra features to the original data set. The success of the technique is a trivial result which depends very much on whether the features you add are discriminating or not. In this case, the IMDB presumably added discriminating features. However, if it had not, then "improved algorithms" would have had the upper hand.
The second thing about the claim seems to be that there is always additional information actually available. The comment is made that academia and business don't seem to appreciate the value of augmenting the data. That is false. In business additional data is often just not available (physically or for cost reasons). Consequently, improving your algorithms is all you can do. Similarly in academia (say a computer science department) the assumption is often that you are trying to improve your algorithms while assuming that you have all the data available.
"Consensus" in science is _always_ a political construct.
Would you rather know more or be smarter?
Knowledge is power, and the ultimate in information is the answer itself. If the answer is accessible, then by all means access it.
You cannot compare algorithms unless the initial conditions are the same, and this usually includes available information. In other words, algorithms make the most out of "what you have". If what you have can be expanded, then by all means you should expand it.
I wonder if accessing foreign web sites is legal in this competition though, because that definitely alters the complexion of the problem.
To say google succeeded by expanding their data pool is an oversimplification, because not only did they select what they felt was most important, they ignored what they felt was not. Intelligent selection took place to set their initial conditions for their algorithm. So it isn't just data augmentation. It is the ability to augment data relative to a goal, and this is much deeper than just "more data" vs "algorithm". In fact, you can also find situations where algorithms are used to make these intelligent selections, in which case the selection process can be as or more important than just the sheer volume of available data alone.
When I was in college "augmented data" was a tactful way of saying "faked results"
Sent from my ASR33 using ASCII
Slashcrap's threading system threads your comment in the wrong thread.
Secondly, your moronic link would only fool Slashdot moderators.
"I've got mod points".
Who gives a fuck?
In a data mining context, an algorithm extracts, modifies or creates data from an existing data set.
Think of it this way.. algorithm is to verb as data is to noun.
Do you even know anything about perl? -- AC Replying to Tom Christiansen post.
>> Will more data usually perform better than a better algorithm?
of course! more data, more signals
more signals, more clouds
more clouds, more rain
more rain, more marijuana
more marijuana, better performance!
what was the point?
For every problem, there is an optimal solution (okay... there are many optimal solutions, depending on what you are trying to optimize for). If you want to do better than that algorithm, you must break the model. That means that you must either modify the inputs or modify the assumptions of the model. For example, the fastest way to sort arbitrary data that can only be compared using takes O(n*log(n)) time. To do any better, you must break the model by making assumptions about the range and precision of the data. Then you can do it in O(n).
So for the data in netflix, there is an optimal algorithm. To do better, you must include additional data. This particular problem is interesting because it is nearly impossible to determine what the "optimal" algorithm is since it is based on psychological factors. However, the fact that they are seeking out smart people to figure this out indicates that we are probably pretty close to optimal, so maybe we need to start including more information and changing the model.
I am always looking for more data, from new people, from different countries.
I think I am making up for my slow algorithm in my head, or maybe all this data is slowing me down.
Actually it is making no decisions and having a cloud of maybes instead of deciding what rules I want to live by is the problem.
Be Free: Free Software Tuition
yes this data is useful, but you can't use it in the contest:
http://www.netflix.com/TermsOfUse
see also:
http://www.netflixprize.com/community/viewtopic.php?id=98
http://www.netflixprize.com/community/viewtopic.php?id=20
http://www.netflixprize.com/community/viewtopic.php?id=14
note that this makes sense. more/better data would help ANY decent algorithm. they want a better one, and they're judging you on a baseline. so they'd naturally limit your input options.
http://kered.org
Now there's a simple algorithm that works. And beats even page rank.
I would say that a richer set of (relevant) data would generally generate a better result than an improvement of algorithm. Granted, different statistical models and algorithms do work better on certain kinds of data (there's almost an art to picking a good model).
But, as a past professor of mine was fond of saying, "the best data wins."
No data is enough when you have BAD algorithms...
Lisias.
I've seen a great many cases where developing better algorithms caused better performance (and better algorithms rather than better data, in fact, account for the vast majority of Computer Science research papers out there), so certainly it can't only be better data. Additionally, what about the times when you need a better algorithm to take advantage of the additional data. Or, what about when you combine the better algorithm with the better data.
This article is a completely false dichotomy.
mcgrew's razor: Never attribute to stupidity that which can be explained by greedy self-interest
Robert Pike- "Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming".
If you are still making drastic changes to you basic approach you can make great gains from a new algorithm - this is true whether you are measuring speed, memory efficiency, or accuracy. But once you are talking about "tweaking" the algorithm then additional data is going to be a better return for effort in a single area.
... although I'd think intuitively that some data sets could be well-represented by predictive algorithms as well. For a model of a general data-set however, I think a good analogy is that of a lossy compressed graphic. A good compression algorithm can certainly reconstruct a decent likeness of the original graphic, depending on the density of the original data and the algorithm used, maybe even a copy that would not be perceived as different by the human eye (or some other measure). But that lossy copy will never be quite as accurate as the original graphic, by definition. I guess another way to look at it would be that any predictive algorithm is, in some sense, statistical. While the statistics may be incredibly accurate, they can never predict the statistical anamolies. This seems pretty fundamentally intuitive to me, but I wouldn't even know where to begin going about proving it ...
comments?
I am a graduate student in computer science, emphasizing the use of machine learning.
The sound bite conclusion of this blog post is that algorithms are a waste of time and that you are better off adding more training data.
The reality is that a lot of really smart people have been trying to come up with better algorithms for classification, clustering, and (yes) ranking for a very long time. Unless you are already familiar with the field, you really are unlikely to invent something new that will work better than what is already out there.
But that does not mean that the algorithm does not matter - for the problems I work on, using logistic regression or support vector machines outperforms naive bayes by 10% - 30%, which is huge. So if you want good performance, you try a few different algorithms to see what works.
Adding more training data does not always help either, if the distributions of the data are significantly different. You are much better off using the data to design better features which represent/summarize the data.
In other words, the algorithm is not unimportant, it just isn't the place your creative work is going to have the highest ROI.
From a designer-with-rudimentary-programming-abilities point of view, logic would dictate that having more data would always make sorting easier,I mean, digging through 10,000 results based on one piece of information will always be slower than if we have 2 more sets of data that bring the search pool down to 10. A clever 10 year old could tell you that. Can't this be ruled to common sense?
The only way the better algorithm would win out is if the extra data is moot and can't shrink the search pool by any meaningful amount eg.
USEFUL AUGMENTED DATA-george W Bush:
- war monger
- president
- started war in iraq
- kicked out of airforce for coke addiction
- Initials G, W, and B
USELESS AUGMENTED DATA-George W Bush:- wears shoes
- doesn't often wear hats
- likes turkey
but once again i'm sure everybody knows that.The question itself is a little like asking a football coach whether a run play is better than a pass play. There's no objective answer; any coach he even answers the question is really expressing an opinion that one or the other is over-valued, rather than that one is just 'better' in all situations.
Same with data vs. algorithms.
One thing I haven't even seen mentioned--which surprises me--is that it's well known that more data will often make an algorithm perform worse. It's not that the data's bad, it simply produces spurious connections or obscures real ones that were apparent in the smaller set. The idea that more data always produces better results is as incorrect as the idea that more training, or more complexity, is better. There's a point not just of diminishing returns, but of negative ones.
Which is the other reason the question makes me scratch my head. The implicit assumption seems to be that the more 'sophisticated' algorithm is inherently 'better'. But algorithms, especially for this sort of problem, are good or bad based on their results; there's no abstract 'superiority' for an algorithm that makes it better for all problems. (See the 'No Free Lunch' theorem.)
As an empirical matter, I wouldn't doubt that computer science students are prone to approach every problem as thinking that if they just program more, they can solve everything. After all, downloading data is not sexy. So the instructor's post is a decent corrective to that. But trying to abstract some rule ('data is better than programming') is not helpful.
Of course, the example above uses IMDB data as said external data set, at which point you run up against the real world...
IMDB is owned by Amazon.com, which is trying to compete in the same business space as Netflix.
Now, it is concieveable that IMDB/Amazon could be induced to $upply their competetor with data to make them a better competetor, but it seems to me that this would be an unlikely alliance.
It's pretty simple: If you have random noise your algorithm can be as good as you want - you still get no useful information out of it. On the other hand, if the "more data" actually contains additional information, your entropy goes up and with a given algorithm you get better results. Bent to the extreme you just get the desired output as additional information and you can reduce your algorithm to just print it (should be O(1)).
It doesn't matter if you have the best algorithm in the world that can calculate how many times you're going to go pee a year and a day from now, you can't forget the first rule of equations:
Garbage in = Garbage out
Period. Therefore, to a point, better data will yield more better results than a better algorithm and this is a very obvious result.
Seriously I'm beginning to think there is some sort of connection between memory space used and CPU time. I might be on to something here, so I'll submit a story to Slashdot when I get some ads up.
I'm pretty sure the best approach (if there are no other constraints) is to use both - more and better data, and the best algorithm, will give the best results won't it? If you have other goals - like getting the best result for a fixed amount of money or time - then you look at compromising.
As a long time NetFlix user I suppose I have contributed to the sample. I order movies for myself, my wife, and my five year old daughter. Good luck trying to profile my buying activity.
director of Google Research, "a large corpus of data can be much more valuable than an efficient algorithm" - Unreliable reference
I think it obviously depends on the quality, and above all the relevance of the data, to the problem you're trying to solve.
____
nico
Nico-Live
The tradeoff between the number of examples used and the complexity of the classification algorithm is already a much-explored branch of machine learning known as Computational Learning Theory or PAC (Probably Almost Correct) Learning. Bounds on the expected error rate of a system can be given as a function of the complexity of the classifier and the number of examples seen. Typically, the number of examples needed to guarantee with probability 1 - delta that the error is lower than some threshold gamma is lower bounded by 1/gamma^2 * log(k / delta), where k is (hand-wavingly) the complexity of the classifier family. See http://www.stanford.edu/class/cs229/notes/cs229-notes4.pdf for a more formal treatment.
The story ends with "Will more data usually perform better than a better algorithm?" All that does is expose the ignorance of the experimenter, and that of the writer. It is a well-known and easily demonstrated principal that more data and an algorithm are interchangeable; you can always do it either way. In "Software Physics," it was called the "space/time" tradeoff (i.e., the "space" in memory occupied by the data, the "time" occupied by the process). Imagine that you need to frequently obtain the square-root of an integer. If your integers range from 0..100 (with results from 0..10), all you need is a 101-entry lookup table with the values to whatever precision you require. However, the table grows larger when you have millions of potential values (actually, potentially infinitely large). This is a "data intensive" solution (with a negligible process...the table lookup is, afterall, a process); call it 99% data/1% process. It is also possible to create a program using any of the various time-tested algorithms which, when given a value, will compute the square-root. This is a "process intensive" solution (with a negligible amount of data...typically some constants); call it 99% process/1% data. Now, between the extremes of 1/99 and 99/1, there are an infinite varieties of solutions, each one customized and tailored to the specific needs of the application, each with their own data/process trades-off. To claim that one is inherently superior to the other is an exercise in futility (or ignorance, depending on your viewpoint). The correct answer is, "It depends..." What's totally unspoken in this report is the PROCESS involved in identifying and gathering the additional "data." It just exposes the original work (and the original reporter of that work here) as performed by incompetents who have nothing constructive to say on the subject. They might do some research in the field before drawing sweeping conclusions with no basis in fact. (And, a demonstration is NOT a proof.)
Speed requirements? Sure, you can build a decision tree/table from millions of entries from imdb, but can you generate results in the required time based on changing user preferences for millions of users? How dangerous is a so-so suggestion versus a 'please wait, I'm working message'?
Space requirements? Do you have the machine memory and disk space required by your algorithm? Millions of users, thousands of movies, and millions of viewed movies and ratings? When you build your data structures used for your predictions, you probably don't want to start from scratch and recalculate it each time the user updates some of their preferences.
Does the algorithm scale well? Is it's performance predictable given the size of the data set? It might work great with tons of data, but does it perform well with less data? There might be times you need the algorithm to run really quick with mediocre results and maybe want detailed analysis with a larger dataset. Plus, what is the risk of suggesting an unenjoyable movie? If the risk is low, then a simpler algorithm and/or one that uses less data might be beneficial.
So if I leverage the results of another algorithm so that I can make mine simpler that is now seen as adding data?
A scale of five or ten should not make too much of a difference. The difficult part (according to a Wired article) is figuring out the anchoring effect. If you've seen a lot of good movies lately, something mediocre will rate 2 stars, but if you've seen a lot of bad movies lately (ditch that significant other!) then a mediocre movie will more likely receive a three-star rating from you.
Stop the brainwash
So using an old saw --which comes first, the chicken or the egg? Or is there a superior question, how did chickens come to exist in the evolutionary chain? One question unanswerable, or a series of data driven questions that might eventually yield a definitive answer...
...Open Source isn't the only answer -- but it's almost always a better value than the alternatives...
And I didn't even get modded up back when I said that.
Can anyone tell me how to set my sig on Slashdot?
In a data mining context, an algorithm extracts, modifies or creates data from an existing data set.
Ultimately, everything is data.... a Turing machine doesn't know "algorithms"....it's a state machine and its all data to it. So, yeah, although we like to pretend that code and data are separate things, the very bedrock of theory that computer science sits on says that its all the same, and ultimately, when we choose a data driven architecture or an algorithmically heavy one, we're really just choosing where to make the investment of codifying information.
verb vs noun
both are just parts of a grammar, which, overall is just data. look at a very simple language for a text adventure game below. commas indic
noun -> (torch|gold|sword|goblin|door)
verb -> (take|drop|attack)
direction -> (north|south|east|west)
move -> go direction
action -> verb noun
S -> (move|action)
It's all data...we merely invent noun and verb to classify things. But we could just as easily have written
inventoriable -> torch|gold|sword
inventorying -> take|drop
inventory -> inventorying inventoriable
having no noun and verb at all in our language... in fact, we could theoretically write a text adventure engine with a grammar and a few primitives to run a state machine in the background that describes how things are related...it would be all data, essentially.
all I'm saying is, that, sometimes, to bring it back to a data mining context, it might make sense to think about the system as (data being mined + algorithm) as part of a larger soup of information and then assign to each depending on one's preferences... giving up that proven interchangability because it is presently good practice seems awful risky...
This is my sig.
There's now a follow-up post (http://anand.typepad.com/datawocky/2008/04/data-versus-alg.html) addressing some of the comments in this thread.