Augmenting Data Beats Better Algorithms

Heuristics?? by nategoose · 2008-04-01 07:13 · Score: 1, Insightful

Aren't these heuristics and not algorithms?

Re:Heuristics?? by EvanED · 2008-04-01 07:33 · Score: 5, Informative

One would hope that the thing that calculates the heuristic is an algorithm. See wikipedia.
Re:Heuristics?? by glwtta · 2008-04-01 09:25 · Score: 2, Informative

Aren't these heuristics and not algorithms?

Lets not be overly pedantic: a heuristic is a type of algorithm, in casual speech.

--
sic transit gloria mundi
Re:Heuristics?? by nategoose · 2008-04-01 10:23 · Score: 2, Interesting

In this particular case I think that the distinction is important. Saying that something is a better algorithm doesn't imply that it gives a better result(s) as all correct results are semantically the same. Algorithms are ranked on their resource usage. Heuristics are ranked on the perceived goodness of their results. Algorithms must have the same correct results by definition.
Re:Heuristics?? by glwtta · 2008-04-01 10:39 · Score: 2, Insightful

Algorithms must have the same correct results by definition.

Since we are obviously talking about the "goodness" of the results produced by the algorithm, I think it's pretty safe to assume that the broader definition of "algorithm" is being used.

--
sic transit gloria mundi
Re:Heuristics?? by EvanED · 2008-04-01 13:55 · Score: 2, Informative

Lets not be overly pedantic: a heuristic is a type of algorithm, in casual speech.

"In casual speech"? That's just wrong... a heuristic is a type of algorithm, period. (Assuming it meets the other requirements of being an algorithm, such as termination.) That it doesn't produce an optimal result doesn't enter into it. [In this post I say "doesn't produce" as a shorthand for "isn't guaranteed to produce".]

CS theorists talk about randomized algorithms. They don't produce an optimal result. CS theorists talk about online algorithms. They don't produce an optimal result. CS theorists talk about approximation algorithms. They don't produce an optimal result.

Producing an optimal result isn't a requirement of being an algorithm. Heuristics are just algorithms that tend to produce useful results most of the time. In fact, Wikipedia page for the CS notion of a heuristic is called "heuristic algorithm."
Re:Heuristics?? by EvanED · 2008-04-01 13:59 · Score: 3, Insightful

Algorithms are ranked on their resource usage.
Not always. Approximation algorithms are often ranked on their accuracy. Online algorithms are often ranked on something called the competitive ratio. Randomized algorithms are usually ranked on their resource uses, but all three of these needn't be optimal (in the context of an optimization problem) -- or produce correct results (in the context of a decision problem).

Algorithms must have the same correct results by definition.
[citation needed]
Re:Heuristics?? by glwtta · 2008-04-01 14:58 · Score: 1

Producing an optimal result isn't a requirement of being an algorithm.

If you are feeling overly pedantic (like the OP) it can be; ie an algorithm must provide a solution to a problem, and an approximation is not the same as a solution (in the CS sense).

But the whole thing is just the kind of nitpicking that only someone who is really proud of having taken Intro to Complexity and Computability Theory recently would engage in.

--
sic transit gloria mundi
Re:Heuristics?? by glwtta · 2008-04-01 15:04 · Score: 1

Words have meanings, and scientific/mathematical words have very precise meanings. Asshole.

And to make things interesting, many words have several meanings, especially scientific terms that also have far more common, and less precise, general meanings. It's not that hard a concept to grasp, really.

--
sic transit gloria mundi
Re:Heuristics?? by tuomoks · 2008-04-01 19:17 · Score: 1

Correct! I have to say I'm not amazed. We did the same kind of ratings a long time ago, guess where - right, in insurance. Part of the risk management. Is a red headed under 30 less risk than a blond at the same age? Better - what and how costly will be their next accident. Trying to predict human behavior, the cause and the results has been there a long time. The same was done for example ships world wide we insured but there wasn't just the information of the shipping company, we did background checking and information collection of the captain, the crew, companies using/shipping the goods, etc, the more information, the more accurate the risk estimates.
To the topic, the more information you get, external or internal, the better the estimate, proven in my mind (and in your insurance rates.)
Now - this creates an interesting dilemma, how much information you can get and how much information collection the targets will tolerate? Another subject!
Re:Heuristics?? by The+Clockwork+Troll · 2008-04-01 21:23 · Score: 1

OK, but "exponential" pretty much always means "exponential" and is never a reasonable synonym for quadratic, geometric, superexponential, or anything growing non-exponentially.

Intellectually lazy or uninformed people frequently use "exponential" to mean "growing more quickly than my operative mathematics skill can bound".

--

There are no karma whores, only moderation johns
Re:Heuristics?? by SnowZero · 2008-04-01 21:23 · Score: 1

If you are feeling overly pedantic (like the OP) it can be; ie an algorithm must provide a solution to a problem, and an approximation is not the same as a solution (in the CS sense). You have to be careful in defining "the problem". Deterministic computers only execute algorithms. However the problem that the algorithm solves may not be the actual problem you really care about. When those two problem definitions differ, but an algorithm for the former is an approximation or useful substitute for an algorithm solving the true problem, what you've got is a heuristic.

Say I want to choose the best 10 students out of 100 for competing in a math competition. Clearly there's no algorithm for that. What I can do is test them all on last year's test, put the scores in an array, sort the array, and take the first ten. That's an algorithm, and it solves a problem exactly (get the top ten scoring students), but it's not the problem I really care about (choose the ten best for the competition). It is however, a heuristic.

A problem from my own work is "random" number generators for implementations of randomized algorithms; I use a pseudo-random number generator (PRNG), which is an algorithm, since it'll compute an exact result given its mathematical definition. But that problem is meaningless to me, I really want true random numbers. So if I use the pseudo-random numbers to yield "random" numbers, then in that context it's only a heuristic; The PRNG behaves like a true random number given a bunch of statistical tests, but in reality it isn't even a true approximation. That doesn't make it any less useful though; A good PRNG can thus be an incredibly useful surrogate for the unattainable true RNG (on a deterministic computer).
Re:Heuristics?? by Planesdragon · 2008-04-02 00:36 · Score: 1

OK, but "exponential" pretty much always means "exponential" FWIW, "exponential" doesn't always mean "rises according to an exponent of the past value." It also means "rising with a rate of growth that increases rapidly over time."

Welcome to the English language, where words mean what they are defined by the user to mean. This may or not may not reveal the speaker to be an asshat, but trying to say that it's an incorrect usage is a sure way to reveal yourself as one.
Re:Heuristics?? by electrictroy · 2008-04-02 02:32 · Score: 1

When I was first read this article I was confused what they meant. But after I thought about it for awhile, it seems self-evident. Using a consumer example: A 1920x1080 HD-DVD is going to look better than a 720x480 SD-DVD with 1920x1080 upscaling algorithm applied to it. That's fairly self-evident.

More data will produce better results.

--
The government is not your daddy. Its purpose is not to raid middle-class neighbors' wallets and give it to you.
Re:Heuristics?? by somersault · 2008-04-02 05:18 · Score: 1

Exactly. I want to tag this as 'duh'. The challenge was to mine the original data set though, so I'm not sure if this would even be a valid entry?

Here's how the competition works. Netflix has provided a large data set that tells you how nearly half a million people have rated about 18,000 movies. Based on these ratings, you are asked to predict the ratings of these users for movies in the set that they have not rated. [emphasis mine]

--
which is totally what she said
Re:Heuristics?? by The+Clockwork+Troll · 2008-04-02 15:26 · Score: 1

Spelling, grammar, and colloquial usages that don't match up with their strict definition - all fine - if they are unambiguous.

I nitpick the reckless use of "exponential" because it implies a certain level of growth or complexity that may overstate or understate reality, making it difficult to appreciate or consider fairly the magnitude of the matter being discussed.

By way of contrast, many people casually use big-O notation when it would be more precise to use theta notation to describe asymptotic growth, but one can usually discern what was meant even without much context.

But when a word like "exponential" is abused, you don't immediately know what you're dealing with.

--

There are no karma whores, only moderation johns

Depends on the Problem by roadkill_cr · 2008-04-01 07:14 · Score: 4, Insightful

I think it heavily depends on what you're kind of data your mining.

I worked for a while on the Netflix prize, and if there's one thing I learned it's that a recommender system almost always gets better the more data you put into it, so I'm not sure if this one case study is enough to apply the idea to all algorithms.

Though, in a way, this is sort of a "duh" result - data mining relies on lots of good data, and the more there is generally the better a fit you can make with your algorithm.

Re:Depends on the Problem by TooMuchToDo · 2008-04-01 07:38 · Score: 1

Exactly. An algorithm can't see what isn't there, so the more data you have, the better your result will be. You can of course improve upon the algorithm, but the quality/quantity of data is always going to be more important.
Re:Depends on the Problem by ubrgeek · 2008-04-01 07:51 · Score: 1

Isn't that similar to the posting about Berkley's joke recommender posting from the other day? Rate jokes and it then suggests ones you should like. I tried it and I don't know if the pool from which the jokes are pulled is shallow, but the ones it returned after I finished "calibrating" it were terrible and not along the lines of what I would have assumed the system thought I would think were funny.

--
Bark less. Wag more.
Re:Depends on the Problem by Brian+Gordon · 2008-04-01 07:52 · Score: 2, Insightful

It's not always going to be more important. There's really no difference between a sample of 10 million and a sample of 100 million.. at that point it's obviously more effective to put work into improving the algorithm.. but that turning point (again obviously) would come way before 10 million samples of data. It's a balance.
Re:Depends on the Problem by RedHelix · 2008-04-01 08:20 · Score: 3, Insightful

Well, yeah, augmenting data can produce more reliable results than better algorithms. If a legion of film buffs went through every single film record on Netflix's database and assigned "recommendable" films to it, then went and looked up the rental history of every Netflix user and assigned them individual recommendations, you would probably end up with a recommendation system that beats any algorithm. The dataset here would be ENORMOUS. But the reason algorithms exist is so that doesn't have to happen. i like turtles
Re:Depends on the Problem by blahplusplus · 2008-04-01 08:23 · Score: 3, Interesting

"I worked for a while on the Netflix prize, and if there's one thing I learned it's that a recommender system almost always gets better the more data you put into it, ...."

Ironically enough, you'd think they'd adopt the wikipedia model where their customers can simply vote thumbs up vs thumbs down to a small list of recomendations everytime they visit their site.

All this convenience comes at a cost though, you're basically giving people insight into your personality and who you are and I'm sure many "Recommendation engines" easily double as demographic data for advertisers and other companies.
Re:Depends on the Problem by roadkill_cr · 2008-04-01 08:30 · Score: 3, Insightful

It's true that you lose some anonymity, but there is so much to gain. To be perfectly honest, I'm completely fine with rating products on Amazon.com and Netflix - I only go to these sites to shop for products and movies, so why not take full advantage of their recommendation system? If I am in consumer mode, I want the salesman to be as competent as possible.

Anyways, if you're paranoid about data on you being used - there's a less well-known field of recommender systems which uses implicit data gathering which can be easily setup on any site. For example, it might say that because you clicked on product X many times today, you're probably in want of it and they can use that data. Of course, implicit data gathering is more faulty than explicit data gathering, but it just goes to show that if you spend time on the internet, websites can always use your data for their own means.
Re:Depends on the Problem by egyptiankarim · 2008-04-01 09:09 · Score: 1

This may not be as intuitive as you state, and I think there may be two different things under discussion here.

While mining of data for movie recommendations becomes easier as you add n rankings from user x, it isn't intuitively obvious that better recommendations may result from knowing more about movie y.

In other (perhaps more confusing) words, augmenting the data type movie to include attributes like release date, executive producer, and another 100 specific details may reveal more relationships about a viewer's movie preferences than simply having them rank another 100 movies.

In even yet another set of words, it's not about increasing the number of samples, it's about enhancing the resolution of existing points. This is maybe not as intuitively obvious as what you were suggesting.

At least, this is what I think the article was getting at :)

--
Eek!
Re:Depends on the Problem by teh+moges · 2008-04-01 10:01 · Score: 4, Insightful

Think less in sheer numbers and more in density. If there are 200 million possible 'combinations' (say, 50,000 customers and 4000 movies in a Netflix-like situation), then with 10 million data samples, we only have 5% of the possible data. This means that if we are predicting inside the data scope, we are predicting into an unknown field that is 19 times larger then the known.
Say we were looking at 100 million fields, suddenly we have 50% of the possible data, and our unknown field is the same size as the known field. Much more likely to get a result then.
Re:Depends on the Problem by leenks · 2008-04-01 10:59 · Score: 1

Or we over-fit to the training data and end up performing badly in the real world when trends change (eg new style of film production appears)
Re:Depends on the Problem by epine · 2008-04-01 11:52 · Score: 2, Insightful

It seems to be a bad day for science writing. The piece on rowing a galley was a joke. And now we're being told that one data mining problem with a dominant low-hanging return on augmenting data represents a general principle.

The Netflick data shouldn't be regarded as representative of anything. That data set has shockingly low dimensionality. So far as I know, they make no attempt to differentiate what kind of enjoyment the viewer obtained from the movie, or even determine whether the movie was viewed in a solo or group situation. They don't ask "who was your favorite character / actor / actress". Nor do they follow up on aging opinions: "which of these two movies would you presently rate higher?" so the corroboration factor is zero.

I'm pretty fussy about the movies I rent. The worst movie I've endured this year was "Night at the Museum", which was loaned to me. I managed to get through it at the 1.4x speed setting on a slow evening.

As bad as it was, I wouldn't rate it less than a 3. I'd like to save 1 and 2 for outright incompetence. Was "Museum" a manipulative piece of crap? Absolutely. I'd tick that box in a heartbeat. Did I feel personally soiled by Genghis' emotional discharge? I've been showering for days. From what I've read about Genghis, the only way to get him to discharge would have been to lock him in a room with Sacagawea.

If you give "Museum" a three for competence squandered, what do you give Soderberg's "Solaris"? I'm glad I watched it. It was interesting to see what they did with the sets, and to find out whether anything ever happens (spoiler: no). I still recall the intensity of the black woman, though unfortunately her fine acting served no real purpose. While I was happy to rent it, it also earned a place on my list of movies least likely to rent twice.

Really, Netflick deserves five gold stars for having created the least augmented opinion stream since baby spit out his brussel sprouts.
Re:Depends on the Problem by Brian+Gordon · 2008-04-01 13:03 · Score: 1

5% is a 20th of 100, not a 19th :)
Re:Depends on the Problem by CastrTroy · 2008-04-01 13:08 · Score: 1

I don't think it's about density or sheer numbers. It's about how much other data you have about the data you're looking at. In this case, they augmented the netflix data with data from IMDB. By having more data about the movies that were being rated, such as actors, producers, directors, year of filming, and other information, they were better able to recommend movies. If the only thing you know about a movie is who voted for it, but not why, then you are going to have a hard time recommending movies to others.

--

Anthropic principle: We see the universe the way it is because if it were different we would not be here to see it.
Re:Depends on the Problem by teh+moges · 2008-04-01 16:41 · Score: 1

Take 5% - the 'known', leaves 95% - the 'unknown'. The unknown is 19 times larger then the known.
Re:Depends on the Problem by The_reformant · 2008-04-01 21:56 · Score: 1

I actually wondered about whether only ever recommending the current unseen top blockbusters to customers would be a similar and more effective netflix algorithm. By definition these movies have high liklihood of being enjoyed (since that is the criteria you have selected them on)

--
I have discovered a truly remarkable sig which this post is too small to contain.
Re:Depends on the Problem by deander2 · 2008-04-02 05:18 · Score: 1

but the netflix prize has 500,000 users, 20,000 movies and 10,000,000 ratings. that's 10,000,000,000 possible ratings, making the given 0.1%.

but of course you're only asked to predict a subset of ~3,000,000. (still a lot for the given data, but hey, it's $1,000,000 ;)

--
http://kered.org

I think better is subjective... by 3p1ph4ny · 2008-04-01 07:14 · Score: 3, Insightful

In problems like minimizing lateness et. al. "better" can be simply defined as "closer to optimal" or "fewer time units late."

Here, better means different things to different people. The more data you have gives you a larger set of people, and probably a more accurate definition of better for a larger set of people. I'm not sure you can really compare the two.

Re:I think better is subjective... by moderatorrater · 2008-04-01 07:37 · Score: 1

In this case, better is well defined. They're looking for a system that can take a certain data set and use it to predict another data set. ultimately, the quality of picks is determined by the user. For this contest, they've got data sets that they can use to determine which is the best method.
Re:I think better is subjective... by phkhd · 2008-04-01 08:26 · Score: 1

More to the point, the demographic is not for the perfect Pepsi, but for the perfect Pespsis. Malcom Gladwell had a talk at a TED conference, where he explains demographic research that has been done to show that there is usually not a perfect solution, but rather several near optimal solutions. Probably not applicable to all data-mining applications, but certainly appropriate for anything relating to subjective tastes.

Um, Yes? by randyest · 2008-04-01 07:15 · Score: 4, Insightful

Of course. Why wouldn't more (or bettter) relevant data that applies on a case-y-case basis provide more improved results than a "improved algorithm" (what does that mean, really?) that applied generally and globally?

I think we need much, much more rigorous definitions of "more data" and "better algorithm" in order to discuss this in any meaningful way.

--
everything in moderation

Re:Um, Yes? by robbyjo · 2008-04-01 07:24 · Score: 1

It's a simple application of Rao-Blackwell theorem at work. Making use of useful information (in this case, movie genre) makes the estimate more precise.

--

--
Error 500: Internal sig error
Re:Um, Yes? by canajin56 · 2008-04-01 10:04 · Score: 2, Funny

I think we need much, much more rigorous definitions of "more data" and "better algorithm" in order to discuss this in any meaningful way.
So what you are saying is, to answer the question, we need more data?

--
ASCII stupid question, get a stupid ANSI

This reminds me by FredFredrickson · 2008-04-01 07:16 · Score: 2, Interesting

This reminds me of those articles who say that the amount of data humanity has archived is so much data that nobody could possibly use it in a lifetime. I think what people fail to remember is this: the point is to have available data just-in-case you need to reference it in the future. Nobody watches security tapes in full. The review the day or hour that the robbery occured. Does that mean we should stop recording everything? No. Let's keep archiving.

Combine that with the speed at which computers are getting more efficient - and I see no reason to just keep piling up this crap. More is always better. (More efficient might be better- but add the two together, and you're unstoppable)

--
Belief? Hope? Preference?The Existential Vortex

Is it just me that is surprised here? by zappepcs · 2008-04-01 07:16 · Score: 1, Insightful

What do you think? Will more data usually perform better than a better algorithm?" Duh... the algorithm can ONLY be as good as the data supplied to it. Better data always improves performance in this type of problem. The netflix challenge is to arrive at a better algorithm with the supplied data. Adding more data gives you a richer data set to choose from. This is obvious, no?

I read the article in question here and can say that I'm surprised that this is even a question.

--
Support NYCountryLawyer RIAA vs People

Re:Is it just me that is surprised here? by gnick · 2008-04-01 07:27 · Score: 5, Informative

The netflix challenge is to arrive at a better algorithm with the supplied data. Actually, the rules explicitly allow supplementing the data set and Netflix points out that they explore external data sets as well.

--
He's getting rather old, but he's a good mouse.
Re:Is it just me that is surprised here? by geminidomino · 2008-04-01 08:40 · Score: 1

What do you think? Will more data usually perform better than a better algorithm?" Duh... the algorithm can ONLY be as good as the data supplied to it. Better data always improves performance in this type of problem. The netflix challenge is to arrive at a better algorithm with the supplied data. Adding more data gives you a richer data set to choose from. This is obvious, no?

I read the article in question here and can say that I'm surprised that this is even a question. Good point. There doesn't appear to be any mention of the improvement of supplemented data AND an improved algorithm.
Re:Is it just me that is surprised here? by cavemanf16 · 2008-04-01 09:24 · Score: 2, Informative

I tend to agree that augmenting data helps improve the model if the model is not yet overwhelmed with data, but you have to have a decent model to begin with or it won't work. Additionally, the payoff of additional data added to the model is a diminishing return as the amount of data available begins to overwhelm any given model. In other words, the more data you collect and put into your model, the more expensive, time consuming, and difficult it becomes to continue to rely on the original model.

In linear regression models for forecasting there is what's known as a "variable inflation factor". This factor helps a statistician know when their linear regression model is beginning to perform poorly when too much data is in the equation because different variables (containing different, but inter-related data) will eventually begin to conflict with one another.

For the Netflix thing, this could show up as a problem if the model is trying to recommend which movie you should rent next based on actors/actresses in previous movies you've watched, which movies you rated higher than others, which genres those highly rated movies were in, which actors/actresses you had rated highly, and which movies those highly rated actors/actresses had been in that you hadn't seen yet. It's quite likely that someone like Kevin Bacon has been in some romantic comedy with another one of your favorite actors or actresses, but you absolutely hate horror movies and he's in a "horror" film with that same actor or actress. The recommendation model would likely try to recommend a movie to you based on three positives (a favorite film and two separate favorite actors) because there's only one negative in the equation. (your hatred for horror movies) This is a very simplistic example, but that's the problem of too much data with too simplistic of an algorithm. A linear regression might have this problem, but if one were to build in an additional bit of algorithm magic that made sure horror movies were "filtered out" or severely punished for being in the horror genre before looking for other factors like favorite actors/actresses in movies then the algorithm would perform better. But then, of course, additional types of data would be needed to adequately "fill in the gaps" for the new monster algorithm that you've created.
Re:Is it just me that is surprised here? by glitch23 · 2008-04-01 16:03 · Score: 1

Actually, the rules explicitly allow supplementing the data set and Netflix points out that they explore external data sets as well.
However for purposes of NetFlix verifying a contestant's algorithm to see how much better or worse it is compared to NetFlix's current algorithm they use a separate, internal data set whose contents are not known to contestants.

--
this nation, under God, shall have a new birth of freedom. -- Lincoln, Gettysburg Address
Re:Is it just me that is surprised here? by gnick · 2008-04-01 16:56 · Score: 1

However for purposes of NetFlix verifying a contestant's algorithm to see how much better or worse it is compared to NetFlix's current algorithm they use a separate, internal data set whose contents are not known to contestants. Of course. External data sets are used for training predictors. Not verifying accuracy... Wasn't that largely the point of TFA?

--
He's getting rather old, but he's a good mouse.

Too a large extent ... by haluness · 2008-04-01 07:17 · Score: 2, Interesting

I can see that more data (especially more varied data) could be better than a tweaked algorithm. Especially in machine learning, I see many people publish papers on a new method that does 1% better than preexisting methods.

Now, I won't deny that algorithmic advances are important, but it seems to me that unless you have a better understanding of the underlying system (which might be a physical system or a social system) tweaking algorithms would only lead to marginal improvements.

Obviously, there will be a big jump when going from a simplistic method (say linear regression) to a more sophisticated method (say SVM's). But going from one type of SVM to another slightly tweaked version of the fundamental SVM algorithm is probably not as worthwhile as sitting down and trying to understand what is generating the observed data in the first place.

Re:Too a large extent ... by Artuir · 2008-04-01 08:43 · Score: 1

You know, one time I had the luxury of working on a blade server set up to use both forms of feeds. Ultimately I found when compiling my AI dataset, each subchannel was coherently placed within 5 arcs of true accuracy. The AI was able to do very well on the turing test as a result and my boss was quite pleased.

Has anyone attempted to use a KVM setup to see if this improves the data augmentation at all?
Re:Too a large extent ... by __aaahtg7394 · 2008-04-01 09:00 · Score: 2, Interesting

I see many people publish papers on a new method that does 1% better than preexisting methods. If that 1% is from 95% to 96% accuracy, it's actually a 20% improvement in error rates! I know this sounds like an example from "How to Lie With Statistics," but it is the correct way to look at this sort of problem.

It's like n-9s uptime. Each nine in your reliability score costs geometrically more than the last; the same sort of thing holds for the scores measured in ML training.

Assuming the algorithm isn't evil by Lije+Baley · 2008-04-01 07:19 · Score: 1

A piece of pertinent data is worth a thousand (code) lines of speculation.

--
Strange things are afoot at the Circle-K.

"Better data" not "more data" by Anonymous Coward · 2008-04-01 07:19 · Score: 1, Insightful

Just having more data to process doesn't produce better results in this sort of field.

Look at the application. Netflix alone VS Netflix+IMDB. The second not only has more data, but it has "better" data in terms of having more human decision inputs applied to it thus weighting the data to produce more correct results.

But if you looked at it this way Netflix 2007 data VS Netflix 2006-2007 data I don't think you would find a significant difference in results. This is the same "type" of data, only more of it, where as the former is a practical example of data fusion.

Char-Lez

More vs Better by Mikkeles · 2008-04-01 07:19 · Score: 3, Insightful

Better data is probably most important and having more data makes having better data more likely. It would probably make sense to analyse the impact of each datum on the accuracy of the ruslt, then choose a better algorithm using the most influential data. That is, a simple algorithm on good data is better than a great algorithm on mediocre data.

--
Great minds think alike; fools seldom differ.

Re:More vs Better by Mushdot · 2008-04-01 09:38 · Score: 1

I agree here, though when humans are involved I think it can be difficult to get accurate data and the skill is in asking for information which has the least subjectiveness.

To give an example closer to the topic, I watched Beowulf last night. After watching the film I was left with a feeling that he wasn't the hero I assumed he was (never having known the real story except a vague knowledge he was some sort of kick-ass old English hero).

I spent a while doing some research and discovered that the film is basically nothing like the real story, except for the fact there's a demon, witch and dragon in it. Oh, and the name Beowulf. This really pissed me off as I have a real problem with Hollywood 'interpretations' of historical facts/legends because the general public will assume that they have watched the real thing.

Anyway, I then thought that review sites such as IMDB ought to have extra points given or deducted depending on accuracy to the original story (depending on whether it is based on a story or not), but additional information like that would be very dependant on a) The user knowing the original story or researching it, and b) If they could be bothered to go into so much detail. And even then a user may think they are judging accurately whereas in reality they are simply colouring the results, so it is a very difficult area to tackle.

All things being equal... by Just+Some+Guy · 2008-04-01 07:21 · Score: 3, Insightful

One team used a better algorithm while the other harvested augmenting data on movies from the Internet Movie Database. And this team, which used a simpler algorithm, did much better nearly as well as the best algorithm on the boards for the $1 million challenge.

And the teams were identically talented? In my CS classes, I could have hand-picked teams that could make O(2^n) algorithms run quickly and others that could make O(1) take hours.

--
Dewey, what part of this looks like authorities should be involved?

Re:All things being equal... by Just+Some+Guy · 2008-04-01 09:33 · Score: 1

Golf clap for almost (but not quite) getting the point of that.

--
Dewey, what part of this looks like authorities should be involved?
Re:All things being equal... by The_reformant · 2008-04-01 22:05 · Score: 1

Depends what the O(1) algorithm is really doesn't it. I doubt even the most talented people could make the algorithm "sleep for 100 years" faster than a poor implementation of the travelling salesman problem on say 1000 nodes.

--
I have discovered a truly remarkable sig which this post is too small to contain.

Obvious? by nine-times · 2008-04-01 07:22 · Score: 1

Is it just me, or is it pretty obvious that this all just depends on the algorithm and the data?

Like I could "augment" the data with worthless or misleading data, and get the same or worse results. If I have a huge set of really good and useful data, I can get better results without making my algorithm more advanced. And no matter how advanced my algorithm is, it won't return good results if it doesn't have sufficient data.

When a challenge is put out to improve these algorithms, it's really because these companies are operating with limited and/or bad data. They have to deal with crap data and people trying to game the system. They can't pull data from other sites because they don't own the other sites' data. They can't necessarily track their own customers' searches and compile that because (sometimes) their customers would be outraged at the "invasion of privacy".

Hold on a sec... by peacefinder · 2008-04-01 07:23 · Score: 4, Funny

"What do you think? Will more data usually perform better than a better algorithm?"

I need more data.

--
With reasonable men I will reason; with humane men I will plead; but to tyrants I will give no quarter. -- William Lloyd

Re:Hold on a sec... by Archangel+Michael · 2008-04-01 07:52 · Score: 1

... or a better algorithm

This is classic XOR thinking that permeates our society. One or the other, not both is rarely a correct option. It is mostly for boolean operations, which this is clearly not. This is clearly an AND function. More Data AND a Better Algorithm is actually the most correct answer. "Which helps more?" is a silly question except for deciding on how much resources should be split in improving both, along with how much easier is one vs the other.

--
Agent K: A *person* is smart. People are dumb, stupid, panicky animals, and you know it.
Re:Hold on a sec... by peacefinder · 2008-04-01 08:10 · Score: 1

"How can you be sure that you're not actually in need of a better algorithm?"

I was optimizing for humor.

--
With reasonable men I will reason; with humane men I will plead; but to tyrants I will give no quarter. -- William Lloyd
Re:Hold on a sec... by aug24 · 2008-04-01 22:39 · Score: 1

Or a better algorithm.

--
You're only jealous cos the little penguins are talking to me.

Five stars by CopaceticOpus · 2008-04-01 07:24 · Score: 5, Insightful

If more data is helpful, then Netflix is really hurting themselves with their 5-star rating system. I'd only give 5 stars to a really amazing movie, but to only give 3/5 stars to a movie I enjoyed feels too low. Many movies that range from a 7/10 to a 9/10 get lumped into that 4 star category, and the nuances of the data are lost.

How to translate the entire experience of watching a movie into a lone number is a separate issue.

Re:Five stars by edcheevy · 2008-04-01 07:53 · Score: 1

You're absolutely correct, the more you compress categories the more you essentially throw away data. The flip side is the average customer whose mind would be blown if you let them rate a movie on a 1 to 1000 scale (or something equally ridiculous). Most of us would chunk that down into a more meaningful range anyway.

I'm afraid I don't have the linkage, but I seem to recall research on the Likert scale (typically a 1-5 or 1-7 scale) that actually found larger scales really didn't add much beyond 1 through 7 or 1 through 9. That said "not adding much" may still be worth a million dollars to Netflix if that "not much" is still better than what they've got (and doesn't scare people by offering too many choices).
Re:Five stars by Chris+Burke · 2008-04-01 07:57 · Score: 1

I'd only give 5 stars to a really amazing movie, but to only give 3/5 stars to a movie I enjoyed feels too low.

I don't think this is a problem of it being a 1-5 scale instead of 1-10. It's not like there's really that much information given by scoring a movie a 7 instead of 8, since it's all subjective anyway on any given day those scores could have been reversed.

I think it's more the extremely common situation where people don't want to give an "average" score, so you get score inflation such that only the top half of the scale is ever used except for palpably bad movies. So even in a 1-10 scale, you're only really using 6-10. You might as well use a 0-5 scale, where "1" means good, and a 0 is anything worse than that.

I personally try to solve this by firmly keeping in mind the idea that the middle score should be for the "average" movie. If I'm never giving out scores that are 3 or lower, then I'm not rating them correctly. Unfortunately, I'm the only one who does this, which just means that I'm giving movies lower scores than everyone else even if I felt the same about the movie. So it's not much of a solution. ;)

--

The enemies of Democracy are
Re:Five stars by areReady · 2008-04-01 08:02 · Score: 1

Well, I suppose they should use a 100-point scale, so you don't have to lump all those 71-79s together in the 7's when there could be much more delineation between them. Or 1,000 points. Obviously this breaks down at some point. Five stars isn't necessarily bad, the correlation between positive and negative ratings is still very useful.
Re:Five stars by RingDev · 2008-04-01 08:22 · Score: 1

Hey, if you can find a link to that research, please post it. I swear I caught a glimpse of similar work years ago in college, but have long since lost it. And it just so happens that now I'm working for a R&D company focusing on mental health testing and the topic of scale sizes comes up on occasion (especially when planning surveys for patients with mental illnesses). Anyways, I've had squat for luck tracking down that paper.

-Rick

--
"Most people in the U.S. wouldn't know they live in a tyrannical state if it walked up and grabbed their junk." - MyFirs
Re:Five stars by pavon · 2008-04-01 08:59 · Score: 1

My interpretation is not that I don't like rating things average, but that selection bias means that I only watch things that I expect to like, and more often than not that turns out to be the case. Every now and then I'll end up disliking a movie that I had high hopes for, or watch a movie I know I won't like with someone else, but for the most part I enjoy the (few) movies I see. And since you only rate the films you've watched, the majority of ratings by the majority of people will be positive.

But that's okay. In this context, the information from rating comes from clustering according to what movies you liked, and the extent to which you liked them isn't as important. Most of the info that netflix has about your viewing habits is binary - did you rent it or not. The main purpose of ranking is just to let them know about movies you rented and didn't like, or movies you watched outside of netflix. So, even if the vast majority of your movies are rated okay, good, or great, that is really all you need.

My iTunes ratings on the other hand, are another issue :)
Re:Five stars by lemnar · 2008-04-01 09:36 · Score: 1

It's just a UI decision - you can rate with half stars.

Re:attn computer scientists: stop renaming stuff by Anonymous Coward · 2008-04-01 07:25 · Score: 5, Funny

you guys are nothing more than glorified engineers. Computer scientists are not glorified engineers. They're the butt of engineers' jokes too.

For the Sake of Discussion by eldavojohn · 2008-04-01 07:29 · Score: 3, Insightful

Well, for the sake of discussion I will try to give you an example so that you might pick it apart.

"more data" More data means that you understand directors and actors/actresses often do a lot of the same work. So for every movie that the user likes, you weight their stars they gave it with a name. Then you cross reference movies containing those people using a database (like IMDB). So if your user loved The Sting and Fight Club, they will also love Spy Games which had both Redford & Pitt starring in it.

"better algorithm" If you naively look at the data sets, you can imagine that each user represents a taste set and that high correlations between two movies in several users indicates that a user who has not seen the second movie will most likely enjoy it. So if 1,056 users who saw 12 Monkeys loved Donnie Darko but your user has only seen Donnie Darko, highly recommend them 12 Monkeys.

You could also make an elaborate algorithm that uses user age, sex & location ... or even a novel 'distance' algorithm that determines how far away they are from liking 12 Monkeys based on their highly ranked other movies.

Honestly, I could provide endless ideas for 'better algorithms' although I don't think any of them would even come close to matching what I could do with a database like IMDB. Hell, think of the Bayesian token analysis you could do on the reviews and message boards alone!

--
My work here is dung.

Re:For the Sake of Discussion by Anonymous Coward · 2008-04-01 09:25 · Score: 1, Insightful

The critical failure of this example points out the flaw in the original premise:
You used an algorithm to sort out the "data" that you are using. The act of weighting and comparing the different properties of the data you have IS an algorithm. Period.

No algorithm can operate without data, and data is useless without an algorithm to DO something with it.

More data will give a better statistical reading of the data, and help eliminate 'bad data points', so in many cases more data can be better... but that depends on the algorithm used, the quantity of data, etc.

I would suggest the original person simply take a couple courses in computer science. There they will see classic examples of situations in which two algorithms are compared, and how in some situations one will excel in speed with limited data, and others will excel in speed with prolific data.
Re:For the Sake of Discussion by Plutonite · 2008-04-01 15:12 · Score: 1

You could also make an elaborate algorithm that uses user age, sex & location That's just more data, IMHO, and nothing to do with the algorithm - you'd just be running the learner over more fields. What is a "better" algorithm? In formal terms, the "better" algorithm will classify with a higher accuracy during performance (the phase after the learner has "learnt") than another one using the same data and in a consistent manner (i.e not for some particular sample).

I am only vaguely familiar with the netflix prize but I think you are asking a rather open-ended question here. Relevant data always improves classifiers, and some classifiers are better than others depending on the domain. Talking in relative terms isn't going to achieve much.
Re:For the Sake of Discussion by Plutonite · 2008-04-01 15:51 · Score: 1

I'd also like to say that the analogy with pagerank is a little off-base. I realize this is a Stanford professor, but trust me neither the machine learning people nor the information retrieval folks know what the other side is talking about at a deep level, mostly because IR is a hack-ish, "sciencified" topic (I'm quoting a very well-known man in the field) while statistical inference is a little more formal. They each have completely different goals and challenges, though they do overlap in places.

Simply put: weighting results from an queried index is one problem, trying to classify a movie rating from a training set is another. And the approach of each regarding heuristics/hacks are similarly worlds apart.

Will more data usually perform better than a bette by sm62704 · 2008-04-01 07:32 · Score: 1

I would suggest that one both go for better algorythms AND more/better data.

--
mcgrew's razor: Never attribute to stupidity that which can be explained by greedy self-interest

apparently... by spune · 2008-04-01 07:33 · Score: 1

...the algorithm wasn't 'better' enough.

The punch line by shewfig · 2008-04-01 07:34 · Score: 1

The last sentence of TFA sums up the non-usefulness of the result: "Of course, you have to be judicious in your choice of the data to add to your data set."

I refer you to the question of training Bayesian data sets for anti-spam: should you classify every single email, or only the ones that are "clearly" well-defined? Without a good algorithm to extract the search terms, the additional data just poisons the data sets, reducing the effectiveness of the filter.

See also any decent physiological study, in which "extraneous" factors are "corrected". Without enough data pruning, you have a correlation like the study that showed that losing weight, and keeping it off, reduces life expectancy. They didn't correct for the terminally ill, who lost weight as a result of their conditions. However, do too much pruning, and you have the controversial Harvard study, which reached the "common sense" conclusion almost at the expense of the data.

For more examples of massaging data using a bad algorithm, see studies that demonstrate a better TCO by going Microsoft.

In short, adding additional data is no guarantee of good results. The students clearly got lucky in finding a similar data set on a well-researched topic, based on an established taxonomy rather than a murky preference rating.

To augment or algorithm is the question? by flajann · 2008-04-01 07:35 · Score: 1

It really depends on a number of factors. I don't think anyone can make a general claim for one over the other. A smart algorithm can beat data augmentation in some cases. Of course, creating the algorithm is the crux of the matter, one that is harder to put a definition on.

So, the upshot is to look at both approaches and take the best course of action for your needs.

--
Ruby Neural Evolution of Augmenting Topologies

It depends on your definition of "better" by paulatz · 2008-04-01 07:36 · Score: 1

How do you define a "better" algorithm? Well, a better algorithm is an algoithm that works better on the field, it may seem obvious, but it is not at all. Usually it is not possible to test an algorithm deeply enough until its development is finished. On the other hand you would rather not spend a lot of time developing an algorithm that is not good enough. Hence the quality of algorithms is often deduced by some indicators, like some small test samples. Finally, as the general theory improves, the difference in performance between the top ranking algorithm decreases, and may start to depend quite strongly on the subset of the general total population to wich they are actually applied. We cannot simply say that "given two algorithms, the best one is the one which performs better on all possible samples;" we should rather say "the best one is the one which performs better on most of the real world samples." You can clearly see how actually impractical this definition is, this is why finding a good ranking algorithm requires constant tuning, as they do in google. A better algorithm may not be so much better, or may lack of generality when tested in the real world. More data always helps.

--
this post contain no useful information, no need to mod it down

Isn't an algorithm just data? by tjstork · 2008-04-01 07:37 · Score: 1

I mean, if we balloon up to 10,000 feet, the problem really is, where do you put the extra data? Do you encode it in an algorithm, or do you have less code but more dynamic data. Given that POV, then, it stands to reason the best place to put the extra data is outside of the code, so that it is easier and less costly to modify.

--
This is my sig.

Re:attn computer scientists: stop renaming stuff by jank1887 · 2008-04-01 07:40 · Score: 1

ooohhh... can we start on computer engineers next ??

This is assuming... by jd · 2008-04-01 07:41 · Score: 2, Insightful

...that algorithms and data are, in fact, different animals. Algorithms are simply mapping functions, which can in turn be entirely represented as data. A true algorithm represents a set of statements which, when taken as a collective whole, will always be true. In other words, it's something that is generic, across-the-board. Think object-oriented design - you do not write one class for every variable. Pure data will contain a mix of the generic and the specific, with no trivial way to always identify which is which, or to what degree.

Thus, an algorithm-driven design should always out-perform data-driven designs when knowledge of the specific is substantially less important than knowledge of the generic. Data-driven designs should always out-perform algorithm-driven design when the reverse is true. A blend of the two designs (in order to isolate and identify the nature of the data) should outperform pure implementations following either design when you want to know a lot about both.

The key to programming is not to have one "perfect" methodology but to have a wide range at your disposal.

For those who prefer mantras, have the serenity to accept the invariants aren't going to change, the courage to recognize the methodology will, and the wisdom to apply the difference.

--
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)

Re:This is assuming... by The_reformant · 2008-04-01 22:09 · Score: 1

...that algorithms and data are, in fact, different animals. Algorithms are simply mapping functions, which can in turn be entirely represented as data. A true algorithm represents a set of statements which, when taken as a collective whole, will always be true. In other words, it's something that is generic, across-the-board. Think object-oriented design - you do not write one class for every variable. Pure data will contain a mix of the generic and the specific, with no trivial way to always identify which is which, or to what degree. Thus, an algorithm-driven design should always out-perform data-driven designs when knowledge of the specific is substantially less important than knowledge of the generic. Data-driven designs should always out-perform algorithm-driven design when the reverse is true. A blend of the two designs (in order to isolate and identify the nature of the data) should outperform pure implementations following either design when you want to know a lot about both. The key to programming is not to have one "perfect" methodology but to have a wide range at your disposal. For those who prefer mantras, have the serenity to accept the invariants aren't going to change, the courage to recognize the methodology will, and the wisdom to apply the difference.
Thats complete nonsense, the class of algorithms which are mapping functions, and which be wholly represented by a finite data set is a tiny proportion of the algorithm space.

--
I have discovered a truly remarkable sig which this post is too small to contain.

A bit like swap vs. real memory by etymxris · 2008-04-01 07:41 · Score: 2, Informative

A machine with swap enabled will always have more throughput than a machine without. It's a better use of the resources available. However, replace that swap space with the same amount of RAM, and of course that will be even better. Some use this as an argument against swap space, but it's not a fair comparison, since you can enable swap space in the RAM increased machine and increase throughput even more.

So when I think of this recommendation system, a better algorithm is like having swap space enabled. It's a more sophisticated use of the data you have. Having more data is like having more RAM. And of course the best option is to have more reference data and a better algorithm. It's not an exclusive disjunction, and it's silly to think it has to be.

Data has no comments by freejamesbrown · 2008-04-01 07:42 · Score: 1

In the long term, if gamed Data determines hidden features of an algorithm's output, that output will not be completely understood in case it needs to be analyzed.

I've seen this on several systems over the years where legacy programmers tweak the data just a bit to affect sort order, etc etc and it leads to nightmares when you try to actually understand what's really happening to try to replace it's functionality.

There's no hard rule but beware, Data has no comments, so you'll never completely understand all the actions of your algorithm.

Google Page Rank probably suffers from this.

Re:attn computer scientists: stop renaming stuff by Freeside1 · 2008-04-01 07:45 · Score: 5, Funny

Say what you want about computer scientists, but without them you'd probably be complaining on a chalkboard.

Re:attn computer scientists: stop renaming stuff by jank1887 · 2008-04-01 07:45 · Score: 4, Funny

Mathematics is physics without purpose, Chemistry is physics without thought, Engineering is physics - CliffsNotes edition.

Diminishing Returns by areReady · 2008-04-01 07:47 · Score: 1

It is obvious that both will help. Your first big chunk of augmenting data will help a lot, as will your first few algorithm adjustments. As you go forward, however, you will get smaller and smaller returns for each new tweak to the algorithm and each new set of data. It seems obvious after these results that the best course is BOTH.

Sounds reasonable. by Orig_Club_Soda · 2008-04-01 07:48 · Score: 1

Personally, I more likely to watch a movie based on genre, producer, director, writers, actors... Especially with plot specifics like era and technology.

Re:attn computer scientists: stop renaming stuff by etymxris · 2008-04-01 07:48 · Score: 1

And chemists are just doing heuristic physics, and biologists are just doing heuristic chemistry.

All of math reduces to logic and set theory but you don't see philosophers snubbing their noses at mathematicians. I agree that "computer science" is a misnomer in many ways, but "algorithm" in this case is not a misnomer. Yes, it's readily apparent that all algorithms can ultimately be represented mathematically, but that means no more than the reduction of math to logic.

Really? by edcheevy · 2008-04-01 07:48 · Score: 1

The more data you have, the more likely your results are going to be significant. I think we already knew this. ;)

Really though, it's the "design fix" vs the "statistics fix" (or the algorithm fix in this case) and a proper design always beats a crappy design with statistical band aids.

Re:attn computer scientists: stop renaming stuff by Sciros · 2008-04-01 07:51 · Score: 2, Informative

What noobery. You're confusing the "what" with the "how". Finding eigenvalues is part of a particular page rank algorithm. It's not THE page rank algorithm. Likewise, statistical inference is part of particular "machine learning" systems. It's not THE system. Using statistical inference alone will give you crude (albeit good, with enough training data) baselines to work from in some applications such as automatic text translation, but you'll need more than that to overcome issues like data sparseness, etc.

I know anonymous cowards like playing expert, but there's a reason why you're the butt of so many jokes here -- only thing you're usually expert in is misinformation and disingenuity.

--
I like basketball!!1!

Recommendations Systems and subjectivity by mlwmohawk · 2008-04-01 07:52 · Score: 3, Insightful

I have written two recommendations systems and have taken a crack at the Netflix prize (but have been hard pressed to make time for the serious work.)

The article is informative and generally correct, however, having done this sort of stuff on a few projects, I have some problems with the netflix data.

First, the data is bogus. The preferences are "aggregates" of rental behaviors, whole families are represented by single accounts. Little 16 year old Tod, likes different movies than his 40 year old dad. Not to mention his toddler sibling and mother. A single account may have Winnie the Pooh and Kill Bill. Obviously, you can't say that people who like Kill Bill tend to like Winnie the Pooh. (Unless of course there is a strange human behavioral factor being exposed by this, it could be that parents of young children want the thrill of vicarious killing, but I digress)

The IMDB information about genre is interesting as it is possibly a good way to separate some of the aggregation.

Recommendation systems tend to like a lot of data, but not what you think. People will say, if you need more data, why just have 1-5 and not 1-10? Well, that really isn't much more added data it is just greater granularity of the same data. Think of it like "color depth" vs "resolution" on a video monitor.

My last point about recommendations is that people have moods are are not as predictable as we may wish. On an aggregate basis, a group of people is very predictable. A single person setting his/her preferences one night may have had a good day and a glass of wine and numbers are higher. The next day could have had a crappy day and had to deal with it sober, the numbers are different.

You can't make a system that will accurately predict responses of a single specific individual at an arbitrary time. Let alone based on an aggregated data set. That's why I haven't put much stock in the Netflix prize. Maybe someone will win it, but I have my doubts. A million dollars is a lot of money, but there are enough vagaries in what qualifies as a success to make it a lottery or a sham.

That being said, the data is fun to work with!!

more data in which dimension? by Fuzuli · 2008-04-01 07:56 · Score: 1

The team with more data performed better, probably because their data allowed them to clearly differentiate between movies using a far significant dimension than the given ratings per movie dimension.
The fundamental idea is to be able to identify clusters of movies, or users (who like a certain type of movie), and the idea of clusters is built on some form of distance. When you add a new dimension to your feature vector, you get a chance to identify groups of entities better, using that dimension. You may do worse as well, a new dimension may blur the lines between groups. Genres for movies looks like a good label for identifying groups of movies. Trying to do the same with more complex methods, using only ratings is harder.
More data does not necessarily mean you'll do better, it has to allow you to identify differences better, it should either contain or add a dimension with a "good" data. It seems team B directly went for generating a more relevant data set for the problem at hand.

Captain Obvious Says: by GameboyRMH · 2008-04-01 08:00 · Score: 1

The quality (accuracy) of the result is a function of how much data you put in and how you operate on it, but entering more data can yield a much greater improvement in the quality of the output than a better algorithm.

--
"When information is power, privacy is freedom" - Jah-Wren Ryel

Re:attn computer scientists: stop renaming stuff by JasonKChapman · 2008-04-01 08:02 · Score: 5, Funny

Mathematics is physics without purpose, Chemistry is physics without thought, Engineering is physics

Mathematics is physics without purpose, Chemistry is physics without thought, Engineering is physics without tenure.

--
Sorry, I'm a writer. That makes you raw material.

Re:attn computer scientists: stop renaming stuff by agentultra · 2008-04-01 08:02 · Score: 1

sigh.

It sounds like you've got a hammer and look at everything as nails.

You might want to take a trip outside your ivory tower.

Synonyms happen to be a way of abstracting complexity out of the language used so that laypersons can understand, or at least talk about, the concepts and such that we "glorified engineers" use. It's really so the marketing guys have something to sell other than "eigenvalue calculation."

I suppose its beneath you, but average people should be able to have a chance at grasping what we do; even if its not in its most pure and exact form. It doesn't mean that we "engineers," are all ignorant of the actual mathematical terms. It just means we have to adopt language to deal with people who are involved with the product of our endeavors who may not understand what statistical inference means, but can at least grasp the idea by using the term, "machine learning."

geez.

One Trivial Result, One Big Assumption by fygment · 2008-04-01 08:03 · Score: 3, Insightful

Two things. The first is that it is tritely obvious that adding more data improves your results. But there are two possible mechanisms at work. On the one hand add more of the same data ie. just make your original database larger with more entries. That form of augmentation will hopefully give you more insight into the underlying distribution of the data. On the other hand you can augment the existing data. In the latter you are really adding extra dimensions/features/attributes to the data set. That's what seems to be alluded to in the article i.e. the students are adding extra features to the original data set. The success of the technique is a trivial result which depends very much on whether the features you add are discriminating or not. In this case, the IMDB presumably added discriminating features. However, if it had not, then "improved algorithms" would have had the upper hand.

The second thing about the claim seems to be that there is always additional information actually available. The comment is made that academia and business don't seem to appreciate the value of augmenting the data. That is false. In business additional data is often just not available (physically or for cost reasons). Consequently, improving your algorithms is all you can do. Similarly in academia (say a computer science department) the assumption is often that you are trying to improve your algorithms while assuming that you have all the data available.

--
"Consensus" in science is _always_ a political construct.

Re:attn computer scientists: stop renaming stuff by Fishbulb · 2008-04-01 08:04 · Score: 1

...and a physicist is nothing without alcohol.

Q.E.D., beer.

Depends on the problem. by v(*_*)vvvv · 2008-04-01 08:05 · Score: 1, Interesting

Would you rather know more or be smarter?

Knowledge is power, and the ultimate in information is the answer itself. If the answer is accessible, then by all means access it.

You cannot compare algorithms unless the initial conditions are the same, and this usually includes available information. In other words, algorithms make the most out of "what you have". If what you have can be expanded, then by all means you should expand it.

I wonder if accessing foreign web sites is legal in this competition though, because that definitely alters the complexion of the problem.

To say google succeeded by expanding their data pool is an oversimplification, because not only did they select what they felt was most important, they ignored what they felt was not. Intelligent selection took place to set their initial conditions for their algorithm. So it isn't just data augmentation. It is the ability to augment data relative to a goal, and this is much deeper than just "more data" vs "algorithm". In fact, you can also find situations where algorithms are used to make these intelligent selections, in which case the selection process can be as or more important than just the sheer volume of available data alone.

Re:attn computer scientists: stop renaming stuff by jank1887 · 2008-04-01 08:08 · Score: 1

since when do engineers not get tenure? and to boot, we get sizable research dollars.

Re:attn computer scientists: stop renaming stuff by Metasquares · 2008-04-01 08:09 · Score: 3, Insightful

And nonlinear dimensionality reduction is just nonconvex trace optimization coupled with kernel principal component analysis (fine, call it "singular value decomposition") using Mercer's theorem to map the resulting dot product through a kernel function (usually represented as a Hermitian positive semidefinite Gram matrix), yielding an inner product space of higher (possibly infinite) dimensionality in which the original problem is linearly separable.

Now take this description and write an algorithm that performs it efficiently. And you use PageRank as an example, so let's call "efficient" "performs as well as Google on the entire web's worth of data".

If you can't do this, perhaps you should reconsider your view of computer scientists. There's no reason whatsoever to play up the boundaries between two very related fields. Arbitrary boundaries in knowledge are already bad enough; they need to be knocked down, not reinforced.

Re:attn computer scientists: stop renaming stuff by raddan · 2008-04-01 08:11 · Score: 1

Mod parent flamebait. Right, and as a holder of a philosophy degree, I don't understand why you nitwit mathematicians can't get it through your thick skulls that we "statistical inference" is just yet another flawed manifestation of the Cartesian dichotomy. See where I'm going with this?

Why all the hate, people? Different disciplines have different terminology. Sure, there are probably some mathematical generalizations for common computer science problems. And there are probably some CS generalizations for common accounting problems. But you know why actual traveling salesmen don't call their travels the Traveling Salesman problem? They don't fucking care, and for the most part, it doesn't matter to them.

Now that I'm a CS student, I can appreciate where my current field and where my former field overlap. In my book, nobody who puts their mind to work is the butt of anybody else's joke.

Augmented? by Anne+Thwacks · 2008-04-01 08:11 · Score: 1

When I was in college "augmented data" was a tactful way of saying "faked results"

--
Sent from my ASR33 using ASCII

Re:attn computer scientists: stop renaming stuff by Deepness+In+The+Sky · 2008-04-01 08:12 · Score: 1

So... if you have a degree in Mathematics and Computer Science, are you the butt of your own jokes?

Re:Ask Slashdot by apt142 · 2008-04-01 08:17 · Score: 1

Depends. It really depends on the specifics.

--
Star Pirates

Um, no. by emmons · 2008-04-01 08:18 · Score: 1

In a data mining context, an algorithm extracts, modifies or creates data from an existing data set.

Think of it this way.. algorithm is to verb as data is to noun.

--
Do you even know anything about perl? -- AC Replying to Tom Christiansen post.

This is a rule of algorithms by MobyDisk · 2008-04-01 08:22 · Score: 1

For every problem, there is an optimal solution (okay... there are many optimal solutions, depending on what you are trying to optimize for). If you want to do better than that algorithm, you must break the model. That means that you must either modify the inputs or modify the assumptions of the model. For example, the fastest way to sort arbitrary data that can only be compared using takes O(n*log(n)) time. To do any better, you must break the model by making assumptions about the range and precision of the data. Then you can do it in O(n).

So for the data in netflix, there is an optimal algorithm. To do better, you must include additional data. This particular problem is interesting because it is nearly impossible to determine what the "optimal" algorithm is since it is based on psychological factors. However, the fact that they are seeking out smart people to figure this out indicates that we are probably pretty close to optimal, so maybe we need to start including more information and changing the model.

Making up for being slow - or being slow. by totierne · 2008-04-01 08:22 · Score: 1

I am always looking for more data, from new people, from different countries.
I think I am making up for my slow algorithm in my head, or maybe all this data is slowing me down.
Actually it is making no decisions and having a cloud of maybes instead of deciding what rules I want to live by is the problem.

--
Be Free: Free Software Tuition

against the terms of the prize by deander2 · 2008-04-01 08:25 · Score: 1

yes this data is useful, but you can't use it in the contest:
http://www.netflix.com/TermsOfUse

see also:
http://www.netflixprize.com/community/viewtopic.php?id=98
http://www.netflixprize.com/community/viewtopic.php?id=20
http://www.netflixprize.com/community/viewtopic.php?id=14

note that this makes sense. more/better data would help ANY decent algorithm. they want a better one, and they're judging you on a baseline. so they'd naturally limit your input options.

--
http://kered.org

Re:attn computer scientists: stop renaming stuff by Hoi+Polloi · 2008-04-01 08:31 · Score: 1

What was that? Sorry, I was busy admiring my fat IT paycheck.

--
It is by the juice of the coffee bean that thoughts acquire speed, the teeth acquire stains. The stains become a warning

One answer: Kevin Bacon by recharged95 · 2008-04-01 08:33 · Score: 1

Now there's a simple algorithm that works. And beats even page rank.

The Best Data Wins by dj+e-rock · 2008-04-01 08:34 · Score: 1

I would say that a richer set of (relevant) data would generally generate a better result than an improvement of algorithm. Granted, different statistical models and algorithms do work better on certain kinds of data (there's almost an art to picking a good model).

But, as a past professor of mine was fond of saying, "the best data wins."

What about lambda calculus ? by S3D · 2008-04-01 08:39 · Score: 1

i know you computer scientists like playing mathematician, but there's a reason why you're the butt of mathematicians jokes. because you guys are nothing more than glorified engineers.

And category theory applied to functional programming ?

Algorithms help too by kabloom · 2008-04-01 08:41 · Score: 1

I've seen a great many cases where developing better algorithms caused better performance (and better algorithms rather than better data, in fact, account for the vast majority of Computer Science research papers out there), so certainly it can't only be better data. Additionally, what about the times when you need a better algorithm to take advantage of the additional data. Or, what about when you combine the better algorithm with the better data.

This article is a completely false dichotomy.

Re:In case you haven't heard by sm62704 · 2008-04-01 08:49 · Score: 1

Secondly, your moronic link would only fool Slashdot moderators.

*Woosh*

--
mcgrew's razor: Never attribute to stupidity that which can be explained by greedy self-interest

Re:attn computer scientists: stop renaming stuff by Arthur+B. · 2008-04-01 08:50 · Score: 5, Funny

"machine learning" is just statistical inference

Riiiht. And mathematical research is just finding a Hamiltonian cycle in a graph defined by the set of axioms used.

--
\u262D = \u5350

Re:attn computer scientists: stop renaming stuff by colinrichardday · 2008-04-01 08:51 · Score: 1

So if John Nash was a purposeless physicist, how did he get a Nobel Prize in economics?

*sigh* by sm62704 · 2008-04-01 08:57 · Score: 1

Humorless tards.

UnNews:April Fool's Day postponed to May
From Uncyclopedia, the content-free encyclopedia.
Jump to: navigation, search
This article is part of UnNews, your source for up-to-the-microsecond misinformation.

1 April 2008

WASHINGTON, DC -- Congress has passed a bill officially postponing April Fool's Day, originally on April 1, to May 1. Additionally, pranks on the traditional date will be a federal and capital offense.

Naturally, pranksters and liars all over the United States are flabbergasted, shocked, and whining.

President George W. Bush said, "This is a national blasphemy to a major Western celebration, and I will veto this bill... APRIL FOOL!", attempting a poor April Fool's prank, and subsequently signing the bill into effect.

I. M. Luvinitt, of Kansas City, not in Kansas, says, "March 31, 2008 is a date which will live in infamy. Yesterday Congress attacked pranksters, liars, and mischief-causing brats -- a vital and necessary part of our society."

However, many are actually relieved. One anonymous citizen says, "I have one extra month to enjoy sleeping in without being woken up by loud noises, one more month to not hear about any fake products or events."

Another one says, "How does it matter? If you're looking for fake events to laugh about, there's always Uncyclopedia!"

Others, like Hu Ah-yu of Los Angeles, CA, are less concerned about the pranks: "What will become of the name? April Fool's Day is called as such because it's in April! What do we call it now? May Fool's Day? It's congealed! It's an absolutely hideous name!"

The Dow fell sharply at the news, however, in fears that consumer spending would further decrease, especially on the joke shop front. "We are looking at a potential 70% decrease in sales," says Gene Tornaparte, head of the Association of Pranksters for Roaring Insane Laughter (APRIL). "We think this may be rather bad for the shops, and even worse for the economy."

The exact reason the bill was passed is not clearly known as of yet. Speculation is already rising amongst the pundits, however.

"I think," says expert analyst Stephen Colbert, "that this is further evidence that respect for great Americans and great American traditions is in sharp decline. And you know what that means: bears will take over this glorious nation, and you won't exist!"

Less conservative pundits and organisations are maintaining less radical views. The Democratic National Committee released an official statement: "We are sad to hear that April Fool's Day is now actually in May. We see this as a sign of further incompetence of the current administration, and we believe that a Democratic President and a Democratic Congress will push the date back to April 1."

One child was so shocked that all he could calmly say was, "Is this an April Fool's joke?"

Mod this one troll too. ;P

--
mcgrew's razor: Never attribute to stupidity that which can be explained by greedy self-interest

unix philosophy wins again by fermatslittletheorem · 2008-04-01 09:01 · Score: 1

Robert Pike- "Data dominates. If you've chosen the right data structures and organized things well, the algorithms will almost always be self-evident. Data structures, not algorithms, are central to programming".

I think more data will always win ... by strings42 · 2008-04-01 09:03 · Score: 1

... although I'd think intuitively that some data sets could be well-represented by predictive algorithms as well. For a model of a general data-set however, I think a good analogy is that of a lossy compressed graphic. A good compression algorithm can certainly reconstruct a decent likeness of the original graphic, depending on the density of the original data and the algorithm used, maybe even a copy that would not be perceived as different by the human eye (or some other measure). But that lossy copy will never be quite as accurate as the original graphic, by definition. I guess another way to look at it would be that any predictive algorithm is, in some sense, statistical. While the statistics may be incredibly accurate, they can never predict the statistical anamolies. This seems pretty fundamentally intuitive to me, but I wouldn't even know where to begin going about proving it ... comments?

This does not mean what I think you think it means by aibob · 2008-04-01 09:09 · Score: 4, Informative

I am a graduate student in computer science, emphasizing the use of machine learning.

The sound bite conclusion of this blog post is that algorithms are a waste of time and that you are better off adding more training data.

The reality is that a lot of really smart people have been trying to come up with better algorithms for classification, clustering, and (yes) ranking for a very long time. Unless you are already familiar with the field, you really are unlikely to invent something new that will work better than what is already out there.

But that does not mean that the algorithm does not matter - for the problems I work on, using logistic regression or support vector machines outperforms naive bayes by 10% - 30%, which is huge. So if you want good performance, you try a few different algorithms to see what works.

Adding more training data does not always help either, if the distributions of the data are significantly different. You are much better off using the data to design better features which represent/summarize the data.

In other words, the algorithm is not unimportant, it just isn't the place your creative work is going to have the highest ROI.

Re:attn computer scientists: stop renaming stuff by egyptiankarim · 2008-04-01 09:18 · Score: 2, Insightful

Without mathematics, chemistry and physics would be boring. Without chemistry and physics engineering would be impossible. Without engineering, computer science would be useless. Without computer science, today's best designers would be bored. Without today's best designers, many questions of logic would go unpondered. Logic is rooted in mathematics.

I think we all need each other, folks :)

--
Eek!

Meaningless Question by avilliers · 2008-04-01 09:28 · Score: 1

The question itself is a little like asking a football coach whether a run play is better than a pass play. There's no objective answer; any coach he even answers the question is really expressing an opinion that one or the other is over-valued, rather than that one is just 'better' in all situations.

Same with data vs. algorithms.

One thing I haven't even seen mentioned--which surprises me--is that it's well known that more data will often make an algorithm perform worse. It's not that the data's bad, it simply produces spurious connections or obscures real ones that were apparent in the smaller set. The idea that more data always produces better results is as incorrect as the idea that more training, or more complexity, is better. There's a point not just of diminishing returns, but of negative ones.

Which is the other reason the question makes me scratch my head. The implicit assumption seems to be that the more 'sophisticated' algorithm is inherently 'better'. But algorithms, especially for this sort of problem, are good or bad based on their results; there's no abstract 'superiority' for an algorithm that makes it better for all problems. (See the 'No Free Lunch' theorem.)

As an empirical matter, I wouldn't doubt that computer science students are prone to approach every problem as thinking that if they just program more, they can solve everything. After all, downloading data is not sexy. So the instructor's post is a decent corrective to that. But trying to abstract some rule ('data is better than programming') is not helpful.

Re:attn computer scientists: stop renaming stuff by Anonymous Coward · 2008-04-01 09:41 · Score: 1, Insightful

Computer engineers are people who design your CPUs, motherboards, hard drives, etc. Computer engineering is a specialization within electrical engineering, even if people constantly abuse the term incorrectly.

If they're not engineers, then what are they?

Re:attn computer scientists: stop renaming stuff by russotto · 2008-04-01 09:49 · Score: 1

Right, and as a holder of a philosophy degree

I thought you guys all starved to death fighting over the silverware?

Does nobody know Shannon anymore? by eigengott · 2008-04-01 09:52 · Score: 2, Insightful

It's pretty simple: If you have random noise your algorithm can be as good as you want - you still get no useful information out of it. On the other hand, if the "more data" actually contains additional information, your entropy goes up and with a given algorithm you get better results. Bent to the extreme you just get the desired output as additional information and you can reduce your algorithm to just print it (should be O(1)).

Best Algorithm in the World? by mmyrfield · 2008-04-01 10:04 · Score: 1

It doesn't matter if you have the best algorithm in the world that can calculate how many times you're going to go pee a year and a day from now, you can't forget the first rule of equations:

Garbage in = Garbage out

Period. Therefore, to a point, better data will yield more better results than a better algorithm and this is a very obvious result.

Re:attn computer scientists: stop renaming stuff by DavidShor · 2008-04-01 10:11 · Score: 1

Something tells me that philosophers could not understand a word of advanced Math logic. Of course, neither could most Mathematicians...

both? by dropbearsrus · 2008-04-01 10:31 · Score: 1

I'm pretty sure the best approach (if there are no other constraints) is to use both - more and better data, and the best algorithm, will give the best results won't it? If you have other goals - like getting the best result for a fixed amount of money or time - then you look at compromising.

Lack of multiple queues means lousy sample data by dcraid · 2008-04-01 10:52 · Score: 1

As a long time NetFlix user I suppose I have contributed to the sample. I order movies for myself, my wife, and my five year old daughter. Good luck trying to profile my buying activity.

According to Peter Norvig, by copdk4 · 2008-04-01 11:30 · Score: 1

director of Google Research, "a large corpus of data can be much more valuable than an efficient algorithm" - Unreliable reference

More data ? by nsebban · 2008-04-01 11:59 · Score: 1

I think it obviously depends on the quality, and above all the relevance of the data, to the problem you're trying to solve.

--
____
nico
Nico-Live

Re:attn computer scientists: stop renaming stuff by jank1887 · 2008-04-01 12:12 · Score: 1

now, you just take all that kumbaya B.S. and get back in the corner where you belong... we don't tolerate that kind of stuff in here.

Re:Depends on the Problem (It's The Wrong Question by CAOgdin · 2008-04-01 14:23 · Score: 1

The story ends with "Will more data usually perform better than a better algorithm?" All that does is expose the ignorance of the experimenter, and that of the writer. It is a well-known and easily demonstrated principal that more data and an algorithm are interchangeable; you can always do it either way. In "Software Physics," it was called the "space/time" tradeoff (i.e., the "space" in memory occupied by the data, the "time" occupied by the process). Imagine that you need to frequently obtain the square-root of an integer. If your integers range from 0..100 (with results from 0..10), all you need is a 101-entry lookup table with the values to whatever precision you require. However, the table grows larger when you have millions of potential values (actually, potentially infinitely large). This is a "data intensive" solution (with a negligible process...the table lookup is, afterall, a process); call it 99% data/1% process. It is also possible to create a program using any of the various time-tested algorithms which, when given a value, will compute the square-root. This is a "process intensive" solution (with a negligible amount of data...typically some constants); call it 99% process/1% data. Now, between the extremes of 1/99 and 99/1, there are an infinite varieties of solutions, each one customized and tailored to the specific needs of the application, each with their own data/process trades-off. To claim that one is inherently superior to the other is an exercise in futility (or ignorance, depending on your viewpoint). The correct answer is, "It depends..." What's totally unspoken in this report is the PROCESS involved in identifying and gathering the additional "data." It just exposes the original work (and the original reporter of that work here) as performed by incompetents who have nothing constructive to say on the subject. They might do some research in the field before drawing sweeping conclusions with no basis in fact. (And, a demonstration is NOT a proof.)

Re:attn computer scientists: stop renaming stuff by Metasquares · 2008-04-01 14:45 · Score: 1

It goes beyond just needing each other. We're fundamentally doing the same thing, just focusing on different applications of it.

Re:attn computer scientists: stop renaming stuff by Anonymous Coward · 2008-04-01 15:41 · Score: 1, Funny

Nothing compared to physicists. When was the last time you guys got a multibillion dollar facility to discover (if you're lucky) a few other particles nobody gives a rat's ass about?

Re:attn computer scientists: stop renaming stuff by mapleneckblues · 2008-04-01 15:46 · Score: 1

but we just learned a few days back that beer kills research abaility. Sigh, slashdot is so confusing these days...

Hang on a minute! by dsmatthews · 2008-04-01 17:53 · Score: 1

So if I leverage the results of another algorithm so that I can make mine simpler that is now seen as adding data?

Re:attn computer scientists: stop renaming stuff by mollymoo · 2008-04-01 21:26 · Score: 2, Funny

i know you computer scientists like playing mathematician, but there's a reason why you're the butt of mathematicians jokes. because you guys are nothing more than glorified engineers.

Adapted from a joke I saw on Jester the other day:

A physicist, a computer scientist and a mathematician are sharing a hotel room. It must have bad wiring or something.

Late at night when they're all asleep a small fire starts in the room. The smell of smoke wakes the physicist. He gets up, notices the fire and looking round the room, sees a bucket and a sink. He calculates how much water will be required, fills the bucket with precisely that much, douses the flames and goes back to bed.

A little later, another small fire starts. This time the smell of smokes wakes the computer scientist. He wakes up and sees the flames. He looks around and sees the bucket and the sink. He reasons that calculating the quantity of water required would take at least as long as filling the bucket, so he fills it right up, douses the flames and goes back to bed.

Again there is a fire. This time the mathematician smells the smoke and wakes up. He sees the flames, sees the bucket and the sink. He exclaims "there is a solution!" and goes back to bed.

--
Chernobyl 'not a wildlife haven' - BBC News

Anchoring effect! by Jeppe+Salvesen · 2008-04-01 23:54 · Score: 1

A scale of five or ten should not make too much of a difference. The difficult part (according to a Wired article) is figuring out the anchoring effect. If you've seen a lot of good movies lately, something mediocre will rate 2 stars, but if you've seen a lot of bad movies lately (ditch that significant other!) then a mediocre movie will more likely receive a three-star rating from you.

--

Stop the brainwash

Hello? duh!! more data yields better algorithms by CodeShark · 2008-04-02 01:33 · Score: 1

Most of the data shows that Newtonian Physics really explains much of the physical universe really well. So if we leave out Einstein's experiments we can usually get along just peachy. But include Einstein's rules in your algorithm's and calculations and they will ALWAYS be superior to simple Newtonian physics in those areas where "more data" proves the calculations, and the calculations themselves yield more data.

So using an old saw --which comes first, the chicken or the egg? Or is there a superior question, how did chickens come to exist in the evolutionary chain? One question unanswerable, or a series of data driven questions that might eventually yield a definitive answer...

--
...Open Source isn't the only answer -- but it's almost always a better value than the alternatives...

Told You So by Phat_Tony · 2008-04-02 02:41 · Score: 1

And I didn't even get modded up back when I said that.

--
Can anyone tell me how to set my sig on Slashdot?

Um, yes? by tjstork · 2008-04-02 03:30 · Score: 1

In a data mining context, an algorithm extracts, modifies or creates data from an existing data set.

Ultimately, everything is data.... a Turing machine doesn't know "algorithms"....it's a state machine and its all data to it. So, yeah, although we like to pretend that code and data are separate things, the very bedrock of theory that computer science sits on says that its all the same, and ultimately, when we choose a data driven architecture or an algorithmically heavy one, we're really just choosing where to make the investment of codifying information.

verb vs noun

both are just parts of a grammar, which, overall is just data. look at a very simple language for a text adventure game below. commas indic

noun -> (torch|gold|sword|goblin|door)
verb -> (take|drop|attack)
direction -> (north|south|east|west)
move -> go direction
action -> verb noun
S -> (move|action)

It's all data...we merely invent noun and verb to classify things. But we could just as easily have written

inventoriable -> torch|gold|sword
inventorying -> take|drop
inventory -> inventorying inventoriable

having no noun and verb at all in our language... in fact, we could theoretically write a text adventure engine with a grammar and a few primitives to run a state machine in the background that describes how things are related...it would be all data, essentially.

all I'm saying is, that, sometimes, to bring it back to a data mining context, it might make sense to think about the system as (data being mined + algorithm) as part of a larger soup of information and then assign to each depending on one's preferences... giving up that proven interchangability because it is presently good practice seems awful risky...

--
This is my sig.

Re:attn computer scientists: stop renaming stuff by Grizzlysmit · 2008-04-03 05:26 · Score: 1

Mathematics is physics without purpose, Chemistry is physics without thought, Engineering is physics - CliffsNotes edition.

Mathematics is a form of poetry, all applications are just icing on the cake.

Or to put it another way the purpose of a Mathematics is Mathematics, and the fact that you can solve real world problems with it is just a lucky break, but is quite besides the point.

--
in my life God comes first.... but Linux is pretty high after that :-D
Francis Smit

Follow-up post by anand_rajaraman · 2008-04-03 07:20 · Score: 1

There's now a follow-up post (http://anand.typepad.com/datawocky/2008/04/data-versus-alg.html) addressing some of the comments in this thread.

Slashdot Mirror

Augmenting Data Beats Better Algorithms

144 of 179 comments (clear)