Programming Collective Intelligence
Joe Kauzlarich writes "In 2006, the on-line movie rental store Netflix proposed a $1 million prize to whomever
could write a movie recommendation algorithm that offered a ten
percent improvement over their own. As of this writing, the
intriguingly-named Gravity
and Dinosaurs team holds first place by a slim margin of .07
percent over BellKor,
their algorithm an 8.82 percent
improvement on the Netflix benchmark. So, the question remains,
how do they write these so-called recommendation algorithms? A new
O'Reilly book gives us a thorough introduction to the basics of this
and similar lucrative sciences." Keep reading for the rest of Joe's review.
Programming Collective Intelligence
author
Toby Segaran
pages
334
publisher
O'Reilly Media Inc.
rating
9/10
reviewer
Joe Kauzlarich
ISBN
9780596529321
summary
Introduction to data mining algorithms and techniques
Among the chief ideological mandates of the Church of Web 2.0 is that
users need not click around to locate information when that
information can be brought to the users. This is achieved by
leveraging 'collective intelligence,' that is, in terms of
recommendations systems, by computationally analyzing statistical
patterns of past users to make as-accurate-as-possible guesses about
the desires of present users. Amazon, Google and certainly many other
organizations, in addition to Netflix, have successfully edged out
more traditional competitors on this basis, the latter failing to pay
attention to the shopping patterns of users and forcing customers to
locate products in a trial and error manner as they would in, say, a
Costco. As a further illustration, if I go to the movie shelf at Best
Buy, and look under 'R' for Rambo, no one's going to come up to
me and say that the Die Hard Trilogy now has a special-edition
release on DVD and is on sale. I'd have to accidentally pass the 'D'
section and be looking in that direction in order to notice it. Amazon
would immediately tell me, without bothering to mention that Gone
With The Wind has a new special edition.
Programming Collective Intelligence is far more than a guide to building recommendation systems. Author Toby Segaran is not a commercial product vendor, but a director of software development for a computational biology firm, doing data-mining and algorithm design (so apparently there is more to these 'algorithms' than just their usefulness in recommending movies?). Segaran takes us on a friendly and detailed tour through the field's toolchest, covering the following topics in some depth:
Recommendation Systems
Discovering Groups
Searching and Ranking
Document Filtering
Decision Trees
Price Models
Genetic Programming
... and a lot more
As you can see, the subject matter stretches into the higher levels of mathematics and academia, but Segaran successfully keeps the book intelligible to most software developers and examples are written in the easy-to-follow Python language. Further chapters cover more advanced topics, like optimization techniques and many of the more complex algorithms are deferred to the appendix.
The third chapter of the book, 'Discovering Groups,' deserves some explanation and may enlighten you as to how the book may be of some use in day-to-day software designs. Suppose you have a collection of data that is interrelated by a 'JOIN' in two sets of data. For example, certain customers may spend more time browsing certain subsets of movies. 'Discovering Groups' refers to the computational process of recognizing these patterns and sectioning data into groups. In terms of music or movies, these groups would represent genres. The marketing team may thus become aware that jazz enthusiasts buy more music at sale prices than do listeners of contemporary rock, or that listeners of late-60's jazz also listen to 70's prog, or similar such trends.
Certainly the applications of such tools as Programming Collective Intelligence provides us are broader than my imagination can handle. Insurance companies, airlines and banks are all part of massive industries that rely on precise knowledge of consumer trends and can certainly make use of the data-mining knowledge introduced in this book.
I have no major complaints about the book, particularly because it fills a gap in popular knowledge with no precursor of which I'm aware. Presentation-wise, even though Python is easy to read, pseudo-code is more timeless and even easier to read. You can't cut & paste from a paper book into a Python interpreter anyway. It may 've been more appropriate to use pseudo-code in print and keep the example code on the website (I'm sure it's there anyway).
If you ever find yourself browsing or referencing your algorithms text from college or even seriously studying algorithms for fun or profit, then I would highly recommend this book depending on your background in mathematics and computer science. That is, if you have a strong background in the academic study of related research, then you might look elsewhere, but this book, certainly suitable as an undergraduate text, is probably the best one for relative beginners that is going to be available for a long time.
You can purchase Programming Collective Intelligence from amazon.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.
Programming Collective Intelligence is far more than a guide to building recommendation systems. Author Toby Segaran is not a commercial product vendor, but a director of software development for a computational biology firm, doing data-mining and algorithm design (so apparently there is more to these 'algorithms' than just their usefulness in recommending movies?). Segaran takes us on a friendly and detailed tour through the field's toolchest, covering the following topics in some depth:
Recommendation Systems
Discovering Groups
Searching and Ranking
Document Filtering
Decision Trees
Price Models
Genetic Programming
... and a lot more
As you can see, the subject matter stretches into the higher levels of mathematics and academia, but Segaran successfully keeps the book intelligible to most software developers and examples are written in the easy-to-follow Python language. Further chapters cover more advanced topics, like optimization techniques and many of the more complex algorithms are deferred to the appendix.
The third chapter of the book, 'Discovering Groups,' deserves some explanation and may enlighten you as to how the book may be of some use in day-to-day software designs. Suppose you have a collection of data that is interrelated by a 'JOIN' in two sets of data. For example, certain customers may spend more time browsing certain subsets of movies. 'Discovering Groups' refers to the computational process of recognizing these patterns and sectioning data into groups. In terms of music or movies, these groups would represent genres. The marketing team may thus become aware that jazz enthusiasts buy more music at sale prices than do listeners of contemporary rock, or that listeners of late-60's jazz also listen to 70's prog, or similar such trends.
Certainly the applications of such tools as Programming Collective Intelligence provides us are broader than my imagination can handle. Insurance companies, airlines and banks are all part of massive industries that rely on precise knowledge of consumer trends and can certainly make use of the data-mining knowledge introduced in this book.
I have no major complaints about the book, particularly because it fills a gap in popular knowledge with no precursor of which I'm aware. Presentation-wise, even though Python is easy to read, pseudo-code is more timeless and even easier to read. You can't cut & paste from a paper book into a Python interpreter anyway. It may 've been more appropriate to use pseudo-code in print and keep the example code on the website (I'm sure it's there anyway).
If you ever find yourself browsing or referencing your algorithms text from college or even seriously studying algorithms for fun or profit, then I would highly recommend this book depending on your background in mathematics and computer science. That is, if you have a strong background in the academic study of related research, then you might look elsewhere, but this book, certainly suitable as an undergraduate text, is probably the best one for relative beginners that is going to be available for a long time.
You can purchase Programming Collective Intelligence from amazon.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.
So, the question remains, how do they write these so-called recommendation algorithms?
For now I'm more interested to know how they quantify these improvements.
You just got troll'd!
But the teams that are good continue to refine their algorithms and do better and better. The top teams continue to be at the top over the life of the competition. Also, you can't compare this to the stock market. If company A is doing well now, there is no guarantee that they will still be doing well in 2 or 3 years. However, if you liked a movie, you will probably always like the movie. Sure tastes change, but a lot less than the stock market.
Anthropic principle: We see the universe the way it is because if it were different we would not be here to see it.
Of course people who still decide to rate The Wedding Singer seven stars can throw the whole thing off, like on iTunes where *no* album scores under a four or a five. But that's the problem isn't it, humans are entering these things. Not only do differences in taste have to be considered, but also differences in how people view the rating scale, what their current mood while entering the information is, etc.
Perhaps more effective data can be mined form people's purchasing choices, since we know that what people say and do are often not the same. I think that's why I like Amazon's "most people who viewed this item ended up purchasing:" and then it lists the three most popular options. Their recommendations are fairly solid, if redundent, overall.
Anyway, it's hard to do anything correctly with a large number of average humans.
Think of it more like marketing. because thats exactly what it is. They are basically showing you billboards of other movies you may have an interest in. This algorithm decides which billboards are to be shown to you. Now, if the algorithm is 0.1 percent better at deciding which billboards to show you, does that really matter to you as an individual? not at all. Does it matter to netflix across a userbase of millions of people? absolutely. hence this contest.
Silly. What they are doing is smart. The grad school can compete and win the money if it chooses. In the event the University or the greedy code geeks fail to produce it cost Netflix nothing. With your thinking it cost them money whether results are produced or not. I guess that is why you do not run Netflixs:)
I think that is the point - academia has been studying this for decades and has yet to produce meaningful results. I'm not saying that universities haven't contributed their fair share of technological advances through the years, but doing so in a practical and timely manner isn't exactly what they're known for. When business and/or money gets thrown into the mix, the pace of progress tends to rapidly accelerate.
X Prize Foundation
Millennium Problems
2008 Templeton Prize
Netflix could have saved a boatload of money by throwing some cash at a university with an established AI group and asking them to research the current state-of-the-art
According to the Netflix site there are currently 35558 contestants on 29326 teams from 170 different countries. They could have thrown any amount of money at any university and still not received the kind of effort they've seen to date. I'd say their million dollars is money well spent.
Seriously. It's a trend to create websites with more dynamic and shared content. That's it. No church, no ideology, no 2.0.
This is an attempt to bring out new solutions.
Eivind.
Doubting the existence of evolution is like doubting the existence of China: It just shows that you're uninformed.
Actually, I don't think they care whether you like the movie or not.....I think the point is to maximize the movies out to subscribers and minimize the movies stored in a warehouse. If I have 1,000 movies in inventory and only 100 are "active", I have 900 movies taking up space. I also have customers who are waiting on one of the 100 movies to become available so they can watch it. If I recommend to you one of the 900, you get to watch a movie while waiting for one of the 100 popular titles which means you aren't sitting there complaining about how long it takes to get a movie from Netflix. Of course, if you like the obscure movie that was recommended, you'll be more likely to take a chance on the next obscure movie that gets recommended, thus my 900 movies are in circulation keeping people from hating my service and coincidentally not taking up space in my warehouse.
Layne