Programming Collective Intelligence
Joe Kauzlarich writes "In 2006, the on-line movie rental store Netflix proposed a $1 million prize to whomever
could write a movie recommendation algorithm that offered a ten
percent improvement over their own. As of this writing, the
intriguingly-named Gravity
and Dinosaurs team holds first place by a slim margin of .07
percent over BellKor,
their algorithm an 8.82 percent
improvement on the Netflix benchmark. So, the question remains,
how do they write these so-called recommendation algorithms? A new
O'Reilly book gives us a thorough introduction to the basics of this
and similar lucrative sciences." Keep reading for the rest of Joe's review.
Programming Collective Intelligence
author
Toby Segaran
pages
334
publisher
O'Reilly Media Inc.
rating
9/10
reviewer
Joe Kauzlarich
ISBN
9780596529321
summary
Introduction to data mining algorithms and techniques
Among the chief ideological mandates of the Church of Web 2.0 is that
users need not click around to locate information when that
information can be brought to the users. This is achieved by
leveraging 'collective intelligence,' that is, in terms of
recommendations systems, by computationally analyzing statistical
patterns of past users to make as-accurate-as-possible guesses about
the desires of present users. Amazon, Google and certainly many other
organizations, in addition to Netflix, have successfully edged out
more traditional competitors on this basis, the latter failing to pay
attention to the shopping patterns of users and forcing customers to
locate products in a trial and error manner as they would in, say, a
Costco. As a further illustration, if I go to the movie shelf at Best
Buy, and look under 'R' for Rambo, no one's going to come up to
me and say that the Die Hard Trilogy now has a special-edition
release on DVD and is on sale. I'd have to accidentally pass the 'D'
section and be looking in that direction in order to notice it. Amazon
would immediately tell me, without bothering to mention that Gone
With The Wind has a new special edition.
Programming Collective Intelligence is far more than a guide to building recommendation systems. Author Toby Segaran is not a commercial product vendor, but a director of software development for a computational biology firm, doing data-mining and algorithm design (so apparently there is more to these 'algorithms' than just their usefulness in recommending movies?). Segaran takes us on a friendly and detailed tour through the field's toolchest, covering the following topics in some depth:
Recommendation Systems
Discovering Groups
Searching and Ranking
Document Filtering
Decision Trees
Price Models
Genetic Programming
... and a lot more
As you can see, the subject matter stretches into the higher levels of mathematics and academia, but Segaran successfully keeps the book intelligible to most software developers and examples are written in the easy-to-follow Python language. Further chapters cover more advanced topics, like optimization techniques and many of the more complex algorithms are deferred to the appendix.
The third chapter of the book, 'Discovering Groups,' deserves some explanation and may enlighten you as to how the book may be of some use in day-to-day software designs. Suppose you have a collection of data that is interrelated by a 'JOIN' in two sets of data. For example, certain customers may spend more time browsing certain subsets of movies. 'Discovering Groups' refers to the computational process of recognizing these patterns and sectioning data into groups. In terms of music or movies, these groups would represent genres. The marketing team may thus become aware that jazz enthusiasts buy more music at sale prices than do listeners of contemporary rock, or that listeners of late-60's jazz also listen to 70's prog, or similar such trends.
Certainly the applications of such tools as Programming Collective Intelligence provides us are broader than my imagination can handle. Insurance companies, airlines and banks are all part of massive industries that rely on precise knowledge of consumer trends and can certainly make use of the data-mining knowledge introduced in this book.
I have no major complaints about the book, particularly because it fills a gap in popular knowledge with no precursor of which I'm aware. Presentation-wise, even though Python is easy to read, pseudo-code is more timeless and even easier to read. You can't cut & paste from a paper book into a Python interpreter anyway. It may 've been more appropriate to use pseudo-code in print and keep the example code on the website (I'm sure it's there anyway).
If you ever find yourself browsing or referencing your algorithms text from college or even seriously studying algorithms for fun or profit, then I would highly recommend this book depending on your background in mathematics and computer science. That is, if you have a strong background in the academic study of related research, then you might look elsewhere, but this book, certainly suitable as an undergraduate text, is probably the best one for relative beginners that is going to be available for a long time.
You can purchase Programming Collective Intelligence from amazon.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.
Programming Collective Intelligence is far more than a guide to building recommendation systems. Author Toby Segaran is not a commercial product vendor, but a director of software development for a computational biology firm, doing data-mining and algorithm design (so apparently there is more to these 'algorithms' than just their usefulness in recommending movies?). Segaran takes us on a friendly and detailed tour through the field's toolchest, covering the following topics in some depth:
Recommendation Systems
Discovering Groups
Searching and Ranking
Document Filtering
Decision Trees
Price Models
Genetic Programming
... and a lot more
As you can see, the subject matter stretches into the higher levels of mathematics and academia, but Segaran successfully keeps the book intelligible to most software developers and examples are written in the easy-to-follow Python language. Further chapters cover more advanced topics, like optimization techniques and many of the more complex algorithms are deferred to the appendix.
The third chapter of the book, 'Discovering Groups,' deserves some explanation and may enlighten you as to how the book may be of some use in day-to-day software designs. Suppose you have a collection of data that is interrelated by a 'JOIN' in two sets of data. For example, certain customers may spend more time browsing certain subsets of movies. 'Discovering Groups' refers to the computational process of recognizing these patterns and sectioning data into groups. In terms of music or movies, these groups would represent genres. The marketing team may thus become aware that jazz enthusiasts buy more music at sale prices than do listeners of contemporary rock, or that listeners of late-60's jazz also listen to 70's prog, or similar such trends.
Certainly the applications of such tools as Programming Collective Intelligence provides us are broader than my imagination can handle. Insurance companies, airlines and banks are all part of massive industries that rely on precise knowledge of consumer trends and can certainly make use of the data-mining knowledge introduced in this book.
I have no major complaints about the book, particularly because it fills a gap in popular knowledge with no precursor of which I'm aware. Presentation-wise, even though Python is easy to read, pseudo-code is more timeless and even easier to read. You can't cut & paste from a paper book into a Python interpreter anyway. It may 've been more appropriate to use pseudo-code in print and keep the example code on the website (I'm sure it's there anyway).
If you ever find yourself browsing or referencing your algorithms text from college or even seriously studying algorithms for fun or profit, then I would highly recommend this book depending on your background in mathematics and computer science. That is, if you have a strong background in the academic study of related research, then you might look elsewhere, but this book, certainly suitable as an undergraduate text, is probably the best one for relative beginners that is going to be available for a long time.
You can purchase Programming Collective Intelligence from amazon.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.
So, the question remains, how do they write these so-called recommendation algorithms?
For now I'm more interested to know how they quantify these improvements.
You just got troll'd!
I was initially intrigued by reccomendation algorithms. Sadly, it's easy to get them up to a certain point and then almost impossible to make them any better. At least for movies. Netflix rates almost everything between 2.5 to 4 stars. Movies it rates 1 or 2 stars, I wouldn't have considered watching anyways. It never rates anything 5 stars. And for things between 3 and 4 stars, I seem equally as likely to really like a 3 star rated item as I am to not really like a 4 star rated item. So why is Netflix paying a million bucks to change that 3 to a 3.1 or 2.9?
It's turtles all the way down!
The numbers in the summary don't match up with the numbers on Netflix's leaderboard:
BellKor: 9.08%
Gravity/Dinosaurs: 8.82%
BigChaos: 8.80%
How are they defining this %10 improvement? How do they judge it? And how can they get it down to things like %.07. There have to be user test groups involved and I can't believe their that objective. %10 increase in rentals, in click throughs, in user agreement that the recommendations are helpful? What?
There are now 35535 entries in the Netflix competition. If they all used roughly the same algorithm, with some randomness in the tuning variables, we'd expect to see results about like what we've seen. I think we're looking at noise here.
The same phenomenon shows up with mutual funds. Some outperform the market, some don't, but prior year results are not good predictors of future results.
I was at the Borders and was looking for something to pass the weekend, and I'd been doing some sound effects library work, so I took a look at this.
It has a lot of statistics; it's essentially a statistics-in-use book , with code examples in Python of all of the algorithms. That said, it makes all of the topics very accessible, and proposes many different ways of solving different wisdom-of-crowds type problems, and gives you enough knowledge so you'd be able to hear someone pitch you their dataset, and you'd be able to say "Oh, you wanna do full-text relevance ranking" or "You need decision tree for that" or "you just want the correlation." The book very much has a sort of statistics-as-swiss-army-knife approach.
Also, I'm not Pythonic, but I was able to translate all of the algorithms into Ruby as I went, even turning the list comprehensions into the Rubyish block/yield equivalents, so his style is not too idiomatic.
Don't blame me, I voted for Baltar.
A million dollars? This is what happens when business people dabble in science. Artificial Intelligence grad students and professors have been studying these kinds of problems for decades. Netflix could have saved a boatload of money by throwing some cash at a university with an established AI group and asking them to research the current state-of-the-art. The only reason to put up that kind of money is to generate publicity, and I'm not really sure that worked.
The problem is where you post your algorithm, if you wait till they are paying for their items ( as at Amazon) where they add in the shopping cart, the people who bought this book also bought this book, or we have a sale, 2 books one of which you have plus this one, for less...
This can only be done with a shopping cart style, where as Netflickshas to wait for them to select their movie before they can recommend anything, seriously they should partner up with Amazon,
the people who rented this movie from Netflicks, also bought this book from....lol!
If I had to choose whether to be my million bucks on some cushy grant-wallowing researchers or some hungry self-motivated code geeks, I'd pick the latter.
A work that expires before its copyright never enters the public domain and thus enjoys eternal copyright protection.
When was this written? According to the leaderboard, http://www.netflixprize.com//leaderboard BellKor is leading by 0.26 and has been leading for several months.
Seriously. It's a trend to create websites with more dynamic and shared content. That's it. No church, no ideology, no 2.0.
I've read this book, and let me say I found it to be a superb introduction to the topic. It teachs you different methods applicable to a lot of different situations. In fact, after reading it, I decided to build my own social news site based on user recommendation. However, I had to research a lot into the field before coming with a good and fast algorithm. That's the only flaw I found in the book, all the algorithms are poorly implemented (altought this may be for the sake of clarity).
I came across this book browsing through Safari Books Online's titles, and was almost halfway through the book before I was able to get hold of an actual copy. While the main focus of the book is on data mining (definitely not only recommendation algorithms, it also shows how Google's PageRank algorithm works, how to mine user data from Facebook and write matching algorithms etc.) it provides a good introduction to pattern recognition in general. It shows you how to write a simple neural network in Python, how to write a Bayes classifier for spam filtering, and even touches on Support Vector Machines (SVMs). What I really love about the book is that everything is explained by means of code examples, with the actual math theory in an appendix for those of us more mathematically inclined. You can literally sit with the book next to the computer and reproduce the code as you go along.
The Netflix competition, in principle, is an example of an interesting class of prediction algorithms. There is a lot of good work in academia in this area and on the face of it one might be surprised that no one has beat Netflix yet.
Unfortunately Netflix restricts the data that can be applied to prediction. You have to use their data which includes only movie title and genre. A much better job could be done if something like the Internet Movie Database were fused with the title selection information. This would allow the algorithm to predict based on actors, directors and detailed genre. For example, I see all movies directed by John Woo. Given that I've seen all of his movies, it's not hard to predict that I'm going to see his next movie.
Could you not just add an extra box on the rating section that asks for the customers mood? Say a box that says rate this film 1-5 stars. Below that a drop down with the most common moods, happy, sad, angry, annoyed. It seems to me a big factor in when you rate a film is your current mood. If your in a good mood your more likely to be forgiving of a film, in a bad mood your going to be critical. This extra information might help you determin the accuracy of a given rating. I'm shure a study could help determin just how much a given mood can effect a rating +/- so many points. Seems to make sense to me but what the hell do I know?
see: http://developers.slashdot.org/article.pl?sid=08/04/01/189230
"A teacher is offering empirical evidence that when you're mining data, augmenting data is better than a better algorithm. He explains that he had teams in his class enter the Netflix challenge, and two teams went two different ways. One team used a better algorithm while the other harvested augmenting data on movies from the Internet Movie Database. And this team, which used a simpler algorithm, did much better -- nearly as well as the best algorithm on the boards for the $1 million challenge. The teacher relates this back to Google's page ranking algorithm and presents a pretty convincing argument. What do you think? Will more data usually perform better than a better algorithm?"
She loves me: 09F911029D74E35BD84156C5635688C0 She loves me not: 09F911029D74E35BD84156C5635688BF
All I know about these recommendation algorithms is that they're a bit crazy. I have had The L Word recommended because I liked Alias, 24, and Roswell.
Of course maybe The L Word is about lesbian alien spies with super powers. Huh. I'm gonna go check it out.
I have also read Collective Intelligence. I think I enjoyed it significantly more than the Slashdot reviewer. Here is my review:
~~~~
Have you ever wondered how:
* Google comes up with its search results
* Amazon recommends you books/movies/music
* spam filters decide good from bad
Well, Toby Segaran not only explains these topics and more in Collective Intelligence, but he does so in a way accessible to software developers that haven't worked on machine-learning problems before. He even provides working Python code for all the algorithms.
Oh, and Collective Intelligence reads incredibly well. I could not wait to get home and get back to it -- and when I went in to work the next morning, I usually had a new idea or two of how to improve our software. I also started implementing the most important examples in Groovy to make sure I got it.
If you are a Senior Software Engineer or "better," this is a must-read. Proper application of the algorithms in this book are a great way to simplify your system and avoid getting nickel-and-dimed to death with new ways to prioritize/categorize/slice-and-dice your domain data.