Programming Collective Intelligence

← Back to Stories (view on slashdot.org)

Programming Collective Intelligence

Posted by samzenpus on Wednesday April 16, 2008 @04:32AM from the the-movies-you-want-to-watch dept.

Joe Kauzlarich writes "In 2006, the on-line movie rental store Netflix proposed a $1 million prize to whomever could write a movie recommendation algorithm that offered a ten percent improvement over their own. As of this writing, the intriguingly-named Gravity and Dinosaurs team holds first place by a slim margin of .07 percent over BellKor, their algorithm an 8.82 percent improvement on the Netflix benchmark. So, the question remains, how do they write these so-called recommendation algorithms? A new O'Reilly book gives us a thorough introduction to the basics of this and similar lucrative sciences." Keep reading for the rest of Joe's review. Programming Collective Intelligence author Toby Segaran pages 334 publisher O'Reilly Media Inc. rating 9/10 reviewer Joe Kauzlarich ISBN 9780596529321 summary Introduction to data mining algorithms and techniques Among the chief ideological mandates of the Church of Web 2.0 is that users need not click around to locate information when that information can be brought to the users. This is achieved by leveraging 'collective intelligence,' that is, in terms of recommendations systems, by computationally analyzing statistical patterns of past users to make as-accurate-as-possible guesses about the desires of present users. Amazon, Google and certainly many other organizations, in addition to Netflix, have successfully edged out more traditional competitors on this basis, the latter failing to pay attention to the shopping patterns of users and forcing customers to locate products in a trial and error manner as they would in, say, a Costco. As a further illustration, if I go to the movie shelf at Best Buy, and look under 'R' for Rambo, no one's going to come up to me and say that the Die Hard Trilogy now has a special-edition release on DVD and is on sale. I'd have to accidentally pass the 'D' section and be looking in that direction in order to notice it. Amazon would immediately tell me, without bothering to mention that Gone With The Wind has a new special edition.

Programming Collective Intelligence is far more than a guide to building recommendation systems. Author Toby Segaran is not a commercial product vendor, but a director of software development for a computational biology firm, doing data-mining and algorithm design (so apparently there is more to these 'algorithms' than just their usefulness in recommending movies?). Segaran takes us on a friendly and detailed tour through the field's toolchest, covering the following topics in some depth:
Recommendation Systems
Discovering Groups
Searching and Ranking
Document Filtering
Decision Trees
Price Models
Genetic Programming
... and a lot more

As you can see, the subject matter stretches into the higher levels of mathematics and academia, but Segaran successfully keeps the book intelligible to most software developers and examples are written in the easy-to-follow Python language. Further chapters cover more advanced topics, like optimization techniques and many of the more complex algorithms are deferred to the appendix.

The third chapter of the book, 'Discovering Groups,' deserves some explanation and may enlighten you as to how the book may be of some use in day-to-day software designs. Suppose you have a collection of data that is interrelated by a 'JOIN' in two sets of data. For example, certain customers may spend more time browsing certain subsets of movies. 'Discovering Groups' refers to the computational process of recognizing these patterns and sectioning data into groups. In terms of music or movies, these groups would represent genres. The marketing team may thus become aware that jazz enthusiasts buy more music at sale prices than do listeners of contemporary rock, or that listeners of late-60's jazz also listen to 70's prog, or similar such trends.

Certainly the applications of such tools as Programming Collective Intelligence provides us are broader than my imagination can handle. Insurance companies, airlines and banks are all part of massive industries that rely on precise knowledge of consumer trends and can certainly make use of the data-mining knowledge introduced in this book.

I have no major complaints about the book, particularly because it fills a gap in popular knowledge with no precursor of which I'm aware. Presentation-wise, even though Python is easy to read, pseudo-code is more timeless and even easier to read. You can't cut & paste from a paper book into a Python interpreter anyway. It may 've been more appropriate to use pseudo-code in print and keep the example code on the website (I'm sure it's there anyway).

If you ever find yourself browsing or referencing your algorithms text from college or even seriously studying algorithms for fun or profit, then I would highly recommend this book depending on your background in mathematics and computer science. That is, if you have a strong background in the academic study of related research, then you might look elsewhere, but this book, certainly suitable as an undergraduate text, is probably the best one for relative beginners that is going to be available for a long time.

You can purchase Programming Collective Intelligence from amazon.com. Slashdot welcomes readers' book reviews -- to see your own review here, read the book review guidelines, then visit the submission page.

9 of 74 comments (clear)

Min score:

Reason:

Sort:

Numbers? by drquoz · 2008-04-16 04:51 · Score: 2, Informative

The numbers in the summary don't match up with the numbers on Netflix's leaderboard:

BellKor: 9.08%
Gravity/Dinosaurs: 8.82%
BigChaos: 8.80%
I bought this book by iluvcapra · 2008-04-16 05:03 · Score: 4, Informative

I was at the Borders and was looking for something to pass the weekend, and I'd been doing some sound effects library work, so I took a look at this.

It has a lot of statistics; it's essentially a statistics-in-use book , with code examples in Python of all of the algorithms. That said, it makes all of the topics very accessible, and proposes many different ways of solving different wisdom-of-crowds type problems, and gives you enough knowledge so you'd be able to hear someone pitch you their dataset, and you'd be able to say "Oh, you wanna do full-text relevance ranking" or "You need decision tree for that" or "you just want the correlation." The book very much has a sort of statistics-as-swiss-army-knife approach.

Also, I'm not Pythonic, but I was able to translate all of the algorithms into Ruby as I went, even turning the list comprehensions into the Rubyish block/yield equivalents, so his style is not too idiomatic.

--
Don't blame me, I voted for Baltar.
1. Re:I bought this book by StarfishOne · 2008-04-16 06:58 · Score: 2, Informative
  
  Very nice summary! I own the book and I must say that it's very nice and accessible.
  
  The examples are practical and described quite well, even if ones math skills are not that great.
  
  And the example in Python are almost looking pseudo-code like, even if one has little to no Python skills, the language is not a huge barrier.
  
  5 stars out of 5!
  
  The reviews at amazing are also quite quite good:
  
  http://www.amazon.com/review/product/0596529325/ref=pd_bbs_sr_1_cm_cr_acr_txt?_encoding=UTF8&showViewpoints=1
  
  23 ratings at this moment, 20x5 stars, 1x4 star, 1x3 star.
Re:How is it quantified by Otter · 2008-04-16 05:07 · Score: 3, Informative

Let's say I have a dataset where 1000 people have each reviewed 20 movies. If I give you a set with five reviews blanked out for each person, how accurately can you predict them from the other 15?

--
What I'm listening to now on Pandora...
Re:How is it quantified by robizzle · 2008-04-16 05:13 · Score: 5, Informative

Which improvements? The Netflix competition?

They basically have a large dataset consisting of User, Movie, Rating. Of this set, they split it into two data sets. In the smaller subset they removed the ratings and didn't release these to the public. They didn't modify the larger subset at all. They had cinematch make predictions on the smaller subset (without having been told the real predictions) and use this as the baseline. Next, people that compete in the competition make predictions on the missing data and improvements can be calculated. They calculate the percent improvement as 100 * [Submission's Error] / [Cinematch's Error]

There are a number of ways to calculate the error but for the Netflix competition they use MASE (Mean Average Squared Error). Basically you take the sum of the squared difference between what was predicted and what the real rating was then divide it by the number of ratings.

Detailed information can be found on the Netflix Prize rules page and there are a number of good posts on the forums as well.
Re:Who Cares About 0.1 Stars Difference? by Jynx77 · 2008-04-16 06:20 · Score: 2, Informative

I think they are paying 50K a year out to the top team. Not sure if that's got a time limit on it. I guess the pub is good.

--
It's turtles all the way down!
Re:With 35535 entrants, this may just be noise by glyph42 · 2008-04-16 06:45 · Score: 2, Informative

You should read the competition rules. The test set is so enormous that you would need 2^something_huge entries to see the results we've seen based on randomness. I did a back-of-the-envelope calculation at the beginning of the competition to see if a random search would be feasible to win the prize, and it's not. Not in a million years. Literally.

--
Music speeds up when you yawn, but does not change pitch.
Good introduction to pattern recognition by Gendor · 2008-04-16 07:32 · Score: 2, Informative

I came across this book browsing through Safari Books Online's titles, and was almost halfway through the book before I was able to get hold of an actual copy. While the main focus of the book is on data mining (definitely not only recommendation algorithms, it also shows how Google's PageRank algorithm works, how to mine user data from Facebook and write matching algorithms etc.) it provides a good introduction to pattern recognition in general. It shows you how to write a simple neural network in Python, how to write a Bayes classifier for spam filtering, and even touches on Support Vector Machines (SVMs). What I really love about the book is that everything is explained by means of code examples, with the actual math theory in an appendix for those of us more mathematically inclined. You can literally sit with the book next to the computer and reproduce the code as you go along.
Re:How is it quantified by Gorobei · 2008-04-16 13:45 · Score: 3, Informative

For now I'm more interested to know how they quantify these improvements.

Quantification is fun field in itself, and by no means trivial. As other posters have noted, there are many leave-n-out approaches: basically, divide the dataset into a training set and a test set, and rank by how accurately the code predicts the test set given the training set.

These types of tests are good in that they are easy to understand by the judges and participants. The problem, of course, is that over repeated trials, information about the test set leaks out in the scoring, and the participants slowly overfit their algorithms to the test set based on scoring feedback (in the extreme case, there is no training data, only test data - the winning algorithms are just maps of matched test inputs to correct outputs.)

Even if you manage to ameliorate this problem (e.g by requiring submission of a function that will be applied to an unknown training set to produce a set of predictions,) there is still the risk that the high scoring functions are not very useful (e.g. predicting someones rating of "The Matrix" is easy and has a low RMS error, but do you even care about error from most peoples rating of "Mr Lucky," most have never heard of it?)

So, to be really useful, you want your rating (objective) function to be weighted by usefulness from the point of view of your business (e.g. yes, everyone like the current blockbuster, but will John Q Random be happy geting "Bringing Up Baby" instead?) Here, "happy" is defined as maximizing profits for the firm :)

So, you often a prize with a simple (but wrong) objective function. Then offer the winners a chance a real money if they work on the actual hard problems the firm is facing (this is what we do on Wall St, anyway ;)