How Google Is Solving Its Book Problem

← Back to Stories (view on slashdot.org)

How Google Is Solving Its Book Problem

Posted by samzenpus on Wednesday November 3, 2010 @11:57PM from the beyond-the-card-catalog dept.

Pickens writes "Alexis Madrigal writes in the Atlantic that Google's famous PageRank algorithm can't be deployed to search through the 15 million books that Google has already scanned because books don't link to each other in the way that webpages do. Instead Google's new book search algorithm called 'Rich Results' looks at word frequency, how closely your query matches the title of a book, web search frequency, recent book sales, the number of libraries that hold the title, how often an older book has been reprinted, and 100 other signals. 'There is less data about books than web pages, but there is more structure to it, and there's less spam to contend with,' writes Madrigal. Yet the focus on optimizing an experience from vast amounts of data remains. 'You want it to have the standard Google quality as much as possible,' says Matthew Gray, lead software engineer for Google Books. '[You want it to be] a merger of relevance and utility based on all these things.'"

3 of 58 comments (clear)

Min score:

Reason:

Sort:

Why can't the text of these books be clearer? by bogaboga · 2010-11-04 00:25 · Score: 2, Interesting

I have always wondered why the text in these books is not clear. The blurry fonts make my eyes hurt and surely, Google can create a better interface for the main page. Just 1 million dollars can do so much if some expert were hired to revamp the site. Come on Google!
1. Re:Why can't the text of these books be clearer? by ortholattice · 2010-11-04 01:44 · Score: 4, Interesting
  
  As someone studying certain specialized math books from the 1800's and early 1900's, I had great expectations for Google books, since they offer downloadable PDFs for public domain works. However, the focus quality of many (most?) of them is so incredibly poor that things like tiny subscripts are illegible blobs, making them essentially useless.
  While plain text solves this problem for novels, it is useless for math books, because OCR renders the equations (which are the essence of the book) as garbage characters. And it's not clear how one would communicate them as plain text anyway, unless the OCR was extremely sophisticated and generated say LaTeX output.
  Thankfully, some of the ones I need are in the University of Michigan Historical Mathematics Collection, with a much higher quality. But for the ones that are not there, I've used the Google pdf as a last resort - at least I can get an overview, if somewhat unpleasant to read. But for books I actually want to study, I've ended up making my own scan from a library copy (which, if done with care, is better quality than even the U Mich. version) when Google's is the only one I can find on-line.
  However, scanning physically stresses these old books. I think it is sad that I have to repeat what Google has done, when they (presumably) could have scanned them with high quality with a little more effort or better equipment with automatic focusing. In some cases, the books have been in the rare book section of the university library, which can't be checked out, and making copies of the whole book locally is frowned upon because of possible damage and sometimes, depending on the book's condition, not allowed.
Google Book Metadata by Anonymous Coward · 2010-11-04 00:26 · Score: 1, Interesting

I'm not sure Google can correlate the kinds of data they are talking about because their book metadata (author, title, edition, etc.) is so inaccurate. I often find Google books based on text search that can't be located in author or title searches.