How Google Is Solving Its Book Problem

← Back to Stories (view on slashdot.org)

How Google Is Solving Its Book Problem

Posted by samzenpus on Wednesday November 3, 2010 @11:57PM from the beyond-the-card-catalog dept.

Pickens writes "Alexis Madrigal writes in the Atlantic that Google's famous PageRank algorithm can't be deployed to search through the 15 million books that Google has already scanned because books don't link to each other in the way that webpages do. Instead Google's new book search algorithm called 'Rich Results' looks at word frequency, how closely your query matches the title of a book, web search frequency, recent book sales, the number of libraries that hold the title, how often an older book has been reprinted, and 100 other signals. 'There is less data about books than web pages, but there is more structure to it, and there's less spam to contend with,' writes Madrigal. Yet the focus on optimizing an experience from vast amounts of data remains. 'You want it to have the standard Google quality as much as possible,' says Matthew Gray, lead software engineer for Google Books. '[You want it to be] a merger of relevance and utility based on all these things.'"

58 comments

Min score:

Reason:

Sort:

Rainbows End by Toe,+The · 2010-11-03 23:59 · Score: 2, Funny

But do they really have to shred all the books just to scan them?
1. Re:Rainbows End by ikkonoishi · 2010-11-04 00:32 · Score: 0
  
  Does it matter? Its not like these are one of a kind Tomes of Utter Significance. Besides, once scanned, they can be reprinted if needed.
2. Re:Rainbows End by Anonymous Coward · 2010-11-04 00:49 · Score: 1, Informative
  
  No. Speculation on Google's process based on a patent filing.
  I seem to recall an article that was more than speculation, but I couldn't find it while searching. The 2003 entry for the Google books history also points toward it being a non-destructive process.
3. Re:Rainbows End by Samantha+Wright · 2010-11-04 01:06 · Score: 4, Informative
  
  Wait! I'm undoing all my mod points because I just realised that no, you're quite wrong. The printing process wouldn't be the same for the older books, and some of them have survived hundreds of years before we came along and scanned them.
  
  However, the story about books being cut up for scanning was about microfilm. I think it was an institution in Texas whose library was cutting them up mentioned as an aside in a submission about how they were converting their library into a lounge and computer lab.
  
  --
  Bio questions? Ask me to start a Q&A journal. Computer analogies available for most topics!
4. Re:Rainbows End by delinear · 2010-11-04 01:38 · Score: 5, Insightful
  
  I would guess (read, hope) that while the process means books which are commonly available might be handled in the quick yet destructive manner, books which are more rare or have historic significance beyond the data would be treated much more carefully (at the lower end of the scale, someone with a hand scanner maybe, at the upper end perhaps even people manually transcribing). Ultimately, though, while I think it's a crime for a book to be destroyed, if it's a choice between it mouldering away in a basement somewhere until it falls apart or Google destroying it early in the interest of preserving the data, surely it's better that the ideas rather than the physical object are preserved (I appreciate in reality it's not just a black and white either/or choice).
5. Re:Rainbows End by bughunter · 2010-11-04 01:42 · Score: 1
  
  Slashdot doesn't have a "+1 Obscure" moderation, probably because nothing is obscure on /., so I'm just gonna drop you a shout and friend you.
  
  --
  I can see the fnords!
6. Re:Rainbows End by Merpy · 2010-11-04 01:49 · Score: 5, Informative
  
  Google doesn't destroy the books, they've got a patent on "unbending" the pages. http://news.cnet.com/8301-11386_3-10232931-76.html
7. Re:Rainbows End by Mathinker · 2010-11-04 02:20 · Score: 1
  
  Hm, it doesn't have a "+1 Thanks for the conundrum" moderation, either. Oh, well. :-)
8. Re:Rainbows End by perryizgr8 · 2010-11-04 03:30 · Score: 1
  
  actually in rainbow's end one of the protagonists mentions that he prefer google's non-destructive scanning over the new shredding method.
  so its not google who are shredding books. they are the ones saving the books.
  
  --
  Wealth is the gift that keeps on giving.
9. Re:Rainbows End by bughunter · 2010-11-04 04:03 · Score: 1
  
  Google greater scooch-a-mout ...
  
  --
  I can see the fnords!
10. Re:Rainbows End by ElizabethGreene · 2010-11-04 04:27 · Score: 2, Informative
  
  But do they really have to shred all the books just to scan them?
  No. A book scanning machine is capable of scanning a book non-destructively. My unsubstantiated guess is that they are less harmful to the book than your average reader.
  You can build one if you'd like. Instructable The automated page turners on the commercial models are awesome. Youtube video
11. Re:Rainbows End by Mathinker · 2010-11-04 07:58 · Score: 1
  
  That was a pretty quick spoiler. I guess you young'uns just don't understand how it is when yer brain gets all fuzzy and slow.
  (I didn't even notice the comment title. I am embarrassed to admit that I actually remember that book being one of the last dead-tree science fiction books I bought, but I have never gotten around to reading it, and am not sure if I could even still find it, because my life suddenly mutated in unexpected ways quite soon after the purchase. A good excuse, I suppose, to go looking for it now.)
How does one write ... by gphilip · 2010-11-04 00:02 · Score: 1

in the Atlantic?
1. Re:How does one write ... by MrHanky · 2010-11-04 00:22 · Score: 2, Funny
  
  With a pen.
2. Re:How does one write ... by Anonymous Coward · 2010-11-04 00:41 · Score: 1, Informative
  
  With a fountain pen.
3. Re:How does one write ... by JustOK · 2010-11-04 01:11 · Score: 4, Funny
  
  there's a whole branch of science that studies writing and drawing while in an ocean, it's called oceanographics
  
  --
  rewriting history since 2109
4. Re:How does one write ... by goombah99 · 2010-11-04 01:52 · Score: 1
  
  scratch it on an iceberg.
  
  --
  Some drink at the fountain of knowledge. Others just gargle.
5. Re:How does one write ... by goombah99 · 2010-11-04 02:55 · Score: 1
  
  in the Atlantic?
  using underwater paper like this.
  
  --
  Some drink at the fountain of knowledge. Others just gargle.
6. Re:How does one write ... by Anonymous Coward · 2010-11-04 03:54 · Score: 0
  
  In the navy!
7. Re:How does one write ... by bughunter · 2010-11-04 04:15 · Score: 1
  
  With a watered-down representation of a niche, minority, or extreme viewpoint, apparently.
  
  --
  I can see the fnords!
VSM by elsurexiste · 2010-11-04 00:20 · Score: 1

For a second I thought they were merely using VSM: http://en.wikipedia.org/wiki/Vector_space_model . As I read further, I was happily proven wrong. :)

--
I rarely respond to comments. Also, don't ask for clarifications: a brain and Google are faster, believe me!
1. Re:VSM by mcgrew · 2010-11-04 00:35 · Score: 1
  
  Their book search has improved greatly since they started. A year ago I was looking for Huckleberry Finn and the first result was amazon.com. This was annoying, as that book is in the public domain.
  They seemed to have fixed it. The first result now is wikipedia, the second a study guide, the third is the book itself hosted at the University of Virginia.
  Shopping is at the bottom of the page. I'm pleased!
  
  --
  Free Martian Whores!
2. Re:VSM by delinear · 2010-11-04 01:42 · Score: 3, Insightful
  
  I suspect this is as much to do with the uptake in ebook readers as any change to the search indexing. Previously, if you were searching for this book you probably had a very specific interest in it and often wanted to buy a copy, now the people searching are more likely looking for free reading material, so the ranks have adjusted to accommodate that (since "people looking for free stuff" is a much wider market than "people with interest in a particular book", so it's easy to swing the ranking in favour of the former).
Scientific books by sourcerror · 2010-11-04 00:22 · Score: 1

I think it should work well for scientific monographies as they contain a lot of references to each other, but don't usually get reprinted. [citattion needed]
1. Re:Scientific books by jank1887 · 2010-11-04 00:30 · Score: 2, Informative
  
  they already do that via Google Scholar. Scientific paper searches often (maybe not often enough) bring up textbook references. I know searching through regular Google does quite frequently.
Why can't the text of these books be clearer? by bogaboga · 2010-11-04 00:25 · Score: 2, Interesting

I have always wondered why the text in these books is not clear. The blurry fonts make my eyes hurt and surely, Google can create a better interface for the main page. Just 1 million dollars can do so much if some expert were hired to revamp the site. Come on Google!
1. Re:Why can't the text of these books be clearer? by AdmiralXyz · 2010-11-04 00:37 · Score: 4, Informative
  
  It's because the book-scanning process is completely automated. I can't find a look to it, but a remember a Slashdot or Wired article about Google's automatic book-scanning machine. Basically it's too difficult to adjust for perfect focus for every book.
  
  I wouldn't worry about it though: Google is doing OCR on all these books, and they'll presumably replace the images with plain-text equivalents at some point (more searchable, portable, etc.) That's my hope, anyway.
  
  --
  Dislike the Electoral College? Lobby your state to join the National Popular Vote Interstate Compact.
2. Re:Why can't the text of these books be clearer? by inode_buddha · 2010-11-04 00:42 · Score: 1
  
  I would indeed like that, but it'll be interesting to see how they could OCR my copy of DaVinci's manuscripts. Particularly when the pages alternate between latin and english, with illustrations.
  
  --
  C|N>K
3. Re:Why can't the text of these books be clearer? by Anonymous Coward · 2010-11-04 00:42 · Score: 0
  
  The OCR currently supports only Latin script but Google scans books in every langauge ?
4. Re:Why can't the text of these books be clearer? by multipartmixed · 2010-11-04 00:45 · Score: 1
  
  Yes, and a significant fraction of the books they are going to scan are DaVinci's Manuscripts!
  
  --
  
  Do daemons dream of electric sleep()?
5. Re:Why can't the text of these books be clearer? by grumbel · 2010-11-04 01:26 · Score: 2, Insightful
  
  It's because the book-scanning process is completely automated.
  I doubt it, it is not exactly hard to get a book that is at a rather fixed distance into focus. Anyway, the reason why the fonts are blurry isn't the focus to begin with, the images that Google shows are simply extremely low resolution. Why they are in such a low resolution I have no idea.
6. Re:Why can't the text of these books be clearer? by ortholattice · 2010-11-04 01:44 · Score: 4, Interesting
  
  As someone studying certain specialized math books from the 1800's and early 1900's, I had great expectations for Google books, since they offer downloadable PDFs for public domain works. However, the focus quality of many (most?) of them is so incredibly poor that things like tiny subscripts are illegible blobs, making them essentially useless.
  While plain text solves this problem for novels, it is useless for math books, because OCR renders the equations (which are the essence of the book) as garbage characters. And it's not clear how one would communicate them as plain text anyway, unless the OCR was extremely sophisticated and generated say LaTeX output.
  Thankfully, some of the ones I need are in the University of Michigan Historical Mathematics Collection, with a much higher quality. But for the ones that are not there, I've used the Google pdf as a last resort - at least I can get an overview, if somewhat unpleasant to read. But for books I actually want to study, I've ended up making my own scan from a library copy (which, if done with care, is better quality than even the U Mich. version) when Google's is the only one I can find on-line.
  However, scanning physically stresses these old books. I think it is sad that I have to repeat what Google has done, when they (presumably) could have scanned them with high quality with a little more effort or better equipment with automatic focusing. In some cases, the books have been in the rare book section of the university library, which can't be checked out, and making copies of the whole book locally is frowned upon because of possible damage and sometimes, depending on the book's condition, not allowed.
7. Re:Why can't the text of these books be clearer? by Covalent · 2010-11-04 01:51 · Score: 1
  
  It's because the book-scanning process is completely automated.
  I doubt it, it is not exactly hard to get a book that is at a rather fixed distance into focus. Anyway, the reason why the fonts are blurry isn't the focus to begin with, the images that Google shows are simply extremely low resolution. Why they are in such a low resolution I have no idea.
  Imagine the storage required for that many hi-res images when low-res works well enough. That's why.
  
  --
  Great warrior...hrmph! Wars not make one great.
8. Re:Why can't the text of these books be clearer? by icebraining · 2010-11-04 02:01 · Score: 2, Informative
  
  One way they do it is through reCaptcha. When you're typing them, you're also helping the OCR process.
  
  --
  Dilbert RSS feed
9. Re:Why can't the text of these books be clearer? by bouldin · 2010-11-04 02:28 · Score: 1
  
  Maybe the font is intentionally blurry so you can't use your own OCR to scrape the book from Google.
10. Re:Why can't the text of these books be clearer? by Anonymous Coward · 2010-11-04 02:48 · Score: 0
  
  Also, it is better to have poor quality initial versions available than to have nothing while they develop the proper automated processes to produce very high quality versions rather than trying to incrementally improve quality from book to book.
11. Re:Why can't the text of these books be clearer? by Anonymous Coward · 2010-11-04 03:05 · Score: 0
  
  While plain text solves this problem for novels, it is useless for
  math books, because OCR renders the equations (which are the essence of
  the book) as garbage characters. And it's not clear how one would
  communicate them as plain text anyway, unless the OCR was extremely
  sophisticated and generated say LaTeX output.
  Sounds like an excellent research project.
  Imagine if you could scribble math stuff with a pen on a Wacom board and have it be understood by the computer. Every time that I sit down to do some math I always end up daydreaming about such a system.
12. Re:Why can't the text of these books be clearer? by perryizgr8 · 2010-11-04 03:41 · Score: 1
  
  actually windows 7 math input panel works very nice.
  
  --
  Wealth is the gift that keeps on giving.
13. Re:Why can't the text of these books be clearer? by tlhIngan · 2010-11-04 03:50 · Score: 1
  
  It's because the book-scanning process is completely automated.
  I doubt it, it is not exactly hard to get a book that is at a rather fixed distance into focus. Anyway, the reason why the fonts are blurry isn't the focus to begin with, the images that Google shows are simply extremely low resolution. Why they are in such a low resolution I have no idea.
  Well, actually it is. Google's book scanners use two digital cameras to take photos of both pages at once, rather than a much clearer scanner system. Those photos then are automatically cropped to remove the background (scanner hardware - the book holder, platen frame, etc), then a reverse-page remover (to remove traces of the text on the other side which can surprisingly leak through), then run through a deskewer to straighten up the words and lines to give OCR an easier time. Going through all this on a relatively low-resolution "scan" leads to even lower resolution book images after processing.
  Plus it's optimized for throughput, so large digital camera photos mean each page takes longer to transfer the images from, so you don't want to do pictures at too high res. Plus if the book is smaller than what the full frame of a book is supposed to accomplish, you're starting with a lower resolution picture to begin with (the "zoom" isn't adjusted - the FOV is kept constant for the largest book the scanner can handle).
14. Re:Why can't the text of these books be clearer? by DerekLyons · 2010-11-04 03:59 · Score: 1
  
  Basically it's too difficult to adjust for perfect focus for every book.
  Huh? Autofocus is not a new technology.
15. Re:Why can't the text of these books be clearer? by kryliss · 2010-11-04 07:13 · Score: 1
  
  Sounds like a two person project. One person holds the book up shaped like an L. Use a digital camera to take pictures of each page. If a page is too "curved" try using the glass from a picture frame to hold it down. Make sure to use indirect lighting and no flash, if needed set the ISO speed on your camera to a good setting. You could even setup some kind of tripod that points the camera straight down and set the timer.
  
  --
  --- If the bible proves the existence of God, then Superman comics prove the existence of Superman.
16. Re:Why can't the text of these books be clearer? by waterbear · 2010-11-04 12:18 · Score: 1
  
  the book-scanning process is completely automated.
  Well if it really is automated, how comes it that some of the scanned pages show part of the hand (complete with finger rings!) of the person who was doing the scanning? It looks as if the scanning was done by someone who didn't realise that the text can't be read if there's a hand between the page and the scanner-glass!
  I reckon that's a manual process.
  -wb-
17. Re:Why can't the text of these books be clearer? by WillAdams · 2010-11-04 12:19 · Score: 1
  
  There are a couple of projects which OCR math properly:
  http://inftyreader.org/
  http://research.cs.queensu.ca/drl//ffes/
  William
  
  --
  Sphinx of black quartz, judge my vow.
18. Re:Why can't the text of these books be clearer? by Anonymous Coward · 2010-11-05 12:13 · Score: 0
  
  You seem to be implying that Google is providing the books at the highest quality they have. They're spending tens, if not hundreds, of millions of dollars to scan, make available, and fight lawsuits about, these scanned books.
  I wouldn't be all that surprised if they have each page stored as a 10mb, 20mp image.. but only, currently, present them in enough quality to satisfy the 99.999999% of people that don't read 100-year-old math texts.
Google Book Metadata by Anonymous Coward · 2010-11-04 00:26 · Score: 1, Interesting

I'm not sure Google can correlate the kinds of data they are talking about because their book metadata (author, title, edition, etc.) is so inaccurate. I often find Google books based on text search that can't be located in author or title searches.
eBooks Contribute to Global Warming by Anonymous Coward · 2010-11-04 00:30 · Score: 0

Because paper books sequester carbon.
aiming for "standard Google quality"? by a2wflc · 2010-11-04 00:39 · Score: 1, Funny

I hope they aren't trying to get experts-exchange as 8 of my top 10 book results.
Re:Books Contribute to Global Warming by WillAdams · 2010-11-04 00:43 · Score: 4, Informative

You're not taking into consideration the energy required to make the book, or to transport it to the marketplace. The amount of carbon sequestered in the physical pages of a book is insignificant in comparison.
The production of a book releases 8.85 lbs. of CO_2:
http://latimesblogs.latimes.com/emeraldcity/2008/06/paper-vs-paperl.html
Here's a page which indicates most CO_2 production is for energy:
http://www.eia.doe.gov/oiaf/1605/ggrpt/carbon.html
And here's a page which indicates that CO_2 production is a much larger problem for the manufacturing of electronics:
http://www.energybulletin.net/node/49730
w/ a ratio of 12 to 1 for energy usage to weight, so my PRS-505 weighs roughly 9 ozs., so presumably required 108 ounces of fuel to manufacture (on-going energy usage is trivial and not considered)
http://www.epa.gov/oms/climate/420f05001.htm
gives us a figure of 19.4 pounds of CO_2 per gallon of gasoline which equals roughly 16.36875 pounds of CO_2 to make the ebook reader.
So getting two books for the Sony should make it roughly break even, and each printed book beyond that which is not purchased should result in a net reduction of CO_2 emissions, since the energybulletin.net page indicates that the embodied energy usage for electronics is much greater than the lifetime usage.

--
Sphinx of black quartz, judge my vow.
Re:Books Contribute to Global Warming by Anonymous Coward · 2010-11-04 00:51 · Score: 1, Funny

My printing press is fueled by the frantic posts of trolled know-it-alls.
" 'There is less data about books than web pages.. by unitron · 2010-11-04 01:59 · Score: 2, Informative

Shouldn't that be "are fewer data"?

--
I see even classic Slashdot is now pretty much unusable on dial up anymore.
Re:Books Contribute to Global Warming by cherokee158 · 2010-11-04 02:14 · Score: 1

Did you include the energy cost of the manufacturing and disposal of the batteries that will power your e-reader? How many batteries and e-readers do you expect to consume during the lifespan of a typical book?
Link == citation by Compaqt · 2010-11-04 02:19 · Score: 1

Books don't link to each other?
What are citations and footnotes?

--
I'm not a lawyer, but I play one on the Internet. Blog
1. Re:Link == citation by Anonymous Coward · 2010-11-04 06:23 · Score: 1, Informative
  
  CS Lewis and Tolkien weren't really known for their citations.
  (Now Tolkien might well be known for his appendices., but that is totally different.)
Re:Books Contribute to Global Warming by WillAdams · 2010-11-04 02:38 · Score: 1

Batteries are included in the initial production weight and the battery is a small fraction of that weight --- an e-ink screen reader uses so little power that one needs to recharge every week or so, so batteries last for _years_ --- if one does replace the battery the old one contains materials which are valuable enough to warrant recycling, so the environmental impact is minimal as stated in my post.
An ebook reader which used typical batteries would be a really bad idea and if there are any such, I hope they get loaded w/ rechargeable batteries.
William

--
Sphinx of black quartz, judge my vow.
what's also nice is it's on a dual core Cortex-A9 by Locutus · 2010-11-04 05:23 · Score: 1

The Tegra2 kit I messed with was 1GHz with 1GB of RAM and it wasn't optimized but run Ubuntu great. can't wait

LoB

--
"Anyone who stands out in the middle of a road looks like roadkill to me." --Linus
Yes, but... by Anonymous Coward · 2010-11-04 05:47 · Score: 0

To be pedantic yes, it should be.
However, in the common lexicon data has become more of a indefinite noun, few would actually use the singular datum at any point. Thus it becomes natural to talk about it in indefinite terms (is less data) rather than the correct definite terms (are fewer data).