Google Book Scanning Efforts Not Open Enough?
An anonymous reader writes to mention the Washington Post is reporting that the Open Content Alliance is taking the latest shot at Google's book scanning program. Complaining that having all of the books under the "control" of one corporation wouldn't be open enough, the New York-based foundation is planning on announcing a $1 million grant to the Internet Archive to achieve the same end. From the article: "A splinter group called the Open Content Alliance favors a less restrictive approach to prevent mankind's accumulated knowledge from being controlled by a commercial entity, even if it's a company like Google that has embraced 'Don't Be Evil' as its creed. 'You are talking about the fruits of our civilization and culture. You want to keep it open and certainly don't want any company to enclose it,' said Doron Weber, program director of public understanding of science and technology for the Alfred P. Sloan Foundation."
Scanning a book is easy, it simply involves taking pictures. You can splice the spine off an take pictures of each page or use one of the panoply of non-destructive machines to correct the page warping effects of an open book. This is not particularly hard or expensive.
Damn straight. The OCR process is the hardest part, of course they wouldn't allow access to highly valuable text to others. They might have a million books "scanned" this year but each page has to be OCRed. Most people don't decouple those operations and assume that after scanning the hard part is over. Say each book has 300 pages, so we're talking about running 300 million pages of text through OCR. Now you've got a real problem. How does one know if a page of a book is OCRed correctly? You can pay a human or even a large team of humans to QA the text but even then you can only spot check here and there. A 99.99% correct OCR program will mess up on the equivalent of 150,000 pages of text a year (spread out more or less uniformly across the 300 million). Also, not all pages of books are scanable (pictures, weird fonts, weird page layouts), and then there are headaches with keeping track of the related editions of a books, multiple editions of books, displaying pictures in the reader you don't have copyright to (which I think always gets glossed over with these sorts of articles), 10 digit to 13 digit ISBNs, etc. So yes, they aren't going to allow access to the text to others, because it's hard and expensive to do so because you can only automate so much if you want to the ensure accuracy of the text itself (I think Google does). If they opened the text up what stops the competitors from simply adding the data into their search engines after the difficult part is over? Google does no evil but they aren't stupid.
I'm a kind of baffled why people are talking about starting up new projects or Open Sourcing (tm) google's prject (whatever that means...).
Project Gutenburg is open and non proprietary (ASCII text) and has been for quite a while.
After scanning, they use a distributed proofreading system where volunteers compare a scanned page image to the OCR text for errors. If you've got some free time, consider helping out.
Do you even lift?
These aren't the 'roids you're looking for.