Slashdot Mirror


Google Book Scanning Efforts Not Open Enough?

An anonymous reader writes to mention the Washington Post is reporting that the Open Content Alliance is taking the latest shot at Google's book scanning program. Complaining that having all of the books under the "control" of one corporation wouldn't be open enough, the New York-based foundation is planning on announcing a $1 million grant to the Internet Archive to achieve the same end. From the article: "A splinter group called the Open Content Alliance favors a less restrictive approach to prevent mankind's accumulated knowledge from being controlled by a commercial entity, even if it's a company like Google that has embraced 'Don't Be Evil' as its creed. 'You are talking about the fruits of our civilization and culture. You want to keep it open and certainly don't want any company to enclose it,' said Doron Weber, program director of public understanding of science and technology for the Alfred P. Sloan Foundation."

5 of 113 comments (clear)

  1. Good! by SatanicPuppy · · Score: 4, Insightful

    The more the merrier!

    Ideally we could set up a few hundred digital libraries that would all hold some percentage of the catalog, so that any 5 would be able to duplicate the entire catalog. That way, in the event of a catastrophe or some kind of weird global event, it would be more likely that an uncorrupted copy could be found.

    I'd definitely like to see some not-for-profits get involved.

    --
    ad logicam Claiming a proposition is false because it was presented as the conclusion of a fallacious argument.
  2. Scanning a book is easy... by creative_Righter · · Score: 5, Insightful
    Already facing a legal challenge for alleged copyright infringement, Google Inc.'s crusade to build a digital library has triggered a philosophical debate with an alternative project promising better online access to the world's books, art and historical documents.

    Scanning a book is easy, it simply involves taking pictures. You can splice the spine off an take pictures of each page or use one of the panoply of non-destructive machines to correct the page warping effects of an open book. This is not particularly hard or expensive.

    The latest tensions revolve around Google's insistence on chaining the digital content to its Internet-leading search engine and the nine major libraries that have aligned themselves with the Mountain View-based company.

    Damn straight. The OCR process is the hardest part, of course they wouldn't allow access to highly valuable text to others. They might have a million books "scanned" this year but each page has to be OCRed. Most people don't decouple those operations and assume that after scanning the hard part is over. Say each book has 300 pages, so we're talking about running 300 million pages of text through OCR. Now you've got a real problem. How does one know if a page of a book is OCRed correctly? You can pay a human or even a large team of humans to QA the text but even then you can only spot check here and there. A 99.99% correct OCR program will mess up on the equivalent of 150,000 pages of text a year (spread out more or less uniformly across the 300 million). Also, not all pages of books are scanable (pictures, weird fonts, weird page layouts), and then there are headaches with keeping track of the related editions of a books, multiple editions of books, displaying pictures in the reader you don't have copyright to (which I think always gets glossed over with these sorts of articles), 10 digit to 13 digit ISBNs, etc. So yes, they aren't going to allow access to the text to others, because it's hard and expensive to do so because you can only automate so much if you want to the ensure accuracy of the text itself (I think Google does). If they opened the text up what stops the competitors from simply adding the data into their search engines after the difficult part is over? Google does no evil but they aren't stupid.

  3. Re:Google's got a long way to go . . . by bcrowell · · Score: 4, Informative
    I think you're vastly overestimating the added benefit from scanning books from more libraries after the first few:
    1. Most libraries' collections are very similar to most other libraries' collections, and the greatest overlap occurs with the books that are the most important.
    2. This is all about PD stuff, since OCA isn't proposing to do anything still in copyright. Less ephemeral works (the kind typically preserved in library collections a century later) generally all had their copyrights renewed in the U.S., so that means we're only talking about pre-1923 materials. Since congress keeps on extending copyright terms, nothing is probably ever going to enter the public domain from 1923 on. That means we're talking about the publishing world of 1922, which was vastly smaller than today's publishing world. Amazon.com has on the order of 10^6 books. To get a feel for the size of the publishing industry in past decades, try browsing through the catalog of renewals; the number of books published was extremely small in the early 19th century.
    3. There are many books that won't be in any library's collection, simply because they weren't considered very valuable. You could digitize a thousand libraries, and never find them. Handwriting manuals from 1893. Trashy novels. Etc. In fact, there are a lot of books from the 1930's-1950's that are now PD, because they never had their copyrights renewed, but you're not going to find them in libraries' collections, and in fact it's very unlikely that anyone will ever be interested in them.
  4. Project Gutenburg by larry+bagina · · Score: 5, Interesting

    I'm a kind of baffled why people are talking about starting up new projects or Open Sourcing (tm) google's prject (whatever that means...).

    Project Gutenburg is open and non proprietary (ASCII text) and has been for quite a while.

    After scanning, they use a distributed proofreading system where volunteers compare a scanned page image to the OCR text for errors. If you've got some free time, consider helping out.

    --
    Do you even lift?

    These aren't the 'roids you're looking for.

  5. the books aren't going anywhere... by the+packrat · · Score: 4, Insightful

    You folks do realise that Google returns the books after they scan them so they'll still be in the libraries afterwards right? So how does this reduce their availability?

    --
    Nihil Illegitemi Carborvndvm