Slashdot Mirror


Google Book Scanning Efforts Not Open Enough?

An anonymous reader writes to mention the Washington Post is reporting that the Open Content Alliance is taking the latest shot at Google's book scanning program. Complaining that having all of the books under the "control" of one corporation wouldn't be open enough, the New York-based foundation is planning on announcing a $1 million grant to the Internet Archive to achieve the same end. From the article: "A splinter group called the Open Content Alliance favors a less restrictive approach to prevent mankind's accumulated knowledge from being controlled by a commercial entity, even if it's a company like Google that has embraced 'Don't Be Evil' as its creed. 'You are talking about the fruits of our civilization and culture. You want to keep it open and certainly don't want any company to enclose it,' said Doron Weber, program director of public understanding of science and technology for the Alfred P. Sloan Foundation."

6 of 113 comments (clear)

  1. Re:Google's got a long way to go . . . by bcrowell · · Score: 4, Informative
    I think you're vastly overestimating the added benefit from scanning books from more libraries after the first few:
    1. Most libraries' collections are very similar to most other libraries' collections, and the greatest overlap occurs with the books that are the most important.
    2. This is all about PD stuff, since OCA isn't proposing to do anything still in copyright. Less ephemeral works (the kind typically preserved in library collections a century later) generally all had their copyrights renewed in the U.S., so that means we're only talking about pre-1923 materials. Since congress keeps on extending copyright terms, nothing is probably ever going to enter the public domain from 1923 on. That means we're talking about the publishing world of 1922, which was vastly smaller than today's publishing world. Amazon.com has on the order of 10^6 books. To get a feel for the size of the publishing industry in past decades, try browsing through the catalog of renewals; the number of books published was extremely small in the early 19th century.
    3. There are many books that won't be in any library's collection, simply because they weren't considered very valuable. You could digitize a thousand libraries, and never find them. Handwriting manuals from 1893. Trashy novels. Etc. In fact, there are a lot of books from the 1930's-1950's that are now PD, because they never had their copyrights renewed, but you're not going to find them in libraries' collections, and in fact it's very unlikely that anyone will ever be interested in them.
  2. Please do a better job, not just a bigger job by Ankh · · Score: 2, Informative

    Most of these people focus on English-language books printed in the 19th and early 20th centuries, because (1) it's usually easy to determine copyright status, and (2) if you go earlier you get the tall "s" ( in utf-8) which no OCR program today seems able to handle, so the scanning cost is increased.

    Scanning with a flat-bed scanner basically wrecks the binding. So the books probably need to be rebound afterwards, or can be discarded.

    There are photography setups (e.g. Phase One has one) but the resolution is too low, even with a 40 megapixel medium-format camera (yes, they are used for this). A little high-school mathematics (e.g. Nyquist) and the back of an envelope, combined with some measurements, will show that if you scan engravings at under 1200dpi, you will lose a lot of detail, and indeed, compare for example the Alice in Wonderland pictures on my own site with the Project Gutenberg ones. You can read the engraver's signature on most of the ones I have. Yes, the bandwidth needed to host higher resolution images is greater (which is why I have ads, sorry). But it's worth it.

    Some of these books will never be scanned again. Even for OCR, 400dpi grayscale seems a minimum for footnotes and other small text even in English.

    I'd also like to see more interfaced like the Project Gutenberg Distributed Proofreaders' site where people can submit corrections. Maybe use a WIKI for the transcription??

    Liam

    --
    Live barefoot!
    free engravings/woodcuts
  3. Re:Good! by Anonymous Coward · · Score: 2, Informative

    Yeah, that's why we don't have any idea what anybody might have said (or meant) more than a thousand years ago.

  4. money by Anonymous Coward · · Score: 1, Informative

    "And how, pray, are they supposed to survive without the adverts?"

    Don't know about you, but I would pop for a yearly subscription for a *good quality* search engine that had a toggle for "with adverts" or "no adverts" option. Not sure how much I would spend, that would depend on how good they were on filtering out link farms, etc, but some reasonable fee to have the option of no ads. And then websites might have an indcement to restrict use of ads to at least the interior pages and nt the main public facing page. Ads there just suck.

    Right now I would classify the free google search with ads as being of medium quality until you get good at it with a lot of -restrict this and that word added to your query and learning wild cards and domain restrictions, etc. In fact, I wish google had one simple option on their main page, split their search bar in two by default, one side is for words/phrases you are looking for, the other side is what you want to immediately filter out. For example if you add -sale, you eliminate a lot of commercial sites. Dogsquat simple, hardly anyone does it.

        Google is good once you learn to use it, by default like most people use it though it's just a fancy yellow pages.

  5. Re:Good! by Korin43 · · Score: 2, Informative

    Not to mention that the whole "decaying medium" argument is ridiculous. If a hard drive fails, replace it. If you get something better than hard drives, copy it. It's not like big servers only keep the information in one specific place. There's usually copies.

  6. Re:Just Open Source It? by CleverBoy · · Score: 2, Informative

    Exactly right. All these comments about "must show ads over it" pretty much misses the point. Google's project allows you to SEARCH all the books its scanning, and even so, its drawn the ire of copyright holders. Imagine if they said... "Oh, yes... we're OPEN SOURCING all of our scanning results for unfettered public consumption." No judge in the world... nuff said. Open sourcing the actually methodology would not serve much purpose, although its worthy of note that they have open sources some OCR software earlier. Very well received too. Gift horses and such, blah blah blah.