Google Book Scanning Efforts Not Open Enough?
An anonymous reader writes to mention the Washington Post is reporting that the Open Content Alliance is taking the latest shot at Google's book scanning program. Complaining that having all of the books under the "control" of one corporation wouldn't be open enough, the New York-based foundation is planning on announcing a $1 million grant to the Internet Archive to achieve the same end. From the article: "A splinter group called the Open Content Alliance favors a less restrictive approach to prevent mankind's accumulated knowledge from being controlled by a commercial entity, even if it's a company like Google that has embraced 'Don't Be Evil' as its creed. 'You are talking about the fruits of our civilization and culture. You want to keep it open and certainly don't want any company to enclose it,' said Doron Weber, program director of public understanding of science and technology for the Alfred P. Sloan Foundation."
The more the merrier!
Ideally we could set up a few hundred digital libraries that would all hold some percentage of the catalog, so that any 5 would be able to duplicate the entire catalog. That way, in the event of a catastrophe or some kind of weird global event, it would be more likely that an uncorrupted copy could be found.
I'd definitely like to see some not-for-profits get involved.
ad logicam Claiming a proposition is false because it was presented as the conclusion of a fallacious argument.
Google's big mistake was to try to do both PD and copyrighted books. Regardless of the legal merits (which are complicated), it was just a stupid business decision to waste effort on doing copyrighted books in general, on an opt-out basis. The controversy about the copyrighted books has dragged the PD books down with it. Part of the fallout from the lawsuit has been that Google has done everything it could to hide from users the fact that the service even exists. The whole thing is actually an abject failure, so it doesn't make me worry that Google will somehow get too powerful. Anyway, AFAIK Google doesn't claim any IP rights on their scans of PD books, so they actually don't have any control at all -- other people can take the scans and do whatever they want with them. Google is in the advertising business, not the publishing business.
Find free books.
anyone else find the irony here funny. Google is on the side of keeping this a closed circuit project and MS is part of the alliance trying to make it open.
Its funny. Laugh.
Can't Google just Open Source the project?
Well, the source of the code running the project wouldn't be that helpful, it's the content we're after.
And presuming you meant Google opening the content.... well I doubt it... they want to sell ads on the content after all!
Don't forget, google nice tho' they are haven't given out code/content/etc for any of their "crown jewels"
There are shills on slashdot. Apparently, I'm one of them.
It would also be pretty nieve and stupid to only utilize reference from one source if you're doing research. You'd want to check out multiple sources to get the full picture. Of course, there is a growing problem that is quite common nowadays among an increasing number of college students that they believe that if it's not available on the web, it doesn't exist. Such students might find themselves somewhat, "enlightened," if they walked over to the library and cracked open a book or journal from, say, before 1995.
I bet they won't.
There is nothing sexy or secret about the methods of scanning, but they must have put an imperial frickton of money into the process...To give the fruit of that much money away would be irresponsible to their shareholders...At least until they've made their money back with it.
ad logicam Claiming a proposition is false because it was presented as the conclusion of a fallacious argument.
Scanning a book is easy, it simply involves taking pictures. You can splice the spine off an take pictures of each page or use one of the panoply of non-destructive machines to correct the page warping effects of an open book. This is not particularly hard or expensive.
Damn straight. The OCR process is the hardest part, of course they wouldn't allow access to highly valuable text to others. They might have a million books "scanned" this year but each page has to be OCRed. Most people don't decouple those operations and assume that after scanning the hard part is over. Say each book has 300 pages, so we're talking about running 300 million pages of text through OCR. Now you've got a real problem. How does one know if a page of a book is OCRed correctly? You can pay a human or even a large team of humans to QA the text but even then you can only spot check here and there. A 99.99% correct OCR program will mess up on the equivalent of 150,000 pages of text a year (spread out more or less uniformly across the 300 million). Also, not all pages of books are scanable (pictures, weird fonts, weird page layouts), and then there are headaches with keeping track of the related editions of a books, multiple editions of books, displaying pictures in the reader you don't have copyright to (which I think always gets glossed over with these sorts of articles), 10 digit to 13 digit ISBNs, etc. So yes, they aren't going to allow access to the text to others, because it's hard and expensive to do so because you can only automate so much if you want to the ensure accuracy of the text itself (I think Google does). If they opened the text up what stops the competitors from simply adding the data into their search engines after the difficult part is over? Google does no evil but they aren't stupid.
Google 'Do No Evil' ... is about the biggest lie perp'ed on mankind. Google is the last company aside from the obvious M$ that I would want to control anything. They are about inflated stock, and making you see ads online. Are well all that stupid that we believe Google-ganda?
Oh, do calm down... They never claimed "we do absolutely no evil whatsoever", it's more like - the founders happen to think that "evil should not be done". What's a lie about that? Also, how does inflated stock make them evil?
And how, pray, are they supposed to survive without the adverts? Never mind the fact that Google didn't actually come up with online advertising but were pretty much the first ones to run targeted, non-offensive (as in, no flashing banners, pop-ups, etc.) ads.
I'm no Google fanboy, although I happily use many of their services. But I don't think there's anything inherently wrong with them, and I find it somewhat sad to see this paranoid drivel modded up to +3 Insightful.
Basilisk Digital
I'm a kind of baffled why people are talking about starting up new projects or Open Sourcing (tm) google's prject (whatever that means...).
Project Gutenburg is open and non proprietary (ASCII text) and has been for quite a while.
After scanning, they use a distributed proofreading system where volunteers compare a scanned page image to the OCR text for errors. If you've got some free time, consider helping out.
Do you even lift?
These aren't the 'roids you're looking for.
Hear, hear! Books want to be open! I find that when books can be open, as they should, they become much more accessible to people than if they were kept closed.
---GEC
I'm but the humble pupil, seeking to snatch the scratchbuilt pebble from the master's fully articulated hand
You folks do realise that Google returns the books after they scan them so they'll still be in the libraries afterwards right? So how does this reduce their availability?
Nihil Illegitemi Carborvndvm
Most of these people focus on English-language books printed in the 19th and early 20th centuries, because (1) it's usually easy to determine copyright status, and (2) if you go earlier you get the tall "s" ( in utf-8) which no OCR program today seems able to handle, so the scanning cost is increased.
Scanning with a flat-bed scanner basically wrecks the binding. So the books probably need to be rebound afterwards, or can be discarded.
There are photography setups (e.g. Phase One has one) but the resolution is too low, even with a 40 megapixel medium-format camera (yes, they are used for this). A little high-school mathematics (e.g. Nyquist) and the back of an envelope, combined with some measurements, will show that if you scan engravings at under 1200dpi, you will lose a lot of detail, and indeed, compare for example the Alice in Wonderland pictures on my own site with the Project Gutenberg ones. You can read the engraver's signature on most of the ones I have. Yes, the bandwidth needed to host higher resolution images is greater (which is why I have ads, sorry). But it's worth it.
Some of these books will never be scanned again. Even for OCR, 400dpi grayscale seems a minimum for footnotes and other small text even in English.
I'd also like to see more interfaced like the Project Gutenberg Distributed Proofreaders' site where people can submit corrections. Maybe use a WIKI for the transcription??
Liam
Live barefoot!
free engravings/woodcuts
Quite the opposite. If they give it away, then I can set up ePhil House o' Classic Literature and reap the benefits of that advertising in place of Google. I can show less advertising because I don't have that nasty overhead of scanning the books. Google's need is to make it available to consumers in exchange for "eyeballs" but keep it away from me. Hammer away on Google's servers and they will cut you off, I ran operations in a company that performed such meta-searches and used to be able to tell you with a high degree of precision where that line was (which we considered business intelligence and thus wouldn't tell you unless you worked there).
And for the record there is no requirement that they give away the content to show you advertising, they choose do to so because a free service attracts more "eyeballs" than a paid service. It up to management to decide which combination of advertising vs subscription fees nets the most profit. Since Google best understands the "charge by advertising" model, the have a predilection for the "advertising-only supported" model.
So your grade for Google 101 is an F
You are in a maze of twisted little posts, all alike.
Did someone break their legs?
See that big building downtown with all the books in it?
Oh wait, get up from your desk, go outside (yes I know, it burns...), get on the bus and go downtown.
OK, now see the big building with the strange letters "LIBRARY" on the front? OK, that's the one, go inside... see all the books?
Now go up to the attendant at the desk and tell them your name and address and show a piece of photo ID. The nice person will give you a card that you can use to borrow books.
What's a book? OK, its many pages of paper bound together usually with glue and string. On each of these "pages" you will find ink (a dye) in the pattern of letters that form words and sentences and paragraphs.
Usually, these "books" tell a story or provide organised information.
No go ahead, pick one out - they'll even let you take it home for a week or two so you can read it. For free!
You can browse the stacks (a colloquialism for those big shelves with books on them) which are organised according to a system known as the Dewey Decimal System. You can use a revolutionary piece of technology known as a "card catalog" to indicate the position of the title you seek on the stacks (though many libraries have this same catalog searchable from computer terminals).
It's revolutionary, I know. But there you have it, free information and entertainment, enough to last a lifetime, with a "less restrictive approach".
Enjoy.
One things that bugs the heck out of me with Google is their, "Oh we will do this because we have the rights", yet if you want to use their stuff you need EXPLICIT permissions. http://www.google.com/permissions/index.html
" All of Google's trademarks, logos, web pages, screen shots, or other distinctive features ("Google Brand Features") are protected by applicable trademark, copyright, and other intellectual property laws. If you would like to use any of Google Brand Features on your website, in an advertisement, in an article or book, or reproduce them anywhere else, you must first receive Google's permission. We've tried to make this process as painless as possible."
Funny Google wants you to get permission and they are saying no such thing as fair use. YET they want publishers to opt out...
Google is hypocritical!
"You can't make a race horse of a pig"
"No," said Samuel, "but you can make very fast pig"
Exactly right. All these comments about "must show ads over it" pretty much misses the point. Google's project allows you to SEARCH all the books its scanning, and even so, its drawn the ire of copyright holders. Imagine if they said... "Oh, yes... we're OPEN SOURCING all of our scanning results for unfettered public consumption." No judge in the world... nuff said. Open sourcing the actually methodology would not serve much purpose, although its worthy of note that they have open sources some OCR software earlier. Very well received too. Gift horses and such, blah blah blah.