Google Book Scanning Efforts Not Open Enough?
An anonymous reader writes to mention the Washington Post is reporting that the Open Content Alliance is taking the latest shot at Google's book scanning program. Complaining that having all of the books under the "control" of one corporation wouldn't be open enough, the New York-based foundation is planning on announcing a $1 million grant to the Internet Archive to achieve the same end. From the article: "A splinter group called the Open Content Alliance favors a less restrictive approach to prevent mankind's accumulated knowledge from being controlled by a commercial entity, even if it's a company like Google that has embraced 'Don't Be Evil' as its creed. 'You are talking about the fruits of our civilization and culture. You want to keep it open and certainly don't want any company to enclose it,' said Doron Weber, program director of public understanding of science and technology for the Alfred P. Sloan Foundation."
The more the merrier!
Ideally we could set up a few hundred digital libraries that would all hold some percentage of the catalog, so that any 5 would be able to duplicate the entire catalog. That way, in the event of a catastrophe or some kind of weird global event, it would be more likely that an uncorrupted copy could be found.
I'd definitely like to see some not-for-profits get involved.
ad logicam Claiming a proposition is false because it was presented as the conclusion of a fallacious argument.
Can't Google just Open Source the project?
That way we don't have different companies and foundations duplicating eachother's work, but all the results are still open and accessible to everybody.
Google's big mistake was to try to do both PD and copyrighted books. Regardless of the legal merits (which are complicated), it was just a stupid business decision to waste effort on doing copyrighted books in general, on an opt-out basis. The controversy about the copyrighted books has dragged the PD books down with it. Part of the fallout from the lawsuit has been that Google has done everything it could to hide from users the fact that the service even exists. The whole thing is actually an abject failure, so it doesn't make me worry that Google will somehow get too powerful. Anyway, AFAIK Google doesn't claim any IP rights on their scans of PD books, so they actually don't have any control at all -- other people can take the scans and do whatever they want with them. Google is in the advertising business, not the publishing business.
Find free books.
anyone else find the irony here funny. Google is on the side of keeping this a closed circuit project and MS is part of the alliance trying to make it open.
Its funny. Laugh.
It would also be pretty nieve and stupid to only utilize reference from one source if you're doing research. You'd want to check out multiple sources to get the full picture. Of course, there is a growing problem that is quite common nowadays among an increasing number of college students that they believe that if it's not available on the web, it doesn't exist. Such students might find themselves somewhat, "enlightened," if they walked over to the library and cracked open a book or journal from, say, before 1995.
...when you can copy. If Google is going to make the data freely available, why pay people to start another scanning program when pay people to wait for Google to finish, have them go to the Google page and simply press CTRL-A, CTRL-C and then CTRL-V into their own page? Scan complete!
There is no "I disagree" mod for a reason. Flamebait, Troll, and Overrated are not substitutes.
Since when were Google "in control" for being allowed to show excerpts of a book for the advertisement of the companies allowing them to carry their books?
Beware: In C++, your friends can see your privates!
One million dollars? Even if you focus that solely on the contents of the Library of Congress, that will be, what, five cents per book?
Scanning a book is easy, it simply involves taking pictures. You can splice the spine off an take pictures of each page or use one of the panoply of non-destructive machines to correct the page warping effects of an open book. This is not particularly hard or expensive.
Damn straight. The OCR process is the hardest part, of course they wouldn't allow access to highly valuable text to others. They might have a million books "scanned" this year but each page has to be OCRed. Most people don't decouple those operations and assume that after scanning the hard part is over. Say each book has 300 pages, so we're talking about running 300 million pages of text through OCR. Now you've got a real problem. How does one know if a page of a book is OCRed correctly? You can pay a human or even a large team of humans to QA the text but even then you can only spot check here and there. A 99.99% correct OCR program will mess up on the equivalent of 150,000 pages of text a year (spread out more or less uniformly across the 300 million). Also, not all pages of books are scanable (pictures, weird fonts, weird page layouts), and then there are headaches with keeping track of the related editions of a books, multiple editions of books, displaying pictures in the reader you don't have copyright to (which I think always gets glossed over with these sorts of articles), 10 digit to 13 digit ISBNs, etc. So yes, they aren't going to allow access to the text to others, because it's hard and expensive to do so because you can only automate so much if you want to the ensure accuracy of the text itself (I think Google does). If they opened the text up what stops the competitors from simply adding the data into their search engines after the difficult part is over? Google does no evil but they aren't stupid.
is about the biggest lie perp'ed on mankind. Google is the last company aside from the obvious M$ that I would want to control anything. They are about inflated stock, and making you see ads online. Are well all that stupid that we believe Google-ganda?
... and copyright extension then, since that is also dominating our culture now...?
yeah, thought not. copyright enforcement is only demanded by those who can control it, and it's sheer brilliance that they turned a civil law issue into a criminal one and thus got the gov't to pay the copyright holder's costs!
Google 'Do No Evil' ... is about the biggest lie perp'ed on mankind. Google is the last company aside from the obvious M$ that I would want to control anything. They are about inflated stock, and making you see ads online. Are well all that stupid that we believe Google-ganda?
Oh, do calm down... They never claimed "we do absolutely no evil whatsoever", it's more like - the founders happen to think that "evil should not be done". What's a lie about that? Also, how does inflated stock make them evil?
And how, pray, are they supposed to survive without the adverts? Never mind the fact that Google didn't actually come up with online advertising but were pretty much the first ones to run targeted, non-offensive (as in, no flashing banners, pop-ups, etc.) ads.
I'm no Google fanboy, although I happily use many of their services. But I don't think there's anything inherently wrong with them, and I find it somewhat sad to see this paranoid drivel modded up to +3 Insightful.
Basilisk Digital
Oh damn! You really nailed Google there. They're all about making you see ads. Oh man, they're never going to live that tongue-lashing down. I bet their PR people are going nuts trying to figure out how to clean this mess up.
Are you angry because Google suspended the SOAP API? Or are you just a grumpy troll?
It breaks my pluginses, my precious!
Outer darkness time for you ;)
I'm a kind of baffled why people are talking about starting up new projects or Open Sourcing (tm) google's prject (whatever that means...).
Project Gutenburg is open and non proprietary (ASCII text) and has been for quite a while.
After scanning, they use a distributed proofreading system where volunteers compare a scanned page image to the OCR text for errors. If you've got some free time, consider helping out.
Do you even lift?
These aren't the 'roids you're looking for.
The only thing left is for the two CEOs to have a drink and snicker about the foolish peasants known as Internet users. They control the content, access, and media.
No, I don't think that they'll hold to it forever. I suspect that once the founders are gone, things will erode until that motto will go the way of the dinosaur except for its PR function.
That said, based on what they're *doing* (and not what they're merely saying), they're at least making a reasonable effort to live up to an ideal, and that's a hell of a lot more than I can say for any other corp.
In other words, I'll retain some loyalty to Google so long as it shows some loyalty to us. Like I said, they'll probably let us down someday and that'll be the time to ditch them, but at the same time, it's stupid not to enjoy the good while it lasts.
You folks do realise that Google returns the books after they scan them so they'll still be in the libraries afterwards right? So how does this reduce their availability?
Nihil Illegitemi Carborvndvm
Most of these people focus on English-language books printed in the 19th and early 20th centuries, because (1) it's usually easy to determine copyright status, and (2) if you go earlier you get the tall "s" ( in utf-8) which no OCR program today seems able to handle, so the scanning cost is increased.
Scanning with a flat-bed scanner basically wrecks the binding. So the books probably need to be rebound afterwards, or can be discarded.
There are photography setups (e.g. Phase One has one) but the resolution is too low, even with a 40 megapixel medium-format camera (yes, they are used for this). A little high-school mathematics (e.g. Nyquist) and the back of an envelope, combined with some measurements, will show that if you scan engravings at under 1200dpi, you will lose a lot of detail, and indeed, compare for example the Alice in Wonderland pictures on my own site with the Project Gutenberg ones. You can read the engraver's signature on most of the ones I have. Yes, the bandwidth needed to host higher resolution images is greater (which is why I have ads, sorry). But it's worth it.
Some of these books will never be scanned again. Even for OCR, 400dpi grayscale seems a minimum for footnotes and other small text even in English.
I'd also like to see more interfaced like the Project Gutenberg Distributed Proofreaders' site where people can submit corrections. Maybe use a WIKI for the transcription??
Liam
Live barefoot!
free engravings/woodcuts
In 25 years they will determine that googles library is incomplete and start OCR shotgunning books down camera filled canvas chutes.
A brief protest will be launched, but all the kids will be too busy with their new fangled wearables and feelie parks to care.
I'll just use my special getting high powers one more time...
Did someone break their legs?
See that big building downtown with all the books in it?
Oh wait, get up from your desk, go outside (yes I know, it burns...), get on the bus and go downtown.
OK, now see the big building with the strange letters "LIBRARY" on the front? OK, that's the one, go inside... see all the books?
Now go up to the attendant at the desk and tell them your name and address and show a piece of photo ID. The nice person will give you a card that you can use to borrow books.
What's a book? OK, its many pages of paper bound together usually with glue and string. On each of these "pages" you will find ink (a dye) in the pattern of letters that form words and sentences and paragraphs.
Usually, these "books" tell a story or provide organised information.
No go ahead, pick one out - they'll even let you take it home for a week or two so you can read it. For free!
You can browse the stacks (a colloquialism for those big shelves with books on them) which are organised according to a system known as the Dewey Decimal System. You can use a revolutionary piece of technology known as a "card catalog" to indicate the position of the title you seek on the stacks (though many libraries have this same catalog searchable from computer terminals).
It's revolutionary, I know. But there you have it, free information and entertainment, enough to last a lifetime, with a "less restrictive approach".
Enjoy.
Google is scaning books why arent they doing the same for music and vids so we can get them for FREE?!?
i dont care about books thats too much work!!! i want free music and vids!!! it dont cost them nothing when its on p2pand besides they dont pay the artists nyway and copyrights are bad bcause they infring on our right to be entertaine. its a hole nu paradim and they should be LISTENING to us!!!
its all ones and 0's and they are tryin to CHARGE us for them, can you beleve it?!?
I am so fucking angry right now, I can barly type! Fucking greedy corps! Google shuld fiht them! They should just make ALL music and vids FREE, and FUCK the other corps that are so gredy!
They have BILLIONS of dollars and if they arnt GREEDY, THEY do that for us!
FUCK the RIAA! FUCK the MPAA! FUCK, FUCK, FUCK!
I WANT TO BE ENTERTAIND FOR FREE! I DESERVE IT BECAUSE I DO!
YOU ARE GREEDY IF YOU DONT LET ME DO IT!
Fucking greedy corps! Its all shit anyway! I wouldnt pay for it anyhow, so I should be able to just download it for free because it isnt worth anything! I PAY my internet!
And im not rich like those fucking bastards! I shouldnt have to pay for music and vids they already have enugh money! I should get it for FREE!
"And how, pray, are they supposed to survive without the adverts?"
Don't know about you, but I would pop for a yearly subscription for a *good quality* search engine that had a toggle for "with adverts" or "no adverts" option. Not sure how much I would spend, that would depend on how good they were on filtering out link farms, etc, but some reasonable fee to have the option of no ads. And then websites might have an indcement to restrict use of ads to at least the interior pages and nt the main public facing page. Ads there just suck.
Right now I would classify the free google search with ads as being of medium quality until you get good at it with a lot of -restrict this and that word added to your query and learning wild cards and domain restrictions, etc. In fact, I wish google had one simple option on their main page, split their search bar in two by default, one side is for words/phrases you are looking for, the other side is what you want to immediately filter out. For example if you add -sale, you eliminate a lot of commercial sites. Dogsquat simple, hardly anyone does it.
Google is good once you learn to use it, by default like most people use it though it's just a fancy yellow pages.
whoever modded the gp down is obviously a fanboi, a faggot, a rump roaster, a dicksucker, a fucktard and a bush supporter. fucking faggots ruin it for everyone else with their ass fucking aids disease.
and if you're a fag reading this you're useless and you're a shithead. go fuck yourself.
> Hammer away on Google's servers and they will cut you off
It'd be hard for them to defend against a bandwidth-limited, widely distributed effort.
Anyone want a crack at writing "UnfoldingClassics@Home" ?
One things that bugs the heck out of me with Google is their, "Oh we will do this because we have the rights", yet if you want to use their stuff you need EXPLICIT permissions. http://www.google.com/permissions/index.html
" All of Google's trademarks, logos, web pages, screen shots, or other distinctive features ("Google Brand Features") are protected by applicable trademark, copyright, and other intellectual property laws. If you would like to use any of Google Brand Features on your website, in an advertisement, in an article or book, or reproduce them anywhere else, you must first receive Google's permission. We've tried to make this process as painless as possible."
Funny Google wants you to get permission and they are saying no such thing as fair use. YET they want publishers to opt out...
Google is hypocritical!
"You can't make a race horse of a pig"
"No," said Samuel, "but you can make very fast pig"
Given Microsoft's history on intellectual property, the complaints of the OCA would be a lot more credible if Microsoft weren't a part of it.
it was just a stupid business decision to waste effort on doing copyrighted books in general, on an opt-out basis. The controversy about the copyrighted books has dragged the PD books down with it
The number of books that are clearly out of copyright is actually quite small (most books are in a gray area), so doing just them isn't very useful.
But more important: what is being "dragged down"? There's a lot of chest beating by people with strong interest in keeping control over printed materials and distribution channels, but nothing really substantial has happened.
It takes many thousands of years for even uncommon languages to disappear. And if they were even remotely similar to our own, they can be deciphered without any advanced knowledge. So, I'd be worried about the long-term chances of a complex language like Chinese to be preserved, but anything with Latin roots, that uses a small alphabet should do fine.
A thousand years for a language to disappear? All it takes is a generation who doesn't speak it and it might as well be considered gone. A language is often two interdepenent parts - spoken and written. Often - but not always. You could take the shining example of the Canadian approach to the First Nations peoples in the last century, where students were forced to learn exclusively in English, rather than their native tongue. An entire generation suddenly loses contact with the language of their parents. That would be devastating enough for, say, french speakers. Now consider that most of the First Nations languages and dialects have no written form. Needless to say, in hindsight apologies have been made but it certainly wiped out dialects that had survived centuries until then.
I think the corollary in IT is also important. Any physical media which is not used for a generation of technology (maybe less than 10 years) quickly becomes difficult to read as the machinery required to read it fails. Wait thirty years and it will cost you many times over to retrieve that information. The only hope for a lot of old data is to constantly move it onto the ubiquitous storage of the day, time after time. Anything missed will, sooner rather than later, be lost.
Cheers,
Toby Haynes
Anything I post is strictly my own thoughts and doesn't necessarily have anything to do with the opinions of IBM.
Surely we can speed up this process by simply asking the publishers to make available the original digital Latex or SGML files for all books printed since the late 70s right?
Why invest hundreds of hours on scan/ocr/qa for texts which already exist in a digital format?
A fool throws a stone into a well and a thousand sages can not remove it.
That is, if we imagine a digital archive to function like it's plain-paper counterpart : with huge underground stores with shelves full of discs.
But if we're a little bit realistic we should realise that, in the current age of internet and digital information, the data doesn't hve to remain fixed on a specific medium. The ability to make perfect copies is basically inherent to the nature of digital data.
The problem of preservation isn't anymore preserving a single old medium, but keeping a copy of the data as the storage medium is progressivly upgraded.
Think about it : everytime you upgrade a harddisk in your computer, you keep your old data (you either copy your old partition or copy your files). Some of the files you have kept around in nostalgy may come from very old computers that can't be found anymore. (On this system I'm writing on, I still hve some games, I programmed in basic when I was a kid long time ago. The original floppy may have rotten, but there's still a copy of the
According to your argument, software could NOT be found for old vintage computers, home computers, game consoles and arcade machine, because most of the disc have rotten, the ROM board may be broken and/or not be readable by any modern hardware, etc.
But in reality you can google for any classic emulation site and such and still find disc and rom image. Digital data is easy to copy around. The medium may have changed data was moved from ROMs and 8" / 5.25" / 2" floppy to harddisks, then to image inside ZIP files on the internet.
Granted the medium it self will never again be a medium. The single biggest problem that we will face are the readers. For all this marvellous "survival through digital copy" to function, the data need to be accessed and copied in the first time.
Sadly with all DRM systems that appear and restrict the possibility to copy digital data, the preservatiion will be much more difficult.
DRM : Bringing you a new dark age.
"Sufficiently advanced satire is indistinguishable from reality." - [Tips: 1DrYakQDKCQ6y52z6QbnkxHXAocMZJE61o ]
It's called "Lots of Copies Keeps Stuff Safe" and it's even got standards and software.
I just read
You could read this as.. What?? Money? We want money, give us some money. We want our share. Why can't we have our share?? Wahhhhh. We want money.
Mean what you say...say what you mean.
Just how does Google scanning a book prevent anyone else from doing the same? Does Google own the only copy? I doubt that. This seems like much ado about nothing, or an outright grab to force Google to share what they put the effort into creating in the first place. And I'll bet the sharing is expected to be Free.
"It's the height of ridiculousness to say for those 9 lines you get hundreds of millions."
If Google (or Microsoft's www.Books.Live.Com) wants to open this up to us, they can do one of 2 things:
(1) provide a complete index, possibly sortable, so I can have an easy set of links to mirror
(2) send a backup to a company that will sell the DVD version of their collection. One company makes money selling $1 DVDs at dollar stores, and we all see 4-disc sets fo John Wayne videos at Wal-Mart for $5.50. Microsoft or Google could send such a company a backup of their book collection once a year, and copmelte sets of, say, 25 DVD-9 discs (2TB) were available for $50? Bean counters can raise the price, but as long as we are free to copy them, I'm sure that universities would be willing to buy
Andy Out!
... a company like Google that has embraced 'Don't Be Evil' as its creed
Now that you mention it, so has the Christian Church, the Muslems and in fact most of the other religions. As have such magnificent luminarias as George Bush and Tony Blair. Well, more or less.
Morale: You can't trust people that try to use that kind of 'creed' as a selling point.
http://gutenberg.org/
Last modified this month.
I think Project Gutenberg is still around.
There is a fine line between recklessness and courage... -- Paul McCartney