Google Books As "Train Wreck" For Scholars
Following up on our earlier discussion, here's more detail on Geoffrey Nunberg's argument that Google Books could prove detrimental to academics and other scholars. Recently Nunberg gave a talk at a conference claiming that the metadata in Google Books is riddled with errors and is classified in a scheme unfit for scholarly use. This blog post was fleshed out somewhat a few days later in the Chronicle of Higher Education. Quoting from the latter: "Start with publication dates. To take Google's word for it, 1899 was a literary annus mirabilis, which saw the publication of Raymond Chandler's Killer in the Rain, The Portable Dorothy Parker, [and] Stephen King's Christine... A search on 'internet' in books written before 1950 and turns up 527 hits. ... [Google blames some errors on the originating libraries.] ...the libraries can't be responsible for books mislabeled as Health and Fitness and Antiques and Collectibles, for the simple reason that those categories are drawn from the Book Industry Standards and Communications codes, which are used by the publishers to tell booksellers where to put books on the shelves. ... In short, Google has taken a group of the world's great research collections and returned them in the form of a suburban-mall bookstore." The head of metadata for Google Books, Jon Orwant, has responded in detail to Numberg's complaints in a comment on the original blog post — and says his team has already fixed the errors that Nunberg so helpfully pointed out.
> The odd thing about complaining about this is, what are they comparing
> to? A hypothetical perfect online database that doesn't exist anyways?
That's exactly why this article is little more than some long winded trolling. So the metadata is wrong... As long as the books themselves are perfectly fine (which they seem to be), you can always check the metadata your self. I must think that as far as Google is concerned (and 99+% of its users) the metadata isn't nearly as important as the data itself. Once the data is collected you can always fix the rest.
Expect a new "tagging game" in the next year or two to manually correct these error.
And this is no exception. Before google books you had access to books from various libraries, books you owned, books you could loan from friends (*shock* *gasp* copyright infringement), books you could buy and books from non-google online sources. Now you have access to all of those and additionally google books. Even if google books is 99% "piece of shit" (which in my experience is simply not true, but nevertheless) you still have the 1% potentially useful material available that wasn't available before, so you win.
This is much like Google itself.
Google's brilliance, and woe, is its sloppy imprecision.
You type in a query. It returns a bunch of stuff. Quite a lot of it is irrelevant and as perceived as not meeting the requirements of the search, but you don't mind because all you care about is that it finds what you want, not that it finds other stuff. Unfortunately, Google is so good that it tricks you into believing that it always finds everything that matches your query. But, of course, there's no way to find out what it _missed_.
I've personally noticed and been puzzled by the publication dates. I'd noticed it particularly with periodicals. What seems to be the case here is that Google is very prone to give the date that a journal began publication as the publication date of every article that has ever appeared in that journal.
Wikipedia editors are well aware of the dangers of using Google hit counts as data. It's amusing to see that there are 1,930,000 hits on "Ghandi" compared to 22,900,000 for "Gandhi" and conclude that Gandhi's name is misspelled 10% of the time... or to notice, as I have, that that percentage is increasing and project the year in which "Ghandi" must inevitably become the accepted spelling... but it is, as they say, "for amusement purposes only."
"How to Do Nothing," kids activities, back in print!
Definatly. It's like, "Oh, look, I found an error. If I had done this, that error wouldn't be there!!" And to that I respond, then do it yourself. YOU go tack metadata onto the 100 million books they have, you smug egocentric bastard.
And, of course, he completely ignores the 999,999 proper entries compared to the 1 error. Google seems to know there's lots of problems here, and they're not going to get it right the first pass. But having a first pass at all is better than nothing.
Yes, having all of the world's literature available for instant full text search sounds
disastrous for scholars.
Where are we going and why are we in a handbasket?
They pushed the copyright law to over hundred years (just to make sure they will make money of writers even after they are dead), now comes our big brother Google to the ring to resurrect all the OUT OF COPYRIGHT books -- meaning those dead books that publishers no longer exclusively distribute. What an offense against the poor publishers. Google is creating a real e-Library of enormous proportions of virtually free books, what a threat. I bet I am not alone who wants to see the Newton's books on physics e-published again and searchable.
Sorry if I sound bitter, but I spent a lot of time reading this crap, and very little of it was as insightful or interesting as even my classmates' comments.
That sounds like more of a you problem than an academia problem. If you don't enjoy using a work's minutiae to accuse perfectly innocent authors of misogyny, innuendo, (to add a couple you forgot) blatant colonialism or latent homosexuality, what the fuck were you doing in an English Lit program? The rest of us live for that shit.
As someone who should not have majored in English Literature in college
There. I fixed it for you.
Mod my comments down. It'll be fun.
How about good old fashioned legwork? It *is* possible to make sure that the metadata is consistent with the facts, but that involves doing actual research and verification such as academics have been doing for hundreds of years.
Then you have very low standards indeed. There's absolutely no reason why a single entity had to / has to scan all the world's back catalog on their own as fast as they can. It's pure commercial greed, and leads to the garbage we have on the net today.
What is needed is an open standard for scanned works, with minimum resolution, minimum quality, and minimum verified metadata such as subject, author, publisher, year etc. All those are trivially listed on the title page of every book. All one has to do is open the damn book and flip a few pages, but that appears to be too hard for some people.
This is a long term project for humanity. There's absolutely no point in having crappy scans with garbage metadata available quickly today, when it could be available correctly with good quality in say five years. It's also a perfect case for crowdsourcing, with some real standards to ensure quality.
The current dreck that's online only causes duplication and waste. Take a look someday at archive.org (for example), and see how many copies of the same book are available, if it's a popular book. You'll typically find 5-10 scanned versions, by Google, Microsoft, and various local library projects, in black and white or colour none of which is truly good quality: broken characters, pages with dark margins, missing pages, typos or incorrect titles, wrong authors etc.
Why did they bother?