Google Books As "Train Wreck" For Scholars

← Back to Stories (view on slashdot.org)

Google Books As "Train Wreck" For Scholars

Posted by kdawson on Monday September 7, 2009 @11:52AM from the mishmash-wrapped-in-a-muddle dept.

Following up on our earlier discussion, here's more detail on Geoffrey Nunberg's argument that Google Books could prove detrimental to academics and other scholars. Recently Nunberg gave a talk at a conference claiming that the metadata in Google Books is riddled with errors and is classified in a scheme unfit for scholarly use. This blog post was fleshed out somewhat a few days later in the Chronicle of Higher Education. Quoting from the latter: "Start with publication dates. To take Google's word for it, 1899 was a literary annus mirabilis, which saw the publication of Raymond Chandler's Killer in the Rain, The Portable Dorothy Parker, [and] Stephen King's Christine... A search on 'internet' in books written before 1950 and turns up 527 hits. ... [Google blames some errors on the originating libraries.] ...the libraries can't be responsible for books mislabeled as Health and Fitness and Antiques and Collectibles, for the simple reason that those categories are drawn from the Book Industry Standards and Communications codes, which are used by the publishers to tell booksellers where to put books on the shelves. ... In short, Google has taken a group of the world's great research collections and returned them in the form of a suburban-mall bookstore." The head of metadata for Google Books, Jon Orwant, has responded in detail to Numberg's complaints in a comment on the original blog post — and says his team has already fixed the errors that Nunberg so helpfully pointed out.

8 of 160 comments (clear)

Min score:

Reason:

Sort:

Re:Who needs metadata any more by timeOday · 2009-09-07 12:03 · Score: 5, Interesting

To read the article, it is mostly a problem for people who are essentially studying trends in metadata itself, such as the emergence of some particular word over time. The "oddball" categorizations, I agree, why would anybody browse the "technology" section of a collection with millions of titles?
The odd thing about complaining about this is, what are they comparing to? A hypothetical perfect online database that doesn't exist anyways? The article says google got it wrong in some cases where, e.g. the Harvard Library got it right. OK, that's an issue for all of us deciding whether to search on our nearest computer, or at the Harvard library.
To me, google's project was a long time coming - somebody had to scan the world's back catalog. Maybe it would be better if governments had done it, but (and this is the point) they didn't. Google is.
Anonymous Coward by Anonymous Coward · 2009-09-07 12:28 · Score: 5, Interesting

Google has scanned many volumes of the Laws of Indiana, which go back to 1816. These are the session laws of the Indiana General Assembly and have never been copyrighted. However, Google has arbitrarily decided not to make most post-1922 volumes it has digitized, and even some pre-1922 volumes (e.g. 1877, 1893, 1895, 1909, 1917 and 1918), available, using the claim of copyright.
Google has done all the decision-making here. Anyone who might object to the classification of one of these volumes as copyrighted and thus available in "snippet-view only" presumably would have the burden of proving the contrary. (And where would you even start? Who would you contact? I have seen nothing on this.)
Once (or if) the settlement is approved early this fall, Google's "rights" attach to these volumes. If I understand correctly, at that point any individual who wishes to access one of these volumes of Indiana's session laws not already in "full view" will have to pay for it, and for the money will obtain only individual rights, NOT the right to make it freely available to others.
Broader implications: Finally, this analysis has been limited to volumes of Indiana session laws, but surely similar situations exist more broadly.
For more on this, see this Aug. 2, 2009 Indiana Law Blog entry: http://indianalawblog.com/archives/2009/08/courts_my_probl.html
Why Isn't Google Books A Library? by LifesABeach · 2009-09-07 12:31 · Score: 4, Interesting

With all the class act talent that Google hires right out of college, why can't Google create its own Public Library on the Internet? Chrome could be the entry way to any book that is in the Public Domain, or by the Authors written permission. Turning the page of a book could be as simple as the [Back], or [Next] button. The "Card Catalog" would be a No-Brainer. No Library goes through these many hops. There's even translation to other languages, Brail, and Audio; from my viewpoint, this SHOULD be the challenge, not what word category is or isn't. If it's a case of "buy the book", then to buy 10 copies of "Gone with the Wind", and ONLY allow up to 10 readers to ONLY read "Gone with the Wind". Google could even have a "Google Online Library Card"; this is were the company hums "Ka-Ching".
Card catalogs by dpbsmith · 2009-09-07 12:49 · Score: 5, Interesting

Tangential, but "card catalogs." Ha! I once had a compelling need to look up an article in the Occasional Papers of the Bingham Oceanographic Collection. So I went to the card catalog.
It wasn't under O. It wasn't under P. It wasn't under B. It wasn't under C.
It was under N.
Why? Because, naturally, as of course everybody knows, the Bingham Oceanographic Collection is part of the Peabody Museum. Which is part of Yale. Which (drum roll...)... ...is in New Haven.
The great thing here is that you can't even say there was an error in the card catalog, unless filing something under a heading that is perfectly correct, but under which nobody would dream of looking for it, is considered an error.

--
"How to Do Nothing," kids activities, back in print!
Re:Who needs metadata any more by Potor · 2009-09-07 12:49 · Score: 5, Interesting

Exactly. And the whole argument totally ignores the fact that these books are now easily available.
Shock horror: I am a liberal arts scholar. And Google Books has helped me incredibly in a project I am doing on a 18th century scholar. I have original texts in various editions at my fingertips, wonderful reference books (including a dozen 18th and 19th century Latin grammars), and serious secondary literature. Not all of these are fully posted on Google Books, but now I know what books to check out of the library, or even buy.
As an arts scholar, I love Google books.
Re:"scholarly" information by moosesocks · 2009-09-07 14:12 · Score: 3, Interesting

Actually, the GP's got a good point. Back in college, I took a number of humanities courses whenever I could squeeze them into my schedule.
I can say from firsthand experience that there are a lot of "scholarly" articles that are complete and total crap. When writing papers, I'd frequently peruse JStor for pertinent articles about my topic, keeping an eye out for particularly good articles, as well as the heinously bad ones. Picking apart and systematically disproving a bad paper published in a "good" journal was an easy ticket to an 'A' on the paper.
These papers, of course, were certainly the exception. Most scholarly papers I encounter are humbling in their brilliance. However, I've seen more than a few bad journal articles, as well as quite a few blog entries that would be worthy of scholarly publication. It's hard to make any generalizations about the validity of certain sources of information.
Unfortunately, Physics wasn't quite as easy to bullshit (Random aside: The physical sciences certainly have their fair share of bad journal articles, especially in light of the fact that printed media is a terrible means by which to communicate scientific results. It's a cruel irony that the www was invented to enable collaboration and information exchange between scientists, but is rarely (if ever) used for that purpose. Also, any use of the word 'trivial,' or its synonyms needs to be punishable by death.)
PS. Don't judge our writing abilities based upon out slashdot comments. I'm sure the GP had his own reasons for majoring in English, even though literary discourse is often trite and contrived.

--
-- If you try to fail and succeed, which have you done? - Uli's moose
Re:Incredible arrogance of the "scholar" by grcumb · 2009-09-07 16:26 · Score: 3, Interesting

And I think he's entirely off-base. Nose-in-the-air "Scholars" like this gentleman fail to recognize that Google's efforts are about making material available to "the rest of us" who don't have access to those major research libraries. And categorical indexing of material makes complete and total sense if you expect to have non-PhD sorts searching for it.
You're fighting the wrong battle here. It's easy to find any number of legitimately nasty things about 'Scholars' and 'Academics' and elitism in general. But arguing for proper classification in Google Books is not one of them.
For several years I was an avid amateur of Information Retrieval. Classification (and other useful organisational models) of information into related collections is essential when you don't know what keywords you're looking for. This is especially important with historical works, where the use of 21st Century names, terms and other common keywords is next to useless.
Google search is useful when you know what you're searching for. But knowing what to look for in Google Books is an entirely different matter. Categorisation matters here.
By using a classification system that is designed for book sellers, Google's chosen a very poor set of criteria. Not only will most of the titles be poorly characterised (and thus harder to find), the effort required to find them increases with their rarity or uniqueness. These aren't always a measure of importance or interest, but often enough, they are.
Asking Google to consider a proven, effective and well-understood categorisation system is not being snooty; it's an effort to suggest - as we geeks often do - that there might actually be a correct way to perform this task.
Sometimes what looks like 'arrogance' is actually the state of being right about something when no one else will listen.

--
Crumb's Corollary: Never bring a knife to a bun fight.
Re:Who needs metadata any more by introspekt.i · 2009-09-07 18:37 · Score: 3, Interesting

You act like the technology and processes use to generate this catalog are going to remain deficient indefinitely. You ignore the fact that consumer demand for better (metadata|accuracy|whathaveyou) will drive improvements in the technology. In the meantime, we get access to the early iterations of the technology and the benefits it can provide today.

What is needed is an open standard for scanned works, with minimum resolution, minimum quality, and minimum verified metadata such as subject, author, publisher, year etc.
Necessity is the mother invention. Wait for one to pop up, or go make one up. Nobody's stopping you.

All those are trivially listed on the title page of every book. All one has to do is open the damn book and flip a few pages, but that appears to be too hard for some people.
Opening the covers of every possible resource you use is quite easy when you have a discrete, present set of resources to thumb through. What if your resources aren't present, are high in number, or (lo!) are undefined...because you don't even know what exactly it is you're looking for?

This is a long term project for humanity. There's absolutely no point in having crappy scans with garbage metadata available quickly today, when it could be available correctly with good quality in say five years.
I think you're absolutely wrong. It's naive to assume we can just have an instant rubber-meets-the-road system available in x years without rigorous testing and input on the part of users. No point? Hah! This is absolutely the best way to go about things! Let the system work itself out with angry users pushing technicians to improve archives to have the best working system in the end. The Google system is hardly "done" and it's only going to get better with time.

The current dreck that's online only causes duplication and waste. Take a look someday at archive.org (for example), and see how many copies of the same book are available, if it's a popular book.
God forbid we have multiple copies of popular books in different archives.

black and white or colour none of which is truly good quality: broken characters, pages with dark margins, missing pages, typos or incorrect titles, wrong authors etc.
Quality is relative. Why prohibit use because we lack perfection?

Why did they bother?
Why did you bother? Why did I bother? Why does anybody bother? Probably because we all feel like it.