Google Books As "Train Wreck" For Scholars
Following up on our earlier discussion, here's more detail on Geoffrey Nunberg's argument that Google Books could prove detrimental to academics and other scholars. Recently Nunberg gave a talk at a conference claiming that the metadata in Google Books is riddled with errors and is classified in a scheme unfit for scholarly use. This blog post was fleshed out somewhat a few days later in the Chronicle of Higher Education. Quoting from the latter: "Start with publication dates. To take Google's word for it, 1899 was a literary annus mirabilis, which saw the publication of Raymond Chandler's Killer in the Rain, The Portable Dorothy Parker, [and] Stephen King's Christine... A search on 'internet' in books written before 1950 and turns up 527 hits. ... [Google blames some errors on the originating libraries.] ...the libraries can't be responsible for books mislabeled as Health and Fitness and Antiques and Collectibles, for the simple reason that those categories are drawn from the Book Industry Standards and Communications codes, which are used by the publishers to tell booksellers where to put books on the shelves. ... In short, Google has taken a group of the world's great research collections and returned them in the form of a suburban-mall bookstore." The head of metadata for Google Books, Jon Orwant, has responded in detail to Numberg's complaints in a comment on the original blog post — and says his team has already fixed the errors that Nunberg so helpfully pointed out.
...when you have Search? Pick your own keywords.
Do not mock my vision of impractical footwear
I already read that quote about "inaccurate metadata" and "1899 was a literary annus mirabilis" half an hour ago when the first article was posted.
"I disapprove of what you say, but I will defend to the death your right to say it." - historian Evelyn Beatrice Hall
How? If you don't like it just ignore it.
So, the argument is that the new system is bad because it may have errors or bad data?
Were card catalogs immune to this? It's a database. It's only as good as what you put into it. A bad database is not useful. It just means someone needs to do it better. Honestly, if anything this seems like an argument that the database shouldn't be proprietary. It should be open to everyone so that someone can always make a better version of the metadata with the same base data.
"It's a piece of shit" shouldn't be the same argument as "nobody should even try it". The Wright brothers didn't exactly start out with a 747 or an F-35.
The road to tyranny has always been paved with claims of necessity.
We are trying to correctly amalgamate information about all the books in the world. (Which numbered precisely 168,178,719 when we counted them last Friday.)
- Jon Orwant (Google)
why does that number seem incredibly low to me?
How we know is more important than what we know.
Folks are afraid their citations might actually be checked in context? Or that equal access to public domain content gives the professional little more than a buttload of competition?
I have found Google Books invaluable for genealogy research, though I admit that their metadata and the file names are messed up. If you find several different volumes of a set, you have to rename them when you save them and be careful not to overwrite.
One huge gripe is that the PDFs do not include the OCR'd text so one can search within it. This is a huge oversight. I hope they will correct that someday.
Still, Google Books is the best solution that has come along. I hope they continue to improve it.
As someone who majored in English Literature in college, I can tell you that academics love getting their panties in a bunch over what is Scholarly Publication and what is not. Some teachers will actually have special assignments that have to be written entirely using Scholarly sources, or in response to a Scholarly article.
Before the advent of the internet, I can see how it might have been useful to have an in-group comprised of people who had some sort of qualifications to write about something, but it seems antiquated in light of the ease with which we can independently verify claims.
Usually, if someone's going to write something that's actually useful, they'll write an actual book. Soon thereafter, a bunch of "Scholars" will come along and write a bunch of journal articles and tell us all about how the useful work was one of three things: misogynistic, code for a religious statement, or arcane, carefully-hidden innuendo.
Sorry if I sound bitter, but I spent a lot of time reading this crap, and very little of it was as insightful or interesting as even my classmates' comments.
Google has scanned many volumes of the Laws of Indiana, which go back to 1816. These are the session laws of the Indiana General Assembly and have never been copyrighted. However, Google has arbitrarily decided not to make most post-1922 volumes it has digitized, and even some pre-1922 volumes (e.g. 1877, 1893, 1895, 1909, 1917 and 1918), available, using the claim of copyright.
Google has done all the decision-making here. Anyone who might object to the classification of one of these volumes as copyrighted and thus available in "snippet-view only" presumably would have the burden of proving the contrary. (And where would you even start? Who would you contact? I have seen nothing on this.)
Once (or if) the settlement is approved early this fall, Google's "rights" attach to these volumes. If I understand correctly, at that point any individual who wishes to access one of these volumes of Indiana's session laws not already in "full view" will have to pay for it, and for the money will obtain only individual rights, NOT the right to make it freely available to others.
Broader implications: Finally, this analysis has been limited to volumes of Indiana session laws, but surely similar situations exist more broadly.
For more on this, see this Aug. 2, 2009 Indiana Law Blog entry: http://indianalawblog.com/archives/2009/08/courts_my_probl.html
And this is no exception. Before google books you had access to books from various libraries, books you owned, books you could loan from friends (*shock* *gasp* copyright infringement), books you could buy and books from non-google online sources. Now you have access to all of those and additionally google books. Even if google books is 99% "piece of shit" (which in my experience is simply not true, but nevertheless) you still have the 1% potentially useful material available that wasn't available before, so you win.
like shelving 'Life of an Iceberg' under biographies, but by and large they strive to be and are correct. If they mess up, some other library will fix the error. Libraries' cataloging data is usually centralized by OCLC so that the data is uniform throughput the country as other libraries pull from this central source for their own catalogs. Libraries also use a recognized and standardized subject scheme with a controlled vocabulary, not just a bunch of meta tags. Cataloging librarians are a rare and little-recognized breed of people who spend their entire professional lives trying to make it easier to gain access to material. The result is an organized body of knowledge--not just a heap of books on the floor in no particular order, like the Internet--and Google. For Google to blame libraries for their troubles is like blaming the Machinist Mates on the Titanic for crashing the ship into an iceberg. There, full circle. How did that happen?
How about a moderation of -1 pedantic.
With all the class act talent that Google hires right out of college, why can't Google create its own Public Library on the Internet? Chrome could be the entry way to any book that is in the Public Domain, or by the Authors written permission. Turning the page of a book could be as simple as the [Back], or [Next] button. The "Card Catalog" would be a No-Brainer. No Library goes through these many hops. There's even translation to other languages, Brail, and Audio; from my viewpoint, this SHOULD be the challenge, not what word category is or isn't. If it's a case of "buy the book", then to buy 10 copies of "Gone with the Wind", and ONLY allow up to 10 readers to ONLY read "Gone with the Wind". Google could even have a "Google Online Library Card"; this is were the company hums "Ka-Ching".
The inline replies are written with a smug sense of self-entitlement as though he and other "scholars" are the only legitimate users of Google Books. It's NOT about you - you are not going to create enough adsense hits to make this whole thing worthwhile (or turn a profit).
... is that academics can't rely on Google Books to make their bibliographies, because the publication date and authorship information, which are used in all citation styles (MLA, Harvard, etc.) are incorrect on Google Books for an apparently large amount of books. Categories aren't used in citations, they're used by searchers.
Jon Orwant of Google said that 1899 was a placeholder year for unknown publication dates, as provided by some of their metadata providers... which leads me to ask if they sanitise their data or do any research into publication dates themselves!
This is much like Google itself.
Google's brilliance, and woe, is its sloppy imprecision.
You type in a query. It returns a bunch of stuff. Quite a lot of it is irrelevant and as perceived as not meeting the requirements of the search, but you don't mind because all you care about is that it finds what you want, not that it finds other stuff. Unfortunately, Google is so good that it tricks you into believing that it always finds everything that matches your query. But, of course, there's no way to find out what it _missed_.
I've personally noticed and been puzzled by the publication dates. I'd noticed it particularly with periodicals. What seems to be the case here is that Google is very prone to give the date that a journal began publication as the publication date of every article that has ever appeared in that journal.
Wikipedia editors are well aware of the dangers of using Google hit counts as data. It's amusing to see that there are 1,930,000 hits on "Ghandi" compared to 22,900,000 for "Gandhi" and conclude that Gandhi's name is misspelled 10% of the time... or to notice, as I have, that that percentage is increasing and project the year in which "Ghandi" must inevitably become the accepted spelling... but it is, as they say, "for amusement purposes only."
"How to Do Nothing," kids activities, back in print!
Yes, having all of the world's literature available for instant full text search sounds
disastrous for scholars.
Where are we going and why are we in a handbasket?
Tangential, but "card catalogs." Ha! I once had a compelling need to look up an article in the Occasional Papers of the Bingham Oceanographic Collection. So I went to the card catalog.
It wasn't under O. It wasn't under P. It wasn't under B. It wasn't under C.
It was under N.
Why? Because, naturally, as of course everybody knows, the Bingham Oceanographic Collection is part of the Peabody Museum. Which is part of Yale. Which (drum roll...)... ...is in New Haven.
The great thing here is that you can't even say there was an error in the card catalog, unless filing something under a heading that is perfectly correct, but under which nobody would dream of looking for it, is considered an error.
"How to Do Nothing," kids activities, back in print!
*grabs popcorn*
They pushed the copyright law to over hundred years (just to make sure they will make money of writers even after they are dead), now comes our big brother Google to the ring to resurrect all the OUT OF COPYRIGHT books -- meaning those dead books that publishers no longer exclusively distribute. What an offense against the poor publishers. Google is creating a real e-Library of enormous proportions of virtually free books, what a threat. I bet I am not alone who wants to see the Newton's books on physics e-published again and searchable.
The impression I get from these stories is that once Google scans them, no one else can. Is that somehow the case?
"Thanks for all the money you paid to us. We've used it to buy off ISO among other things" -Microsoft
Please give it a rest, anyone can scan all the books they want and post them online. The only problem is that the law hasn't established an efficent way to get the right to post books online. If Google had tried to do this with the laws current they would have had to figure out who owned the right to every book. Imagine how much the internet would suck if search engines had to do the same thing.
Also to get back to the topic at hand, it looks like they are trying to fix this as best they can and libraries have errors in them, it happens. zomg.
Not the fault of the publishers who "mislabel" their books. Google should be ashamed. Bad Google!
Of course perhaps the "Academics" should get off their arses and actually do some real research instead of taking everything provided to them by Google books as fact. But of course - Google made them do what ever it is that they do wrong.
And we haven't even gotten into what Microsofts involvement is!
If Google books was done on an iphone then everything would be ok...
I wish these people would just quit their whining about Google and it's book scanning. If you don't like what Google is doing go scan them yourselves. Google is creating something that never existed before - a large repository of the history of books in digital, searchable, available form - and all I hear is complaining. I don't believe that Google has an exclusive on this. I don't believe that their agreements to scan books prelude anyone else from undertaking the same project. And with technology improving in capability and cost every year it might even be cheaper for a latecomer to duplicate this feat. BUT SHUT UP ABOUT IT! If you don't like it then just go away and pretend it never existed and absolutely nothing else in your life will have changed.
Personally, I'm glad Google is doing it. Think of the screams of pain if it was Microsoft doing this.
"It's the height of ridiculousness to say for those 9 lines you get hundreds of millions."
As an aspiring academic half way through a philosophy Ph. D., I find Nunberg's argument pretty absurd. Google books is a godsend for academics, and would be much more so if there was full access to their entire catalog rather than "limited previews" for most books. I have used Google books countless times to quickly check out whether a book is relevant to my research, or to get the gist of an author's argument without having to trudge down to the library. I know many others who do this as well. In all this time I've never even looked at Google's metadata. No decent academic would rely on such information, as there are far more reliable methods: such as actually checking what's written in the book, which yes, Google scans in.
Get on it, Google.
... that Nunberg needs to get laid more. I can just imagine the man there banging away on his keyboard about this outrage all backed up.
Users... the only thing keeping 1st level support from being the bottom feeders.
In inline comments to the Google head guy's reply to the original blog entry, I find:
Google: Geoff asks why we decided to infer BISAC subjects in the first place. There is only one reason: we thought our end users would find it useful.
Scholar: The question is, why did you think end-users would find this useful? Which end-users did you talk to about this? I don't think you'd find a whole a lot of scholars who would embrace the idea of using the BISAC classifications in place of other library classification schemes. In fact, why would anybody think that a scheme designed for organizing the shelves of a Barnes & Noble outlet would be appropriate for a collection assembled out of the holdings of major research libraries?
I read this as "any book that can be found in the holdings of a major research library is only of interest to scholars." And I think he's entirely off-base. Nose-in-the-air "Scholars" like this gentleman fail to recognize that Google's efforts are about making material available to "the rest of us" who don't have access to those major research libraries. And categorical indexing of material makes complete and total sense if you expect to have non-PhD sorts searching for it.
I happen to be a scholar, in some sense, of one particular science. If I want to read some classic literature that has absolutely nothing to do with my science, should I be denied access because I'm not a scholar of that?
Village idiot in some extremely smart villages.
This may be a trite point, but yes, Google does err. Google also does a better job than most companies at going back and fixing their errors. This, being an online database, is pretty easy to correct. If by some principle the scholarship potential of this otherwise unavailable information was irredeemably corrupted, then yes, I'd worry. Instead, it sounds like a pretty amazing project which happens to be in beta.
This could be the stupidest and most disingenuous argument I've encountered all year. I guess I'll never know since the metadata is not at my finger tips. This might be a good argument for getting the metadata right. It isn't a good argument for tossing the virtual books out with the bathwater.
So no I won't get off your lawn. We're better off without scholars who'd rather hoard information. Begone!
These posts express my own personal views, not those of my employer
Which is incredibly helpful for anybody interested in printed materials before 1966...
Sounds like Google are doing their best to fix the problems. What I couldn't quite figure out is why bad data is overriding usually good data like Harvard. Maybe they need to give reliability rankings or something. We are 84% sure this date is right (because it came from Harvard), but there is a 10% chance this one is right (because some other place said that), and a 6% chance of this one (because some guys in Korea said it). Have the option to search only best guesses or all guesses.
Google Scholar has big problems of its own, as far as being "scholarly." Citations that Google's non-expert review staff believe represent a given technology get promoted, while citations Google's nonexperts don't seem to want to recognize but which actually were the origin of the technology are suppressed.
It's called the IPL. www.ipl.org. It has public domain works in the categories you'd expect to find them. (ie. Gutenberg content)
www.refdesk.org is similar but for reference.
Why did they bother?
Why did you bother to comment on it? If you don't like it - don't use it.
You are clearly ignorant of the key problem with the Google books settlement (as it currently stands), which is that Google and only Google will be given the right to reproduce orphaned works. I assume the morons tagging this "caveat emptor" are also ignorant of this.
So your glib remark should more correctly read, "if you don't like it, never have access to millions of pages of orphaned copyright works again because Google has an exclusive licence to reproduce them electronically". Which doesn't quite work as well, really, does it?
Read Pynchon.
While it's unlikely that Google's scanning technology is as dramatic as the one in Vinge's novel, there appear to be striking similarities. I wonder if Larry Page or Sergey Brin have read it.
Having read the original blog post this is clearly the vituperative rant of a imagine-wronged academic with which I am all too familiar.
Google is doing the hard work of scanning and attaching some meta-data. Once that is done (a) more meta-data can be added and (b) errors fixed. Additional mete data will be needed as there TWO academic classifications for english, and many more for non-englisg languages.
This is just stupid carping, by those who would rather retain control of their baliwick.
He does not seem to realize that google will be able to search the content making the meta-data somewhat less important.
There is no reason for you to post this comment here when you could have put together a properly formed and documented essay in a couple of months. There is was no reason for Newton to come up with his theory of gravity when in a few centuries Einstein would come up with a more complete theory.
This is a long term project for humanity. We damn well better start now rather than waiting to do it right. Badly data can be cross compared and corrected. Data which has not been digitized at all is completely useless (Towards the purpose of having digitized data). In the time it took you to complain about it you could have pulled up a few scans, and done some good old fashioned legwork in the form of copying it out in ASCII and redrawing the illustrations like clerks of old.
Because only licensed entities can create a copy of these works and at this moment in time, only Google has PAID for the license to do this.
You can do it yourself if you wish: just stump up the $125Million and buy a license.
If you don't like Google having the sole rights to commercial exploitation of this work, why aren't you complaining about Marvel having sole right to the graphic novels of Stan Lee etc? Or Warner Bros having sole rights to "The Matrix"?
the problem can ONLY be fixed by
a) forcing the copyright owners to give up the licensing for their works and make it PD
b) forcing Google to pay for everyone else to have the rights
(b) isn't going to happen.
(a) isn't going to happen unless you kill off copyrights.
Perhaps someone should point out to Mr. Nunberg (if one can get past his ceaseless caterwauling) that the books digitized come from LIBRARIES, and if scholars find their digitization, cataloging, or other minutiae somehow insufficient, they can always go back to said LIBRARIES and do their research the old fashioned way?
Some complaints just ring with irrelevance in immaturity. Complaining when someone has gone to great effort and expense to GIVE you something where you had nothing before, simply because they didn't organize it the way you might have, or because of some errors in the process seems...weak. Very weak.
-Styopa
As long as the books themselves are perfectly fine (which they seem to be),
Well, some are really good and well scanned, but others are a mess. From some organizations that do the scanning, you get missing pages and mangled pages. You get pages where the person doing the scanning sometimes put their hand between the page and the glass, so you can read the rings on their fingers but not the text on the page. (Books scanned at NY Public Library for example.) If ever there is a fold-out, you get at max half of it.
The Google Books organization doesn't seem to want to know, there is a mechanism for reporting single page defects but when 50 defects occur in a book it gets hard to work through them all using the button-clicks: I tried it for two books and also sent a message to Google Books, there was an automated reply and no action after several months.
So much for 'As long as the books themselves are perfectly fine ....', I'm afraid.
-wb-
librarians make mistakes too. if you went to any given library database and did funky searches, like 'show me all the maps you have', you will get all sorts of crap back.
JSTOR had people bitching about them too...guess what? turns out the wonderful paper libraries had fucked up shitty catalogs, and not only that, their collections were themselves missing issues, missing pages, missing all sorts of stuff.
JSTOR actually has a paper 'backup' of everything they scanned... they get it from universities who throw out their old junk to make way for new junk. and JSTORs collection is more 'pristine' than anything the libraries had to begin with.
oh wait... im sorry nobody can criticize the obvious stupid hypocrisy of the giant linux penis club.
oh wait.. dont > 99.99% of every website on the planet have a title? isnt that meta data....??
oh wait.. again, i apologize, oh great open source lords of azeroth.
the cataloger sucked, or the card for the title was missing.
go look up any book/journal from that time. they are by title and author, not 'city it came from' unless it is a special geographic card catalog.
If Google's service isn't sufficient for your research needs, THEN DON'T FUCKING USE IT. Dear god....
Don't take life so seriously. No one makes it out alive.
Oh, there, I think, I disagree. I once read a book entitled "Indexing, The Art Of," about how book indexes are created, and it was an eye-opener.
Conversely, there's nothing more useless than a completely computer-generated book index. You're looking for a topic that's discussed in three substantial sections and mentioned in passing fifty times, and the index lists fifty-three page numbers because the computer doesn't know which are the important ones.
The same principle probably applies to card catalogs and other indexes. Indexing is a deeply human activity; the person doing the indexing has to have a feeling for importance and organization and be able to guess how the user probably thinks about things.
P. S. Our local public library's card catalog used to have all of the first world war material listed under "Great European War 1914-18", though fortunately someone had stuck in "SEE" cards under "First World War" and "World War I."
"How to Do Nothing," kids activities, back in print!
"Scholars" make up less than 1% of the 'net-using population. So sorry if it inconveniences you "scholars" that Google Books, et al., is organized for the convenience of the great unwashed masses. Of course, all things should be organized specifically for the convenience of under 1% of the user base, esp. if that 1% has "PhD" after their names. And wear funny clothes, too. The day Google or anyone other public info service goes out of their way to organize information for your convenience vs. mine is the day I find a different service. And guess what? 99%+ of the user base is like me. So we have your ivory tower a$$es out-voted. ("O Democracy, what a terrible toll thou takest upon our tweed jackets and curiously high-socks...") Suck it up. Ppppfffffftttt..... :-P
I heard an author talk about on The Discovery of Air at the local bookstore. The book is about the correspondence between Priestly and Thomas Jefferson about Priestly's scientific ideas. This author talk was the first time I heard an author say that Google Books was an important reference source for him. This is a sweet spot for Google Books: 19th and early 20th century books out of copyright, but captured by google's university library digitzation effort.
I hate to be so cynical, but there was a huge uptick in negative articles on Slashdot about Google as soon as Microsoft started their anti-Google PR effort in DC. Now I see at least one anti-Google article on Slashdot every day. Is Slashdot falling for an extensive trolling effort from MS?
More info available from previous Slashdot article...
Nevertheless, there are many parallels between drafting laws and writing programs, most of all the 'unforeseen effects at run-time' one. It's such a good analogy that I'm surprised it's not used more often.
As someone that works in Higher Ed with many librarians and academics. I can tell you that librarians are taking any stand to justify their existence in this "new fangled" Internet world. I have some advise; get on the boat, or find a new career because your world is evaporating quickly.
When we left the Dewey Decimal System in the dirt, it was a mistake. It is almost always a mistake to let vendors -- and of them -- define standards. Sadly, we are (and have been) heading in that direction for a long time... mostly due to convenience, the balance to laziness.