Google To Digitize Much of Harvard's Library

← Back to Stories (view on slashdot.org)

Google To Digitize Much of Harvard's Library

Posted by timothy on Monday December 13, 2004 @06:46PM from the that's-a-lot-of-library dept.

FJCsar writes "According to an e-mail sent today to Harvard students, Google will collaborate with Harvard's libraries on a pilot project to digitize a substantial number of the 15 million volumes held in the University's extensive library system, which is second only to the Library of Congress in the number of volumes it contains. Google will provide online access to the full text of those works that are in the public domain. In related agreements, Google will launch similar projects with Oxford, Stanford, the University of Michigan, and the New York Public Library. As of 9 am on December 14, a FAQ detailing the Harvard pilot program with Google will be available at hul.harvard.edu."

17 of 296 comments (clear)

Min score:

Reason:

Sort:

Will it be like google scholar? by baronben · 2004-12-13 18:53 · Score: 5, Interesting

Ever since they introduced Google Scholar, I've been wanting something like this for my university. For those of you who don't know, finding articles on a subject can be a pain in the ass, as subjects are indexed on several different systems (depending on subject, date, and journal). None of them, not one, has a decent interface or gets results that are as good as google. Google scholar lets you search through academic texts, but its limited to what's available, usually working papers or pre-published drafts. If there is some way that google could team up with Academic printers to index as many journals and texts as possible, this would make everyone's life a lot better.
I think this is a great start, There's incredible profit here too, universities spend millions for catalogue systems. If I could use one interface to search for books, chapters, and articles on a subject, I could spend more time actually learning, and less time looking at the same damn "no results" page on GeoWeb. Grrrr.

--
Sleep is for the weak!
1. Re:Will it be like google scholar? by ISEENOEVIL · 2004-12-13 19:07 · Score: 2, Interesting
  
  As long as we don't have something like Google comes in and picks up these prestigious library resources, Yahoo comes and gets another set, and then Microsoft picks still more. I have a feeling some of these resources are wanting to be universally accessed. This is one step closer, but still not close enough if you have to use 3+ different major search engines. My library fees that are tacked onto tuition would actually be used if I could use my preferred search engine to access everything my university is paying so much for in one place. As it stands now I cringe when I have to navigate our electronic resources.
  
  -Stormy
2. Re:Will it be like google scholar? by Txiasaeia · 2004-12-13 19:11 · Score: 4, Interesting
  
  "If I could use one interface to search for books, chapters, and articles on a subject, I could spend more time actually learning, and less time looking at the same damn "no results" page on GeoWeb. Grrrr."
  Or finding that perfect article in the MLA database, only to find out that nobody in Canada subscribes to the journal, nor does anybody have the journal on fulltext. I'd rather have a more comprehensive fulltext database in plaintext rather than digitalised copies of everything anyway - makes searching a hellova lot easier.
  
  --
  Condemnant quod non intellegunt.
3. Re:Will it be like google scholar? by belg4mit · 2004-12-13 21:19 · Score: 2, Interesting
  
  Also try Scirus from the facts at FAST. I've often had better luck there than on google.
  
  --
  Were that I say, pancakes?
4. Re:Will it be like google scholar? by Rich0 · 2004-12-14 00:16 · Score: 2, Interesting
  
  The one thing that something like google is lacking is persistant results sets. When I do serious searching I usually start with broad terms and figure out what it takes to narrow things down to a scale that I'm willing to work with.
  
  Good quality search engines have lots of qualities that Google lacks. You could search for two words located within 3 words of each other. You could search for these two words within 3 words of each other while two other words don't occur within 6 words of each other. Indexes are gennerally well-thought-out and vocabularies are sometimes controlled.
  
  Google allows many of these features, but they're cumbersome to use. If I ran two searches and I want to merge the results I have to be copying down everything I did, and try to concoct some kind of advanced search which combines the two sets of parameters. In a decent professional search tool you just ask it to return "set 1 or set 2" - giving you a set 3 that has any item that appeared in either. This is powerful and easy to use, and there is no comparison with google.
  
  Don't get me wrong, I'm glad Google is going into this business. I no longer have free access to just browse the literature any time I feel like it, and this tool would provide that. I just don't think that they'll close down the commercial operations anytime soon.
  
  Personally, I think that all articles written using federal funding should be released into the public domain. The NIH could sponsor journals if none of the commercial journals are willing to publish works that have no copyright. If my tax dollars were used to pay for a study on bumblebee migration patterns, then I should be able to thumb through the report whether or not some bureaucrat thinks that I have a need to know the results. And doing so should not require a trip to some non-public library halfway around the country...
The Fight against Plagiarism by manmanic · 2004-12-13 19:04 · Score: 5, Interesting

One reason why this is in the interest of big old universities like Harvard is that it will make it much easier to detect plagiarism in students' essays. If published books were included in Google's index, a plagiarism detection service like Copyscape would also be able to check whether content was lifted from printed material, as well as from the web.
Re:Are these volumes stored as text or pictures? by robla · 2004-12-13 19:09 · Score: 3, Interesting

I would hope the handle it in just like catalog.google.com
How will the books be scanned? by supersat · 2004-12-13 19:14 · Score: 2, Interesting

About two months ago, Jeff Dean (an employee of Google) gave a talk at the University of Washington about the inner workings of Google. One thing he mentioned was Google Print and how they scan books: they slice 'em up into individual pages, and then feed them through a scanner. This doesn't seem like an acceptable way to archive a library's collection. So, how are they scanning them in? Why not use this method for Google Print?
Re:Flipside: The false positive problem by Gori · 2004-12-13 20:15 · Score: 2, Interesting

Well, there are such things as references.

Using work of other people in academic work is not only possible, but greatly encouraged. Just make sure that it is very clear what comes from whom.

In many ways, science is done exactly as Open Source software. Take what you need, modify and improve it where appropriate, and make sure you give full credit where due.

As a teacher, I have given full points to a paper that has hardly any text of their own, as long as they are properly referenced, and used together to make a valid point, not made by any of the sources.

So I do not think students should bother staying below the rarad. Just reference everything,and voila, you are doing science

--
Complexity is a measure of our ignorance...
It's about Time! by Shafe · 2004-12-13 20:16 · Score: 2, Interesting

I've been emailing them asking them to do this for years. I'm glad someone is finally doing it! There is only one problem: how do they get past copyright violations? I tried to get Cornell to do this on campus, but they said a lot of their volumes (periodicals, in particular) were still under copyright and hence cannot be scanned. No, it doesn't make any sense to leave these carbon books literally fall apart when we can preserve them forever digitally, but that's the name of the game.

Someone hurry up with nanostorage so I can store the entire content of human knowledge on a postage stamp (with nanosecond seek time and gigabyte transfer speeds, of course)
Mailing Lists by lousyd · 2004-12-13 20:22 · Score: 2, Interesting

Call me mundane, but I want Google to index mailing lists, with a nice interface like their "Groups".

--
If aspiration is a virtue, achievement cannot be a vice.
Re:Nice! by Anonymous Coward · 2004-12-13 20:29 · Score: 1, Interesting

But also: PG books are full of errors, and there is no source info or scans available to fix against in any sort of easy way. Many books Such as Wealth of Nations went through a number of editions during the author's lifetime. It would be nice to have the various early editions for collation. And often times new editions come out long after the death of the author with bullshit editorial changes in order to claim a new copyright. A library like Harvard will have many of the first number of editions of classic works.
Re:University of California is anti-digital by JoshuaDFranklin · 2004-12-13 20:33 · Score: 2, Interesting

Got a link for that policy?

Ever tried a Freedom of Information Act (FOIA) request? Strange as it may seem, that apparently works in the State of Washington.
Re:U of Michigan by truesaer · 2004-12-13 21:16 · Score: 2, Interesting

Actually, I see that it is actually Stanford with 8 million items that will get to claim themselves as the largest, then followed by Michigan with 7 million. I don't know why Harvard is getting any props at all with only 40k items. Here is what I found most interesting in the article though:

The size of the U-M undertaking is staggering. It involves the use of new technology developed by Google that greatly speeds the digitizing process. Without that technology -- which Google won't discuss in detail -- the task would be impossible, says John Wilkin, the U-M associate librarian who is heading the project.

"Going as fast as we can with the traditional means of doing this, it would take us about 1,600 years to do all 7 million volumes," he said. "Google will do it in six years."

Under the agreement, the library will get a digital copy of every book scanned. With those copies, the library can prepare special research projects, virtual exhibitions and more relevant scholarly and academic material for its students and faculty.

"If we were to do this job ourselves, it would probably cost us $600 million," Wilkin said. "That's just the human cost of preparing the material for scanning, packing it up and sending it out to vendors and then quality-control checking of the results. This is easily a billion-dollar effort."

Items will start appearing in 2005 with completion predicted for 2010. Can you imagine how many libraries there are out there? The information that could be gathered seems endless. I'm guessing they'll come up with a good way to detect duplicates in future libraries, but as anyone who has wandered through a University library knows there are a LOT of shady books that seem like they haven't been widely published and there are a LOT of things that were self published by academics in the University itself (theses, postdoc research, etc).
Why journals are expensive. by commodoresloat · 2004-12-13 22:02 · Score: 4, Interesting

The reason there are so few copies is because they are so expensive. Chicken and Egg.
No; the reason there are so few copies is there are so few people who want to read specialized journals. And the small audience only accounts for a small part of what many academic journals charge.
No; the problem is not overhead costs or small audiences. The problem is that the owners of much of that kind of content are greedy bastards. There is no reason for the outrageous price of some journals. Some scientific journal subscriptions are in the tens of thousands; even many liberal arts journals are far from cheap. And if you want to copy an article for your students to buy at kinkos, expect them to pay 35 cents a page or more for the copyrights alone.
And many of them are worse than the RIAA in terms of access to content electronically. Journal articles are included in databases sold to some universities You can read articles in some databases but only by loading a .gif of every page one at a time. No copy and paste, no text access at all. So much technology going into preventing the thing from being copied that the online version is actually less useful than the dead tree version rotting on the shelf.
I think this is a great move by Google and Harvard, and I like the idea behind google scholar, but I expect this kind of work to be resisted by many of journals and professional organizations, to the extent that they have in a say in it. This will be a huge boon in terms of the availability of public domain resources, but unfortunately outdated perspectives on intellectual property are likely to hold back real progress for something really useful to scholars in a systematic way. At least until those perspectives change significantly.
Re:Oxford University gets every UK book published by Andrew+Aguecheek · 2004-12-13 22:47 · Score: 2, Interesting

Yep, fell foul of this one the other day. The National Library of Wales happens to be situated in Aberystwyth, on the same hill as the University. (Which, by the way, is a bitch to climb in the mornings... do not apply for sea-front residences unless you are sure of your fitness!) Aaaaanyway, as the librarian there tactfully explained to me: one hell of a lot of books are published every year, and there's only so much space in the place... and they like to have a Welsh Language copy too!

--
Tomorrow, I may eat another house plant
Re:Nice! by Anonymous Coward · 2004-12-14 02:41 · Score: 0, Interesting

Worth noting that this project is putting a LOT of people out of work. Literally, they are laying off almost their entire library staff (I know a few..). Wonder if that'll be in the FAQ?