Google To Digitize Much of Harvard's Library

← Back to Stories (view on slashdot.org)

Google To Digitize Much of Harvard's Library

Posted by timothy on Monday December 13, 2004 @06:46PM from the that's-a-lot-of-library dept.

FJCsar writes "According to an e-mail sent today to Harvard students, Google will collaborate with Harvard's libraries on a pilot project to digitize a substantial number of the 15 million volumes held in the University's extensive library system, which is second only to the Library of Congress in the number of volumes it contains. Google will provide online access to the full text of those works that are in the public domain. In related agreements, Google will launch similar projects with Oxford, Stanford, the University of Michigan, and the New York Public Library. As of 9 am on December 14, a FAQ detailing the Harvard pilot program with Google will be available at hul.harvard.edu."

24 of 296 comments (clear)

Min score:

Reason:

Sort:

Ivy Exchange by Anonymous Coward · 2004-12-13 19:04 · Score: 1, Informative

I know Brown has been digitizing all journals coming in for a while...

On another note, all the Ivies except Haavad participate in interlibrary loan program. There's over 40 million bound volumes overall. Check it out here.
Re:Nice! by RollingThunder · 2004-12-13 19:18 · Score: 4, Informative

Well, there's the Distributed Proofreaders project for Project Gutenberg... but PG isn't a "we must be the source" attitude from what I've seen. As far as PG is concerned, the more eBooks, the better.

DP probably isn't threatened either - they just shift focus to books that are not in the Harvard collection to avoid duplication of effort.
Reminds me of the U of Michigan and U. Microfilms by Ungrounded+Lightning · 2004-12-13 19:21 · Score: 2, Informative

Back around the '60s or so the University of Michigan cut a similar deal with University Microfilms.

U Microfilms set up and ran a microfilming operation in the library system, microfilming everything that wasn't under copyright (and much that was with permission of the copyright holders, such as several large newspapers and many magazines and other periodicals), along with much of the University's records. Rare books, etc.

(If I have this right) the U got microfilm prints of the documents for free and didn't have to pay for the microfilming of its records. University Microfilms made its money by selling microfilms of the various publications (forwarding royalties, where appropriate, to the copyright holders). The rare books, for instance, could now be studied on microfilm with no further stress on the original, and their content became available at many other colleges and libraries. Good deal all around.

University Microfilms was founded by a regent, who was later slammed for conflict of interest. He dropped out of the Board of Regents but the business deal continued.

--
Bantam Dominique roosters crow a four-note song. Once you've heard it as "Happy BIRTHday" you can't NOT hear it that way
University of California is anti-digital by dananderson · 2004-12-13 19:24 · Score: 5, Informative

This is great. Compare this pro-digitalization attitude of Harvard, Stanford, and others with the University of California's (UC's) anti-digital position.
For books in Special Collections, they won't allow copies to be digitalized unless they are (1) paid a fee to scan the book (fair enough) and (2) paid a royalty to post the book to the web.
The royalty amounts to hundreds or thousands of dollars per book (about $100/page or image). This allows the libraries to act as a "profit center" for the universities. This policy applies to all UC campuses (I've tried UCB, UCLA, UCI, UCSD).
This is true even though the book is in the public domain (because they have physical possession and nobdy can make copies until you sign a license agreement). This is true even if you're using the book for non-commercial purposes (such as free posting to the web).
Something is wrong here. People donate to UC libraries (either books or money) for the public good. They don't donate so the library can start a business licensing public-domain books.
Despite that, I have been able to scan many books (by using books in open stacks or purchasing them). These books concern Yosemite history and are at http://www.yosemite.ca.us/history/
"Slice and scan" is used for new books only by dananderson · 2004-12-13 19:31 · Score: 3, Informative

I'm not familiar with Google Print, but typically "slice and scan" is usually used for new books only. That's because there's multiple copies of the book available and the paper is usually flat and dust free.
For older books, most archivists use a cradle and photograph the pages. It's easier on the book, requires no slicing, and there's no scanner to clog with dust.
The disadvantage is the scanner operators need a little bit more training, but that's not a big problem.
1. Re:"Slice and scan" is used for new books only by brunogirin · 2004-12-14 01:42 · Score: 2, Informative
  
  Also, the "slice and scan" method is much, much faster because you can feed the whole book in one go to a high volume scanner and hey presto! it comes out with all the scans in minutes rather than spending hours photographing and scanning each page individually. But of course, "slice and scan" is a destructive method (destructive for the book) so only makes sense if the printed book is not a rare item.
Text of Dec 13th Email by olvr · 2004-12-13 19:34 · Score: 5, Informative

December 13, 2004

Dear Colleague,

I am writing today with news of an exciting new project within the Harvard libraries. As all of us know, Harvard's is the world's preeminent university library. Its holdings of over 15 million volumes are the result of nearly four centuries of thoughtful and comprehensive collecting. While those holdings are of primary importance to Harvard students and faculty, we have, for several years, been considering ways to make the collections more useful and accessible to scholars around the world. Now we are about to begin a project that can further that global goal-and, at the same time, can greatly enhance access to Harvard's vast library resources for our students and faculty.

We have agreed to a pilot project that will result in the digitization of a substantial number of volumes from the Harvard libraries. The pilot will give the University a great deal of important data on a possible future large-scale digitization program for most of the books in the Harvard collections. The pilot is a small but extremely significant first step that can ultimately provide both the Harvard community and the larger public with a revolutionary new information location tool to find materials available in libraries.

The pilot project will be done in collaboration with Google. The project will link Harvard's library collections with Google's resources and its cutting-edge technology. The pilot project, which will be announced officially tomorrow, is the result of more than a year of careful consultation at many levels of the University. We could not have achieved a meaningful pilot project without the efforts of the Harvard Corporation; the President, Provost, Chief Information Officer, and Office of General Counsel; the University Library Council; and senior managers within the College Library and the University Library.

A full description of the pilot program follows here, with further materials available on the Harvard home page tomorrow.

With best regards,
Sidney Verba
Carl H. Pforzheimer University Professor and
Director of the University Library

Project Description:
Harvard's Pilot Project with Google

Harvard University is embarking on a collaboration with Google that could harness Google's search technology to provide to both the Harvard community and the larger public a revolutionary new information location tool to find materials available in libraries. In the coming months, Google will collaborate with Harvard's libraries on a pilot project to digitize a substantial number of the 15 million volumes held in the University's extensive library system. Google will provide online access to the full text of those works that are in the public domain. In related agreements, Google will launch similar projects with Oxford, Stanford, the University of Michigan, and the New York Public Library. As of 9 am on December 14, an FAQ detailing the Harvard pilot program with Google will be available at http://hul.harvard.edu.

The Harvard pilot will provide the information and experience on which the University can base a decision to launch a large-scale digitization program. Any such decision will reflect the fact that Harvard's library holdings are among the University's core assets, that the magnitude of those holdings is unique among university libraries anywhere in the world, and that the stewardship of these holdings is of paramount importance. If the pilot is deemed successful, Harvard will explore a long-term program with Google through which the vast majority of the University's library books would be digitized and included in Google's searchable database. Google will bear the direct costs of digitization in the pilot project.

By combining the skills and library collections of Harvard University with the innovative search skills and capacity of Google, a long-term program has the potential to create an important public good. According to Harvard President Lawrence H. Summers, "Harvard has the greate
Re:Nice! by happyemoticon · 2004-12-13 19:36 · Score: 4, Informative

I happen to work for one.

It's focused on putting otherwise one-of-a-kind materials online for preservation and ease of access, rather than Byron: The Critical Anthology or Cather on the Rye. It's kind of a mammoth, innefficient beaurocracy, though; I don't agree with some of the practices (such as sending texts off to India to be scrivened, rather than just using OCR software), they're very, very slow to incorporate data, and there are a lot of other problems which stem from the fact that most of them are not computer people, but MIMS holders (librarians).

The fact that Google is doing it gives me hope. Hell, maybe I can jump ship.
Oxford University gets every UK book published by aegilops · 2004-12-13 19:37 · Score: 3, Informative

The library of the University of Oxford, i.e. the Bodleian Library, was the first "copyright" library in the UK - one of only three - which means that it automatically gets a copy of every book published in the UK.

Aegilops
1. Re:Oxford University gets every UK book published by Jon+Chatow · 2004-12-13 20:33 · Score: 4, Informative
  
  Actually, they don't automatically get copies. They have the right to get one, but they don't have much space, so they only get copies of publications that they feel like getting. The British Library would be a more interesting one to team up with, as they get a copy of every publication...
  
  --
  James F.
2. Re:Oxford University gets every UK book published by Anonymous Coward · 2004-12-13 20:33 · Score: 1, Informative
  
  No it does not. It has the right to get every book published, but it has to ask for them within a year of publication. Only the British Library gets the books automatically under law.
Amen by lavaface · 2004-12-13 19:54 · Score: 2, Informative

It was just a matter of time before a project of this scope got off the ground. I would like to see them team up with Project Gutenberg (and perhaps archive.org) to provide images of the material. Throw in the little transcoder and perhaps wikipedia and we will soon have a killer information resource that can be cross-referenced to silly proportions. This is a boon for research. Projects like this and the public library of science will add much to collective knowledge. It would also be nice to see them team up with the newspaper project! Next stop--public domain LOC!!!

--
harmonious design
Screenshot by BReflection · 2004-12-13 20:15 · Score: 3, Informative

Screenshot of the service from John Battelle's Searchblog.

--
python -c "x='python -c %sx=%s; print x%%(chr(34),repr(x),chr(34))%s'; print x%(chr(34),repr(x),chr(34))"
second only to the Library of Congress. . . by Leonig+Mig · 2004-12-13 20:24 · Score: 2, Informative

... are you sure , - doesn't it mean (as is so often the case - "within the united states?" what about the British Library? What about the Bodelian at Oxford?

--
i'm trying to give up sigs.
1. Re:second only to the Library of Congress. . . by julesh · 2004-12-13 21:28 · Score: 2, Informative
  
  Apparently the Bodleian only has 7.2 million volumes, so this is larger than that collection.
  
  The British Library apparently has "150 million items" according to their web site, but a large number of these are not books (they claim, for instance, to have 8 million stamps). But, I'm pretty sure they have more than 15 million books.
  
  Whether or not they have more books than the Library of Congress is an interesting question.
2. Re:second only to the Library of Congress. . . by Steve+Cox · 2004-12-13 21:33 · Score: 2, Informative
  
  According to the British Library's website, it contains 150 million items and gains a futher 3 million each year (but it doesn't distinguish between items and volumes - they collect any published item, and receive a copy of EVERY published item in the UK and Ireland).
  
  The Bodelian has only 7 million volumes.
  
  I would suspect that the Brish Library is substantially larger than Stanfords, but the Library Of Congress is recognised as the largest library in the world.
  
  Steve.
U of Michigan by truesaer · 2004-12-13 20:30 · Score: 4, Informative

It looks like the largest portion of this will be 7 million items from the University of Michigan (compared to only 40,000 from Harvard). Good article from the Detroit Free Press.
Re:15 million volumes? by pmc · 2004-12-13 20:49 · Score: 5, Informative

The Library of Congress is the largest library in the world, with nearly 128 million items on approximately 530 miles of bookshelves.

The British Library (www.bl.uk) has 150 million items (but fewer bookshelves) so the claim of "largest" is a bit dubious.

For /. readers 1 BL = 1.17 LoC
Re:Will it be like google scholar? by treerex · 2004-12-14 00:27 · Score: 2, Informative

I've been using CiteSeer for years in my research, and still prefer it over Google Scholar.

For computing research CiteSeer and the ACM DL are the two places to go. Scholar may obviate the need for going to both places, someday, but for now it needs to mature a bit.
Re:Why journals are expensive. by Anonymous Coward · 2004-12-14 00:52 · Score: 3, Informative

A link which backs the "greedy bastards" theory :
http://math.berkeley.edu/~kirby/journals.html
New York Times article by sporktoast · 2004-12-14 02:03 · Score: 3, Informative

For what it is worth, there was an article in the Painted Lady about it today.

--
In a related story, the IRS has recently ruled that the cost of Windows upgrades can NOT be deducted as a gambling loss.
Re:Nice! by Charles+Franks · 2004-12-14 02:06 · Score: 4, Informative

Actually we do save the images. Many of the initial projects images are saved on CD's but anything from the last few years will make its way to the 'Open Library System' which is an image archive of the DP page scans. You can find a pre-alpha version at: http://www.pgdp.org/ols There are images for about a 1,000 projects there with many more pending me getting around to importing them. Lots of work to be done, developers welcome. Charles Franks Founder, Distributed Proofreaders
Re:Why journals are expensive. by commodoresloat · 2004-12-14 06:00 · Score: 4, Informative

The prestige of a journal is related to the difficulty of getting an article past peer review, not to the fact of the journal being available online or in paper. So there is no "trick" at all other than for the prestigious journals that already exist to start making content available online or in other electronic form.
As for fulltext articles, try JSTOR if you want to see how to do it wrong. Page by page in gif format, and some huge pdfs with all pictures and no ability to process text. Useless!! Yes you can print it out but then I'd just as soon get the hardcopy in the first place.
FOIA fees by KMSelf · 2004-12-14 06:32 · Score: 2, Informative

FYI, FOIA isn't free, though the fees are pretty nominal. $0.10/page, $18/hr, after the first 100 pages, with a significant educational discount.
The thought of having a spook do my photocopying for me just sounds.... Hrm. Ironic?

--
What part of "gestalt" don't you understand?