Google To Digitize Much of Harvard's Library

← Back to Stories (view on slashdot.org)

Google To Digitize Much of Harvard's Library

Posted by timothy on Monday December 13, 2004 @06:46PM from the that's-a-lot-of-library dept.

FJCsar writes "According to an e-mail sent today to Harvard students, Google will collaborate with Harvard's libraries on a pilot project to digitize a substantial number of the 15 million volumes held in the University's extensive library system, which is second only to the Library of Congress in the number of volumes it contains. Google will provide online access to the full text of those works that are in the public domain. In related agreements, Google will launch similar projects with Oxford, Stanford, the University of Michigan, and the New York Public Library. As of 9 am on December 14, a FAQ detailing the Harvard pilot program with Google will be available at hul.harvard.edu."

22 of 296 comments (clear)

ads by clovercase · 2004-12-13 18:52 · Score: 5, Funny

will there be ads for particle accelerators, scanning tunneling microscopes and tokamaks in the margins?
1. Re:ads by IntelliTubbie · 2004-12-13 19:48 · Score: 5, Funny
  
  will there be ads for particle accelerators, scanning tunneling microscopes and tokamaks in the margins?
  
  Yes, but it'll be mixed in with ads for V14gr4, male "enhancement", and Nigerian wealth opportunities. When the scientists complain, the humanities faculty will protest that spam is a perfectly valid epistemology, and that the scientists' attempt to impose an orthodoxy of "truth" in advertising is simply a power grab to extend Western, white male hegemony. At which point, the scientists will defect to MIT's library down the street.
  
  Cheers,
  IT
  
  --
  Power corrupts. PowerPoint corrupts absolutely.
Re:Not Just Harvard by BizidyDizidy · 2004-12-13 18:53 · Score: 4, Funny

Also according to the summary, Einstein.

--
The safest way to approach lava is to have another person with you and he goes first.
Will it be like google scholar? by baronben · 2004-12-13 18:53 · Score: 5, Interesting

Ever since they introduced Google Scholar, I've been wanting something like this for my university. For those of you who don't know, finding articles on a subject can be a pain in the ass, as subjects are indexed on several different systems (depending on subject, date, and journal). None of them, not one, has a decent interface or gets results that are as good as google. Google scholar lets you search through academic texts, but its limited to what's available, usually working papers or pre-published drafts. If there is some way that google could team up with Academic printers to index as many journals and texts as possible, this would make everyone's life a lot better.
I think this is a great start, There's incredible profit here too, universities spend millions for catalogue systems. If I could use one interface to search for books, chapters, and articles on a subject, I could spend more time actually learning, and less time looking at the same damn "no results" page on GeoWeb. Grrrr.

--
Sleep is for the weak!
1. Re:Will it be like google scholar? by Txiasaeia · 2004-12-13 19:11 · Score: 4, Interesting
  
  "If I could use one interface to search for books, chapters, and articles on a subject, I could spend more time actually learning, and less time looking at the same damn "no results" page on GeoWeb. Grrrr."
  Or finding that perfect article in the MLA database, only to find out that nobody in Canada subscribes to the journal, nor does anybody have the journal on fulltext. I'd rather have a more comprehensive fulltext database in plaintext rather than digitalised copies of everything anyway - makes searching a hellova lot easier.
  
  --
  Condemnant quod non intellegunt.
2. Re:Will it be like google scholar? by baronben · 2004-12-13 19:22 · Score: 4, Insightful
  
  That's a great point, that I think should be addressed (it has a bit, with some free-online journals, but nothing major). In the world of digital publishing, why do journals cost thousands of dollars a year. Its certainly not in costs, academics pay the journals to defray the cost of publishing, and editors and referees generally get only an honorarium, if anything.
  
  Sure, the company needs to get some money to cover the costs of printing, distribution, and other things, plus the associations that sponsor the journal want some money to help hold conferences, but why, oh why, must they price journals so expensively that many colleges can't even afford them?
  
  --
  Sleep is for the weak!
So... by Anonymous Coward · 2004-12-13 18:53 · Score: 4, Funny

If I download a book, when do I have to upload it again? What is the late fee if I forget?
The Fight against Plagiarism by manmanic · 2004-12-13 19:04 · Score: 5, Interesting

One reason why this is in the interest of big old universities like Harvard is that it will make it much easier to detect plagiarism in students' essays. If published books were included in Google's index, a plagiarism detection service like Copyscape would also be able to check whether content was lifted from printed material, as well as from the web.
Re:Are these volumes stored as text or pictures? by Txiasaeia · 2004-12-13 19:14 · Score: 4, Insightful

I think you're missing the point. I'm not so much concerned with getting rid of dead tree books (I love reading paper books for enjoyment); I would, on the other hand, prefer all my academic sources to be electronic. As I mentioned in reply to another poster, it's a huge pain to look something up on MLA or Expanded ASAP only to find out that my university doesn't carry it and the interlibrary loan system can't get it for two or three weeks because it's backlogged as it is. I could care less about the spiffy fonts and typesetting; give me the plaintext so I get my research done!

--
Condemnant quod non intellegunt.
Re:Nice! by RollingThunder · 2004-12-13 19:18 · Score: 4, Informative

Well, there's the Distributed Proofreaders project for Project Gutenberg... but PG isn't a "we must be the source" attitude from what I've seen. As far as PG is concerned, the more eBooks, the better.

DP probably isn't threatened either - they just shift focus to books that are not in the Harvard collection to avoid duplication of effort.
University of California is anti-digital by dananderson · 2004-12-13 19:24 · Score: 5, Informative

This is great. Compare this pro-digitalization attitude of Harvard, Stanford, and others with the University of California's (UC's) anti-digital position.
For books in Special Collections, they won't allow copies to be digitalized unless they are (1) paid a fee to scan the book (fair enough) and (2) paid a royalty to post the book to the web.
The royalty amounts to hundreds or thousands of dollars per book (about $100/page or image). This allows the libraries to act as a "profit center" for the universities. This policy applies to all UC campuses (I've tried UCB, UCLA, UCI, UCSD).
This is true even though the book is in the public domain (because they have physical possession and nobdy can make copies until you sign a license agreement). This is true even if you're using the book for non-commercial purposes (such as free posting to the web).
Something is wrong here. People donate to UC libraries (either books or money) for the public good. They don't donate so the library can start a business licensing public-domain books.
Despite that, I have been able to scan many books (by using books in open stacks or purchasing them). These books concern Yosemite history and are at http://www.yosemite.ca.us/history/
Text of Dec 13th Email by olvr · 2004-12-13 19:34 · Score: 5, Informative

December 13, 2004

Dear Colleague,

I am writing today with news of an exciting new project within the Harvard libraries. As all of us know, Harvard's is the world's preeminent university library. Its holdings of over 15 million volumes are the result of nearly four centuries of thoughtful and comprehensive collecting. While those holdings are of primary importance to Harvard students and faculty, we have, for several years, been considering ways to make the collections more useful and accessible to scholars around the world. Now we are about to begin a project that can further that global goal-and, at the same time, can greatly enhance access to Harvard's vast library resources for our students and faculty.

We have agreed to a pilot project that will result in the digitization of a substantial number of volumes from the Harvard libraries. The pilot will give the University a great deal of important data on a possible future large-scale digitization program for most of the books in the Harvard collections. The pilot is a small but extremely significant first step that can ultimately provide both the Harvard community and the larger public with a revolutionary new information location tool to find materials available in libraries.

The pilot project will be done in collaboration with Google. The project will link Harvard's library collections with Google's resources and its cutting-edge technology. The pilot project, which will be announced officially tomorrow, is the result of more than a year of careful consultation at many levels of the University. We could not have achieved a meaningful pilot project without the efforts of the Harvard Corporation; the President, Provost, Chief Information Officer, and Office of General Counsel; the University Library Council; and senior managers within the College Library and the University Library.

A full description of the pilot program follows here, with further materials available on the Harvard home page tomorrow.

With best regards,
Sidney Verba
Carl H. Pforzheimer University Professor and
Director of the University Library

Project Description:
Harvard's Pilot Project with Google

Harvard University is embarking on a collaboration with Google that could harness Google's search technology to provide to both the Harvard community and the larger public a revolutionary new information location tool to find materials available in libraries. In the coming months, Google will collaborate with Harvard's libraries on a pilot project to digitize a substantial number of the 15 million volumes held in the University's extensive library system. Google will provide online access to the full text of those works that are in the public domain. In related agreements, Google will launch similar projects with Oxford, Stanford, the University of Michigan, and the New York Public Library. As of 9 am on December 14, an FAQ detailing the Harvard pilot program with Google will be available at http://hul.harvard.edu.

The Harvard pilot will provide the information and experience on which the University can base a decision to launch a large-scale digitization program. Any such decision will reflect the fact that Harvard's library holdings are among the University's core assets, that the magnitude of those holdings is unique among university libraries anywhere in the world, and that the stewardship of these holdings is of paramount importance. If the pilot is deemed successful, Harvard will explore a long-term program with Google through which the vast majority of the University's library books would be digitized and included in Google's searchable database. Google will bear the direct costs of digitization in the pilot project.

By combining the skills and library collections of Harvard University with the innovative search skills and capacity of Google, a long-term program has the potential to create an important public good. According to Harvard President Lawrence H. Summers, "Harvard has the greate
Re:Nice! by happyemoticon · 2004-12-13 19:36 · Score: 4, Informative

I happen to work for one.

It's focused on putting otherwise one-of-a-kind materials online for preservation and ease of access, rather than Byron: The Critical Anthology or Cather on the Rye. It's kind of a mammoth, innefficient beaurocracy, though; I don't agree with some of the practices (such as sending texts off to India to be scrivened, rather than just using OCR software), they're very, very slow to incorporate data, and there are a lot of other problems which stem from the fact that most of them are not computer people, but MIMS holders (librarians).

The fact that Google is doing it gives me hope. Hell, maybe I can jump ship.
Both Images & Uncorrected OCR should be availa by dananderson · 2004-12-13 19:38 · Score: 4, Insightful

Typically, both page images and uncorrected OCR are made available. Correcting OCR is too labor-intensive for thousands of books.
The uncorrected OCR is very useful for indexing (by Google or others), as the 5% or fewer typos are not enough to interfere with indexing keywords. Uncorrected OCR can also be corrected later.
The page images are tied with the uncorrected OCR so you can see exactly what's there.
For an example, see books at University of Michigan's Making of America (MoA) Exhibit, which has thousands of 19th century books and periodicals available.
Do no evil. by nels_tomlinson · 2004-12-13 19:48 · Score: 4, Funny

Their corporate motto is ``do no evil'', and we've all applauded that, but this is such a great thing that I think we could give them a pass on at least one evil act.
Maybe they could do something really evil to Microsoft, and then we could say: ``Well, you digitized Harvard's library, so we'll let it pass this time.''

--
See what I've been reading.
U of Michigan by truesaer · 2004-12-13 20:30 · Score: 4, Informative

It looks like the largest portion of this will be 7 million items from the University of Michigan (compared to only 40,000 from Harvard). Good article from the Detroit Free Press.
Re:Oxford University gets every UK book published by Jon+Chatow · 2004-12-13 20:33 · Score: 4, Informative

Actually, they don't automatically get copies. They have the right to get one, but they don't have much space, so they only get copies of publications that they feel like getting. The British Library would be a more interesting one to team up with, as they get a copy of every publication...

--
James F.
Re:15 million volumes? by pmc · 2004-12-13 20:49 · Score: 5, Informative

The Library of Congress is the largest library in the world, with nearly 128 million items on approximately 530 miles of bookshelves.

The British Library (www.bl.uk) has 150 million items (but fewer bookshelves) so the claim of "largest" is a bit dubious.

For /. readers 1 BL = 1.17 LoC
Why journals are expensive. by commodoresloat · 2004-12-13 22:02 · Score: 4, Interesting

The reason there are so few copies is because they are so expensive. Chicken and Egg.
No; the reason there are so few copies is there are so few people who want to read specialized journals. And the small audience only accounts for a small part of what many academic journals charge.
No; the problem is not overhead costs or small audiences. The problem is that the owners of much of that kind of content are greedy bastards. There is no reason for the outrageous price of some journals. Some scientific journal subscriptions are in the tens of thousands; even many liberal arts journals are far from cheap. And if you want to copy an article for your students to buy at kinkos, expect them to pay 35 cents a page or more for the copyrights alone.
And many of them are worse than the RIAA in terms of access to content electronically. Journal articles are included in databases sold to some universities You can read articles in some databases but only by loading a .gif of every page one at a time. No copy and paste, no text access at all. So much technology going into preventing the thing from being copied that the online version is actually less useful than the dead tree version rotting on the shelf.
I think this is a great move by Google and Harvard, and I like the idea behind google scholar, but I expect this kind of work to be resisted by many of journals and professional organizations, to the extent that they have in a say in it. This will be a huge boon in terms of the availability of public domain resources, but unfortunately outdated perspectives on intellectual property are likely to hold back real progress for something really useful to scholars in a systematic way. At least until those perspectives change significantly.
1. Re:Why journals are expensive. by commodoresloat · 2004-12-14 06:00 · Score: 4, Informative
  
  The prestige of a journal is related to the difficulty of getting an article past peer review, not to the fact of the journal being available online or in paper. So there is no "trick" at all other than for the prestigious journals that already exist to start making content available online or in other electronic form.
  As for fulltext articles, try JSTOR if you want to see how to do it wrong. Page by page in gif format, and some huge pdfs with all pictures and no ability to process text. Useless!! Yes you can print it out but then I'd just as soon get the hardcopy in the first place.
Re:15 million volumes? by commodoresloat · 2004-12-13 22:11 · Score: 4, Funny

The British Library (www.bl.uk) has 150 million items
He means just books and such. It's not fair counting umbrellas.
Re:Nice! by Charles+Franks · 2004-12-14 02:06 · Score: 4, Informative

Actually we do save the images. Many of the initial projects images are saved on CD's but anything from the last few years will make its way to the 'Open Library System' which is an image archive of the DP page scans. You can find a pre-alpha version at: http://www.pgdp.org/ols There are images for about a 1,000 projects there with many more pending me getting around to importing them. Lots of work to be done, developers welcome. Charles Franks Founder, Distributed Proofreaders