Google To Digitize Much of Harvard's Library

← Back to Stories (view on slashdot.org)

Google To Digitize Much of Harvard's Library

Posted by timothy on Monday December 13, 2004 @06:46PM from the that's-a-lot-of-library dept.

FJCsar writes "According to an e-mail sent today to Harvard students, Google will collaborate with Harvard's libraries on a pilot project to digitize a substantial number of the 15 million volumes held in the University's extensive library system, which is second only to the Library of Congress in the number of volumes it contains. Google will provide online access to the full text of those works that are in the public domain. In related agreements, Google will launch similar projects with Oxford, Stanford, the University of Michigan, and the New York Public Library. As of 9 am on December 14, a FAQ detailing the Harvard pilot program with Google will be available at hul.harvard.edu."

15 of 296 comments (clear)

Min score:

Reason:

Sort:

One more reason... by Anonymous Coward · 2004-12-13 18:51 · Score: 2, Insightful

to never leave my apartment.
get your scuba gear... by uighur · 2004-12-13 18:57 · Score: 2, Insightful

because its time to dive into the deep web. Projects like this are the key to unlocking the vast stores of important which are currently not readiy accessed online. Personally I'd like to see a Google-run free access Lexis-Nexus project.
Images and formatting? by MacFury · 2004-12-13 19:00 · Score: 2, Insightful

I should RTFA but what about images and general formatting? I suppose you could find the relevant text, then try and get the physical book...but if you could view the book in it's original formatting...that would be sweet.
Just how much storage space will all this data consume? It seems like a massive undertaking.
Are these volumes stored as text or pictures? by wealthychef · 2004-12-13 19:03 · Score: 2, Insightful

I am ambivalent about this. Will the books be stored as text to enable searching? If so, given that part of a book's character is its font and typesetting, will ALL the flavor of these books really be captured, in the same way that it would be to read them? Something seems likely to be "lost in translation" here.

--
Currently hooked on AMP
1. Re:Are these volumes stored as text or pictures? by clovercase · 2004-12-13 19:08 · Score: 3, Insightful
  
  i think your comments would be salient if they were going to scan the documents and the BURN the originals. putting massive content on the web for free is the best way to push content all over the world. some internet user in sri lanka doesn't have the bandwidth to download images of the pages, and would never have the opportunity to view the actual documents in a library at harvard. if everyone digitized all the valuable content (and i presume that much of the content in harvar's libraries are valuable), and made it freely available, the world would be a much better place. would you be satisfied if there was a link on each page to view an image of the actual page?
2. Re:Are these volumes stored as text or pictures? by Txiasaeia · 2004-12-13 19:14 · Score: 4, Insightful
  
  I think you're missing the point. I'm not so much concerned with getting rid of dead tree books (I love reading paper books for enjoyment); I would, on the other hand, prefer all my academic sources to be electronic. As I mentioned in reply to another poster, it's a huge pain to look something up on MLA or Expanded ASAP only to find out that my university doesn't carry it and the interlibrary loan system can't get it for two or three weeks because it's backlogged as it is. I could care less about the spiffy fonts and typesetting; give me the plaintext so I get my research done!
  
  --
  Condemnant quod non intellegunt.
Flipside: The false positive problem by rsborg · 2004-12-13 19:15 · Score: 2, Insightful

Ok, so this is just a bit of devil's avocate, but what happens if you just *happen* to have a writing style similar to someone else who was printed before... what if you read something, and unknowingly wrote something in a similar vein in your essay? I assume you could check it yourself, but then that would just introduce extra cost to even write the essay in the first place... or worse, the plagiarists could just "tweak" their papers ensuring that they're "below the radar" by changing enough style to not be recognizeable...

--
Make sure everyone's vote counts: Verified Voting
Re:Will it be like google scholar? by baronben · 2004-12-13 19:22 · Score: 4, Insightful

That's a great point, that I think should be addressed (it has a bit, with some free-online journals, but nothing major). In the world of digital publishing, why do journals cost thousands of dollars a year. Its certainly not in costs, academics pay the journals to defray the cost of publishing, and editors and referees generally get only an honorarium, if anything.

Sure, the company needs to get some money to cover the costs of printing, distribution, and other things, plus the associations that sponsor the journal want some money to help hold conferences, but why, oh why, must they price journals so expensively that many colleges can't even afford them?

--
Sleep is for the weak!
Re:Nice! by dvdeug · 2004-12-13 19:37 · Score: 2, Insightful

DP probably isn't threatened either - they just shift focus to books that are not in the Harvard collection to avoid duplication of effort.

Are they really going to provide proofread texts? A novel might only take a couple hours to process, but math is going to take hand markup, and some of the more complex critical editions are a bear. Even at only 2 hours a book (and that's not including scanning time), 4 million volumes adds up to 8 million man-hours or a million man-days. At seven bucks an hour that's 56 million dollars. I expect we'll get scans and OCR, but no hand work; there will still be a place for DP. In fact, we'll be better off, with a huge source of scans to work from.
Both Images & Uncorrected OCR should be availa by dananderson · 2004-12-13 19:38 · Score: 4, Insightful

Typically, both page images and uncorrected OCR are made available. Correcting OCR is too labor-intensive for thousands of books.
The uncorrected OCR is very useful for indexing (by Google or others), as the 5% or fewer typos are not enough to interfere with indexing keywords. Uncorrected OCR can also be corrected later.
The page images are tied with the uncorrected OCR so you can see exactly what's there.
For an example, see books at University of Michigan's Making of America (MoA) Exhibit, which has thousands of 19th century books and periodicals available.
Dead authors tell no tales . . . till now by dananderson · 2004-12-13 19:44 · Score: 2, Insightful

This will be sweet. I just hope that we don't get too many authors getting pissed.
Only public-domain books will be scanned. In all or most cases the author's are dead. However, this will revive a great body of work and widen access to many.
One class of author may be pissed will be authors who take older works and just slap a foreword or introduction to the front and collect royalties. I've seen this done for many histories. But author's of todays works can count on royalties for themselves, their children, and their grandchildren (if the book is still selling). The copyright term is too long in the U.S., but that's another story . . .
False positives can be double-checked manually by wrinkledshirt · 2004-12-13 20:12 · Score: 2, Insightful

The professor can just wait until the match comes up, and then double-check at that point.

You'd want to do a thorough overview of any potential instance of cheating anyway. A quick run-through would determine whether or not a paper happened to contain an identical sentence clause or three identical paragraphs.

I think the bigger problem would be the second one you described -- that students could plagiarize and then go through each paragraph, changing the wording slightly so as to avoid positive matches. Still, you could argue that this is pretty much what academics is anyway, just with footnotes and a bibliography.

--
--------
Bleah! Heh heh heh... BLEAH BLEAH!!! Ha ha ha ha...
Re:Will it be like google scholar? by AlanS2002 · 2004-12-13 21:25 · Score: 0, Insightful

In addition to the other reply to this, there might also be the case of journals which are published by proffesional organisations being used to defray the cost of running such organisation. You'll also find individual subscription prices being much cheaper than institutional subscription prices, I'd posit a guess that the institutional subscription holders are in some part subsidising the individual subscription holders.

--
Not all conservatives are stupid,
but it is true that most stupid people are conservative.
- Hume
Re:Will it be like google scholar? by tootlemonde · 2004-12-14 03:11 · Score: 2, Insightful
Good quality search engines have lots of qualities that Google lacks.
One solution is to use google to locate a superset of the target articles and then use a more powerful search engine to winnow the google result set. For an individual, this approach would mean maintaining a personal index of the articles but that is a problem of storage space and bandwidth which is relatively cheap.
The two main problems that google solves is
- having access to the articles in the first place
- reducing the number of possible articles to a managable level
One could imagine a plugin for browsers that would add the additional search facilities to a google search. Until then, Google Hacks will get you started.
Re:Why journals are expensive. by DarkSarin · 2004-12-14 07:00 · Score: 2, Insightful

I wasn't saying that the prestige of the journal had anything to do with the medium, but that there is a lot of name recognition.

JSTOR varies in quality from journal to journal--some are actually okay, while others suck. I know that I have gotten pdf's from JSTOR, but I wonder if that is a function of JSTOR or the amount that a person/institution is paying for access.

Most journals that I have dealt with online where I had to pay (because the university wasn't a subscriber) wanted between $15 and $25 for a single article. This is a LOT of money, and sometimes (if you aren't in a hurry), it is easier to contact the author and ask for a reprint--they usually have them, and if they are like many researchers, they are glad to send you a copy, provided you explain what you are doing.

There is a trick to it--the current prestigious journals ARE NOT going to go to a low/no cost format for publishing online until there are one or two major competitors who are seen as valid (peer-review) and prestigious. The prestige factor is huge and rests largely on (as you mention) the peer review process AND who is publishing in the journal. Sorry, but Robert Sternberg doesn't generally publish in just any old journal--he has one or two that he will send a manuscript to, and go from there.

When my thesis advisor (who wrote two chapters for the Handbook of Research Methods in Industrial Psychology) publishes, he typically sends stuff first to the Journal of Occupational Behavior, not DarkSarin's Online Journal of Amateur Psychology or Commoderesloat's Journal of Human Weirdness. Why? Because no one has EVER heard of those journals, and if puts that on his vita, it won't make any difference to the next folks wanting to hire him for his research ability (not that he's going anywhere--he's a full professor).

But when the next university sees that he has published 10 articles in the Journal of Occupational Behavior (JOB), they say, "Hey, this guy is getting published in one of the top 10 journals in Behavioral Psychology, he's probably pretty good!" They will then probably hire him.

But when that same university interviews me, and I put down that I published 123 articles in DarkSarin's Journal of Computer Gaming Psychology, they are going say, "Wow, I've never heard of that journal--is it peer reviewed? Is it attached to a professional association (APA, MPA, SIOP, etc)? Has anybody here heard of it? Does anyone who's any good publish in that journal?" If you are REALLY lucky, they MIGHT take the time to look up the answers, but chances are slim if the position is getting very many applicants (and if it isn't, it probably isn't paying very well!).

The long and the short of it is that there is little, if any, financial pressure to offer content online for free, and that is unlikely to change without competition. There is unlikely to be much competition, because few young researchers are going to put their career on the line by publishing in any but the most prestigious journals that they can possibly get an article into. Older researchers are already in the habit of sending articles to certain journals, and so they aren't likely to change either.

There isn't a good, quick, easy solution to this, and anyone who says that there is needs to have their head checked. Sorry.

--
"We don't know what we are doing, but we are doing it very carefully,..." Wherry, R.J. Personnel Psychology (1995)