Google To Digitize Much of Harvard's Library

Nice! by mind21_98 · 2004-12-13 18:47 · Score: 1

But aren't there projects that are already doing this?

--
US businesses that currently accept chip and PIN/signature

Re:Nice! by ravenspear · 2004-12-13 18:49 · Score: 1

Already digitizing the Harvard library?

No.
Re:Nice! by Meetch · 2004-12-13 18:50 · Score: 1

More targets to avoid the pressure of /.ing?
Re:Nice! by RollingThunder · 2004-12-13 19:18 · Score: 4, Informative

Well, there's the Distributed Proofreaders project for Project Gutenberg... but PG isn't a "we must be the source" attitude from what I've seen. As far as PG is concerned, the more eBooks, the better.

DP probably isn't threatened either - they just shift focus to books that are not in the Harvard collection to avoid duplication of effort.
Re:Nice! by happyemoticon · 2004-12-13 19:36 · Score: 4, Informative

I happen to work for one.

It's focused on putting otherwise one-of-a-kind materials online for preservation and ease of access, rather than Byron: The Critical Anthology or Cather on the Rye. It's kind of a mammoth, innefficient beaurocracy, though; I don't agree with some of the practices (such as sending texts off to India to be scrivened, rather than just using OCR software), they're very, very slow to incorporate data, and there are a lot of other problems which stem from the fact that most of them are not computer people, but MIMS holders (librarians).

The fact that Google is doing it gives me hope. Hell, maybe I can jump ship.
Re:Nice! by dvdeug · 2004-12-13 19:37 · Score: 2, Insightful

DP probably isn't threatened either - they just shift focus to books that are not in the Harvard collection to avoid duplication of effort.

Are they really going to provide proofread texts? A novel might only take a couple hours to process, but math is going to take hand markup, and some of the more complex critical editions are a bear. Even at only 2 hours a book (and that's not including scanning time), 4 million volumes adds up to 8 million man-hours or a million man-days. At seven bucks an hour that's 56 million dollars. I expect we'll get scans and OCR, but no hand work; there will still be a place for DP. In fact, we'll be better off, with a huge source of scans to work from.
Re:Nice! by Anonymous Coward · 2004-12-13 20:29 · Score: 1, Interesting

But also: PG books are full of errors, and there is no source info or scans available to fix against in any sort of easy way. Many books Such as Wealth of Nations went through a number of editions during the author's lifetime. It would be nice to have the various early editions for collation. And often times new editions come out long after the death of the author with bullshit editorial changes in order to claim a new copyright. A library like Harvard will have many of the first number of editions of classic works.
Re:Nice! by Anonymous Coward · 2004-12-13 23:53 · Score: 1, Funny

or Cather on the Rye
You're not the spell checker, huh?
Re:Nice! by kalidasa · 2004-12-14 00:38 · Score: 1

He just OCRed it, that's all. Seriously, there's a reason that stuff is sent to Asia to be hand-input rather than OCRed: someone typing a language they don't know is always more accurate than OCR, and therefore the practice reduces editing time. (Someone typing a language they DO know is a different issue: they tend to "improve" things, albeit unintentionally: they read "affect" and type "effect," etc.)
Re:Nice! by RollingThunder · 2004-12-14 00:57 · Score: 1

Very true - DP demonstrates that it's not trivial to OCR these texts without error. Part of me was assuming that Google must have some amazing improvements up their sleeve to manage it in a completely automated fashion.

Having looked at their Catalogs beta, I still suspect they just may have... not only is the text OCR'd well enough to search it, but it even highlights the words in the text. They could certainly have hand-proofed, but that strikes me as not a very Google thing to do.
Re:Nice! by RollingThunder · 2004-12-14 00:59 · Score: 1

The PG books may be full of errors, but hopefully the raw scans from DP will be kept after the OCR, doubleproof, and postproduction is complete. That way you can still go back and see what the heck a given item really said.

I should scrounge around their forums and see if they state what the final disposition of the scans is.
Re:Nice! by advocate_one · 2004-12-14 01:21 · Score: 1

scan it twice on diferent hardware, OCR both scans and do a diff on the text files to find the errors.

--
Donald 'Duck' Dunn: We had a band powerful enough to turn goat piss into gasoline.
Re:Nice! by Forge · 2004-12-14 01:58 · Score: 1

Q: What is the worlds largest English Speaking Democracy?

A: If you said "USA", you just removed a Billion people from the planet. That's right. English is Indea's primary language.

So much for that "More accurate than OCR" stuff.

--
--= Isn't it surprising how badly I spell ?
Re:Nice! by Charles+Franks · 2004-12-14 02:06 · Score: 4, Informative

Actually we do save the images. Many of the initial projects images are saved on CD's but anything from the last few years will make its way to the 'Open Library System' which is an image archive of the DP page scans. You can find a pre-alpha version at: http://www.pgdp.org/ols There are images for about a 1,000 projects there with many more pending me getting around to importing them. Lots of work to be done, developers welcome. Charles Franks Founder, Distributed Proofreaders
Re:Nice! by jhutch2000 · 2004-12-14 02:18 · Score: 1

That works wonderfully for two separate human typed in files... For two computer OCR'd files, the differences are almost completely non-existant and it doesn't really get you anywhere. (Assumes, of course, that both scans did have a major hiccup that caused bogus data on one of the runs).

The problem with OCR is that while it is 95-97% accurate on clean, clear type (and this makes it sufficient for search purposes like Google), it becomes maddening to read a work with that many errors. I've seen many "experts" say that at least "4 9's" (99.99% accurate) is the minimum and I've seen "5 9's" (99.999%) or "6 9's" (99.9999%) proposed. Some checks from about a year ago showed the current DP output somewhere around 5 9's (99.999% accurate).
Re:Nice! by Anonymous Coward · 2004-12-14 02:41 · Score: 0, Interesting

Worth noting that this project is putting a LOT of people out of work. Literally, they are laying off almost their entire library staff (I know a few..). Wonder if that'll be in the FAQ?
Re:Nice! by kalidasa · 2004-12-14 03:01 · Score: 1

Nice try, but not all billion people in India speak English, as I know from visiting friends who have guests who don't speak English. But thank you for playing.
Re:Nice! by kalidasa · 2004-12-14 03:07 · Score: 1

Come to think of it, though, nearly everyone who is doing work like this would (and yes, there are probably more English speakers in India than in any other country). So I conced that you have a point. But it is nevertheless a better idea to have people do data entry if they don't know the language, if you are dealing with a simple writing system like Latin/English.
Re:Nice! by RollingThunder · 2004-12-14 03:58 · Score: 1

Wow, cool! Thanks for the reply. I've only got about 120 pages under my belt so far but I'm certainly enjoying it.
Re:Nice! by Squalish · 2004-12-14 04:08 · Score: 1

Octavo?

--
People in Soviet Russia, however, appear to be afflicted with amusing juxtapositions of the aforementioned situation
Re:Nice! by westlake · 2004-12-14 04:30 · Score: 1

And often times new editions come out long after the death of the author with bullshit editorial changes in order to claim a new copyright
You can't casually dismiss the modern editions of classic texts published by Penquin Books, The Library of America, etc., which are a pleasure to read and more respectful to their sources than the 19th century editions typical of Project Gutenberg.
Re:Nice! by happyemoticon · 2004-12-14 04:39 · Score: 1

A spellchecker plugin would save me so many flames.
Re:Nice! by KillerDeathRobot · 2004-12-14 05:30 · Score: 1

In my experience though, many of them don't speak English very well.

This is not meant in a derogatory way, simply an observation. I don't speak Punjabi or Hindi at all so I'm hardly in a position to judge.

--
Thinkin' Lincoln - a web comic of presidential proportions
Re:Nice! by severoon · 2004-12-14 05:53 · Score: 1

Ok, so here's my idea to fix this. Presumably, they're indexing all this information so people will actually use it. OCR errors are not like many other kinds of errors, in that they're easy to recognize most times (o instead of a zero, etc).

So, I would advocate loading the texts in two places...one in the searchable DB and one in a wiki format. See which one gets fixed more quickly and more accurately. :-)

--
but have you considered the following argument: shut up.
Re:Nice! by severoon · 2004-12-14 05:57 · Score: 1

I am in a position to judge. I don't speak Hindi or Punjabi either, but I damn well want accurately digitized texts! Does that make me racist or something?

When is all this PC non-judgmental crap going to stop? They're being hired to do a job. They either do it well or they don't, and we have every right to judge the result as consumers of their work.

--
but have you considered the following argument: shut up.
Re:Nice! by Anonymous Coward · 2004-12-14 06:45 · Score: 0

You're in a position to judge work you haven't even seen?
Re:Nice! by severoon · 2004-12-14 11:39 · Score: 1

Yes, I am in that position. In much the same way my boss is in a position to judge the work I haven't yet completed (or haven't even yet been assigned). We are the consumers of the information--if we're not the ones to say whether it's done well or not, then who will?

This is a discussion of roles, not actual work product.

--
but have you considered the following argument: shut up.
Re:Nice! by KillerDeathRobot · 2004-12-14 12:35 · Score: 1

I was being non-judgmental because I wasn't talking about their abilities to speak English as relating to doing any job. The point was made that India probably has the most English-speakers in the world, and I was simply commenting that it's not their native language and I think that the statistics might be skewed because people being counted as English-speakers might only barely qualify as such.

--
Thinkin' Lincoln - a web comic of presidential proportions
Re:Nice! by Anonymous Coward · 2004-12-14 14:37 · Score: 0

For the most part, data entry workers are not writing or editing, they're just copying. People who know a language are more likely to copy idea for idea, while those who don't are more likely to copy character for character. Mind you, this only works with an alphabetic (maybe a syllabic) language, but it's been done, folks have done the studies, and this has been demonstrated to work.
Re:Nice! by Forge · 2004-12-15 11:08 · Score: 1

Having dealt with many Indian natives both on tech support lines and in person, I am in a position to judge. The ones with high level education (I.e. University graduates) have excellent written English and can be called on to spell check or proofread documents. Spoken English is difficult for many of them however. Below that there is a wide gap to the illiterate masses with pore English skills (Written or oral).

In case anyone wants to scream racism, My only surviving grandparent is an Indian. His parents came to Jamaica at the end of the 1800s looking for a better life.

--
--= Isn't it surprising how badly I spell ?
Re:Nice! by Anonymous Coward · 2004-12-16 06:23 · Score: 0

pore != poor

HTH

One more reason... by Anonymous Coward · 2004-12-13 18:51 · Score: 2, Insightful

to never leave my apartment.

Re:One more reason... by Em+Adespoton · 2004-12-14 06:54 · Score: 1

This isn't just Harvard that's involved...

ads by clovercase · 2004-12-13 18:52 · Score: 5, Funny

will there be ads for particle accelerators, scanning tunneling microscopes and tokamaks in the margins?

Re:ads by IntelliTubbie · 2004-12-13 19:48 · Score: 5, Funny

will there be ads for particle accelerators, scanning tunneling microscopes and tokamaks in the margins?

Yes, but it'll be mixed in with ads for V14gr4, male "enhancement", and Nigerian wealth opportunities. When the scientists complain, the humanities faculty will protest that spam is a perfectly valid epistemology, and that the scientists' attempt to impose an orthodoxy of "truth" in advertising is simply a power grab to extend Western, white male hegemony. At which point, the scientists will defect to MIT's library down the street.

Cheers,
IT

--
Power corrupts. PowerPoint corrupts absolutely.
Re:ads by tsm_sf · 2004-12-13 20:49 · Score: 2, Funny

Yeah, Theodoric of York has always held himself in pretty high esteem.

--
Literalism isn't a form of humor, it's you being irritating.

Google Cars by Zilverfire · 2004-12-13 18:53 · Score: 2, Funny

Google is diversifying extravagently, pretty soon all of us geeks will be driving google cars that can cross reference the library of congress

--
"Could you put that in a memo entitled, SHIT I ALREADY KNOW!" - Sarge

Re:Google Cars by Televisor · 2004-12-13 19:12 · Score: 1

No, that'd involve going outside.
Re:Google Cars by Anonymous Coward · 2004-12-13 21:16 · Score: 1, Funny

What is this 'outside' of which you speak?
Re:Google Cars by Anonymous Coward · 2004-12-13 21:29 · Score: 0

Yeah, but can they run linux?

*runs*
Re:Google Cars by TimothyTimothyTimoth · 2004-12-14 00:13 · Score: 1

What is this 'speak' of which you type?

--
It doesn't matter which ape activates the Monolith
Re:Google Cars by nadadogg · 2004-12-14 04:44 · Score: 1

Keep in mind that we are not all shut-in geeks, some of us are also jogging/camping/hiking geeks, like me :) I work on a computer 40 hours a week, then go to class, I tend to get out and active on the weekend, to avoid having the dreaded "teddy bear" look.

--
i use linux and windows oh god how can i have an opinion
Re:Google Cars by Anonymous Coward · 2004-12-14 10:23 · Score: 0

I've heard that it's big features include some sort of giant, flaming orb and vast hordes of normal people. Even worse, there'se people there to prevent you from fixing those problems...

Re:Not Just Harvard by BizidyDizidy · 2004-12-13 18:53 · Score: 4, Funny

Also according to the summary, Einstein.

--
The safest way to approach lava is to have another person with you and he goes first.

Will it be like google scholar? by baronben · 2004-12-13 18:53 · Score: 5, Interesting

Ever since they introduced Google Scholar, I've been wanting something like this for my university. For those of you who don't know, finding articles on a subject can be a pain in the ass, as subjects are indexed on several different systems (depending on subject, date, and journal). None of them, not one, has a decent interface or gets results that are as good as google. Google scholar lets you search through academic texts, but its limited to what's available, usually working papers or pre-published drafts. If there is some way that google could team up with Academic printers to index as many journals and texts as possible, this would make everyone's life a lot better.

I think this is a great start, There's incredible profit here too, universities spend millions for catalogue systems. If I could use one interface to search for books, chapters, and articles on a subject, I could spend more time actually learning, and less time looking at the same damn "no results" page on GeoWeb. Grrrr.

--
Sleep is for the weak!

Re:Will it be like google scholar? by adeydas · 2004-12-13 19:00 · Score: 1

its happening all over... IIT's in india are digitising libraries too... i guess it won't be long before every thing will start with e...
Re:Will it be like google scholar? by ISEENOEVIL · 2004-12-13 19:07 · Score: 2, Interesting

As long as we don't have something like Google comes in and picks up these prestigious library resources, Yahoo comes and gets another set, and then Microsoft picks still more. I have a feeling some of these resources are wanting to be universally accessed. This is one step closer, but still not close enough if you have to use 3+ different major search engines. My library fees that are tacked onto tuition would actually be used if I could use my preferred search engine to access everything my university is paying so much for in one place. As it stands now I cringe when I have to navigate our electronic resources.

-Stormy
Re:Will it be like google scholar? by Txiasaeia · 2004-12-13 19:11 · Score: 4, Interesting

"If I could use one interface to search for books, chapters, and articles on a subject, I could spend more time actually learning, and less time looking at the same damn "no results" page on GeoWeb. Grrrr."
Or finding that perfect article in the MLA database, only to find out that nobody in Canada subscribes to the journal, nor does anybody have the journal on fulltext. I'd rather have a more comprehensive fulltext database in plaintext rather than digitalised copies of everything anyway - makes searching a hellova lot easier.

--
Condemnant quod non intellegunt.
Re:Will it be like google scholar? by baronben · 2004-12-13 19:22 · Score: 4, Insightful

That's a great point, that I think should be addressed (it has a bit, with some free-online journals, but nothing major). In the world of digital publishing, why do journals cost thousands of dollars a year. Its certainly not in costs, academics pay the journals to defray the cost of publishing, and editors and referees generally get only an honorarium, if anything.

Sure, the company needs to get some money to cover the costs of printing, distribution, and other things, plus the associations that sponsor the journal want some money to help hold conferences, but why, oh why, must they price journals so expensively that many colleges can't even afford them?

--
Sleep is for the weak!
Re:Will it be like google scholar? by blincoln · 2004-12-13 20:50 · Score: 1

Sure, the company needs to get some money to cover the costs of printing, distribution, and other things, plus the associations that sponsor the journal want some money to help hold conferences, but why, oh why, must they price journals so expensively that many colleges can't even afford them?

Printing a publication is expensive. Journals are advertisement-free, which is why they cost so much. I used to work for a student newspaper and it was ridiculous how much money we were paid for ads. Without that revenue, high subscription costs are the only way to go.

--
"...always new atoms but always doing the same dance, remembering what the dance was yesterday." -Richard Feynman
Re:Will it be like google scholar? by aussie_a · 2004-12-13 21:10 · Score: 1

Actually it's quite probable we'll lose the e and print books will be known as "traditional books", "paper books" or "print books" in common language. The e is used to differentiate the minority from the norm, once it is the norm it is likely it'll be lost.
Re:Will it be like google scholar? by belg4mit · 2004-12-13 21:19 · Score: 2, Interesting

Also try Scirus from the facts at FAST. I've often had better luck there than on google.

--
Were that I say, pancakes?
Re:Will it be like google scholar? by belg4mit · 2004-12-13 21:22 · Score: 1

s/fact/folk/

--
Were that I say, pancakes?
Re:Will it be like google scholar? by AlanS2002 · 2004-12-13 21:25 · Score: 0, Insightful

In addition to the other reply to this, there might also be the case of journals which are published by proffesional organisations being used to defray the cost of running such organisation. You'll also find individual subscription prices being much cheaper than institutional subscription prices, I'd posit a guess that the institutional subscription holders are in some part subsidising the individual subscription holders.

--
Not all conservatives are stupid,
but it is true that most stupid people are conservative.
- Hume
Re:Will it be like google scholar? by deimtee · 2004-12-13 21:28 · Score: 1

Not true. Printing is cheap. With digital presses even runs of less than a hundred are economical.
Editing, layout and publishing may cost more but the cost of printing and binding a couple of hundred pages is only a few dollars per copy, and goes down as the print run goes up. Factor in handling/postage/shipping etc, and you are looking at a cost of less than $10 per issue.
The reason they are so expensive is the overheads are spread across so few copies.
The reason there are so few copies is because they are so expensive.
Chicken and Egg.

--
I'm guessing that wasn't on their radar screen...
Re:Will it be like google scholar? by Anonymous Coward · 2004-12-13 21:39 · Score: 0

>That's a great point, that I think should be addressed (it has a bit, with some free-online journals, but nothing major). In the world of digital publishing, why do journals cost thousands of dollars a year. Its certainly not in costs, academics pay the journals to defray the cost of publishing, and editors and referees generally get only an honorarium, if anything.

I'm sure they think you're overpaid too, son.
Re:Will it be like google scholar? by zebs · 2004-12-13 23:48 · Score: 1

More e-mail is sent than conventional post....probably... but I don't see e-mail being called mail any time soon
Re:Will it be like google scholar? by Rich0 · 2004-12-14 00:16 · Score: 2, Interesting

The one thing that something like google is lacking is persistant results sets. When I do serious searching I usually start with broad terms and figure out what it takes to narrow things down to a scale that I'm willing to work with.

Good quality search engines have lots of qualities that Google lacks. You could search for two words located within 3 words of each other. You could search for these two words within 3 words of each other while two other words don't occur within 6 words of each other. Indexes are gennerally well-thought-out and vocabularies are sometimes controlled.

Google allows many of these features, but they're cumbersome to use. If I ran two searches and I want to merge the results I have to be copying down everything I did, and try to concoct some kind of advanced search which combines the two sets of parameters. In a decent professional search tool you just ask it to return "set 1 or set 2" - giving you a set 3 that has any item that appeared in either. This is powerful and easy to use, and there is no comparison with google.

Don't get me wrong, I'm glad Google is going into this business. I no longer have free access to just browse the literature any time I feel like it, and this tool would provide that. I just don't think that they'll close down the commercial operations anytime soon.

Personally, I think that all articles written using federal funding should be released into the public domain. The NIH could sponsor journals if none of the commercial journals are willing to publish works that have no copyright. If my tax dollars were used to pay for a study on bumblebee migration patterns, then I should be able to thumb through the report whether or not some bureaucrat thinks that I have a need to know the results. And doing so should not require a trip to some non-public library halfway around the country...
Re:Will it be like google scholar? by trifakir · 2004-12-14 00:25 · Score: 1

I myself use both mail and e-mail for e-mail and snail mail for mail.
Re:Will it be like google scholar? by treerex · 2004-12-14 00:27 · Score: 2, Informative

I've been using CiteSeer for years in my research, and still prefer it over Google Scholar.

For computing research CiteSeer and the ACM DL are the two places to go. Scholar may obviate the need for going to both places, someday, but for now it needs to mature a bit.
Re:Will it be like google scholar? by miu · 2004-12-14 01:06 · Score: 1

In a business environment 'snail mail' sounds a bit slangy - I have heard 'postal mail' in meetings and conference calls to make it obvious what type of mail is being discussed.

--

[Set Cain on fire and steal his lute.]
Re:Will it be like google scholar? by failedlogic · 2004-12-14 02:33 · Score: 1

Not only that but the search engines - particularly the ones in Social Science - are very slow, hard to search, not indexed properly and crash frequently (as always, when I most need it). MedLine, IMO, is probably *the* best one there. I wish Google would replace some of those search engines with their own technology.

I concur. Google should team up with Academic publishers to increase the quality and quantity of the searchable information. And maybe if there is a bit of competition, the other databases will also improve.
Re:Will it be like google scholar? by tootlemonde · 2004-12-14 03:11 · Score: 2, Insightful
Good quality search engines have lots of qualities that Google lacks.
One solution is to use google to locate a superset of the target articles and then use a more powerful search engine to winnow the google result set. For an individual, this approach would mean maintaining a personal index of the articles but that is a problem of storage space and bandwidth which is relatively cheap.
The two main problems that google solves is
- having access to the articles in the first place
- reducing the number of possible articles to a managable level
One could imagine a plugin for browsers that would add the additional search facilities to a google search. Until then, Google Hacks will get you started.
Re:Will it be like google scholar? by Politburo · 2004-12-14 04:40 · Score: 1

Journals are advertisement-free, which is why they cost so much. I used to work for a student newspaper and it was ridiculous how much money we were paid for ads. Without that revenue, high subscription costs are the only way to go.

It seems like you're making a few assumptions, or not providing us with the full story.

It's entirely possible that your newspaper was able to charge high ad fees due to demand, as opposed to being forced to charge high ad rates due to cost. The college demographic is highly sought after by advertisers, so putting ads in a paper that you know will be read almost exclusively by your target audience sounds like it would be a good move.

As siblings mentioned, the act of printing is very cheap.
Re:Will it be like google scholar? by OddWeapon · 2004-12-14 05:00 · Score: 1

Those of use who are computer scientists do not have this problem... I guess because the ACM and IEEE make their indexes crawlable, if not the article texts themselves (although these are often also mirrored somewhere freely accessible)...
Re:Will it be like google scholar? by Eccles · 2004-12-14 05:03 · Score: 1

but I don't see e-mail being called mail any time soon.

Might I mention one of the most recognized phrases of recent history, "You've got mail"?

--
Ooh, a sarcasm detector. Oh, that's a real useful invention.

hah by usernotfound · 2004-12-13 18:53 · Score: 1, Funny

Doesn't matter if they do Purdue's, I think we have the 11th worst library in the Big10. I already use Google for my papers, anyways.

--
You call it excessive, I call it ambitious.

Re:hah by supimmike · 2004-12-13 23:53 · Score: 1

Exactly...

For my gen eds I am tring to figure out which courses do not require written papers to skip the whole library deal.

So... by Anonymous Coward · 2004-12-13 18:53 · Score: 4, Funny

If I download a book, when do I have to upload it again? What is the late fee if I forget?

Re:So... by GreatBunzinni · 2004-12-14 02:38 · Score: 1

Maybe the downloads are distributed via bittorrent, which takes care of that.

--
Slashdot, fix your code or at least hire someone who is competent at it to do it for you.

Let me be the first to say by slinky259 · 2004-12-13 18:55 · Score: 1

That is funking awesome!

~stephen

http://slinky259.blogspot.com

Google to cache the Universe by sjrstory · 2004-12-13 18:56 · Score: 3, Funny

Seeing as Google cached the entire Internet (the last page of the Internet can be seen here): http://www.google.ca/search?q=cache:dQrQDn0dHW8J:w ww.1112.net/lastpage.html+the+end+of+the+Internet& hl=en&client=firefox-a Google is now looking to cache everything else in the Universe :)

Re:Google to cache the Universe by PingPongBoy · 2004-12-14 07:06 · Score: 1

According to the Google spokesperson "Ever since we acquired a 1 Googlebyte hard drive, we've been obsessed with filling it with data generated by others. Did you want to search Microsoft's cache of Google?"

--
Know your pads. One time pad: good for cryptography. Two timing pad: where to take your mistress.

get your scuba gear... by uighur · 2004-12-13 18:57 · Score: 2, Insightful

because its time to dive into the deep web. Projects like this are the key to unlocking the vast stores of important which are currently not readiy accessed online. Personally I'd like to see a Google-run free access Lexis-Nexus project.

Re:get your scuba gear... by burns210 · 2004-12-13 20:42 · Score: 1

scholar.google.com

They are getting there.

15 million volumes? by Anonymous Coward · 2004-12-13 18:58 · Score: 3, Funny

Please, give me the the values in standard metrics, like Libraries of Congress!

Re:15 million volumes? by Anonymous Coward · 2004-12-13 19:10 · Score: 0

Please, give me the the values in standard metrics, like Libraries of Congress!

If they were written on stone tablets, these volumes would weigh aproximately 75 million long cwt (hundredweights), which is the equivalent of 9.8 billion slugs, or roughly 8.42 billion lbs avoirdupois. Hope this helps.
Re:15 million volumes? by HoneyBunchesOfGoats · 2004-12-13 19:26 · Score: 2, Funny

From Fascinating Facts About the Library of Congress:

The Library of Congress is the largest library in the world, with nearly 128 million items on approximately 530 miles of bookshelves. The collections include more than 29 million books and other printed materials, 2.7 million recordings, 12 million photographs, 4.8 million maps, 5 million music items and 57 million manuscripts.

So to answer your question, it's about 0.52 LoC if you count only the books. :)
Re:15 million volumes? by Afrosheen · 2004-12-13 20:48 · Score: 1

I wonder how many miles of classified documents they or the Pentagon have under wraps, just waiting to be discovered?
Re:15 million volumes? by pmc · 2004-12-13 20:49 · Score: 5, Informative

The Library of Congress is the largest library in the world, with nearly 128 million items on approximately 530 miles of bookshelves.

The British Library (www.bl.uk) has 150 million items (but fewer bookshelves) so the claim of "largest" is a bit dubious.

For /. readers 1 BL = 1.17 LoC
Re:15 million volumes? by commodoresloat · 2004-12-13 22:06 · Score: 1

The Library of Congress is the largest library in the world
How many Libraries of Congress is it?
Re:15 million volumes? by commodoresloat · 2004-12-13 22:11 · Score: 4, Funny

The British Library (www.bl.uk) has 150 million items
He means just books and such. It's not fair counting umbrellas.
Re:15 million volumes? by kalidasa · 2004-12-14 00:43 · Score: 1

I suspect that on a word-to-word basis, the LoC would come out ahead: many of the items in the BL might be very short (for instance, is each papyrus fragment counted as a separate item? Many of those have only part of one word on them.). At any rate, the claim that Harvard Libraries is second only to the LoC would only be credible if they're talking about the US, because I'm pretty sure that both the BL and the Bibliotheque Nationale are bigger that Harvard Libraries.
Re:15 million volumes? by RealErmine · 2004-12-14 01:58 · Score: 1

4.8 million maps

4.8 million maps? If the library was any good they'd only need one.

--
Dewey, you fool! Your decimal system has played right into my hands!
Re:15 million volumes? by clambake · 2004-12-14 02:34 · Score: 2, Funny

The British Library (www.bl.uk) has 150 million items (but fewer bookshelves) so the claim of "largest" is a bit dubious.

For /. readers 1 BL = 1.17 LoC

Sorry, I still don't understand... Could you express that in terms of how man shuttle explosions would be required to completely destroy one BL?
Re:15 million volumes? by Thomas+Miconi · 2004-12-14 05:07 · Score: 1

The Library of Congress is the largest library in the world

How many Libraries of Congress is it?

For you it will be about 0.8 Libraries of Congress, cos the relativistic speeds you'll reach on being booted from here will make it look smaller.

*Prepares deuterium-deuterium fusion tokaboot...*

Thomas-
Re:15 million volumes? by Anonymous Coward · 2004-12-14 06:52 · Score: 0

This is only because the British Pound
is strengthening against the dollar.

Images and formatting? by MacFury · 2004-12-13 19:00 · Score: 2, Insightful

I should RTFA but what about images and general formatting? I suppose you could find the relevant text, then try and get the physical book...but if you could view the book in it's original formatting...that would be sweet.

Just how much storage space will all this data consume? It seems like a massive undertaking.

Re:Images and formatting? by Anonymous Coward · 2004-12-13 21:34 · Score: 0

There is no FA. Did no one notice? Here is one
Re:Images and formatting? by bazzman · 2004-12-13 21:55 · Score: 1

Not RTFA. RTFL (Read The F... Library)

Re:first kumquat! by Anonymous Coward · 2004-12-13 19:00 · Score: 0

Are you trying to google bomb 'kumquat'? If so, the effort so far looks rather weak.

Money to blow! by Anonymous Coward · 2004-12-13 19:01 · Score: 1

Wow, so I guess Google doesn't know what to do with their IPO money and is just blowing it on a me-too project!

Re:Money to blow! by hyperlinx · 2004-12-13 21:17 · Score: 1

How about seeing Google as a business, and this being a next step in providing information, while making money off of side ads...this only further entrenches Google in the information provider environment and makes good business sense, and shows innovative thought....the key questions deal with the costs/methods to access the library's works and the format of searching for and viewing the data. Also, if this is linked to the scholar.google searches I see it as a step forward for anyone doing research, especially on uncommon topics where limited resources exist.

--
In /.space, no one can hear you SCREAM!

Re:Not Just Harvard by ravenspear · 2004-12-13 19:02 · Score: 1

Also according to the summary, Einstein.

Yes but the FS is starting to go the way of the FA as far as the number of actual readers is concerned. I admit to occasionally falling victim to this unfortunate disease myself. Sometimes I only read the headline, and with some of the YRO ones that take up nearly the whole width of my 1280px wide monitor, sometimes I can't even get through all of that.

Are these volumes stored as text or pictures? by wealthychef · 2004-12-13 19:03 · Score: 2, Insightful

I am ambivalent about this. Will the books be stored as text to enable searching? If so, given that part of a book's character is its font and typesetting, will ALL the flavor of these books really be captured, in the same way that it would be to read them? Something seems likely to be "lost in translation" here.

--
Currently hooked on AMP

Re:Are these volumes stored as text or pictures? by clovercase · 2004-12-13 19:08 · Score: 3, Insightful

i think your comments would be salient if they were going to scan the documents and the BURN the originals. putting massive content on the web for free is the best way to push content all over the world. some internet user in sri lanka doesn't have the bandwidth to download images of the pages, and would never have the opportunity to view the actual documents in a library at harvard. if everyone digitized all the valuable content (and i presume that much of the content in harvar's libraries are valuable), and made it freely available, the world would be a much better place. would you be satisfied if there was a link on each page to view an image of the actual page?
Re:Are these volumes stored as text or pictures? by robla · 2004-12-13 19:09 · Score: 3, Interesting

I would hope the handle it in just like catalog.google.com
Re:Are these volumes stored as text or pictures? by Txiasaeia · 2004-12-13 19:14 · Score: 4, Insightful

I think you're missing the point. I'm not so much concerned with getting rid of dead tree books (I love reading paper books for enjoyment); I would, on the other hand, prefer all my academic sources to be electronic. As I mentioned in reply to another poster, it's a huge pain to look something up on MLA or Expanded ASAP only to find out that my university doesn't carry it and the interlibrary loan system can't get it for two or three weeks because it's backlogged as it is. I could care less about the spiffy fonts and typesetting; give me the plaintext so I get my research done!

--
Condemnant quod non intellegunt.
Re:Are these volumes stored as text or pictures? by Anonymous Coward · 2004-12-13 19:51 · Score: 0

For non public domain works they will probably only provide access to a low resolution image.
Re:Are these volumes stored as text or pictures? by Anonymous Coward · 2004-12-13 20:57 · Score: 0

That might be ok for a lot of subjects.....but in a lot of humanities subjects your argument is plain wrong. Artists such as William Blake have used their own printing techniques and done their own illustrations. Certain texts are limited to very few or even one copy. Reproductions may be OK for some purposes but for some research it is necessary to refer to the original.

So ideally an electronic library, at least for some books, will contain the original scans as well as the OCR. Of course the scans are nice but they still don't convey textures, they may miss detail, contain artifacts and the colour accuracy even if carefully calibrated equipment is used is not possible because the gamuts of available monitors do not encompass the gamuts of all possible printing techniques. (Please make us better monitor scientist nerds)

I'm quite sure that these requirements are being considered by the relevant librarians who are extremely protective of their rare books.
Re:Are these volumes stored as text or pictures? by Anonymous Coward · 2004-12-13 20:59 · Score: 1, Funny

RTFA, smartass. The article clearly points out that they will indeed be burning the books after digitization and the librarians will be executed. The buildings will then be sold to Walmart. Google ads (including, rather surprisingly, pornographic ones) will be placed on each non-searchable, unindexed Flash-based web page and all chapter headers will blink. Like George Bush's web site, the pages will not be viewable from foreign countries and a permanent centralized record will be maintained of all user IPs at the Dept. of Homeland Security.
Re:Are these volumes stored as text or pictures? by supabeast! · 2004-12-13 21:06 · Score: 1

"...will ALL the flavor of these books really be captured, in the same way that it would be to read them?"

For the vast majority of the people who will ever use the tool, that won't matter. Most of the world's libraries don't hold onto old scholarly stuff indefinately, assuming that they ever bought a lot of the obscure stuff. It seems likely that because this will be limited to public domain works, most of them will be old and hard to find, so anyone looking at them will quite likely have had no way to access them previously.

Even if something is lost in the conversion, the users will still be getting access to things they never would have had access to previously.
Re:Are these volumes stored as text or pictures? by ragnar · 2004-12-14 02:07 · Score: 1

It all depends on the nature of your research. I work in the field of Humanities Computing. While I represent the CS wing (as opposed to the Humanities), examples abound where the formatting and visual properties of a literary object are essential for research. Separation of content and presentation is a misnomer for the people I work with.

That said, I think any digitization that makes the materials available is a good thing, but it may not serve the research needs for everyone.

--
-- Solaris Central - http://w
Re:Are these volumes stored as text or pictures? by PingPongBoy · 2004-12-14 07:11 · Score: 1

would you be satisfied if there was a link on each page to view an image of the actual page?

Not quite. I need Tank to download the actual meanings of the information into my mind complete with examples, cross-indexing, case studies, interpretations, and optimized instructions to arbitrary goal solving.

--
Know your pads. One time pad: good for cryptography. Two timing pad: where to take your mistress.
Re:Are these volumes stored as text or pictures? by Pendersempai · 2004-12-14 08:07 · Score: 1

I don't know what field you're in, but in some areas of math (for example) the typesetting is absolutely essential to getting any value out of the work. I'm not disagreeing with you, though -- I totally agree in the vast majority of cases.
Re:Are these volumes stored as text or pictures? by Anonymous Coward · 2004-12-14 15:57 · Score: 0

the gamuts of available monitors do not encompass the gamuts of all possible printing techniques. (Please make us better monitor scientist nerds) Your wish is their command: http://www.debevec.org/HDRI2004/

Yeah but Harvard? by Anonymous Coward · 2004-12-13 19:04 · Score: 1, Funny

Everyone knows that Harvard sucks.

Ivy Exchange by Anonymous Coward · 2004-12-13 19:04 · Score: 1, Informative

I know Brown has been digitizing all journals coming in for a while...

On another note, all the Ivies except Haavad participate in interlibrary loan program. There's over 40 million bound volumes overall. Check it out here.

Re:Ivy Exchange by Anonymous Coward · 2004-12-13 21:47 · Score: 0

That's because journals now come electronically, idiot. The few that are still delivered in paper form usually include electonic access as well, which the respective library can then include in their indices.

The Fight against Plagiarism by manmanic · 2004-12-13 19:04 · Score: 5, Interesting

One reason why this is in the interest of big old universities like Harvard is that it will make it much easier to detect plagiarism in students' essays. If published books were included in Google's index, a plagiarism detection service like Copyscape would also be able to check whether content was lifted from printed material, as well as from the web.

Re:The Fight against Plagiarism by hussar · 2004-12-13 19:50 · Score: 0, Offtopic

It will also be interesting to see if anyone runs a project to see how much of the historical material was lifted from earlier writers. For example, how much did the US founding fathers "borrow" from other published works? A number of the early US statesmen attended Harvard (the second President of the United States, John Adams springs to mind), and it would be interesting to see how much, if any, of their writing was copied. John Locke's writing influenced much of the political opinion around the time of the founding. Did he "contribute" more than we know?

--

Bureaucracy loves company.

Crapload of storage... by killa62 · 2004-12-13 19:06 · Score: 1

So does this mean that the movies/audiotapes will be archived too. That's a crapload of storage.

Loebs!!!!! by canicus · 2004-12-13 19:07 · Score: 1

Maybe they'll put the Loebs up! No more $20 a pop when you live in a really
obscure town.

Re:Loebs!!!!! by Jonathan · 2004-12-14 13:16 · Score: 1

Have you discovered Perseus? Actually somewhat more useful that the Loebs -- you can click on words to see the translation, case, etc.
Re:Loebs!!!!! by canicus · 2004-12-14 15:02 · Score: 1

Oh yes, and thanks anyway for mentioning them.

I was thinking when I posted that "downloadable," since I'm on a dial in and
reading on Perseus can be a pain in the rumpside, but now that I'm thinking
about it, it probably won't be downloadable.

pHe4r the kUmqu4t! by Anonymous Coward · 2004-12-13 19:08 · Score: 0

I know, it's been slow going so far. But, you do what you have to.

That said... kumquat!

Labour force by stonda · 2004-12-13 19:10 · Score: 1

So what do they have for the task itself, little children from foreign countries?

How will the books be scanned? by supersat · 2004-12-13 19:14 · Score: 2, Interesting

About two months ago, Jeff Dean (an employee of Google) gave a talk at the University of Washington about the inner workings of Google. One thing he mentioned was Google Print and how they scan books: they slice 'em up into individual pages, and then feed them through a scanner. This doesn't seem like an acceptable way to archive a library's collection. So, how are they scanning them in? Why not use this method for Google Print?

Re:How will the books be scanned? by uunh+haun · 2004-12-14 03:21 · Score: 1

There are a number of different types of book scanners available. Libraries already scan tons of books for things like interlibrary loan and digitization projects.
Re:How will the books be scanned? by Anonymous Coward · 2004-12-14 07:17 · Score: 0

and I quote

Sources at Google said the book scanner used in the process was developed in-house and is not commercially available. While Google would give no further comments on the equipment, a statement from Harvard said evaluators at the university thought Google's scanning process "is much gentler with books than other high-speed processes in use today."

and I hate it when I forget my passwords. WrennM
Re:How will the books be scanned? by WEFUNK · 2004-12-14 07:39 · Score: 1

Google's scanning process "is much gentler with books than other high-speed processes in use today."

My guess? Based on all the descriptions that point to this being something different, gentler, much much faster, I bet they're using a technique that can handle reading text at very sharp angles -- so they only need to quickly "flip" through a fanned out book without even laying it out flat.

This also sounds more like something that Google could develop in house with all of their collected PhD's in AI, since the key technological challenge would not be the scanning hardware, but the algorithms for accurately recognizing elongated text across a background of varying darkness (from the edge of the page to the binding). If so, this might preclude any hope for high quality images though.

Any thoughts?

--
My next sig will be ready soon, but friends can beat the rush!

But will you be allowed to copy the materials? by Animats · 2004-12-13 19:15 · Score: 1

Or will they try to lock them up with an EULA, the DMCA, and some eBook system?

Re:But will you be allowed to copy the materials? by QuantumG · 2004-12-13 19:34 · Score: 1

well even if they do try to lock em up I can't see how they'd win a case if you were copying material that is in the public domain.

--
How we know is more important than what we know.
Re:But will you be allowed to copy the materials? by Anonymous Coward · 2004-12-13 21:24 · Score: 0

Well they could, with DRM. Can't see how that'd happen with google though. (web and drm don't go along very well)

Flipside: The false positive problem by rsborg · 2004-12-13 19:15 · Score: 2, Insightful

Ok, so this is just a bit of devil's avocate, but what happens if you just *happen* to have a writing style similar to someone else who was printed before... what if you read something, and unknowingly wrote something in a similar vein in your essay? I assume you could check it yourself, but then that would just introduce extra cost to even write the essay in the first place... or worse, the plagiarists could just "tweak" their papers ensuring that they're "below the radar" by changing enough style to not be recognizeable...

--
Make sure everyone's vote counts: Verified Voting

Re:Flipside: The false positive problem by Gori · 2004-12-13 20:15 · Score: 2, Interesting

Well, there are such things as references.

Using work of other people in academic work is not only possible, but greatly encouraged. Just make sure that it is very clear what comes from whom.

In many ways, science is done exactly as Open Source software. Take what you need, modify and improve it where appropriate, and make sure you give full credit where due.

As a teacher, I have given full points to a paper that has hardly any text of their own, as long as they are properly referenced, and used together to make a valid point, not made by any of the sources.

So I do not think students should bother staying below the rarad. Just reference everything,and voila, you are doing science

--
Complexity is a measure of our ignorance...
Re:Flipside: The false positive problem by tgibbs · 2004-12-14 12:27 · Score: 1

Ok, so this is just a bit of devil's avocate, but what happens if you just *happen* to have a writing style similar to someone else who was printed before... what if you read something, and unknowingly wrote something in a similar vein in your essay?

An accusation of plagiarism is a serious matter in academia. You need multiple lines of word-for-word identity before anybody is likely to raise a flag. There is essentially zero chance that you'll be accused of plagiarism on the basis of an accidental resemblance. A greater risk is that you might have read something with a particularly unusual and striking turn of phrase somewhere and forgotten it, and wrote it down genuinely believing that it was original. That might raise suspicions, but even here it is unlikely that the passage would be long enough for anybody to actually accuse you.

As far as plagiarists "tweaking" their papers, you can't really prevent this, but doing a decent job of tweaking starts to become as much work as writing something original, and indeed, at some point the work becomes merely derivative rather than plagiarized. You may get a C for a paper that is unoriginal, but you won't be flunked or hauled up before an academic review board.

I beg your pardon... by Anonymous Coward · 2004-12-13 19:17 · Score: 0

Make that the SECOND largest library system after the Library of Congress! University of California Library system is the largest.

Re:I beg your pardon... by hussar · 2004-12-13 19:55 · Score: 0

That may be true, but it is not all in one place. It is strategically scattered along the San Andreas Fault.

--

Bureaucracy loves company.
Re:I beg your pardon... by julesh · 2004-12-13 21:44 · Score: 1

Not to mention the British Library, which is larger than the Library of Congress.

But of course, that doesn't come after the LoC, so whether that makes the story factually inaccurate, or just misleading, is an interesting question.

Reminds me of the U of Michigan and U. Microfilms by Ungrounded+Lightning · 2004-12-13 19:21 · Score: 2, Informative

Back around the '60s or so the University of Michigan cut a similar deal with University Microfilms.

U Microfilms set up and ran a microfilming operation in the library system, microfilming everything that wasn't under copyright (and much that was with permission of the copyright holders, such as several large newspapers and many magazines and other periodicals), along with much of the University's records. Rare books, etc.

(If I have this right) the U got microfilm prints of the documents for free and didn't have to pay for the microfilming of its records. University Microfilms made its money by selling microfilms of the various publications (forwarding royalties, where appropriate, to the copyright holders). The rare books, for instance, could now be studied on microfilm with no further stress on the original, and their content became available at many other colleges and libraries. Good deal all around.

University Microfilms was founded by a regent, who was later slammed for conflict of interest. He dropped out of the Board of Regents but the business deal continued.

--
Bantam Dominique roosters crow a four-note song. Once you've heard it as "Happy BIRTHday" you can't NOT hear it that way

clinton by sewagemaster · 2004-12-13 19:22 · Score: 1

if this is clinton's "library" that's tp be "googlized" and "digitized", then that'll be an interesting "shot"... ;)

Homer: mmmmmm digitized google....

--
my blog

Sweet job Harvard by jtbauki · 2004-12-13 19:24 · Score: 1

That is awesome. I just wonder how the book publishers will respond? Imagine being able to read any textbook without paying for it? How will those textbook publishers who keep raising prices and reprinting books with "new" editions make money... I'm imagining an RIAA-like attack on online books. Watch out Google!

Education should be free. Especially now that information can be distributed so cheaply and so efficiently

On a side note, I believe the government should create some standardized books. For example, calculus. It's the same equations and theorems that each school teaches and it hasn't changed much in a long, long time. Teachers can dictate which parts to emphasize. We can have a committee of well-established professors write the book in the same way that any other calc book is written. The book can undergo revision every 10 years or more. Think of the money that can be saved for students!!! It can even be available online! Of course some books can't be standardized like history, where different viewpoints produce different versions of history. /P

Re:Sweet job Harvard by hussar · 2004-12-13 20:17 · Score: 1

Education should be free. Especially now that information can be distributed so cheaply and so efficiently

You have confused information and education. Information is the raw material. Education is the (never-quite-completely-) finished product.

You have also discounted the value of teachers and professors guiding you through the mass of information available in order to help you to use it to get an education. Their efforts are valuable and worthy of compensation. They can be paid for by your tuition fees, by a tax on all of us (in which case, you owe us some benefit in return for you studies), or a combination of the two. But, there is no free (as in beer) education.

--

Bureaucracy loves company.
Re:Sweet job Harvard by aconbere · 2004-12-13 20:22 · Score: 1

ugh... I don't know if anyone else has had experience with "standardization" done by the government, but I know that my school in junior high switched to standardized texts and class room materials, whcih frankly, were just terrible. This is just asking for failure. (dons tin foil cap) Horace mann was one of the founding father of public education in the US. His primary reason for promoting public education? "It is the best way to indoctrinate the youth with protestant republican (like "the repulic" not "the party") values." If you look at our public education system, this type of think still holds true in many areas. It's true of the no child left behind policy, it's true of state mandated curiculem, it runs rampent. Examples are clearly evident in history text books, but can also find their ways into math texts. We all know that there are better ways of teaching math than what's going on in the states. We can look at how mathematics works in china and say, wow, they teach children basic fundementals of sets and numbers then teach them to aply that to arithmatic. But the current wealth of standardized texts in math continue to press the same old, memorize the theorem, do a million of the same problems, forget the theory move on. Without the wealth and variet offered by competing text books, and freedom of education we are doomed to continue down the same path to eternity. Once we let Government dictate how we learn, how to do we regain control over content? We all know how beauracracy works, do we want it to take that long to update text books? Do we want it to be that hard to get corrections made? Education needs the competition of distributed text books, too many are crap, schools and teachers need the power to decide what text book is proper for their curiculem. The education system in america is in sad sad shambles, this is hardly the way to go about fixing it. Anders

Harvard is a small player, actually by organum · 2004-12-13 19:24 · Score: 1

40,000 volumes. Compared to 8 million for Stanford and 7 million for Michigan. The latter already has almost 20 million pages online.

Re:Harvard is a small player, actually by gmcgath · 2004-12-14 03:15 · Score: 1

What's amusing is that I'm a programmer for the Harvard libraries, and I only found out about it this morning in the Globe. They're going to tell us about it officially in a meeting this afternoon.
Re:Harvard is a small player, actually by xipho · 2004-12-14 04:32 · Score: 1

Mod parent up. The slashdot bit is misleading, 40,000 books in the pilot is hardly 15,000,000. This is a pilot project only.

--

only infrmatn esentil to understandn mst b tranmitd
Re:Harvard is a small player, actually by noFilter · 2004-12-14 05:26 · Score: 1

Every officer got the email 12/13 at 7PM. Are you a temp or a non-exempt?
Re:Harvard is a small player, actually by gmcgath · 2004-12-14 05:36 · Score: 1

No, I just don't check my work mail at night.

University of California is anti-digital by dananderson · 2004-12-13 19:24 · Score: 5, Informative

This is great. Compare this pro-digitalization attitude of Harvard, Stanford, and others with the University of California's (UC's) anti-digital position.

For books in Special Collections, they won't allow copies to be digitalized unless they are (1) paid a fee to scan the book (fair enough) and (2) paid a royalty to post the book to the web.

The royalty amounts to hundreds or thousands of dollars per book (about $100/page or image). This allows the libraries to act as a "profit center" for the universities. This policy applies to all UC campuses (I've tried UCB, UCLA, UCI, UCSD).

This is true even though the book is in the public domain (because they have physical possession and nobdy can make copies until you sign a license agreement). This is true even if you're using the book for non-commercial purposes (such as free posting to the web).

Something is wrong here. People donate to UC libraries (either books or money) for the public good. They don't donate so the library can start a business licensing public-domain books.

Despite that, I have been able to scan many books (by using books in open stacks or purchasing them). These books concern Yosemite history and are at http://www.yosemite.ca.us/history/

Re:University of California is anti-digital by rritterson · 2004-12-13 20:23 · Score: 1

Too bad I can't mod you up, because I just had to reply instead. I go to UCB and often hear us brag about how we have the second largest 'public' collection in the nation (or is it world?), after the Library of Congress (Harvard is bigger, but is privately owned). It makes me quite sad that is our policy if what you say is true. Donations to the library are down, funding is short, and access to many journals has been cut. Digitizing books would save money and resources, and benefit everyone. Public Universities exist for public good, not for state profit. Heck, it's hard to get into the library without a UC-issued ID, and even then you can't take anything out of the library. Now, just how many out of the 15 millions of books can one student body actually use at one time?

--
-Ryan
AUWYHSTOT (Acronyms are Useless When You Have to Spell Them Out Too)
Re:University of California is anti-digital by JoshuaDFranklin · 2004-12-13 20:33 · Score: 2, Interesting

Got a link for that policy?

Ever tried a Freedom of Information Act (FOIA) request? Strange as it may seem, that apparently works in the State of Washington.
Re:University of California is anti-digital by novakyu · 2004-12-13 23:33 · Score: 1

Heck, it's hard to get into the library without a UC-issued ID, and even then you can't take anything out of the library.
Well, the undergraduate library (Moffit/Main Stack) does require a UC ID, but I think if you don't have one and you have a valid (i.e. "scholarly") reason for access to library you can talk to the people at the main library (Doe Library). So, I don't think it's such an unreasonable policy, especially considering what a large place the main stack is and how difficult it would be to keep that place maintained if anybody from anywhere could get in. (And most departmental libraries don't have such restrictions.)
Now, what's been bothering me more is that there are special sections of the library called "graduate library" (and I believe all of classics library is like that....) to which undergraduates are not allowed unless accompanied by a patron^H^H^H^H^H^Hgraduate student. I think you can get special permission for that, too, from your department or something if you have a reason to go there, but still.
Now, just how many out of the 15 millions of books can one student body actually use at one time?
Heh--the organization of the main stack is horrible, and the Bancroft Library--well, if it's non-circulating, couldn't it at least open longer? But, what I've found is that, well, the fewer the students who use the library as a library (not as some sort of exam-time study hall), the easier it is for me to check out books (without recalling) and hold onto them (without being recalled). :)
Re:University of California is anti-digital by kalidasa · 2004-12-14 00:51 · Score: 1

I don't see how they can impose a royalty on the TEXT of a book from their special collections, if the book was published before 1923. On their images (product of their scanning), sure, as they create the images and so own the copyright on the images; but they don't own the copyright on the original books.
Re:University of California is anti-digital by Pendersempai · 2004-12-14 08:04 · Score: 1

That is sheer brilliance. I wish to death I had mod points for you, but all I can offer is my absolute astonishment at your (or whoever thought of that's) insight.
Re:University of California is anti-digital by dananderson · 2004-12-14 08:51 · Score: 1
Here's some links to policies, which appear pretty uniform for all UC campuses--that is you pay a fee for reproduction of public domain works in their possession. I've tried all 3 libraries and they either say no, or don't even bother replying (for both personal visits and a snail mail letter):
A FOIA may help, if it applies to UC. I'm not familiar with it, but isn't that a federal law that applies to the U.S. government. I don't know if California has an equivalent. IANAL, but I'll look into it and see if there's a California state version of FOIA.
Re:University of California is anti-digital by dananderson · 2004-12-14 08:55 · Score: 1

They can impose a royalty once you sign a licence agreement to do so. If you don't, you can copy public domain, but they won't give you access to do so. Sort of a Catch-22.
Here's a quote from a California Library Association webpage "Identifying Public Domain Sources to Borrow From" by Mary Minowis: "Though controversial, licensing agreements have been upheld, even when they protect works that are not copyrightable [i.e, public domain]. See ProCD, Inc. v. Zeidenberg, 86 F.3d 1447 (7th Cir. Wis. 1996).

Planned for 05' by after · 2004-12-13 19:29 · Score: 0

Microsoft will do the same

"Slice and scan" is used for new books only by dananderson · 2004-12-13 19:31 · Score: 3, Informative

I'm not familiar with Google Print, but typically "slice and scan" is usually used for new books only. That's because there's multiple copies of the book available and the paper is usually flat and dust free.

For older books, most archivists use a cradle and photograph the pages. It's easier on the book, requires no slicing, and there's no scanner to clog with dust.

The disadvantage is the scanner operators need a little bit more training, but that's not a big problem.

Re:"Slice and scan" is used for new books only by brunogirin · 2004-12-14 01:42 · Score: 2, Informative

Also, the "slice and scan" method is much, much faster because you can feed the whole book in one go to a high volume scanner and hey presto! it comes out with all the scans in minutes rather than spending hours photographing and scanning each page individually. But of course, "slice and scan" is a destructive method (destructive for the book) so only makes sense if the printed book is not a rare item.

don't forget... by kaedemichi255 · 2004-12-13 19:33 · Score: 1

don't forget that amazon has the "search inside the book" feature that has been available for a few years now. i guess the main difference is that google is targeting a lot of academic sources, while amazon gets its database of book texts from publishers. if the two were combined... then maybe they could form ibdb.com, the Internet Books Database ;)

Re:don't forget... by Sein · 2004-12-14 01:43 · Score: 1

Err, didn't Amazon lease Google's tech for that?

Their entire library? by Anonymous Coward · 2004-12-13 19:33 · Score: 0

I bet we are going to find Goosebumps books in it. Millions of them.

Re:Their entire library? by Anonymous Coward · 2004-12-13 22:17 · Score: 0

You forgot the babysitters club.

Text of Dec 13th Email by olvr · 2004-12-13 19:34 · Score: 5, Informative

December 13, 2004

Dear Colleague,

I am writing today with news of an exciting new project within the Harvard libraries. As all of us know, Harvard's is the world's preeminent university library. Its holdings of over 15 million volumes are the result of nearly four centuries of thoughtful and comprehensive collecting. While those holdings are of primary importance to Harvard students and faculty, we have, for several years, been considering ways to make the collections more useful and accessible to scholars around the world. Now we are about to begin a project that can further that global goal-and, at the same time, can greatly enhance access to Harvard's vast library resources for our students and faculty.

We have agreed to a pilot project that will result in the digitization of a substantial number of volumes from the Harvard libraries. The pilot will give the University a great deal of important data on a possible future large-scale digitization program for most of the books in the Harvard collections. The pilot is a small but extremely significant first step that can ultimately provide both the Harvard community and the larger public with a revolutionary new information location tool to find materials available in libraries.

The pilot project will be done in collaboration with Google. The project will link Harvard's library collections with Google's resources and its cutting-edge technology. The pilot project, which will be announced officially tomorrow, is the result of more than a year of careful consultation at many levels of the University. We could not have achieved a meaningful pilot project without the efforts of the Harvard Corporation; the President, Provost, Chief Information Officer, and Office of General Counsel; the University Library Council; and senior managers within the College Library and the University Library.

A full description of the pilot program follows here, with further materials available on the Harvard home page tomorrow.

With best regards,
Sidney Verba
Carl H. Pforzheimer University Professor and
Director of the University Library

Project Description:
Harvard's Pilot Project with Google

Harvard University is embarking on a collaboration with Google that could harness Google's search technology to provide to both the Harvard community and the larger public a revolutionary new information location tool to find materials available in libraries. In the coming months, Google will collaborate with Harvard's libraries on a pilot project to digitize a substantial number of the 15 million volumes held in the University's extensive library system. Google will provide online access to the full text of those works that are in the public domain. In related agreements, Google will launch similar projects with Oxford, Stanford, the University of Michigan, and the New York Public Library. As of 9 am on December 14, an FAQ detailing the Harvard pilot program with Google will be available at http://hul.harvard.edu.

The Harvard pilot will provide the information and experience on which the University can base a decision to launch a large-scale digitization program. Any such decision will reflect the fact that Harvard's library holdings are among the University's core assets, that the magnitude of those holdings is unique among university libraries anywhere in the world, and that the stewardship of these holdings is of paramount importance. If the pilot is deemed successful, Harvard will explore a long-term program with Google through which the vast majority of the University's library books would be digitized and included in Google's searchable database. Google will bear the direct costs of digitization in the pilot project.

By combining the skills and library collections of Harvard University with the innovative search skills and capacity of Google, a long-term program has the potential to create an important public good. According to Harvard President Lawrence H. Summers, "Harvard has the greate

This is great! by Goldrush80 · 2004-12-13 19:36 · Score: 1

This will be sweet,I just hope that we dont get to many authors getting pissed.

Re:This is great! by tepples · 2004-12-13 20:13 · Score: 1

They're DEAD for cricket's sake. How can they get drunk?

Oxford University gets every UK book published by aegilops · 2004-12-13 19:37 · Score: 3, Informative

The library of the University of Oxford, i.e. the Bodleian Library, was the first "copyright" library in the UK - one of only three - which means that it automatically gets a copy of every book published in the UK.

Aegilops

Re:Oxford University gets every UK book published by Jon+Chatow · 2004-12-13 20:33 · Score: 4, Informative

Actually, they don't automatically get copies. They have the right to get one, but they don't have much space, so they only get copies of publications that they feel like getting. The British Library would be a more interesting one to team up with, as they get a copy of every publication...

--
James F.
Re:Oxford University gets every UK book published by Anonymous Coward · 2004-12-13 20:33 · Score: 1, Informative

No it does not. It has the right to get every book published, but it has to ask for them within a year of publication. Only the British Library gets the books automatically under law.
Re:Oxford University gets every UK book published by Andrew+Aguecheek · 2004-12-13 22:47 · Score: 2, Interesting

Yep, fell foul of this one the other day. The National Library of Wales happens to be situated in Aberystwyth, on the same hill as the University. (Which, by the way, is a bitch to climb in the mornings... do not apply for sea-front residences unless you are sure of your fitness!) Aaaaanyway, as the librarian there tactfully explained to me: one hell of a lot of books are published every year, and there's only so much space in the place... and they like to have a Welsh Language copy too!

--
Tomorrow, I may eat another house plant
Re:Oxford University gets every UK book published by illtud · 2004-12-14 06:17 · Score: 1

Aaaaanyway, as the librarian there tactfully explained to me: one hell of a lot of books are published every year, and there's only so much space in the place... and they like to have a Welsh Language copy too!

We get copies of most books which are widely published. We certainly don't get a "Welsh Language Copy" (about .0001% of UK published books are available in both languages). Any books missing from the collection which you think should be there can be requested for aquisition if the request is made while the book is still available from the publishers.

Just what percentage... by EvilMidnightBomber · 2004-12-13 19:38 · Score: 1

Google will provide online access to the full text of those works that are in the public domain Just what percentage of the current works are public domain?

Re:Just what percentage... by julesh · 2004-12-13 21:40 · Score: 1

Google will provide online access to the full text of those works that are in the public domain Just what percentage of the current works are public domain?

With a catalogue that size, probably most of them. The number of new books published per year isn't actually all that huge -- even if you acquired everything published in the US, I would expect it to take a long time for you to reach 15 million items.

Note that, for instance, the LoC has 29 million books, which is understood to be a significant fraction of every book published in the US since 1800.

Both Images & Uncorrected OCR should be availa by dananderson · 2004-12-13 19:38 · Score: 4, Insightful

Typically, both page images and uncorrected OCR are made available. Correcting OCR is too labor-intensive for thousands of books.

The uncorrected OCR is very useful for indexing (by Google or others), as the 5% or fewer typos are not enough to interfere with indexing keywords. Uncorrected OCR can also be corrected later.

The page images are tied with the uncorrected OCR so you can see exactly what's there.

For an example, see books at University of Michigan's Making of America (MoA) Exhibit, which has thousands of 19th century books and periodicals available.

All well and good, except by sulli · 2004-12-13 19:42 · Score: 1, Offtopic

Harvard Sucks

(they admit it themselves!)

--

sulli
RTFJ.

Re:All well and good, except by dkleinsc · 2004-12-14 01:32 · Score: 1

Hey, at least they don't go to a Safety School

--
I am officially gone from /. Long live http://www.soylentnews.com/

they are just doing what they said they will by rasmajx · 2004-12-13 19:43 · Score: 1

Google always emphasized what's their purpose. To organize the world's information to be useful and to serve us.

baooooooo

Dead authors tell no tales . . . till now by dananderson · 2004-12-13 19:44 · Score: 2, Insightful

This will be sweet. I just hope that we don't get too many authors getting pissed.

Only public-domain books will be scanned. In all or most cases the author's are dead. However, this will revive a great body of work and widen access to many.

One class of author may be pissed will be authors who take older works and just slap a foreword or introduction to the front and collect royalties. I've seen this done for many histories. But author's of todays works can count on royalties for themselves, their children, and their grandchildren (if the book is still selling). The copyright term is too long in the U.S., but that's another story . . .

Speaking of education... by quivrnglps · 2004-12-13 19:44 · Score: 1

As of 9 am on December 14, a FAQ detailing the Harvard pilot program...

Don't you mean an FAQ?

Seriously though, I can't help but wonder if projects such as this will help or hurt the overall literacy of the populace. It seems to me that the ability to extract excerpts quickly without having to peruse the context could lead to a less educated society. Some of the most interesting facts I have learned have been things I've accidentally run across in a book while looking for something else.

Don't get me wrong, I fully support the idea of having quick access to any information that might be needed. I am simply speculating that some other steps might need to be taken to ensure that future generations still benefit from the subtleties of knowledge that come from reading a book.

Just a thought.

-Daniel

Re:Speaking of education... by belg4mit · 2004-12-13 20:59 · Score: 1

It'd only be an if you're lame enough to spell out the acronym. If you pronounce it fak it's clearly a fak not an fak.

--
Were that I say, pancakes?
Re:Speaking of education... by usernotfound · 2004-12-13 21:07 · Score: 0

We have an english department here at Purdue? I thought I took Sleep108...i guess that explains why I always woke up with an essay to catch my drool.

--
You call it excessive, I call it ambitious.
Re:Speaking of education... by Anonymous Coward · 2004-12-13 21:12 · Score: 0

No.

an Frequently Asked Questions doesn't make sense.

while a Frequently Asked Question.. might

You're both wrong. I agree. stop doing the hw for the dumbass young people. they are now retarded thanks to spell check.

I may sound like a troll... by Anonymous Coward · 2004-12-13 19:45 · Score: 0

But does it work under Lynx?

F1rst tr011!

"Girls seem to go for sensitive-type guys, so you've always got to act like you're listening to whatever it is they're yapping about, and pretend you give a rat's butt about stupid stuff like flowers and recycling. Oh yeah, be sure to wear plenty of aftershave!" Homer J. Simpson

Do no evil. by nels_tomlinson · 2004-12-13 19:48 · Score: 4, Funny

Their corporate motto is ``do no evil'', and we've all applauded that, but this is such a great thing that I think we could give them a pass on at least one evil act.

Maybe they could do something really evil to Microsoft, and then we could say: ``Well, you digitized Harvard's library, so we'll let it pass this time.''

--
See what I've been reading.

Re:Do no evil. by swiftstream · 2004-12-14 01:18 · Score: 1

Is anything one does to Microsoft evil, though?

Apart from, like, selling out to them, I mean.

--
Be a PATRIOT--because the only thing we have to fear is the lack thereof.
Re:Do no evil. by RealAlaskan · 2004-12-14 05:22 · Score: 1

Is anything one does to Microsoft evil, though?
Apart from, like, selling out to them, I mean.
Well, imagine Hannibal Lecter, and Microsoft. And a bottle of Chianti.
No, I guess you're right.

--
See what I've been reading.
Re:Do no evil. by Anonymous Coward · 2004-12-14 06:34 · Score: 0

I've been saying for years that google is going to be the first major evil empire. Digitizing all information, caching information from 8 billion websites, and with the new keyhole program they got, digitizing the surface of the earth. We'll bow to sergey brin and his evil associates as they laugh maniacally in pseudo-futuristic clothing.

Like my grandpa always said...Never trust a man what's made of gas.

Amen by lavaface · 2004-12-13 19:54 · Score: 2, Informative

It was just a matter of time before a project of this scope got off the ground. I would like to see them team up with Project Gutenberg (and perhaps archive.org) to provide images of the material. Throw in the little transcoder and perhaps wikipedia and we will soon have a killer information resource that can be cross-referenced to silly proportions. This is a boon for research. Projects like this and the public library of science will add much to collective knowledge. It would also be nice to see them team up with the newspaper project! Next stop--public domain LOC!!!

--
harmonious design

False positives can be double-checked manually by wrinkledshirt · 2004-12-13 20:12 · Score: 2, Insightful

The professor can just wait until the match comes up, and then double-check at that point.

You'd want to do a thorough overview of any potential instance of cheating anyway. A quick run-through would determine whether or not a paper happened to contain an identical sentence clause or three identical paragraphs.

I think the bigger problem would be the second one you described -- that students could plagiarize and then go through each paragraph, changing the wording slightly so as to avoid positive matches. Still, you could argue that this is pretty much what academics is anyway, just with footnotes and a bibliography.

--

--------
Bleah! Heh heh heh... BLEAH BLEAH!!! Ha ha ha ha...

Re:What the fuck is wrong with you? by tarunthegreat2 · 2004-12-13 20:15 · Score: 0, Flamebait

Because in Soviet Russia, the Library (and Lord Dredd from Captain Power) digitizes YOU!
Also because in Soviet Russia, real life tolerates YOU!
I, for one, welcome our PDF-making adsense-offering overlords.

--
My Favourite Meme

Screenshot by BReflection · 2004-12-13 20:15 · Score: 3, Informative

Screenshot of the service from John Battelle's Searchblog.

--
python -c "x='python -c %sx=%s; print x%%(chr(34),repr(x),chr(34))%s'; print x%(chr(34),repr(x),chr(34))"

It's about Time! by Shafe · 2004-12-13 20:16 · Score: 2, Interesting

I've been emailing them asking them to do this for years. I'm glad someone is finally doing it! There is only one problem: how do they get past copyright violations? I tried to get Cornell to do this on campus, but they said a lot of their volumes (periodicals, in particular) were still under copyright and hence cannot be scanned. No, it doesn't make any sense to leave these carbon books literally fall apart when we can preserve them forever digitally, but that's the name of the game.

Someone hurry up with nanostorage so I can store the entire content of human knowledge on a postage stamp (with nanosecond seek time and gigabyte transfer speeds, of course)

Re:It's about Time! by burns210 · 2004-12-13 21:43 · Score: 1

According to a couple articles(Google News has a bunch), non-copyrighted works will have their entire content viewable(Oxford, for instance, is only allowing pre-1901 books to be scanned). Book still under copyright will still show up in your results if relavant, but only show the sentence (or two) or page(or two) surrounding the particular search term... With links to buy the book online.

Mailing Lists by lousyd · 2004-12-13 20:22 · Score: 2, Interesting

Call me mundane, but I want Google to index mailing lists, with a nice interface like their "Groups".

--
If aspiration is a virtue, achievement cannot be a vice.

Re:Mailing Lists by FuturePastNow · 2004-12-13 20:39 · Score: 1

"Call me mundane, but I want Google to index mailing lists, with a nice interface like their "Groups"."

You're a spammer, aren't you?

--
Give a man fire, and you warm him for the night. Set a man on fire, and you warm him for the rest of his life.
Re:Mailing Lists by burns210 · 2004-12-13 21:40 · Score: 1

It looks like this Digital Library is going to be part of Google Print, and be a special top-ranked entry on normal web searches...

I would like to see a Library section, of all the books scanned in(preferably text, with an image linked-to, rather than a image you read off of).

Also, I would think it would be neat to see a mailing-lists section either as an extension of their new google-groups2 system, if possible.

Lasly, I a blog search would be neat, though tricky. Being able to do an 'Opinion' section or 'Blog' section would pull a lot of the non-factual (though valid opinion) entries out of the main results.

Now we need a Image search 2(they haven't recrawled the image database in several months, according to google), and a Video search beta.

Then GDS beta 2 that includes myGmail account results... but that is getting ahead of myself:)
Re:Mailing Lists by burns210 · 2004-12-13 21:54 · Score: 1

Heck, what is next, irc logs?
Re:Mailing Lists by sethstorm · 2004-12-13 22:33 · Score: 1

No, Google's exclusive nature would preclude them from logging IRC.

--
Twitter supports and protects racists - by smearing their critics with the "Hate Speech" label.
Re:Mailing Lists by TheUncleBob · 2004-12-13 23:37 · Score: 1

Google already does index mailing lists through their indexing of mailing-list archive sites like
http://www.mail-archive.com/
http://archives.neohapsis.com/
http://readlist.com/
http://marc.theaimsgroup.com/
Some of the sites have their own search, and some have a nice readable interface. Take your pick. Though I'm sure hundreds of mailing lists aren't indexed anywhere, perhaps thats what gmail is for ;-)

Re:What the fuck is wrong with you? by Anonymous Coward · 2004-12-13 20:22 · Score: 0

You must be new here.

New York Times article and print.google.com by dixon · 2004-12-13 20:23 · Score: 0

Better story at the New York Times. There's also http://print.google.com and the odd http://www.google.com/print/

second only to the Library of Congress. . . by Leonig+Mig · 2004-12-13 20:24 · Score: 2, Informative

... are you sure , - doesn't it mean (as is so often the case - "within the united states?" what about the British Library? What about the Bodelian at Oxford?

--
i'm trying to give up sigs.

Re:second only to the Library of Congress. . . by julesh · 2004-12-13 21:28 · Score: 2, Informative

Apparently the Bodleian only has 7.2 million volumes, so this is larger than that collection.

The British Library apparently has "150 million items" according to their web site, but a large number of these are not books (they claim, for instance, to have 8 million stamps). But, I'm pretty sure they have more than 15 million books.

Whether or not they have more books than the Library of Congress is an interesting question.
Re:second only to the Library of Congress. . . by burns210 · 2004-12-13 21:29 · Score: 1

Actually the British Library is something like 30 million books greater than the Library of Congress. Harvard is second largest in the US. First is the Library of Congress, and worldwide(as far as I know) is the British Library.
Re:second only to the Library of Congress. . . by Steve+Cox · 2004-12-13 21:33 · Score: 2, Informative

According to the British Library's website, it contains 150 million items and gains a futher 3 million each year (but it doesn't distinguish between items and volumes - they collect any published item, and receive a copy of EVERY published item in the UK and Ireland).

The Bodelian has only 7 million volumes.

I would suspect that the Brish Library is substantially larger than Stanfords, but the Library Of Congress is recognised as the largest library in the world.

Steve.
Re:second only to the Library of Congress. . . by Steve+Cox · 2004-12-13 21:39 · Score: 1

Blimey. Its amazing how many people come back with the same facts at the same time.

I can't wait by Anonymous Coward · 2004-12-13 20:28 · Score: 0

for Miskatonic University's library to get the same treatment, mwuhahahaha.....

U of Michigan by truesaer · 2004-12-13 20:30 · Score: 4, Informative

It looks like the largest portion of this will be 7 million items from the University of Michigan (compared to only 40,000 from Harvard). Good article from the Detroit Free Press.

Re:U of Michigan by truesaer · 2004-12-13 21:16 · Score: 2, Interesting

Actually, I see that it is actually Stanford with 8 million items that will get to claim themselves as the largest, then followed by Michigan with 7 million. I don't know why Harvard is getting any props at all with only 40k items. Here is what I found most interesting in the article though:

The size of the U-M undertaking is staggering. It involves the use of new technology developed by Google that greatly speeds the digitizing process. Without that technology -- which Google won't discuss in detail -- the task would be impossible, says John Wilkin, the U-M associate librarian who is heading the project.

"Going as fast as we can with the traditional means of doing this, it would take us about 1,600 years to do all 7 million volumes," he said. "Google will do it in six years."

Under the agreement, the library will get a digital copy of every book scanned. With those copies, the library can prepare special research projects, virtual exhibitions and more relevant scholarly and academic material for its students and faculty.

"If we were to do this job ourselves, it would probably cost us $600 million," Wilkin said. "That's just the human cost of preparing the material for scanning, packing it up and sending it out to vendors and then quality-control checking of the results. This is easily a billion-dollar effort."

Items will start appearing in 2005 with completion predicted for 2010. Can you imagine how many libraries there are out there? The information that could be gathered seems endless. I'm guessing they'll come up with a good way to detect duplicates in future libraries, but as anyone who has wandered through a University library knows there are a LOT of shady books that seem like they haven't been widely published and there are a LOT of things that were self published by academics in the University itself (theses, postdoc research, etc).
Re:U of Michigan by ArsSineArtificio · 2004-12-14 03:33 · Score: 1

I don't know why Harvard is getting any props at all with only 40k items.

Since the Harvard library is so old, probably some of those 40,000 volumes are much rarer items than Michigan or Stanford are contributing.

--
All employees must wash hands before seeking equitable relief.
Re:U of Michigan by truesaer · 2004-12-14 19:58 · Score: 1

Since the Harvard library is so old, probably some of those 40,000 volumes are much rarer items than Michigan or Stanford are contributing.

I doubt it....These seem to the be regular library systems they're talking about. Any of the really old and rare stuff will be kept by specialized departments and museums and I doubt they would agree to a high speed OCR processing operation on that stuff.

Re:Not Just Harvard by rhennigan · 2004-12-13 20:39 · Score: 1

Soon we are going to start seeing people saying "I didn't RTFS, but...". I think this shows us the direction we are all headed with /.

Berkeley? Yale? by tavilach · 2004-12-13 20:41 · Score: 1

Stanford only has 6,865,158 books, and the University of Michigan only has 6,973,162. What about schools like Berkeley and Yale?

--

"Give me a lever long enough and a fulcrum on which to place it, and I shall move the world." -Archimedes

Re:Berkeley? Yale? by CountrySon · 2004-12-13 23:56 · Score: 0

Berkeley and Yale? I'd guess that, even if they're larger, they're not that much larger. It may be because the founders have ties to UMich and Stanford. Who knows?

how will this be better than 'grep' by edsarkiss · 2004-12-13 20:49 · Score: 1

not a cynnical criticism but a certified curiousity ... without hyperlinks between pages and other metadata that comes with the web domain, how will Google add value to finding materials above and beyond what a fancy multi-indexed grep could provide?

put another way, aside from "full text search" and "online page image retrieval", what other operations could be put into place to make this a valuable service?

--

SIGUSR1

Re:how will this be better than 'grep' by fuzzybunny · 2004-12-13 23:33 · Score: 1

To use a completely oversimplified analogy, the end result probably won't be much different than a big fat 'grep'.

However, grepping through 15 million volumes of text and making an attempt at ranking results by relevance through a fancy perl script probably would require a bit of time and resources :-)

--
Cole's Law: Thinly sliced cabbage

Re:Not Just Harvard by Anonymous Coward · 2004-12-13 21:15 · Score: 0

I didn't RTFT, but...

In a few years ... by Reez · 2004-12-13 21:23 · Score: 1

I think Google will start a project codenamed GSkyNet or something ...

One step closer... by darth_silliarse · 2004-12-13 21:35 · Score: 1

...to Google "becoming" the internet....

--
I've noticed that everyone who is for abortion has already been born - Ronald Reagan

Only 15 million volumes? by Anonymous Coward · 2004-12-13 21:39 · Score: 0

Only 15 million volumes?! That's much lower than I expected for a University of such stature. University of Waterloo has about 10 million volumes, University of British Columbia has about 10 million volumes evenly split between books and microfiche, University of Toronto has about 14 million holdings, etc.

Re:Only 15 million volumes? by Anonymous Coward · 2004-12-13 21:57 · Score: 0

Sorry, we don't carry the "Rush Million Book Collection".

But it's the same damn book every time by Anonymous Coward · 2004-12-13 22:00 · Score: 0

The British Library (www.bl.uk) has 150 million items (but fewer bookshelves) so the claim of "largest" is a bit dubious.

Yeah, but it's the same story:
"How We Lost the Americas, India, the Middle East, Ireland, but kept the Falklands"

That's okay, the LOC pays homage to the Brits by having a copy or two of Shakespeare, I hear.

Re:But it's the same damn book every time by loadquo · 2004-12-13 22:11 · Score: 1

The falklands is important it was the only place in the Empire where Penguins lived!
Re:But it's the same damn book every time by pommiekiwifruit · 2004-12-13 22:35 · Score: 1

Well, argentina did invade the falklands since they thought there was oil there. It must have turned out to be more expensive than people thought though.
Speaking of disputed areas with oil, I wonder how the Spratly islands (between china, phillipines, vietnam) are going at the moment...
Re:But it's the same damn book every time by gnalre · 2004-12-13 23:53 · Score: 1

No thats not right.

The penguins were developing WMD's

--
Choose your allies carefully, it is highly unlikely you will be held accountable for the actions of your enemies
Re:But it's the same damn book every time by henrygb · 2004-12-14 00:58 · Score: 1

The Empire is a little smaller now so South African penguins are now free (and happy in their black and white plumage), but officially the British Antarctic Territory is still part of it.
Re:But it's the same damn book every time by gomoX · 2004-12-14 03:01 · Score: 1

That's a big assumption. Here in Argentina a lot of people argue in favor of the Falkland's war by saying that they're on our continental platform and a bunch of other stuff that doesn't make a lot of sense.
Actually i've just done a bit of Google research and there's a lot of mumbling about oil indeed. I've always thought it was a damn childish war whatsoever.

--
My english is sow-sow. Sowhat?
Re:But it's the same damn book every time by pommiekiwifruit · 2004-12-14 08:15 · Score: 1

they're on our continental platform
So is Chile...
My personal theory is that one reason why Thatcher defended the falkland islanders (apart from pride, duty, potential oil, strategic location, political gain etc.) was that if she hadn't, Chinese tanks might have moved in onto Hong Kong (thinking britain would have done nothing) and that would have got very messy. As it was the colony was handed over peacefully.
Didn't argentina offer every person on the falkland islands US$1 million to leave at one stage? (they turned it down - should have tried *before* the war not after :-)

Why journals are expensive. by commodoresloat · 2004-12-13 22:02 · Score: 4, Interesting

The reason there are so few copies is because they are so expensive. Chicken and Egg.

No; the reason there are so few copies is there are so few people who want to read specialized journals. And the small audience only accounts for a small part of what many academic journals charge.

No; the problem is not overhead costs or small audiences. The problem is that the owners of much of that kind of content are greedy bastards. There is no reason for the outrageous price of some journals. Some scientific journal subscriptions are in the tens of thousands; even many liberal arts journals are far from cheap. And if you want to copy an article for your students to buy at kinkos, expect them to pay 35 cents a page or more for the copyrights alone.

And many of them are worse than the RIAA in terms of access to content electronically. Journal articles are included in databases sold to some universities You can read articles in some databases but only by loading a .gif of every page one at a time. No copy and paste, no text access at all. So much technology going into preventing the thing from being copied that the online version is actually less useful than the dead tree version rotting on the shelf.

I think this is a great move by Google and Harvard, and I like the idea behind google scholar, but I expect this kind of work to be resisted by many of journals and professional organizations, to the extent that they have in a say in it. This will be a huge boon in terms of the availability of public domain resources, but unfortunately outdated perspectives on intellectual property are likely to hold back real progress for something really useful to scholars in a systematic way. At least until those perspectives change significantly.

Re:Why journals are expensive. by Anonymous Coward · 2004-12-14 00:52 · Score: 3, Informative

A link which backs the "greedy bastards" theory :
http://math.berkeley.edu/~kirby/journals.html
Re:Why journals are expensive. by DarkSarin · 2004-12-14 03:21 · Score: 1

Yikes! Which journals are you looking at? I use Lexis Nexis on occasion for legal research, as well as PsychInfo (Ebsco host) for Psych research (frequently), and I can access full text of most articles that I need. I have more than 100 full text articles in various formats saved on my HDD (mostly .pdf and .txt) that I have acquired through these sources.

Some of the publishing companies are greedy, that I won't deny, but generally I have text access when I need it for newer articles. For older stuff, printed before electronic layout was the way to go, you ususally wind up with a pdf that's nothing more than a bunch of scanned images (worthless).

For what it's worth, however, I tend to print all my sources anyway (makes it easier to read on trips & such).

The concept of free online publishing is limited by a number of things--the very first one being prestige. Publishing an article in Journal of Applied Psychology (JAP) is considered to be difficult and therefore prestigious. Thus if you are trying for a high-paying faculty position, you want to have a lot of pubs in the more respected journals. This leads a lot of researchers to do so, and ignore the possibility of publishing in a more "user friendly" format.

I suspect this will remain the case until a number of VERY respected researchers (and this holds true across academic fields) start using the online tools to publish openly and freely. Then it can slowly become prestigious to publish online. It will take a SOLID method of peer review, access control, and whatnot to help it along, but it is possible. The trick, however, is getting the Cohen's, Danny Kahneman's (psychology researchers) and Steven Hawkings of the world to publish their research in these venues instead of whatever journal is most prestigious at the time. Good luck with that!

--
"We don't know what we are doing, but we are doing it very carefully,..." Wherry, R.J. Personnel Psychology (1995)
Re:Why journals are expensive. by commodoresloat · 2004-12-14 06:00 · Score: 4, Informative

The prestige of a journal is related to the difficulty of getting an article past peer review, not to the fact of the journal being available online or in paper. So there is no "trick" at all other than for the prestigious journals that already exist to start making content available online or in other electronic form.
As for fulltext articles, try JSTOR if you want to see how to do it wrong. Page by page in gif format, and some huge pdfs with all pictures and no ability to process text. Useless!! Yes you can print it out but then I'd just as soon get the hardcopy in the first place.
Re:Why journals are expensive. by DarkSarin · 2004-12-14 07:00 · Score: 2, Insightful

I wasn't saying that the prestige of the journal had anything to do with the medium, but that there is a lot of name recognition.

JSTOR varies in quality from journal to journal--some are actually okay, while others suck. I know that I have gotten pdf's from JSTOR, but I wonder if that is a function of JSTOR or the amount that a person/institution is paying for access.

Most journals that I have dealt with online where I had to pay (because the university wasn't a subscriber) wanted between $15 and $25 for a single article. This is a LOT of money, and sometimes (if you aren't in a hurry), it is easier to contact the author and ask for a reprint--they usually have them, and if they are like many researchers, they are glad to send you a copy, provided you explain what you are doing.

There is a trick to it--the current prestigious journals ARE NOT going to go to a low/no cost format for publishing online until there are one or two major competitors who are seen as valid (peer-review) and prestigious. The prestige factor is huge and rests largely on (as you mention) the peer review process AND who is publishing in the journal. Sorry, but Robert Sternberg doesn't generally publish in just any old journal--he has one or two that he will send a manuscript to, and go from there.

When my thesis advisor (who wrote two chapters for the Handbook of Research Methods in Industrial Psychology) publishes, he typically sends stuff first to the Journal of Occupational Behavior, not DarkSarin's Online Journal of Amateur Psychology or Commoderesloat's Journal of Human Weirdness. Why? Because no one has EVER heard of those journals, and if puts that on his vita, it won't make any difference to the next folks wanting to hire him for his research ability (not that he's going anywhere--he's a full professor).

But when the next university sees that he has published 10 articles in the Journal of Occupational Behavior (JOB), they say, "Hey, this guy is getting published in one of the top 10 journals in Behavioral Psychology, he's probably pretty good!" They will then probably hire him.

But when that same university interviews me, and I put down that I published 123 articles in DarkSarin's Journal of Computer Gaming Psychology, they are going say, "Wow, I've never heard of that journal--is it peer reviewed? Is it attached to a professional association (APA, MPA, SIOP, etc)? Has anybody here heard of it? Does anyone who's any good publish in that journal?" If you are REALLY lucky, they MIGHT take the time to look up the answers, but chances are slim if the position is getting very many applicants (and if it isn't, it probably isn't paying very well!).

The long and the short of it is that there is little, if any, financial pressure to offer content online for free, and that is unlikely to change without competition. There is unlikely to be much competition, because few young researchers are going to put their career on the line by publishing in any but the most prestigious journals that they can possibly get an article into. Older researchers are already in the habit of sending articles to certain journals, and so they aren't likely to change either.

There isn't a good, quick, easy solution to this, and anyone who says that there is needs to have their head checked. Sorry.

--
"We don't know what we are doing, but we are doing it very carefully,..." Wherry, R.J. Personnel Psychology (1995)
Re:Why journals are expensive. by commodoresloat · 2004-12-14 09:14 · Score: 1

When my thesis advisor (who wrote two chapters for the Handbook of Research Methods in Industrial Psychology) publishes, he typically sends stuff first to the Journal of Occupational Behavior, not DarkSarin's Online Journal of Amateur Psychology or Commoderesloat's Journal of Human Weirdness.
Actually, your advisor just sent us a review essay for the Journal of Human Weirdness... ;)
There isn't a good, quick, easy solution to this, and anyone who says that there is needs to have their head checked. Sorry.
I didn't say there was an easy solution. The only thing that would make it easy is eliminating the element of greed. This is human social and intellectual progress we're talking about, no matter what field you're in, and it is short-sighted to put access to that progress in the hands of institutions who only answer to the bottom line. I understand they have to make a living, and I understand that a certain level of greed is integral to human nature, but it is incumbent upon those who are led by things other than greed -- e.g. professional organizations and editorial boards -- to take a more active role in making sure the work is more easily disseminated.
I just joined the publication staff of a journal in my field (not Human Weirdness) and when the idea of making the journal accessible online came up in a meeting, everyone was in favor of it. Yet the only online version that had been available up till that point was a crippled JSTOR version -- gifs of scans of bad photocopies of the paper journal. The desire to see it made more available was there but up until this point nobody had pressured the publishing company to make a real pdf available. I'm not sure how the publishing company will respond when we press them on this point (and the goal of the journal is now to be available in html too) but I do know how they will respond if nobody presses the issue....
Re:Why journals are expensive. by potat0man · 2004-12-14 11:44 · Score: 1

. . .even many liberal arts journals are far from cheap.

thanks...

Bill Daley M.A. Liberal Arts
Re:Why journals are expensive. by jc42 · 2004-12-14 12:09 · Score: 1

... current prestigious journals ARE NOT going to go to a low/no cost format for publishing online until there are one or two major competitors who are seen as valid (peer-review) and prestigious.

Perhaps. But it may happen sooner than you think.

Presumably I don't have to describe Science and Nature to anyone here, other than to note that both are peer-reviewed and prestigious. Both are fully available online now. At present, they are obviously in a transitional state. As of 2004, you can get full online access only if you're a "member", and in both cases, membership includes a print copy of their flagship journal. (Both organizations produce a number of other journals.)

This is clearly transitional, because like most computer-fluent subscribers, I've found that the printed copy mostly just sits on a table now until I throw it away. Their web sites are good enough that I can read them from my Mac about as easily as from the printed copy. And I can copy-and-paste passages to email, something that comes in very handy at times. And I can get to previous issues from anywhere that I have Net access.

Like most subscribers, I don't save the printed copies, because they're weekly (and thick), and I simply don't have the storage space for all the years that I'd like to have kept. It seems obvious to me that the sensible thing to do is to offer subscribers like me an online-only membership, for a price that's discounted by the printing and mailing costs.

My guess is that they'll start doing this within a year or so.

(They may have already, but their online subscription/membership information is sufficiently confusing that I can't tell. ;-)

--
Those who do study history are doomed to stand helplessly by while everyone else repeats it.

Knowing the background of Google... by sethstorm · 2004-12-13 22:05 · Score: 1

Well, it seems Google felt comfortable dealing with people of equally exclusive nature. When they start indexing colleges that dont require your soul(Yale+Harvard), $20k/year(all of them), or a bribe to the right person to get in(Stanford+MIT), Google will be doing no evil. But since their background precedes them, you might as well count them on doing evil, even if it looks like misplaced philantrophy.

--
Twitter supports and protects racists - by smearing their critics with the "Hate Speech" label.

A Canticle for Google by DollyTheSheep · 2004-12-13 22:38 · Score: 1

Wow, I'm really impressed. Together with Google Scholar, this will lift academic research considerably.

Now if they only could contact german libraries like "Bayerische Staatsbibliothek" http://www.bsb-muenchen.de/, too....

"second only to the Library of Congress" by Clansman · 2004-12-13 22:39 · Score: 1

However not if you take into account libraries such as the British Library which has 150 million items - this is bigger than Congress and Harvard combined.

Granted, some of these are just stamps :-)

Re:What the fuck is wrong with you? by Anonymous Coward · 2004-12-13 23:31 · Score: 0

In Korea, only old people repeat tired, canned jokes. :-)

money by stormi · 2004-12-14 00:15 · Score: 0

people with library cards view books for 'free'.... people with library cards also never return the books and supply the library with lots of money to keep everything in order. in a way it's still a paid service.

i am guessing that something like this will require a lot of money, perhaps even after it's finished to fix this or that... (someone pays someone for webspace alot of the time... and labor...)

where is the money coming from?

also seems kind of like napster but with books....

--
"if only i had known i would have been a locksmith." -albert einstein

michigan even worse?! by _randy_64 · 2004-12-14 00:21 · Score: 1

As an Ohio State alumnus, I have to say - I didn't even know they had books in michigan! Didn't think anyone there could read.

--
I mod down all the "free iPod"-sig losers.

Re:michigan even worse?! by Anonymous Coward · 2004-12-14 02:12 · Score: 1, Funny

Wow... you found the internet, congratulations

Help gives 400 Bad Request error by dinojemr · 2004-12-14 00:28 · Score: 1

Why is requesting help such a bad request? I think it is perfectly reasonable.

Spam? by Anders+Andersson · 2004-12-14 00:29 · Score: 1

According to an e-mail sent today to Harvard students

I assume the poster is talking about an actual e-mail message sent to all Harvard students, not a mere press release.

Was this e-mail message sent by Google, or by Harvard themselves? Either way, does Harvard permit their students to opt out from being spammed with the details of every agreement they make with third parties?

I work in university IT support. Don't you just love it when your university makes a deal with some company to distribute their software to staff and students, after which said company sees fit to spam all your students telling them to contact you immediately in order to have you install that software for them, in a manner contrary to the procedures already established by IT support?

MOD PARENT UP +5 Interesting by Anonymous Coward · 2004-12-14 00:31 · Score: 0

This is the most interesting issue about the article. Will the digital versions of the libraries' books that are in the public domain and are scanned by Google be allowed to be freely copied, or will they be released only under a typical restrictive copyright licence (as they would be legally entitled to do since a digital copy of an old work which is in the public domain is considered in law to be a new work subject to copyright)?

That's not how copyright law works. A digital copy of an old work which is in the public domain is considered in law to be a new work that can be fully protected by copyright, even though the copyright on the original work has expired.

Re:copyright law by Animats · 2004-12-14 05:07 · Score: 1
A digital copy of an old work which is in the public domain is considered in law to be a new work that can be fully protected by copyright, even though the copyright on the original work has expired.
Wrong. Bridgeman vs. Corel, 36 F. Supp. 2d 191.
However, there's the "Corbis copyright hack":
- Add digital watermarking information to the content.
- Register a copyright on the watermarking information.
- Use the DMCA's anti-circumvention provisions to prevent removal of the watermarking information.
- Claim that public domain content that's been through this process cannot legally be copied.
Re:copyright law by Anonymous Coward · 2004-12-14 13:22 · Score: 0

That's an interesting case. However, current legal opinion in EU jurisdictions doesn't follow your US precedent (which pre-dates the recent EU directives on copyrights and databases) and asserts that copyright applies to digital copies of public-domain works. Moreover, the WIPO Broadcast Treaty, if adopted internationally, will eventually force the US to give similar protection to digital copies of public-domain works.
Re:copyright law by Animats · 2004-12-14 15:10 · Score: 1

The WIPO Broadcast Treaty, if it goes anywhere, only covers "broadcasting" on Government-licensed stations. Some British museum group paid a lawyer in the UK to write an opinion that Bridgeman vs. Corel didn't apply broadly, but that's just a brief.
In practice, nobody seems to have taken a copyright claim on a photo or scan of a public domain image to court since Bridgeman vs. Corel. The museum commmunity seems to have accepted Bridgeman vs. Corel. As it turns out, selling slides of pictures hasn't declined, the coffee-table book business is still good, and it just isn't a economic issue.
Re:copyright law by Anonymous Coward · 2004-12-15 03:16 · Score: 0

No, you're wrong again. The exclusive rights that would be created by the WIPO Broadcast Treaty include the right of reproduction, regardless of method, to the fullest extent of its current implications. Secondly, the only correct interpretation under current EU law is that digital copies of public-domain materials are themselves entitled to copyright protection.

Re:So after Iraq ... the UK is next? by Shawn+Parr · 2004-12-14 00:38 · Score: 0, Troll

Apparently you have never seen footage of Dubya making a public appearance.

I for one have no fear that our President will ever be personally concerned about a library.

Unless of course we put some oil and commie^H^H^H^H^H^Hterrorists in one for him to go after.

--
Shawn's Tech Articles

Universal Library by Tzinger · 2004-12-14 01:06 · Score: 1

If you google the words "universal library" you'll find this link http://www.ul.cs.cmu.edu/html/ at Carnegie Mellon. Why is Google doing something different?

--
"If all the American people want is security, let them live in prisons." Eisenhower

Championing the commons by mankey+wanker · 2004-12-14 01:08 · Score: 1

Every time we capitulate to money and power and grant new extensions to existing IP laws, this is exactly what we lose - we lose that material that belongs to everyone as a whole and to which we all have a right held in common.

I love moves forward like this. Perhaps if people understood what it meant to access knowledge and information at whim they wouldn't be so keen to keep extending privately held rights any further than is reasonable.

I live for the day when people count down the days until something enters the public domain. There are so many great works of art and knowledge that could gain new life from such enthusiasm.

How about Web of Science? by Bohnanza · 2004-12-14 01:21 · Score: 1

If there is some way that google could team up with Academic printers to index as many journals and texts as possible, this would make everyone's life a lot better.

If you're willing to pay, this is exactly what Web of Science does. It contains just about every article from every journal for the last hundred years.

WoS uses citation indexing, as ISI has done for many years, since well before Google came into existance. You can find newer articles by finding those which have cited the old article you're looking at.

--

-----

Sorry, I'm only a 1336 h4x0r.

Citeseer by simon2263 · 2004-12-14 01:28 · Score: 1

Citeseer (www.citeseer.com) is fairly similar, providing keyword search on scientific articles. It also caches copies of papers for easy (!) access. Its disadvantages are: (i) you have to submit references manually by providing citeseer with URLs, (ii) it sometimes generates garbage titles of papers (don't know why) and (iii) after you've submitted a URL, it takes forever for citeseer to index them.

Muck Fichigan by ChefJRD · 2004-12-14 01:53 · Score: 0, Troll

I can't find anything about it online, but while at Champaign we were told that the University of Illinois' library system was the second largest in the country, behind Harvard. Oh yeah, and our basketball team is #1 :-) And sorry to all you Michigan folk out there, it's just been a while since I've been able to say that!!

Re:Muck Fichigan by Anonymous Coward · 2004-12-14 03:13 · Score: 0

They told you a lot of things at UIUC.

But Michigan is a leader in digitizing books and journals and Larry Page is an alum. And your basketball team will not last the season.

Here, for instance, is a page that claims that Yale is the second largest University collection.
http://www.cftech.com/BrainBank/OTHER REFERENCE/LIB RARIESANDMUSEUMS/MajLibUSCan.html
Of course, the country's largest library is indisputably the LOC. Harvard is just the largest University Library. Maybe UIUC is the second largest public University library (after Berkeley).

Note that Michigan press releases claim that the Google project will digitize about 7 million volumes in its collection, while the above link says U-M has just over 6 million volumes. Apparently, I'm going to have to return my 1.8 million overdue books to get this show on the road.
Re:Muck Fichigan by ChefJRD · 2004-12-14 03:16 · Score: 1

"The University Library is the largest academic library at a public university in the United States. It ranks behind Harvard and Yale as the third largest academic library in the United States. Today, with holdings of more than ten million volumes, the library has strengths in many areas ranging from hard sciences to the humanities." from http://www.library.uiuc.edu/mortenson/aboutus.htm
Re:Muck Fichigan by truesaer · 2004-12-17 07:38 · Score: 1

Hmm, well none of your 10 million are being digitized. And our football team kicked your teams ass as usual this year. Man do they suck.
Re:Muck Fichigan by ChefJRD · 2004-12-23 16:51 · Score: 0

This is true......our football sucks.....but basketball isn't doing all that bad :-) But I didn't go there for the sports, I went there for the engineering degree.

who cares by RagingChipmunk · 2004-12-14 01:58 · Score: 1

Really, who cares if they index the Library of Mich? Who is going to use it? Why dont they index the entire collection of Playboy? Now that would be a great use of technology - nothing I hate more than trying to find all the photographs in a set...

--
The only PT Boat Journal on the web: http://www.PT171.org

Re:who cares by twistedcubic · 2004-12-14 06:12 · Score: 1

It's funny that you asked about Playboy. Years ago I was doing research on the life of a famous author, and so I read every interview of him in print, including one in Playboy from the sixties. My school's library had all the back issues on microfilm, so I went to view it thinking that Playboy was probably tamer back then. Nope, it wasn't, and everyone on that floor of the library who walked past my screen (it was a large screen) thought that I was so desperate for porn I had to "read" Playboy on microfilm at the library.

New York Times article by sporktoast · 2004-12-14 02:03 · Score: 3, Informative

For what it is worth, there was an article in the Painted Lady about it today.

--
In a related story, the IRS has recently ruled that the cost of Windows upgrades can NOT be deducted as a gambling loss.

Re:New York Times article by The+Fun+Guy · 2004-12-14 05:49 · Score: 1

The New York Times (http://nytimes.com/) is nicknamed "The Gray Lady" (http://en.wikipedia.org/wiki/New_York_Times), or if one is feeling expansive, The Great Gray Lady. The nickname extends from the Times decision in the 1950s to stick to plain black and white text when other newspapers started using color for illustrations, photos, ads, etc. However, the Times went color in the late 1990s, so the nickname is not really very apt anymore.

A "Painted Lady" is an ornate style of house popular in the late 19th century, a tattooed whore, and a type of butterfly, among other things.

--
The man who does not read good books has no advantage over the man who cannot read them. - Mark Twain
Re:New York Times article by Anonymous Coward · 2004-12-14 09:50 · Score: 0

For a "fun guy", you seem really humor-deficient. Even *I* could tell that was a joke. Or were you just explaining it for other folks who might not get it?
Maybe "The Pedantic Guy" would be a better name...

Google becomes Epic... by Cplus · 2004-12-14 02:38 · Score: 1

I just watched one of the most interesting pieces of media I've seen in a while. It's basically a look back at the growth of media and the growth of google from 2014. If you've got 8 minutes to spare on something that might be fascinating, check it out.

Epic

--
"Share your knowledge. It's a way to achieve immortality." -- Dalai Lama

Re:Not Just Harvard by Anonymous Coward · 2004-12-14 02:50 · Score: 0

I didn't RYFP (Read Your F'n Post) but...

Will they be available in DJVU format? by Anonymous Coward · 2004-12-14 02:53 · Score: 0

Will these books be available in DJVU format or just in plain text?

DJVU is a compression format that stores the scanned image along with text and links the text to the image so that a DJVU reader can be used to search for text and highlight it on the image with a rectangle. This is far better than PDF for scanned books, especially older books with beautiful typography and engravings.

I especially like using the New Century dictionary on a laptop with 1 gig of RAM where I can hold it sideways like a book. Look at this site to see what I mean http://www.leoyan.com/century-dictionary.com/index .html

DJVU rocks IMHO.

Not just Harvard by boodaman · 2004-12-14 03:18 · Score: 1

Google is also doing Michigan's library (the University of Michigan, that is). Seven million volumes.

An announcement is forthcoming today.

Detroit Free Press article: http://www.freep.com/money/tech/mwend14e_20041214. htm

romance of the "stacks" by peter303 · 2004-12-14 03:38 · Score: 1

The thing I remember most about the Harvard Widner main library is floors upon floors, walls upon walls of bookshelves. Some areas smelled like no one had been there in a century. Lots of stories of ghosts, perpetual students living there into old age, and lusty encounters between patrons. This atmosphere was captured in the movies Love Story and The Paper Chase. Other old, large libraries have these stacks too. where will the romance be when all this is turned into "bits"?

Re:Both Images & Uncorrected OCR should be ava by drooling-dog · 2004-12-14 03:44 · Score: 2, Funny

For an example, see books at University of Michigan's Making of America (MoA) Exhibit, which has thousands of 19th century books and periodicals available.

I see they've recently added the complete run of the Journal of the U.S. Association of Charcoal Iron Workers. If I'd known that, I could've saved a bundle on gift subscriptions...

Re:Both Images & Uncorrected OCR should be ava by Greenisus · 2004-12-14 04:16 · Score: 1

For correcting OCR, the text could be put in a wiki next to the page images, and whoever needs to read it first can correct it....

Re:Not Just Harvard by KrugalSausage · 2004-12-14 04:16 · Score: 1

i kan t reed. :(

Archive.org by jdt112 · 2004-12-14 04:17 · Score: 1

Archive.org is attempting to build a free online repository of movies, audio, and books. It's a pretty impressive collection considering all of the intellectual property barriers that they've gotten past.

Will it just be for harvard????? by Anonymous Coward · 2004-12-14 04:33 · Score: 0

While studing in boston. I thought i would go over to the harvard libary and have a look around.
Yet unfortunaly i was not able to take out any books. Only harvard students. All the other universities in boston have a partnership allowing sudents from the differt universities to use their books but not harvard??
why i am not sure but i guess we will have to wait and see.
ps i know i cant spell.
I also have an inferiorty complex for going to Tufts with all the other harvard rejects
cheers

Re:Will it just be for harvard????? by hether · 2004-12-14 06:04 · Score: 1

According to a BBC story, so far it will be the full libraries of Michigan and Stanford universities, as well as archives at Harvard, Oxford and the New York Public Library.

http://news.bbc.co.uk/1/hi/technology/4094271.st m

--

Most people would die sooner than think; in fact, they do.

Re:Not Just Harvard by noFilter · 2004-12-14 05:35 · Score: 1

I realize this is a little known fact, but the British Library has many more items than the Library of Congress.

Re:Both Images & Uncorrected OCR should be ava by Anonymous Coward · 2004-12-14 05:42 · Score: 0

Dananderson said
"The uncorrected OCR is very useful for indexing (by Google or others), as the 5% or fewer typos are not enough to interfere with indexing keywords. Uncorrected OCR can also be corrected later."

I don't know If I can agree with you here. Oxford is talking about it's entire collection prior to 1901. Harvard is talking about very old books as well. Older books and odd type and print-faces have, in my experience, and that of friends, a much higher typo rate than merely 5%.

(my experience = closing in on 7000 pages on DP. - and I concentrate on the old stuff - odd print and long f's of Englifh. :) )

Just my 2-cents. The more typos - and un-proofed text - the harder it will be to index.

Re:Not Just Harvard by BizidyDizidy · 2004-12-14 05:44 · Score: 1

My cat's breath smells like cat food.

(In other words, did you have some reason for saying this to me?)

--
The safest way to approach lava is to have another person with you and he goes first.

But will they include the X cage? by Anonymous Coward · 2004-12-14 05:52 · Score: 0

I wonder if they will include the contents of the "x cage", a locked set of stacks supposedly containing "pornography". Having been in this thing (I worked as a book shelver as a freshman), it actually isn't so much porn but "sensitive material". A whole bunch of it is stuff published by the Third Reich, for example.

Re:Not Just Harvard by noFilter · 2004-12-14 05:56 · Score: 1

Humor? Another oft-mentioned phrase in the summary.

More than just Harvard by hether · 2004-12-14 06:07 · Score: 1

According to a BBC article today, so far they'll be digitizing the full libraries of Michigan and Stanford universities, as well as archives at Harvard, Oxford and the New York Public Library.

Some of the places are limiting their participation to just certain collections though.

--

Most people would die sooner than think; in fact, they do.

Re:Not Just Harvard by BizidyDizidy · 2004-12-14 06:14 · Score: 1

There's apparently some confusion.

Here you go

--
The safest way to approach lava is to have another person with you and he goes first.

FOIA fees by KMSelf · 2004-12-14 06:32 · Score: 2, Informative

FYI, FOIA isn't free, though the fees are pretty nominal. $0.10/page, $18/hr, after the first 100 pages, with a significant educational discount.

The thought of having a spook do my photocopying for me just sounds.... Hrm. Ironic?

--

What part of "gestalt" don't you understand?

Re:Both Images & Uncorrected OCR should be ava by Eccles · 2004-12-14 07:41 · Score: 1

Ah, so you were my Secret Santa!

--
Ooh, a sarcasm detector. Oh, that's a real useful invention.

Archive.org and DjVu by Anonymous Coward · 2004-12-14 08:12 · Score: 0

Google should use DjVu to distribute the documents, just like the Million Books Project at the Internet Archive.

Small text and footnotes by nycsubway · 2004-12-14 08:20 · Score: 1

I've had a lot of trouble getting my books scanned by google and amazon.com. I believe it is because the font is small (5-6pt). However, the type is clearly printed and if it is scanned at any decent dpi it should be fine to be run through OCR.

If google is using a quick scanning/OCR method, then some small text, especially footnotes, in some of the Harvard books will be lost.

Also, if they want to index and search the books, it will be stored as text. Since google is doing the scanning i'd imagine its exactly what they would do. They may keep an image of the page and highlight the areas where search terms appear.

--
http://github.com/gbook/nidb

Re: Uncorrected OCR Not always useful. by Stuart+Poss · 2004-12-14 09:26 · Score: 1

While this may be true for some kinds of research it is definitely not for others. In biology one typically needs to have a correct identification before the biology of a given species can be investigated. Identification is closely tied to the names given by various authors, but is not identical with it.

If you do taxonomy, uncorrected OCR would be of little value, as a single letter difference is sufficient to treat the scientific names of organisms as different. One needs to know the precise spelling used as well as the context in which it is used to understand the concept of species the author had in mind (or sometimes the identification of the organism in question, depending on the complexity of the taxonomy).

Presently, if you look for all that is known about a species on google, you will often get a lot of records that relate to that species. However, you may also get records that either relate to another species whose name is a homonym or one that has been misidentified (incorrect name applied). You will not get all the records of the species for which an incorrect spelling of the name has been applied or which was treated under another name, which while technically a correct identification is a junior/senior synonym of the species you searched for.

Uncritical use of taxonomic names resulting from uncorrected OCR in a research context has the potential to create considerable taxonomic confusion. In a world that is rapidly loosing biodiversity and saving what remains rests largely on the correct identification of those organisms being studied, this is not an insignificant matter. IMHO it actually represents one of the greatest challenges facing mankind, particularly if you consider that we are often at the top of the food chain and we really know very little about the myriads of organisms that make up the ecology upon which we depend for survival but which most of us simply take for granted. Such ignorance will not serve us well in a future (or present) in which our enviornment has been (is being) severely degraded.

Hence, the recommendation that the uncorrected OCR be tied to the original image file from which it was produced is critical.

Not downloadable? (Big disappointment) by MinorTr · 2004-12-14 10:10 · Score: 1

Initially I was incredibly enthusiastic when I heard about this. But then I found the following in the FAQ at the "Google Print" page: 6. What can I do with a book that I find using Google Print? Well, you can browse a few pages, learn more about the topics explored by the book, buy it, find it at a library, or commit a selection to memory. Browser printing and image copying functions are disabled on Google Print content pages. If images can't be copied for public domain stuff, and they can't be downloaded from the site to be used for any purpose, then this is a lot less free than it sounds.

Re:Not downloadable? (Big disappointment) by trovaxo · 2004-12-14 14:07 · Score: 1

Its the copyright problem. Most books since 1968 are in a perpetual, non-expiring state of copyright - the "Mickey Mouse Effect".

In defense of JSTOR by tgibbs · 2004-12-14 11:56 · Score: 1

As for fulltext articles, try JSTOR if you want to see how to do it wrong. Page by page in gif format, and some huge pdfs with all pictures and no ability to process text. Useless!! Yes you can print it out but then I'd just as soon get the hardcopy in the first place.

I'd certainly rather have searchable text. But a lot of this stuff is not always easy to find even if you are at a university. Due to limited space, many university libraries have moved older journals to offsite storage, and there are significant delays in accessing these materials. Simply being able to rapidly access full text and view/print it is a big improvement.

Re:In defense of JSTOR by commodoresloat · 2004-12-14 19:36 · Score: 1

True ... but we wouldn't be in that situation if universities were still buying hardcopies of things. I don't mind the shift to electronic forms of publication (though I don't want it to completely replace print), but it doesn't make sense to make the shift if you're not actually gaining capabilities. I think the university libraries have talked themselves into buying the worst of both worlds -- more expensive access to crippled forms of publication. Of course, I would not want us to go backwards here and not get the electronic pubs, but it's frustrating to know what the capabilities of technology are and see it intentionally crippled.
Re:In defense of JSTOR by tgibbs · 2004-12-16 06:51 · Score: 1

True ... but we wouldn't be in that situation if universities were still buying hardcopies of things.

Most university libraries are years old and are of fixed sized. The literature is expanding exponentially. So universities are faced with the choice of diverting funds and land from other projects to construct additional libraries, or going to off-site storage, with electronic access making up for the loss in convenience.

Libraries Online by trovaxo · 2004-12-14 13:39 · Score: 1

The big problem is copyright. Every copyright since 1968 is essentially perpetual - so No Recent Works are available. The perpetuality is due to the "Mickey Mouse Effect" - that Disney Corp has managed to get Congress to periodicly extend the date that a copyright expires - to preserve the value of their Mickey Mouse franchise.

Why is this being done by the private sector? by Anonymous Coward · 2004-12-14 14:28 · Score: 0

This worries me. It should worry you too.

Our ability to effectively use the internet as a general knowledge/interest database - as we have becomes accustomed to doing - depends upon the private sector.

Now we are seeing a monumental effort to bring even more of our history and our culture - digitially - chiefly under the control of the private sector.

I see a future in which physical libraries are only a second thought, and Google is looked to as the chief arbiter of information - trivial and scholarly.

Are the ramifications of this clear?

For a fraction of a fraction of the cost of invading Iraq, we could have, as a nation, done this great thing for ourselves. Instead, our public institutions are left to work in league with a shady company, legally bound to do what is profitable over what is ethical in so far as the law does not intervene.

Is it not a much greater world in which every man, woman, and child has free access to the knowledge - the power - that disarms dictators, than the one in which we sacrifice our young men, women, and children to set up new dictators?

Slashdot Mirror

Google To Digitize Much of Harvard's Library

296 comments