Google To Digitize Much of Harvard's Library
FJCsar writes "According to an e-mail sent today to Harvard students, Google will collaborate with Harvard's libraries on a pilot project to digitize a substantial number of the 15 million volumes held in the University's extensive library system, which is second only to the Library of Congress in the number of volumes it contains. Google will provide online access to the full text of those works that are in the public domain. In related agreements, Google will launch similar projects with Oxford, Stanford, the University of Michigan, and the New York Public Library. As of 9 am on December 14, a FAQ detailing the Harvard pilot program with Google will be available at hul.harvard.edu."
But aren't there projects that are already doing this?
US businesses that currently accept chip and PIN/signature
to never leave my apartment.
will there be ads for particle accelerators, scanning tunneling microscopes and tokamaks in the margins?
Google is diversifying extravagently, pretty soon all of us geeks will be driving google cars that can cross reference the library of congress
"Could you put that in a memo entitled, SHIT I ALREADY KNOW!" - Sarge
Also according to the summary, Einstein.
The safest way to approach lava is to have another person with you and he goes first.
I think this is a great start, There's incredible profit here too, universities spend millions for catalogue systems. If I could use one interface to search for books, chapters, and articles on a subject, I could spend more time actually learning, and less time looking at the same damn "no results" page on GeoWeb. Grrrr.
Sleep is for the weak!
Doesn't matter if they do Purdue's, I think we have the 11th worst library in the Big10. I already use Google for my papers, anyways.
You call it excessive, I call it ambitious.
If I download a book, when do I have to upload it again? What is the late fee if I forget?
That is funking awesome!
~stephen
http://slinky259.blogspot.com
Seeing as Google cached the entire Internet (the last page of the Internet can be seen here): http://www.google.ca/search?q=cache:dQrQDn0dHW8J:w ww.1112.net/lastpage.html+the+end+of+the+Internet& hl=en&client=firefox-a
Google is now looking to cache everything else in the Universe :)
because its time to dive into the deep web. Projects like this are the key to unlocking the vast stores of important which are currently not readiy accessed online. Personally I'd like to see a Google-run free access Lexis-Nexus project.
Please, give me the the values in standard metrics, like Libraries of Congress!
Just how much storage space will all this data consume? It seems like a massive undertaking.
Are you trying to google bomb 'kumquat'? If so, the effort so far looks rather weak.
Wow, so I guess Google doesn't know what to do with their IPO money and is just blowing it on a me-too project!
Also according to the summary, Einstein.
Yes but the FS is starting to go the way of the FA as far as the number of actual readers is concerned. I admit to occasionally falling victim to this unfortunate disease myself. Sometimes I only read the headline, and with some of the YRO ones that take up nearly the whole width of my 1280px wide monitor, sometimes I can't even get through all of that.
I am ambivalent about this. Will the books be stored as text to enable searching? If so, given that part of a book's character is its font and typesetting, will ALL the flavor of these books really be captured, in the same way that it would be to read them? Something seems likely to be "lost in translation" here.
Currently hooked on AMP
Everyone knows that Harvard sucks.
I know Brown has been digitizing all journals coming in for a while...
On another note, all the Ivies except Haavad participate in interlibrary loan program. There's over 40 million bound volumes overall. Check it out here.
One reason why this is in the interest of big old universities like Harvard is that it will make it much easier to detect plagiarism in students' essays. If published books were included in Google's index, a plagiarism detection service like Copyscape would also be able to check whether content was lifted from printed material, as well as from the web.
So does this mean that the movies/audiotapes will be archived too. That's a crapload of storage.
Maybe they'll put the Loebs up! No more $20 a pop when you live in a really
obscure town.
I know, it's been slow going so far. But, you do what you have to.
That said... kumquat!
So what do they have for the task itself, little children from foreign countries?
About two months ago, Jeff Dean (an employee of Google) gave a talk at the University of Washington about the inner workings of Google. One thing he mentioned was Google Print and how they scan books: they slice 'em up into individual pages, and then feed them through a scanner. This doesn't seem like an acceptable way to archive a library's collection. So, how are they scanning them in? Why not use this method for Google Print?
Or will they try to lock them up with an EULA, the DMCA, and some eBook system?
Ok, so this is just a bit of devil's avocate, but what happens if you just *happen* to have a writing style similar to someone else who was printed before... what if you read something, and unknowingly wrote something in a similar vein in your essay? I assume you could check it yourself, but then that would just introduce extra cost to even write the essay in the first place... or worse, the plagiarists could just "tweak" their papers ensuring that they're "below the radar" by changing enough style to not be recognizeable...
Make sure everyone's vote counts: Verified Voting
Make that the SECOND largest library system after the Library of Congress! University of California Library system is the largest.
Back around the '60s or so the University of Michigan cut a similar deal with University Microfilms.
U Microfilms set up and ran a microfilming operation in the library system, microfilming everything that wasn't under copyright (and much that was with permission of the copyright holders, such as several large newspapers and many magazines and other periodicals), along with much of the University's records. Rare books, etc.
(If I have this right) the U got microfilm prints of the documents for free and didn't have to pay for the microfilming of its records. University Microfilms made its money by selling microfilms of the various publications (forwarding royalties, where appropriate, to the copyright holders). The rare books, for instance, could now be studied on microfilm with no further stress on the original, and their content became available at many other colleges and libraries. Good deal all around.
University Microfilms was founded by a regent, who was later slammed for conflict of interest. He dropped out of the Board of Regents but the business deal continued.
Bantam Dominique roosters crow a four-note song. Once you've heard it as "Happy BIRTHday" you can't NOT hear it that way
if this is clinton's "library" that's tp be "googlized" and "digitized", then that'll be an interesting "shot"... ;)
Homer: mmmmmm digitized google....
my blog
That is awesome. I just wonder how the book publishers will respond? Imagine being able to read any textbook without paying for it? How will those textbook publishers who keep raising prices and reprinting books with "new" editions make money... I'm imagining an RIAA-like attack on online books. Watch out Google!
Education should be free. Especially now that information can be distributed so cheaply and so efficiently
On a side note, I believe the government should create some standardized books. For example, calculus. It's the same equations and theorems that each school teaches and it hasn't changed much in a long, long time. Teachers can dictate which parts to emphasize. We can have a committee of well-established professors write the book in the same way that any other calc book is written. The book can undergo revision every 10 years or more. Think of the money that can be saved for students!!! It can even be available online! Of course some books can't be standardized like history, where different viewpoints produce different versions of history. /P
40,000 volumes. Compared to 8 million for Stanford and 7 million for Michigan. The latter already has almost 20 million pages online.
For books in Special Collections, they won't allow copies to be digitalized unless they are (1) paid a fee to scan the book (fair enough) and (2) paid a royalty to post the book to the web.
The royalty amounts to hundreds or thousands of dollars per book (about $100/page or image). This allows the libraries to act as a "profit center" for the universities. This policy applies to all UC campuses (I've tried UCB, UCLA, UCI, UCSD).
This is true even though the book is in the public domain (because they have physical possession and nobdy can make copies until you sign a license agreement). This is true even if you're using the book for non-commercial purposes (such as free posting to the web).
Something is wrong here. People donate to UC libraries (either books or money) for the public good. They don't donate so the library can start a business licensing public-domain books.
Despite that, I have been able to scan many books (by using books in open stacks or purchasing them). These books concern Yosemite history and are at http://www.yosemite.ca.us/history/
Microsoft will do the same
For older books, most archivists use a cradle and photograph the pages. It's easier on the book, requires no slicing, and there's no scanner to clog with dust.
The disadvantage is the scanner operators need a little bit more training, but that's not a big problem.
don't forget that amazon has the "search inside the book" feature that has been available for a few years now. i guess the main difference is that google is targeting a lot of academic sources, while amazon gets its database of book texts from publishers. if the two were combined... then maybe they could form ibdb.com, the Internet Books Database ;)
I bet we are going to find Goosebumps books in it. Millions of them.
December 13, 2004
Dear Colleague,
I am writing today with news of an exciting new project within the Harvard libraries. As all of us know, Harvard's is the world's preeminent university library. Its holdings of over 15 million volumes are the result of nearly four centuries of thoughtful and comprehensive collecting. While those holdings are of primary importance to Harvard students and faculty, we have, for several years, been considering ways to make the collections more useful and accessible to scholars around the world. Now we are about to begin a project that can further that global goal-and, at the same time, can greatly enhance access to Harvard's vast library resources for our students and faculty.
We have agreed to a pilot project that will result in the digitization of a substantial number of volumes from the Harvard libraries. The pilot will give the University a great deal of important data on a possible future large-scale digitization program for most of the books in the Harvard collections. The pilot is a small but extremely significant first step that can ultimately provide both the Harvard community and the larger public with a revolutionary new information location tool to find materials available in libraries.
The pilot project will be done in collaboration with Google. The project will link Harvard's library collections with Google's resources and its cutting-edge technology. The pilot project, which will be announced officially tomorrow, is the result of more than a year of careful consultation at many levels of the University. We could not have achieved a meaningful pilot project without the efforts of the Harvard Corporation; the President, Provost, Chief Information Officer, and Office of General Counsel; the University Library Council; and senior managers within the College Library and the University Library.
A full description of the pilot program follows here, with further materials available on the Harvard home page tomorrow.
With best regards,
Sidney Verba
Carl H. Pforzheimer University Professor and
Director of the University Library
Project Description:
Harvard's Pilot Project with Google
Harvard University is embarking on a collaboration with Google that could harness Google's search technology to provide to both the Harvard community and the larger public a revolutionary new information location tool to find materials available in libraries. In the coming months, Google will collaborate with Harvard's libraries on a pilot project to digitize a substantial number of the 15 million volumes held in the University's extensive library system. Google will provide online access to the full text of those works that are in the public domain. In related agreements, Google will launch similar projects with Oxford, Stanford, the University of Michigan, and the New York Public Library. As of 9 am on December 14, an FAQ detailing the Harvard pilot program with Google will be available at http://hul.harvard.edu.
The Harvard pilot will provide the information and experience on which the University can base a decision to launch a large-scale digitization program. Any such decision will reflect the fact that Harvard's library holdings are among the University's core assets, that the magnitude of those holdings is unique among university libraries anywhere in the world, and that the stewardship of these holdings is of paramount importance. If the pilot is deemed successful, Harvard will explore a long-term program with Google through which the vast majority of the University's library books would be digitized and included in Google's searchable database. Google will bear the direct costs of digitization in the pilot project.
By combining the skills and library collections of Harvard University with the innovative search skills and capacity of Google, a long-term program has the potential to create an important public good. According to Harvard President Lawrence H. Summers, "Harvard has the greate
This will be sweet,I just hope that we dont get to many authors getting pissed.
The library of the University of Oxford, i.e. the Bodleian Library, was the first "copyright" library in the UK - one of only three - which means that it automatically gets a copy of every book published in the UK.
Aegilops
Google will provide online access to the full text of those works that are in the public domain Just what percentage of the current works are public domain?
The uncorrected OCR is very useful for indexing (by Google or others), as the 5% or fewer typos are not enough to interfere with indexing keywords. Uncorrected OCR can also be corrected later.
The page images are tied with the uncorrected OCR so you can see exactly what's there.
For an example, see books at University of Michigan's Making of America (MoA) Exhibit, which has thousands of 19th century books and periodicals available.
(they admit it themselves!)
sulli
RTFJ.
Google always emphasized what's their purpose. To organize the world's information to be useful and to serve us.
baooooooo
Only public-domain books will be scanned. In all or most cases the author's are dead. However, this will revive a great body of work and widen access to many.
One class of author may be pissed will be authors who take older works and just slap a foreword or introduction to the front and collect royalties. I've seen this done for many histories. But author's of todays works can count on royalties for themselves, their children, and their grandchildren (if the book is still selling). The copyright term is too long in the U.S., but that's another story . . .
As of 9 am on December 14, a FAQ detailing the Harvard pilot program...
Don't you mean an FAQ?
Seriously though, I can't help but wonder if projects such as this will help or hurt the overall literacy of the populace. It seems to me that the ability to extract excerpts quickly without having to peruse the context could lead to a less educated society. Some of the most interesting facts I have learned have been things I've accidentally run across in a book while looking for something else.
Don't get me wrong, I fully support the idea of having quick access to any information that might be needed. I am simply speculating that some other steps might need to be taken to ensure that future generations still benefit from the subtleties of knowledge that come from reading a book.
Just a thought.
-Daniel
But does it work under Lynx?
F1rst tr011!
"Girls seem to go for sensitive-type guys, so you've always got to act like you're listening to whatever it is they're yapping about, and pretend you give a rat's butt about stupid stuff like flowers and recycling. Oh yeah, be sure to wear plenty of aftershave!" Homer J. Simpson
Maybe they could do something really evil to Microsoft, and then we could say: ``Well, you digitized Harvard's library, so we'll let it pass this time.''
See what I've been reading.
It was just a matter of time before a project of this scope got off the ground. I would like to see them team up with Project Gutenberg (and perhaps archive.org) to provide images of the material. Throw in the little transcoder and perhaps wikipedia and we will soon have a killer information resource that can be cross-referenced to silly proportions. This is a boon for research. Projects like this and the public library of science will add much to collective knowledge. It would also be nice to see them team up with the newspaper project! Next stop--public domain LOC!!!
harmonious design
The professor can just wait until the match comes up, and then double-check at that point.
You'd want to do a thorough overview of any potential instance of cheating anyway. A quick run-through would determine whether or not a paper happened to contain an identical sentence clause or three identical paragraphs.
I think the bigger problem would be the second one you described -- that students could plagiarize and then go through each paragraph, changing the wording slightly so as to avoid positive matches. Still, you could argue that this is pretty much what academics is anyway, just with footnotes and a bibliography.
--------
Bleah! Heh heh heh... BLEAH BLEAH!!! Ha ha ha ha...
Because in Soviet Russia, the Library (and Lord Dredd from Captain Power) digitizes YOU!
Also because in Soviet Russia, real life tolerates YOU!
I, for one, welcome our PDF-making adsense-offering overlords.
My Favourite Meme
Screenshot of the service from John Battelle's Searchblog.
python -c "x='python -c %sx=%s; print x%%(chr(34),repr(x),chr(34))%s'; print x%(chr(34),repr(x),chr(34))"
I've been emailing them asking them to do this for years. I'm glad someone is finally doing it! There is only one problem: how do they get past copyright violations? I tried to get Cornell to do this on campus, but they said a lot of their volumes (periodicals, in particular) were still under copyright and hence cannot be scanned. No, it doesn't make any sense to leave these carbon books literally fall apart when we can preserve them forever digitally, but that's the name of the game.
Someone hurry up with nanostorage so I can store the entire content of human knowledge on a postage stamp (with nanosecond seek time and gigabyte transfer speeds, of course)
Call me mundane, but I want Google to index mailing lists, with a nice interface like their "Groups".
If aspiration is a virtue, achievement cannot be a vice.
You must be new here.
Better story at the New York Times. There's also http://print.google.com and the odd http://www.google.com/print/
i'm trying to give up sigs.
for Miskatonic University's library to get the same treatment, mwuhahahaha.....
It looks like the largest portion of this will be 7 million items from the University of Michigan (compared to only 40,000 from Harvard). Good article from the Detroit Free Press.
Soon we are going to start seeing people saying "I didn't RTFS, but...". I think this shows us the direction we are all headed with /.
Stanford only has 6,865,158 books, and the University of Michigan only has 6,973,162. What about schools like Berkeley and Yale?
"Give me a lever long enough and a fulcrum on which to place it, and I shall move the world." -Archimedes
not a cynnical criticism but a certified curiousity ... without hyperlinks between pages and other metadata that comes with the web domain, how will Google add value to finding materials above and beyond what a fancy multi-indexed grep could provide?
put another way, aside from "full text search" and "online page image retrieval", what other operations could be put into place to make this a valuable service?
SIGUSR1
I didn't RTFT, but...
I think Google will start a project codenamed GSkyNet or something ...
...to Google "becoming" the internet....
I've noticed that everyone who is for abortion has already been born - Ronald Reagan
Only 15 million volumes?! That's much lower than I expected for a University of such stature. University of Waterloo has about 10 million volumes, University of British Columbia has about 10 million volumes evenly split between books and microfiche, University of Toronto has about 14 million holdings, etc.
The British Library (www.bl.uk) has 150 million items (but fewer bookshelves) so the claim of "largest" is a bit dubious.
Yeah, but it's the same story:
"How We Lost the Americas, India, the Middle East, Ireland, but kept the Falklands"
That's okay, the LOC pays homage to the Brits by having a copy or two of Shakespeare, I hear.
No; the reason there are so few copies is there are so few people who want to read specialized journals. And the small audience only accounts for a small part of what many academic journals charge.
No; the problem is not overhead costs or small audiences. The problem is that the owners of much of that kind of content are greedy bastards. There is no reason for the outrageous price of some journals. Some scientific journal subscriptions are in the tens of thousands; even many liberal arts journals are far from cheap. And if you want to copy an article for your students to buy at kinkos, expect them to pay 35 cents a page or more for the copyrights alone.
And many of them are worse than the RIAA in terms of access to content electronically. Journal articles are included in databases sold to some universities You can read articles in some databases but only by loading a .gif of every page one at a time. No copy and paste, no text access at all. So much technology going into preventing the thing from being copied that the online version is actually less useful than the dead tree version rotting on the shelf.
I think this is a great move by Google and Harvard, and I like the idea behind google scholar, but I expect this kind of work to be resisted by many of journals and professional organizations, to the extent that they have in a say in it. This will be a huge boon in terms of the availability of public domain resources, but unfortunately outdated perspectives on intellectual property are likely to hold back real progress for something really useful to scholars in a systematic way. At least until those perspectives change significantly.
Well, it seems Google felt comfortable dealing with people of equally exclusive nature. When they start indexing colleges that dont require your soul(Yale+Harvard), $20k/year(all of them), or a bribe to the right person to get in(Stanford+MIT), Google will be doing no evil. But since their background precedes them, you might as well count them on doing evil, even if it looks like misplaced philantrophy.
Twitter supports and protects racists - by smearing their critics with the "Hate Speech" label.
Wow, I'm really impressed. Together with Google Scholar, this will lift academic research considerably.
Now if they only could contact german libraries like "Bayerische Staatsbibliothek" http://www.bsb-muenchen.de/, too....
However not if you take into account libraries such as the British Library which has 150 million items - this is bigger than Congress and Harvard combined.
:-)
Granted, some of these are just stamps
In Korea, only old people repeat tired, canned jokes. :-)
i am guessing that something like this will require a lot of money, perhaps even after it's finished to fix this or that... (someone pays someone for webspace alot of the time... and labor...)
where is the money coming from?
also seems kind of like napster but with books....
"if only i had known i would have been a locksmith." -albert einstein
As an Ohio State alumnus, I have to say - I didn't even know they had books in michigan! Didn't think anyone there could read.
I mod down all the "free iPod"-sig losers.
Why is requesting help such a bad request? I think it is perfectly reasonable.
I assume the poster is talking about an actual e-mail message sent to all Harvard students, not a mere press release.
Was this e-mail message sent by Google, or by Harvard themselves? Either way, does Harvard permit their students to opt out from being spammed with the details of every agreement they make with third parties?
I work in university IT support. Don't you just love it when your university makes a deal with some company to distribute their software to staff and students, after which said company sees fit to spam all your students telling them to contact you immediately in order to have you install that software for them, in a manner contrary to the procedures already established by IT support?
This is the most interesting issue about the article. Will the digital versions of the libraries' books that are in the public domain and are scanned by Google be allowed to be freely copied, or will they be released only under a typical restrictive copyright licence (as they would be legally entitled to do since a digital copy of an old work which is in the public domain is considered in law to be a new work subject to copyright)?
That's not how copyright law works. A digital copy of an old work which is in the public domain is considered in law to be a new work that can be fully protected by copyright, even though the copyright on the original work has expired.
I for one have no fear that our President will ever be personally concerned about a library.
Unless of course we put some oil and commie^H^H^H^H^H^Hterrorists in one for him to go after.
Shawn's Tech Articles
If you google the words "universal library" you'll find this link http://www.ul.cs.cmu.edu/html/ at Carnegie Mellon. Why is Google doing something different?
"If all the American people want is security, let them live in prisons." Eisenhower
Every time we capitulate to money and power and grant new extensions to existing IP laws, this is exactly what we lose - we lose that material that belongs to everyone as a whole and to which we all have a right held in common.
I love moves forward like this. Perhaps if people understood what it meant to access knowledge and information at whim they wouldn't be so keen to keep extending privately held rights any further than is reasonable.
I live for the day when people count down the days until something enters the public domain. There are so many great works of art and knowledge that could gain new life from such enthusiasm.
If you're willing to pay, this is exactly what Web of Science does. It contains just about every article from every journal for the last hundred years.
WoS uses citation indexing, as ISI has done for many years, since well before Google came into existance. You can find newer articles by finding those which have cited the old article you're looking at.
-----
Sorry, I'm only a 1336 h4x0r.
Citeseer (www.citeseer.com) is fairly similar, providing keyword search on scientific articles. It also caches copies of papers for easy (!) access. Its disadvantages are: (i) you have to submit references manually by providing citeseer with URLs, (ii) it sometimes generates garbage titles of papers (don't know why) and (iii) after you've submitted a URL, it takes forever for citeseer to index them.
I can't find anything about it online, but while at Champaign we were told that the University of Illinois' library system was the second largest in the country, behind Harvard. Oh yeah, and our basketball team is #1 :-)
And sorry to all you Michigan folk out there, it's just been a while since I've been able to say that!!
Really, who cares if they index the Library of Mich? Who is going to use it? Why dont they index the entire collection of Playboy? Now that would be a great use of technology - nothing I hate more than trying to find all the photographs in a set...
The only PT Boat Journal on the web: http://www.PT171.org
For what it is worth, there was an article in the Painted Lady about it today.
In a related story, the IRS has recently ruled that the cost of Windows upgrades can NOT be deducted as a gambling loss.
I just watched one of the most interesting pieces of media I've seen in a while. It's basically a look back at the growth of media and the growth of google from 2014. If you've got 8 minutes to spare on something that might be fascinating, check it out.
Epic
"Share your knowledge. It's a way to achieve immortality." -- Dalai Lama
I didn't RYFP (Read Your F'n Post) but...
Will these books be available in DJVU format or just in plain text?
x .html
DJVU is a compression format that stores the scanned image along with text and links the text to the image so that a DJVU reader can be used to search for text and highlight it on the image with a rectangle. This is far better than PDF for scanned books, especially older books with beautiful typography and engravings.
I especially like using the New Century dictionary on a laptop with 1 gig of RAM where I can hold it sideways like a book. Look at this site to see what I mean http://www.leoyan.com/century-dictionary.com/inde
DJVU rocks IMHO.
Google is also doing Michigan's library (the University of Michigan, that is). Seven million volumes.
. htm
An announcement is forthcoming today.
Detroit Free Press article: http://www.freep.com/money/tech/mwend14e_20041214
The thing I remember most about the Harvard Widner main library is floors upon floors, walls upon walls of bookshelves. Some areas smelled like no one had been there in a century. Lots of stories of ghosts, perpetual students living there into old age, and lusty encounters between patrons. This atmosphere was captured in the movies Love Story and The Paper Chase. Other old, large libraries have these stacks too. where will the romance be when all this is turned into "bits"?
I see they've recently added the complete run of the Journal of the U.S. Association of Charcoal Iron Workers. If I'd known that, I could've saved a bundle on gift subscriptions...
For correcting OCR, the text could be put in a wiki next to the page images, and whoever needs to read it first can correct it....
i kan t reed. :(
Archive.org is attempting to build a free online repository of movies, audio, and books. It's a pretty impressive collection considering all of the intellectual property barriers that they've gotten past.
While studing in boston. I thought i would go over to the harvard libary and have a look around.
Yet unfortunaly i was not able to take out any books. Only harvard students. All the other universities in boston have a partnership allowing sudents from the differt universities to use their books but not harvard??
why i am not sure but i guess we will have to wait and see.
ps i know i cant spell.
I also have an inferiorty complex for going to Tufts with all the other harvard rejects
cheers
I realize this is a little known fact, but the British Library has many more items than the Library of Congress.
Dananderson said
:) )
"The uncorrected OCR is very useful for indexing (by Google or others), as the 5% or fewer typos are not enough to interfere with indexing keywords. Uncorrected OCR can also be corrected later."
I don't know If I can agree with you here. Oxford is talking about it's entire collection prior to 1901. Harvard is talking about very old books as well. Older books and odd type and print-faces have, in my experience, and that of friends, a much higher typo rate than merely 5%.
(my experience = closing in on 7000 pages on DP. - and I concentrate on the old stuff - odd print and long f's of Englifh.
Just my 2-cents. The more typos - and un-proofed text - the harder it will be to index.
My cat's breath smells like cat food.
(In other words, did you have some reason for saying this to me?)
The safest way to approach lava is to have another person with you and he goes first.
I wonder if they will include the contents of the "x cage", a locked set of stacks supposedly containing "pornography". Having been in this thing (I worked as a book shelver as a freshman), it actually isn't so much porn but "sensitive material". A whole bunch of it is stuff published by the Third Reich, for example.
Humor? Another oft-mentioned phrase in the summary.
According to a BBC article today, so far they'll be digitizing the full libraries of Michigan and Stanford universities, as well as archives at Harvard, Oxford and the New York Public Library.
Some of the places are limiting their participation to just certain collections though.
Most people would die sooner than think; in fact, they do.
There's apparently some confusion.
Here you go
The safest way to approach lava is to have another person with you and he goes first.
FYI, FOIA isn't free, though the fees are pretty nominal. $0.10/page, $18/hr, after the first 100 pages, with a significant educational discount.
The thought of having a spook do my photocopying for me just sounds.... Hrm. Ironic?
What part of "gestalt" don't you understand?
Ah, so you were my Secret Santa!
Ooh, a sarcasm detector. Oh, that's a real useful invention.
Google should use DjVu to distribute the documents, just like the Million Books Project at the Internet Archive.
I've had a lot of trouble getting my books scanned by google and amazon.com. I believe it is because the font is small (5-6pt). However, the type is clearly printed and if it is scanned at any decent dpi it should be fine to be run through OCR.
If google is using a quick scanning/OCR method, then some small text, especially footnotes, in some of the Harvard books will be lost.
Also, if they want to index and search the books, it will be stored as text. Since google is doing the scanning i'd imagine its exactly what they would do. They may keep an image of the page and highlight the areas where search terms appear.
http://github.com/gbook/nidb
While this may be true for some kinds of research it is definitely not for others. In biology one typically needs to have a correct identification before the biology of a given species can be investigated. Identification is closely tied to the names given by various authors, but is not identical with it.
If you do taxonomy, uncorrected OCR would be of little value, as a single letter difference is sufficient to treat the scientific names of organisms as different. One needs to know the precise spelling used as well as the context in which it is used to understand the concept of species the author had in mind (or sometimes the identification of the organism in question, depending on the complexity of the taxonomy).
Presently, if you look for all that is known about a species on google, you will often get a lot of records that relate to that species. However, you may also get records that either relate to another species whose name is a homonym or one that has been misidentified (incorrect name applied). You will not get all the records of the species for which an incorrect spelling of the name has been applied or which was treated under another name, which while technically a correct identification is a junior/senior synonym of the species you searched for.
Uncritical use of taxonomic names resulting from uncorrected OCR in a research context has the potential to create considerable taxonomic confusion. In a world that is rapidly loosing biodiversity and saving what remains rests largely on the correct identification of those organisms being studied, this is not an insignificant matter. IMHO it actually represents one of the greatest challenges facing mankind, particularly if you consider that we are often at the top of the food chain and we really know very little about the myriads of organisms that make up the ecology upon which we depend for survival but which most of us simply take for granted. Such ignorance will not serve us well in a future (or present) in which our enviornment has been (is being) severely degraded.
Hence, the recommendation that the uncorrected OCR be tied to the original image file from which it was produced is critical.
Initially I was incredibly enthusiastic when I heard about this. But then I found the following in the FAQ at the "Google Print" page: 6. What can I do with a book that I find using Google Print? Well, you can browse a few pages, learn more about the topics explored by the book, buy it, find it at a library, or commit a selection to memory. Browser printing and image copying functions are disabled on Google Print content pages. If images can't be copied for public domain stuff, and they can't be downloaded from the site to be used for any purpose, then this is a lot less free than it sounds.
As for fulltext articles, try JSTOR if you want to see how to do it wrong. Page by page in gif format, and some huge pdfs with all pictures and no ability to process text. Useless!! Yes you can print it out but then I'd just as soon get the hardcopy in the first place.
I'd certainly rather have searchable text. But a lot of this stuff is not always easy to find even if you are at a university. Due to limited space, many university libraries have moved older journals to offsite storage, and there are significant delays in accessing these materials. Simply being able to rapidly access full text and view/print it is a big improvement.
The big problem is copyright. Every copyright since 1968 is essentially perpetual - so No Recent Works are available. The perpetuality is due to the "Mickey Mouse Effect" - that Disney Corp has managed to get Congress to periodicly extend the date that a copyright expires - to preserve the value of their Mickey Mouse franchise.
This worries me. It should worry you too.
Our ability to effectively use the internet as a general knowledge/interest database - as we have becomes accustomed to doing - depends upon the private sector.
Now we are seeing a monumental effort to bring even more of our history and our culture - digitially - chiefly under the control of the private sector.
I see a future in which physical libraries are only a second thought, and Google is looked to as the chief arbiter of information - trivial and scholarly.
Are the ramifications of this clear?
For a fraction of a fraction of the cost of invading Iraq, we could have, as a nation, done this great thing for ourselves. Instead, our public institutions are left to work in league with a shady company, legally bound to do what is profitable over what is ethical in so far as the law does not intervene.
Is it not a much greater world in which every man, woman, and child has free access to the knowledge - the power - that disarms dictators, than the one in which we sacrifice our young men, women, and children to set up new dictators?