Google To Digitize Much of Harvard's Library
FJCsar writes "According to an e-mail sent today to Harvard students, Google will collaborate with Harvard's libraries on a pilot project to digitize a substantial number of the 15 million volumes held in the University's extensive library system, which is second only to the Library of Congress in the number of volumes it contains. Google will provide online access to the full text of those works that are in the public domain. In related agreements, Google will launch similar projects with Oxford, Stanford, the University of Michigan, and the New York Public Library. As of 9 am on December 14, a FAQ detailing the Harvard pilot program with Google will be available at hul.harvard.edu."
I think this is a great start, There's incredible profit here too, universities spend millions for catalogue systems. If I could use one interface to search for books, chapters, and articles on a subject, I could spend more time actually learning, and less time looking at the same damn "no results" page on GeoWeb. Grrrr.
Sleep is for the weak!
One reason why this is in the interest of big old universities like Harvard is that it will make it much easier to detect plagiarism in students' essays. If published books were included in Google's index, a plagiarism detection service like Copyscape would also be able to check whether content was lifted from printed material, as well as from the web.
I would hope the handle it in just like catalog.google.com
About two months ago, Jeff Dean (an employee of Google) gave a talk at the University of Washington about the inner workings of Google. One thing he mentioned was Google Print and how they scan books: they slice 'em up into individual pages, and then feed them through a scanner. This doesn't seem like an acceptable way to archive a library's collection. So, how are they scanning them in? Why not use this method for Google Print?
Well, there are such things as references.
Using work of other people in academic work is not only possible, but greatly encouraged. Just make sure that it is very clear what comes from whom.
In many ways, science is done exactly as Open Source software. Take what you need, modify and improve it where appropriate, and make sure you give full credit where due.
As a teacher, I have given full points to a paper that has hardly any text of their own, as long as they are properly referenced, and used together to make a valid point, not made by any of the sources.
So I do not think students should bother staying below the rarad. Just reference everything,and voila, you are doing science
Complexity is a measure of our ignorance...
I've been emailing them asking them to do this for years. I'm glad someone is finally doing it! There is only one problem: how do they get past copyright violations? I tried to get Cornell to do this on campus, but they said a lot of their volumes (periodicals, in particular) were still under copyright and hence cannot be scanned. No, it doesn't make any sense to leave these carbon books literally fall apart when we can preserve them forever digitally, but that's the name of the game.
Someone hurry up with nanostorage so I can store the entire content of human knowledge on a postage stamp (with nanosecond seek time and gigabyte transfer speeds, of course)
Call me mundane, but I want Google to index mailing lists, with a nice interface like their "Groups".
If aspiration is a virtue, achievement cannot be a vice.
But also: PG books are full of errors, and there is no source info or scans available to fix against in any sort of easy way. Many books Such as Wealth of Nations went through a number of editions during the author's lifetime. It would be nice to have the various early editions for collation. And often times new editions come out long after the death of the author with bullshit editorial changes in order to claim a new copyright. A library like Harvard will have many of the first number of editions of classic works.
Got a link for that policy?
Ever tried a Freedom of Information Act (FOIA) request? Strange as it may seem, that apparently works in the State of Washington.
The size of the U-M undertaking is staggering. It involves the use of new technology developed by Google that greatly speeds the digitizing process. Without that technology -- which Google won't discuss in detail -- the task would be impossible, says John Wilkin, the U-M associate librarian who is heading the project.
"Going as fast as we can with the traditional means of doing this, it would take us about 1,600 years to do all 7 million volumes," he said. "Google will do it in six years."
Under the agreement, the library will get a digital copy of every book scanned. With those copies, the library can prepare special research projects, virtual exhibitions and more relevant scholarly and academic material for its students and faculty.
"If we were to do this job ourselves, it would probably cost us $600 million," Wilkin said. "That's just the human cost of preparing the material for scanning, packing it up and sending it out to vendors and then quality-control checking of the results. This is easily a billion-dollar effort."
Items will start appearing in 2005 with completion predicted for 2010. Can you imagine how many libraries there are out there? The information that could be gathered seems endless. I'm guessing they'll come up with a good way to detect duplicates in future libraries, but as anyone who has wandered through a University library knows there are a LOT of shady books that seem like they haven't been widely published and there are a LOT of things that were self published by academics in the University itself (theses, postdoc research, etc).
No; the reason there are so few copies is there are so few people who want to read specialized journals. And the small audience only accounts for a small part of what many academic journals charge.
No; the problem is not overhead costs or small audiences. The problem is that the owners of much of that kind of content are greedy bastards. There is no reason for the outrageous price of some journals. Some scientific journal subscriptions are in the tens of thousands; even many liberal arts journals are far from cheap. And if you want to copy an article for your students to buy at kinkos, expect them to pay 35 cents a page or more for the copyrights alone.
And many of them are worse than the RIAA in terms of access to content electronically. Journal articles are included in databases sold to some universities You can read articles in some databases but only by loading a .gif of every page one at a time. No copy and paste, no text access at all. So much technology going into preventing the thing from being copied that the online version is actually less useful than the dead tree version rotting on the shelf.
I think this is a great move by Google and Harvard, and I like the idea behind google scholar, but I expect this kind of work to be resisted by many of journals and professional organizations, to the extent that they have in a say in it. This will be a huge boon in terms of the availability of public domain resources, but unfortunately outdated perspectives on intellectual property are likely to hold back real progress for something really useful to scholars in a systematic way. At least until those perspectives change significantly.
Yep, fell foul of this one the other day. The National Library of Wales happens to be situated in Aberystwyth, on the same hill as the University. (Which, by the way, is a bitch to climb in the mornings... do not apply for sea-front residences unless you are sure of your fitness!) Aaaaanyway, as the librarian there tactfully explained to me: one hell of a lot of books are published every year, and there's only so much space in the place... and they like to have a Welsh Language copy too!
Tomorrow, I may eat another house plant
Worth noting that this project is putting a LOT of people out of work. Literally, they are laying off almost their entire library staff (I know a few..). Wonder if that'll be in the FAQ?