Why Project Gutenberg Isn't There Yet
option8 writes "This wired article ('Any Text. Anytime. Anywhere. (Any Volunteers?)'), goes into good detail on why Project Gutenberg, and similar efforts, are far from creating a complete, free electronic library. A quote: "The mechanics of a universal library are simple. The tricky part: harnessing the free labor." Though it doesn't go into technology much, I expect there's a lot of potential in mass OCR tech and good speech recognition (faster to read a book aloud than to transcribe it correctly)."
That's crazy. OCR will always be faster than speech, even if speech recognition ever works, which it currently does not.
I'm not too informed about this topic; feel free to correct me.
If the goal is a universal library, and there is a need for a work force, wouldn't a program iniated on the library level to utilizie librarians as a volunteer work force, perhaps as a side project they might be interesting in helping along? I think of it as SETI in the library world.. *shrug*
Umm, Project Guttenberg can only legally use public domain works. If you know of any 100+ year old novels typeset in Tex lets hear about it. Even if a modern reprint was done recently, do you think the publisher would really want to give away all that hard work so that everyone can get it for free instead of buying their spiffy new edition?
Delivering militantly anti-commercial music to all two people who care!
That's not how project Gutenberg works. Most everything that's on PG is public domain - that means the copyright has expired. Thus, most of the stuff is over 70 years old. They didn't exactly use Latex back in the 1930s.
Besides, what I generally use PG for are the classics - greek/roman literature, etc... I don't think Plato used UNIX.
It's all got to be somehow entered from dead-tree-format copy. Currently, that pretty much means typing up the entire book.
--
http://nemilar.net - Not your grandmother's soup kitchen
That's pretty much it - most of the books are in the public domain. AFAIK, the rest are all donated by their authors.
From their FAQ:
What books will I find in Project Gutenberg?
We cannot publish any texts still in copyright. This generally means that our texts are taken from books published pre-1923. (It's more complicated than that, as our Copyright Page explains, but 1923 is a good first rule-of-thumb for the U.S.A.)
So you won't find the latest bestsellers or modern computer books here. You will find the classic books from the start of this century and previous centuries, from authors like Shakespeare, Poe, Dante, as well as well-loved favorites like the Sherlock Holmes stories by Sir Arthur Conan Doyle, the Tarzan and Mars books of Edgar Rice Burroughs, Alice's adventures in Wonderland as told by Lewis Carroll, and thousands of others.
These books are chosen by our volunteers. Simply, a volunteer decides that a certain book should be in the archives, obtains the book and does the work necessary to turn it into an e-text. If you're interested in volunteering, click here.
--
http://nemilar.net - Not your grandmother's soup kitchen
It would make life some much easier if I could search an online library rather than searching the library index. Just think how much space we could save as well rather than shelves full of books that are basically dead weight 95% of the time.
I think copyrights got to be the biggest hurdle, publishing houses arnt easily going to be perswaded to put oh say the next harry potter book online for free and risk losing millions
Distributed Proofreaders. Recently discussed on /. as well.
The mechanics of a universal library are simple. The tricky part: hairdressing the free labor.
Karma: Barber
#3 pencils and quadrille pads.
So the point of this post is: why not ask publishers for the material? If it's already public domain, it's not like they'll lose profits, and maybe Project Gutenberg could let them put a little
kind of thing at the top of each book they donate. Plus, maybe it's a tax write off. I don't know. That said, I'd thing it'd be much easier to just type things in than OCR it or use Speach-To-Text.Comment forecast: Bits of genius surrounded by a sea of mediocrity.
Keep in mind the following copyright rules:
1. Works first published before January 1, 1923 with proper copyright notice entered the public domain no later than 75 years from the date copyright was first secured. Hence, all works whose copyrights were secured before 1923 are now in the public domain.
(This is the rule Project Gutenberg uses most often)
Works published from 1923-1977 retain copyright for 95 years. No such works will enter the public domain until 2019.
2. Works first created on or after January 1, 1978 enter the public domain 70 years after the death of the author if the author is a natural person.
(Nothing will enter the public domain under this rule until at least January 1, 2049.)
3. Works first created on or after January 1, 1978 which are created by a corporate author enter the public domain 95 years after publication or 120 years after creation whichever occurs first.
(Nothing will enter the public domain under this rule until at least January 1, 2074.)
4. Works created before January 1, 1978 but not published before that date are copyrighted under rules 2 and 3 above, except that in no case will the copyright on a work not published prior to January 1, 1978 expire before December 31, 2002. If the work is published before December 31, 2002, its copyright will not expire before December 31, 2047.
(This rule copyrights a lot of manuscripts that we would otherwise think of as public domain because of their age.)
5. If a substantial number of copies were printed and distributed in the U.S. prior to March 1, 1989 without a copyright notice, and the work is of entirely American authorship, or was first published in the United States, the work is in the public domain in the U.S.
6. (This rule is complicated, and is seldom applied). Works published before 1964 needed to have their copyrights renewed in their 28th year, or they'd enter into the public domain. Some books originally published outside of the US by non-Americans are exempt from this requirement, under GATT. Works from before 1964 were automatically renewed if ALL of these apply:
At least one author was a citizen or resident of a foreign country (outside the US) that's a party to the applicable copyright agreements. (Almost all countries are parties to these agreements.)
The work was still under copyright in at least one author's "home country" at the time the GATT copyright agreement went into effect for that country (January 1, 1996 for most countries).
The work was first published abroad, and not published in the United States until at least 30 days after its first publication abroad.
This means that we can't simply take electronic versions of modern texts and put them in the archive, because only out-of-copyright books are in there.
In any case, the real obstacle to a useful electronic library isn't labor. It's copyright.
Apparently the author of the article missed Distributed Proofreaders. They seem to have survived their Slashdotting and actually retained a good fraction of their new users. This month they've proofed 116,827 pages! (Cut that in half for unique pages, I think) They have completed in their 2(?) years of existence 918 books, and have another 317 being assembled. It really seems like they are only limited by what they can get their hands on in the public domain.
All digital versions of books that publishers have should be requested and maintained in a safe place till their respective patents expire so that they can be easily integrated into the public domain.... especially if OCR or speech recognition doesn't get any better any time soon.
---- The geek shall inherit the Earth.
While this comment has been addressed, I'd like to point out that you can get pretty decent output from the Gutenberg texts by importing them into LyX. With just a little bit of work (basically setting up the chapters), LyX will allow you to create good looking PDF, Postscript, HTML, etc, along with the LaTeX source. Combine this with rbmake and you can even read them, complete with hyperlinks, on your eBook (if you have one!)
The best and cheapest way to get existing books on the web is to scan them and compress the images. Compression technology for text images is so good (see DjVu), and storage so cheap nowadays that you are better off just distributing high resolution scans.
This is a much more efficient way to make books available on the web, much more efficient than having volunteers painstakingly transcribe the text or correcting OCR mistakes.
OCR can be used for indexing scanned documents, but there is no need to do manual correction. DjVu can compress 300dpi black and white pages of text to 5-25KB. That's less than most HTML pages, and the images look just like the original book.
The Million Book Project at the Internet Archive uses DjVu (as well as other formats).
The open source implementation of DjVu is available on sourceforge
The article didn't say that OCR was faster than speech, it said that speech was faster than transcibing it.
Come on mod's, read more carefully.
poliglut.org: they're still alive and fighting the man
it is part of the philosophy of Project Gutenburg to publish all of their works in the lowest level stardard format, thus insuring continued cross platform, program independant readability, ad infinitum.
That means *plain* ASCII. Plain ASCII means you could read it in edlin if you really had to.
This is a Good Thing.
This also means that if you wish to format any Project Gutenburg text, in HTML or TeX for publication, you start with a blank slate and can immediately start to work your own will upon the raw text.
This is also a Good Thing.
KFG
---- El diablo esta en mis pantalones! Mire, mire!
It seems paradoxical, but there it is. I spend a huge amount of time glued to the screen, reading articles, blogs, forums, FAQs, HOW-TOs, etc. But I don't like it, in fact I find it aggravating.
I am lured and lulled by the vast amount of easy information suitably tailored to my interests, all with an easy to use intuitive associational ( read hypertextual ) interface. But it is tiring, staring at a flickering glaring screen for hours, my eyes get dry, and I strain and get tired picking out fuzzy objects when I try to focus at distance. Its nasty and annoying.
Here is my point about this project. Nobody wants to read books on their computers. Well maybe some do, but I think the vast majority don't. Paper books are easily available and cheap. If you can't find the one you want in a local library or bookstore there are a multitude of ways of ordering them. You don't get tired looking at them, they are actually enjoyable. So why should there be a desire amongst the majority for e-books?
Don't get me wrong, I think its a good idea, but not one that I, nor I think the majority, will go in for until a better way is developped of presenting them. LCDs are an improvement, but they still are shabby. I don't think a project like this is going to see much public interest until some better presentation media is found. E-Paper will be needed before the E-book becomes a reality for most people. Some kind of little book-sized unit that you can hold and which will display on a matt - non-glaring, non-luminous surface.
There are a thousand forms of subversion, but few can equal the convenience and immediacy of a cream pie -Noel Godin
There seems to be an interesting recurring theme in human history - we constantly strive to build libraries but we have never yet built one that is quite "good enough".
The Great Library in Alexandria was a wonder of the ancient world until it got burned down as part of a domestic dispute between Mark Anthony and Cleopatra. I was amused to note that the local University recently received funding approval to rebuild it - grants committees move slowly.
In mediaeval times, monks were the guardians of knowledge and the various monasteries dotted around Europe were oases of learning and knowledge in those times. Knowledge was restricted to the few.
The original Gutenberg made it possible to create huge volumes (literally) of knowledge and disseminate it on a wide scale. Ever since, people in power have sought to control this technology - either through censorship, copyright, or even education (you have to be able to read before a book is of greatest use to you.)
In Victorian England, the mark of a scholarly gentleman was in the breadth of works he maintained in his private library.
Perhaps a new initiative might be Gutenberg@Home whereby any reader made an electronic copy of physical works by some convenient, nondestructive means. By keeping such a personal library private, one would not have to worry about copyright laws, even as currently framed.
How much of what is holding us back from building the perfect library simply our insistence on monetary-related restrictions? How long will it take us to realize that lengthy (in time) and complex or intensive (in resources consumed) PHYSICAL processes are the only ones to which we need to attach a value. Whatever happens inthe electronic world should be free and that the collation, assembly, verification, dissemination and application of the sum of human knowledge is one of the most important things that we could achieve?
STF
Mickey Mouse will never be public domain because MICKEY MOUSE IS A TRADEMARK/LOGO. That would be like forcing IBM to give up their IBM logo/colors/design.
However, *Copyrighted* works should eventually go into public domain. The point is that after you are dead, anything - be it a movie, song, cartoon, book, poem --- whatever --- serves a greater good to mankind than it could to its dead creator. I think that a decade or two is too short of a limit for copyright. If I write a book when I'm 20 years old, I should still be allowed to make money off the sale of that book when I'm 40. But when I'm in the grave, it servs me no use.
Now, it could be said that a person who works hard to create pieces of work like movies or books or songs should be allowed to bestow the revenue from use of that material after the original author is dead. If I write a book that still sells well 20 years after my death, my son and daughter should be allowed to benefit from this copyrighted item in my 'estate'.
But I think that indefinite extensions are rediculous. I would say that 100 years is bordering on ridiculous. I think that 75 years is reasonable. If I create something when I'm 25, the copyright will outlive me by as much as 25 years.
In fact, I would propose that copyright should be extended to the life of the creator plus 20 years **OR** 50 years. Whichever is less (so if you die two years after the copyright, the copyright is still in effect for another 20 years).
Floppy disks get magnetized, hard drives crash, optical disks get scratched...A book can take a beating, man. All the OCR and voice rec in the world won't change this until we can get widespread, cheap cartridged optical media.
I think this take on media longevity also prevents progress WRT Project Gutenberg. Too many people don't see the point, when they can have the Library of Congress backed up on disk one day but be looking at a screen full of garbage characters the next because someone accidentally yanked the power supply on the server or whathaveyou.
A single $5 paperback book can be propagated more reliably than tens of thousands of dollars worth of networks and storage, although the latter system can admittedly hold a whole library's worth of that single book. But think about the infrastructure required to maintain the latter system. Until we have better media, the costs aren't justifiable, IMHO. It's an idea whose time has not yet come.
I prefer to phrase it, "Thus Project Gutenberg has raced ahead at an amazing rate. In its 32nd year in existence, the collection has 6,267 etexts, averaging almost 200 etexts per year. That works out to about one book every other day. This is more impressive given that in the first twenty years of the projects existance the Internet didn't exist anywhere near the form we take it for granted today. The popularization of the Internet has just accelerated the rate the Project Gutenberg grows. With the help of Distributed Proofreaders, a project that allows average people to donate small amounts of time to proofread just one page at a time, Project Gutenberg can expect to add over 400 etexts per year. Clearly Project Gutenberg is thriving."
Search 2010 Gen Con events
full grown, like Athena springing from the head of Zeus, this criticism is largely valid.
Patience, however, is a virtue. Libraries of public domain works *grow.* Every work added remains. Although it may take many years, even generations, as did the construction of the Giza plaza, over time The pyramid grows toward its apex, another pyramid joins it, a temple is added to the side, and so on.
That's part of the point of Project Gutenburg. Not just to provide an online library but to do so in an immutable manner that only grows over time.
Adding only *one page* to the project is valuable, and that addition remains and is added to by others.
Even brick and mortar libraries can take generations to build. A two hundred year plan only requires patience to complete.
That said, I'm going to take an even more contrarian point of view to the Wired article. The amazing thing I find about Project Gutenburg is how much is already in there. It's already at the point that I think few people could manage to read one half of the texts available in their lifetime, and finding a project to donate is complicated by the fact that the hardest part may not be performing the labor, but simply finding a project that interests you that *hasn't already been done.*
It's already a remarkable collection, and I've had to, on occasion, resort to it because my local library didn't have a lending copy of the work I wanted, but Project Gutenburg could give me free ownership of it.
KFG
Additionally translations might generate practical limitations. If a text was written in ancient Greece and translated to English or some other language in the 20th century, the translation might not be public domain even when the original work is. Of course you are free to read the original text or make a new translation. Anyway even if a piece of literature was public domain, the translation to your native language might not be.
That's exactly why. Since 1971 a wide variety of encodings and markup languages existed. 32 years later the only system still trivial to read is plain old ASCII. Project Gutenberg is most interested in preserving the texts themselves. The texts are quite well preserved in ASCII. Sure, some formatting is missing, but it's relatively minor for the majority of books in question. And given the existance of this unformatted text it's alot easier to create formatted text than from scratch, so you even get a benefit there.
I think you're a bit confused on semantic markup. By and large publishers aren't interested in semantics of the documention, just the formtting.
Search 2010 Gen Con events
That's basically what Distributed Proofers does. Except they OCR the book first, so the proofreaders just need to fix the OCR errors. Every page goes through two passes. Then the entire book goes into post-processing where a single person puts all the pages together, and checks for problems that the proofers didn't know how to solve (marked with an astrisk). Once Distributed Proofers finishes the book, they pass it on to Project Gutenberg where somebody reviews the whole text again.
Distributed Proofers currently has a problem. After the previous Slashdot announcement, they were overwhelmed with volunteers. The volunteers processed books so fast, they were running out of material to work on. Three or four people scan in most of the books. They have been slaving away trying to keep up with the proofers.
Distributed Proofers is also working on a standard to mark up the books to better preserve tables, illustrations, bold text, math, etc. I suspect that effort is being slowed due to the priority of keeping material on the site.
Gutenberg did NOT invent the printing press - He invented moveable type -a BIG difference
Before Gutenberg, there were printing presses, BUT you had to carve the master (the plate) for each page, and it could NOT be changed. Other folks had the IDEA of movable type, but what Gutenberg did was figure out a way to make it work (what he did was figure out how to make all the type the same length, so that when you press down, all the type comes in contact with the paper)
Movable type gives you one huge advantage - you can make up a bunch of sets of letters, and reuse them for many pages.
The total irony of this is that movable type is almost never used anymore - we make up a plate for each page. Of course, we are doing it with electronic movable type, but that is here nor there. Movable type started to go away with the Linotype machine - which made up one LINE of type at a time.
I think I still have an ingot of linotype metal around somewhere
-- 73 de KG2V For the Children - RKBA! "You are what you do when it counts" - the Masso
I and probably many others here, like to read Project Gutenberg books on my Palm/Pocket PC. Whenever I have a little down time I can get that out and choose from a dozen "classic" books to read. Can't do that when the "book" is a 800x600 image, and your screen can only do 320x320 (Sony Clies, Palm Tungsten), 320x240 (PocketPCs, Handera), or 160x160 (almost all Palm and Handspring PDAs).
Plain text, HTML, or XML are much more portable than compressed images. Which is at least partly why Gutenberg uses plain ASCII text; it's readable on literally anything with an alphanumeric display, and by all signs will be for decades, if not centuries or millenia. Good luck finding a GIF or BMP in 100 years, let alone formats nobody's even heard of. I have plenty of pictures I made only a few years ago on an Apple II that can't be read by anything, even when I get it off the 5.25" floppies. Yet I've read code and other things written on computers from the 70s and 80s. ASCII Just Doesn't Die.
Actually I've found the most value from the project is downloading and reading classics. I've downloaded works by people such as: Adam Smith, Nietzsche, Aristotle, Plato, Karl Marx, Oscar Wilde, Thomas More, and various other classic writers. I've found this resource indispensable. It provides high quality texts for free. I probably wouldn't read many works by these authors if I had to purchase them. I unfortunately, don't have the money to spend on many small works such as these (they're short, but sometimes cost $10-15). I also don't have easy access to a library and I like keeping a copy for my own personal use.
So I find that Project Gutenberg is a very useful resource.
neurostarBest title for any paper, book or article on the subject: How to wreck a nice beach.
I just found this site a few days ago. Essentially, volunteers can proofread one page at a time, so that huge time commitments of doing an entire book yourself are not required. Worth checking out.
http://texts01.archive.org/dp/
They obviously publish articles written by people with their head up their asses.
Honestly, just what is Mr. J. Bradford DeLong thinking? To characterize Project Gutenberg as a failure is just imbecilic. From PG's own pages, 203 ebooks were released in October 2002. 1975 new books in 2002 (1240 in 2001). It's a lot of work to produce even one book, and PG is churning them out at a pretty good clip for an entirely volunteer effort.
Even as it is, I've found PG to be pretty damned useful. It's kind of nice to be able to grep the collected works of Shakespeare. Or Darwin. Or Conan Doyle. Or H. G. Wells. Or Jules Verne. Or Charles Dickens. Or Frank. L. Baum.
Despite advances in technology, scanning, OCRing and proofreading books remains a very labor intensive process, and it is a boring, often thankless process as well. The Million Book project wants to take a somewhat different approach to providing digital books: they actually scan the books and store them in DJVU format (a very nice format similar to PDF). They can do OCR on it to provide searchable text, but such text doesn't have to be 100% accurate to be effective. Most of the time you print and read the original scans. After all, some publisher went to the trouble of carefully typesetting the book and proofreading it once, why bother to do it all again?
I first became aware of this project and technology when I met Brewster Kahle as he drove the Internet Bookmobile around the U.S., going to libraries and schools trying to drum up interest in Eldred vs. Ashcroft. A compressed version of Alice in Wonderland in DJVU format is about 5 megabytes (the same as a single MP3) including the illustrations and fancy typesetting. He could print and bind a copy of it for about $2 in materials, on demand using an HP laser printer out of the back of the mobile. The binding isn't amazing, but consider the possibility of having literally any book in any small town library in any place in the world. It's an exciting idea, and one that technology is only making easier and cheaper. You can get a decent scanner for $100 (even one small enough to hook to a laptop and take to a library). You can scan a book in an evening. And after you do, the file can be converted to a simple, easy to use format that everyone can use. Forever. One evening. One person. One book.
Despite the setback of Eldred v. Ashcroft, more and more books are going to be made available by the true philanthropists of the world: the volunteers who give something of their own time to make the world a better place. I wonder what Mr. DeLong has done to make the world a better place...
There is much pleasure to be gained in useless knowledge.
That's all for now. Thanks to all the supportive comments in this thread, and to all the constructive criticism. And remember, a page a day is all it takes to contribute!
Greg Newby, Director and CEO
The Project Gutenberg Literary Archive Foundation
www.gutenberg.net