Slashdot Mirror


Why Project Gutenberg Isn't There Yet

option8 writes "This wired article ('Any Text. Anytime. Anywhere. (Any Volunteers?)'), goes into good detail on why Project Gutenberg, and similar efforts, are far from creating a complete, free electronic library. A quote: "The mechanics of a universal library are simple. The tricky part: harnessing the free labor." Though it doesn't go into technology much, I expect there's a lot of potential in mass OCR tech and good speech recognition (faster to read a book aloud than to transcribe it correctly)."

17 of 334 comments (clear)

  1. Re:Tex? by ProgressiveCynic · · Score: 5, Insightful

    Umm, Project Guttenberg can only legally use public domain works. If you know of any 100+ year old novels typeset in Tex lets hear about it. Even if a modern reprint was done recently, do you think the publisher would really want to give away all that hard work so that everyone can get it for free instead of buying their spiffy new edition?

    --

    Delivering militantly anti-commercial music to all two people who care!

  2. Re:Cost of labor? by jonman_d · · Score: 5, Informative

    That's pretty much it - most of the books are in the public domain. AFAIK, the rest are all donated by their authors.

    From their FAQ:

    What books will I find in Project Gutenberg?

    We cannot publish any texts still in copyright. This generally means that our texts are taken from books published pre-1923. (It's more complicated than that, as our Copyright Page explains, but 1923 is a good first rule-of-thumb for the U.S.A.)

    So you won't find the latest bestsellers or modern computer books here. You will find the classic books from the start of this century and previous centuries, from authors like Shakespeare, Poe, Dante, as well as well-loved favorites like the Sherlock Holmes stories by Sir Arthur Conan Doyle, the Tarzan and Mars books of Edgar Rice Burroughs, Alice's adventures in Wonderland as told by Lewis Carroll, and thousands of others.

    These books are chosen by our volunteers. Simply, a volunteer decides that a certain book should be in the archives, obtains the book and does the work necessary to turn it into an e-text. If you're interested in volunteering, click here.

  3. If you do want to help by Anonymous Coward · · Score: 5, Informative

    Distributed Proofreaders. Recently discussed on /. as well.

  4. copyright information by Anonymous Coward · · Score: 5, Informative

    Keep in mind the following copyright rules:

    1. Works first published before January 1, 1923 with proper copyright notice entered the public domain no later than 75 years from the date copyright was first secured. Hence, all works whose copyrights were secured before 1923 are now in the public domain.
    (This is the rule Project Gutenberg uses most often)
    Works published from 1923-1977 retain copyright for 95 years. No such works will enter the public domain until 2019.
    2. Works first created on or after January 1, 1978 enter the public domain 70 years after the death of the author if the author is a natural person.
    (Nothing will enter the public domain under this rule until at least January 1, 2049.)
    3. Works first created on or after January 1, 1978 which are created by a corporate author enter the public domain 95 years after publication or 120 years after creation whichever occurs first.
    (Nothing will enter the public domain under this rule until at least January 1, 2074.)
    4. Works created before January 1, 1978 but not published before that date are copyrighted under rules 2 and 3 above, except that in no case will the copyright on a work not published prior to January 1, 1978 expire before December 31, 2002. If the work is published before December 31, 2002, its copyright will not expire before December 31, 2047.
    (This rule copyrights a lot of manuscripts that we would otherwise think of as public domain because of their age.)
    5. If a substantial number of copies were printed and distributed in the U.S. prior to March 1, 1989 without a copyright notice, and the work is of entirely American authorship, or was first published in the United States, the work is in the public domain in the U.S.
    6. (This rule is complicated, and is seldom applied). Works published before 1964 needed to have their copyrights renewed in their 28th year, or they'd enter into the public domain. Some books originally published outside of the US by non-Americans are exempt from this requirement, under GATT. Works from before 1964 were automatically renewed if ALL of these apply:
    At least one author was a citizen or resident of a foreign country (outside the US) that's a party to the applicable copyright agreements. (Almost all countries are parties to these agreements.)
    The work was still under copyright in at least one author's "home country" at the time the GATT copyright agreement went into effect for that country (January 1, 1996 for most countries).
    The work was first published abroad, and not published in the United States until at least 30 days after its first publication abroad.

    This means that we can't simply take electronic versions of modern texts and put them in the archive, because only out-of-copyright books are in there.

  5. Re:Tex? by Matrix · · Score: 5, Informative

    While this comment has been addressed, I'd like to point out that you can get pretty decent output from the Gutenberg texts by importing them into LyX. With just a little bit of work (basically setting up the chapters), LyX will allow you to create good looking PDF, Postscript, HTML, etc, along with the LaTeX source. Combine this with rbmake and you can even read them, complete with hyperlinks, on your eBook (if you have one!)

  6. just scan and compress by Anonymous Coward · · Score: 5, Interesting

    The best and cheapest way to get existing books on the web is to scan them and compress the images. Compression technology for text images is so good (see DjVu), and storage so cheap nowadays that you are better off just distributing high resolution scans.

    This is a much more efficient way to make books available on the web, much more efficient than having volunteers painstakingly transcribe the text or correcting OCR mistakes.

    OCR can be used for indexing scanned documents, but there is no need to do manual correction. DjVu can compress 300dpi black and white pages of text to 5-25KB. That's less than most HTML pages, and the images look just like the original book.

    The Million Book Project at the Internet Archive uses DjVu (as well as other formats).

    The open source implementation of DjVu is available on sourceforge

  7. The parent is "interesting"? by thac0 · · Score: 5, Informative

    The article didn't say that OCR was faster than speech, it said that speech was faster than transcibing it.

    Come on mod's, read more carefully.

    --
    poliglut.org: they're still alive and fighting the man
  8. What's more. . . by kfg · · Score: 5, Informative

    it is part of the philosophy of Project Gutenburg to publish all of their works in the lowest level stardard format, thus insuring continued cross platform, program independant readability, ad infinitum.

    That means *plain* ASCII. Plain ASCII means you could read it in edlin if you really had to.

    This is a Good Thing.

    This also means that if you wish to format any Project Gutenburg text, in HTML or TeX for publication, you start with a blank slate and can immediately start to work your own will upon the raw text.

    This is also a Good Thing.

    KFG

  9. Re:Books Are Printed With Computers... by BJH · · Score: 5, Insightful
    I used to be a book editor (at a Japanese publishing company). Let me give you a rundown of the process we followed (I'm sure there are more efficient places than the one I worked at - O'Reilly is well known for their high level of automation).

    Get manuscript from author.
    This could be either handwritten or typed. If typed, it's likely to be in either plain text or Word format, but with a lot of errors.

    If the manuscript's handwritten, farm it out to a typist.
    We used to pay 0.5 yen a letter for English, 1 yen a character for Japanese.

    Once it's data, edit.
    I used to do my editing on a Mac with BBEdit, but this varies a lot between editors - some do it on (shudder) Word, where all the formatting gets in the way.

    Reformat it to pass it to the DTP firm.
    When I say 'reformat', I don't mean making things bold or italic - I mean cleaning it up so it's easy to do the next step, which is...

    Print out and insert format directions.
    The manuscript is printed out, and you go through it one line at a time adding things like "Line break here" and "Use larger font for this".

    Proofs arrive from the DTP firm.
    You go through the proofs, making corrections by hand (i.e., "Move this down one line", etc.)

    The DTP firm passes you back the formatted data.
    QuarkXPress is king here. You get the data in a finished form and pass it to the printers.

    The printer produces the final proofs.
    You can still make corrections, but these have to be done by the DTP firm, who then give you the updated data.

    Last-minute corrections are made.
    This depends on the printer, but quite often these are done by pasting the changes over the top of the printer film (i.e., they're not reflected in the data).

    The book is printed.
    Corrections after printing are usually done as described above (pasting changes over the film).

    The problem with this is that the text data held by the editor is now out-of-date in all sorts of ways:
    - It doesn't have the corrections made by the DTP firm.
    - It doesn't have the corrections made by the printer.
    - It doesn't have any formatting.

    QuarkXPress can output the data in other forms, but it's still missing the last-minute changes and after-printing changes, and quite frankly once it's on the market, most publishing companies aren't interested in reworking the data to keep it as text for the next 90 years, so it can be released into the public domain.

  10. In Search of the Perfect Library by drmofe · · Score: 5, Interesting

    There seems to be an interesting recurring theme in human history - we constantly strive to build libraries but we have never yet built one that is quite "good enough".

    The Great Library in Alexandria was a wonder of the ancient world until it got burned down as part of a domestic dispute between Mark Anthony and Cleopatra. I was amused to note that the local University recently received funding approval to rebuild it - grants committees move slowly.

    In mediaeval times, monks were the guardians of knowledge and the various monasteries dotted around Europe were oases of learning and knowledge in those times. Knowledge was restricted to the few.

    The original Gutenberg made it possible to create huge volumes (literally) of knowledge and disseminate it on a wide scale. Ever since, people in power have sought to control this technology - either through censorship, copyright, or even education (you have to be able to read before a book is of greatest use to you.)

    In Victorian England, the mark of a scholarly gentleman was in the breadth of works he maintained in his private library.

    Perhaps a new initiative might be Gutenberg@Home whereby any reader made an electronic copy of physical works by some convenient, nondestructive means. By keeping such a personal library private, one would not have to worry about copyright laws, even as currently framed.

    How much of what is holding us back from building the perfect library simply our insistence on monetary-related restrictions? How long will it take us to realize that lengthy (in time) and complex or intensive (in resources consumed) PHYSICAL processes are the only ones to which we need to attach a value. Whatever happens inthe electronic world should be free and that the collation, assembly, verification, dissemination and application of the sum of human knowledge is one of the most important things that we could achieve?

    STF

  11. Re:The REAL Problem by Anonymous Coward · · Score: 5, Insightful

    Mickey Mouse will never be public domain because MICKEY MOUSE IS A TRADEMARK/LOGO. That would be like forcing IBM to give up their IBM logo/colors/design.

    However, *Copyrighted* works should eventually go into public domain. The point is that after you are dead, anything - be it a movie, song, cartoon, book, poem --- whatever --- serves a greater good to mankind than it could to its dead creator. I think that a decade or two is too short of a limit for copyright. If I write a book when I'm 20 years old, I should still be allowed to make money off the sale of that book when I'm 40. But when I'm in the grave, it servs me no use.

    Now, it could be said that a person who works hard to create pieces of work like movies or books or songs should be allowed to bestow the revenue from use of that material after the original author is dead. If I write a book that still sells well 20 years after my death, my son and daughter should be allowed to benefit from this copyrighted item in my 'estate'.

    But I think that indefinite extensions are rediculous. I would say that 100 years is bordering on ridiculous. I think that 75 years is reasonable. If I create something when I'm 25, the copyright will outlive me by as much as 25 years.

    In fact, I would propose that copyright should be extended to the life of the creator plus 20 years **OR** 50 years. Whichever is less (so if you die two years after the copyright, the copyright is still in effect for another 20 years).

  12. Stupid article. Project Gutenberg doing great. by ChaosDiscord · · Score: 5, Insightful
    Thus Project Gutenberg has inched ahead at a snail's pace. In its 32nd year of existence, the collection has only 6,267 etexts.

    I prefer to phrase it, "Thus Project Gutenberg has raced ahead at an amazing rate. In its 32nd year in existence, the collection has 6,267 etexts, averaging almost 200 etexts per year. That works out to about one book every other day. This is more impressive given that in the first twenty years of the projects existance the Internet didn't exist anywhere near the form we take it for granted today. The popularization of the Internet has just accelerated the rate the Project Gutenberg grows. With the help of Distributed Proofreaders, a project that allows average people to donate small amounts of time to proofread just one page at a time, Project Gutenberg can expect to add over 400 etexts per year. Clearly Project Gutenberg is thriving."

  13. Re:Cost of labor? by GammaTau · · Score: 5, Informative

    Additionally translations might generate practical limitations. If a text was written in ancient Greece and translated to English or some other language in the 20th century, the translation might not be public domain even when the original work is. Of course you are free to read the original text or make a new translation. Anyway even if a piece of literature was public domain, the translation to your native language might not be.

  14. That's part of what DP does by smiff · · Score: 5, Informative
    Why not modify that in such a way as to have avaliable a scanned image of a single page of the book, along with an empty box to enter text?

    That's basically what Distributed Proofers does. Except they OCR the book first, so the proofreaders just need to fix the OCR errors. Every page goes through two passes. Then the entire book goes into post-processing where a single person puts all the pages together, and checks for problems that the proofers didn't know how to solve (marked with an astrisk). Once Distributed Proofers finishes the book, they pass it on to Project Gutenberg where somebody reviews the whole text again.

    Distributed Proofers currently has a problem. After the previous Slashdot announcement, they were overwhelmed with volunteers. The volunteers processed books so fast, they were running out of material to work on. Three or four people scan in most of the books. They have been slaving away trying to keep up with the proofers.

    Distributed Proofers is also working on a standard to mark up the books to better preserve tables, illustrations, bold text, math, etc. I suspect that effort is being slowed due to the priority of keeping material on the site.

  15. Re:And not going anywhere soon.... by dvdeug · · Score: 5, Insightful

    Floppy disks get magnetized, hard drives crash, optical disks get scratched...A book can take a beating, man. All the OCR and voice rec in the world won't change this until we can get widespread, cheap cartridged optical media.

    One small book also takes up the space of a hard drive, and can't be redownloaded, or backed up. If my roof leaks, or I have a fire, it will cost me thousands of dollars to replace my books, and some will be hard to impossible to replace. If my hard drive crashes, I redownload the files from Gutenberg, and/or restore them from my backups.

  16. Distributed Proofreaders by Amata · · Score: 5, Informative

    I just found this site a few days ago. Essentially, volunteers can proofread one page at a time, so that huge time commitments of doing an entire book yourself are not required. Worth checking out.

    http://texts01.archive.org/dp/

  17. Semi-official response from Project Gutenberg by gbnewby · · Score: 5, Insightful
    Michael Hart and I are working on a written response that we'll send to Wired and other media, but by then this /. article will be off the front page. So, allow me to make a few comments.
    • Projecting back to 1971, Project Gutenberg has tracked Moore's Law quite precisely. January 2003 will be our most productive month ever, and we are looking forward to continuing to double our rate of new eBooks every 18 months.
    • Project Gutenberg has received some big donations, and we're working on grants and other funding. However, when you do the math you realize that there's essentially no hope for paying for content -- it takes thousands and thousands of people. The hope for "someone" to do it is naive -- the only answer is to figure out ways for "everyone" to work on digitization.
    • While the author makes 6200 books sound like small potatoes, in fact it represents about 1/3 of all eBooks listed in places like the Internet Public Library. Not bad, and it certainly explains why some random book the author wants isn't part of the collection -- there just aren't that many projects working on digitizing literature.
    • Where did the author figure on $750million, and for what? Over 30 million printed books were registered for copyright in the last 100 years (this doesn't count magazines, recordings, etc.). The notion that $25/book could pay for digitization is not unreasonable. But where do you get the books, and what about copyright? If there's a plan, I'd like to hear it.
    • One more point, to keep this short: We have just under 7000 eBooks (up about 800 from whenever the author did his research!). We have over 1000 active volunteers. The books are in over 20 languages, dozens of formats and, if printed, would fill a small library. We're on track to reach #10,000 in 2003. Via Distributed Proofreading, as mentioned here and in a previous /. story, we can and frequently do complete digitizing a 300 page book in just a few hours. Mr. DeLong, I don't feel apologetic about these numbers at all.

    That's all for now. Thanks to all the supportive comments in this thread, and to all the constructive criticism. And remember, a page a day is all it takes to contribute!

    Greg Newby, Director and CEO
    The Project Gutenberg Literary Archive Foundation
    www.gutenberg.net