Slashdot Mirror


Why Project Gutenberg Isn't There Yet

option8 writes "This wired article ('Any Text. Anytime. Anywhere. (Any Volunteers?)'), goes into good detail on why Project Gutenberg, and similar efforts, are far from creating a complete, free electronic library. A quote: "The mechanics of a universal library are simple. The tricky part: harnessing the free labor." Though it doesn't go into technology much, I expect there's a lot of potential in mass OCR tech and good speech recognition (faster to read a book aloud than to transcribe it correctly)."

33 of 334 comments (clear)

  1. Tex? by duckpoopy · · Score: 0, Insightful

    Aren't many books typeset using Latex? Just post the source.

    --
    word.
    1. Re:Tex? by ProgressiveCynic · · Score: 5, Insightful

      Umm, Project Guttenberg can only legally use public domain works. If you know of any 100+ year old novels typeset in Tex lets hear about it. Even if a modern reprint was done recently, do you think the publisher would really want to give away all that hard work so that everyone can get it for free instead of buying their spiffy new edition?

      --

      Delivering militantly anti-commercial music to all two people who care!

  2. Cost of labor? by Anonymous Coward · · Score: 3, Insightful

    What about the cost of the books? Unless the only books you have in this "universal library" are old enough to be without copyrights, won't there be a problem in finding funding to buy current day books?

    1. Re:Cost of labor? by Acidic_Diarrhea · · Score: 2, Insightful

      That's exactly it. Check out their website. All the works they currently have and all the ones they want to get are public domain. So it's a big project but one that we can eventually finish since the age of intellectual property that never expires is upon us. Today's books won't ever be in the public domain if the current trend continues.

      --
      I hate liberals. If you are a liberal, do not reply.
    2. Re:Cost of labor? by Nekoi · · Score: 2, Insightful

      Well... the project is a worthy cause. I don't see copyrights being the biggest problem. most books that's worth reading are older classics anyways. Plus, the project is probably aimed at providing people that has no access to a library with materials that's available in one. I mean, if you live in the middle of no where, how are you suppose to get new books, let alone a library? on the other hand, a e-library can provide that person with the same material, as long as he/she has a internet connection. as to cost of labor, there is always people who are welling to put in a little time for a worthy cause. you just need to advertise well.

  3. The REAL Problem by echucker · · Score: 2, Insightful

    So many of the things that people want to read are copyrighted, and won't be availble until long after we're dead.

    1. Re:The REAL Problem by Lenbok · · Score: 2, Insightful

      If they become available at all, given the current copywrite extension precedents.

    2. Re:The REAL Problem by Anonymous Coward · · Score: 5, Insightful

      Mickey Mouse will never be public domain because MICKEY MOUSE IS A TRADEMARK/LOGO. That would be like forcing IBM to give up their IBM logo/colors/design.

      However, *Copyrighted* works should eventually go into public domain. The point is that after you are dead, anything - be it a movie, song, cartoon, book, poem --- whatever --- serves a greater good to mankind than it could to its dead creator. I think that a decade or two is too short of a limit for copyright. If I write a book when I'm 20 years old, I should still be allowed to make money off the sale of that book when I'm 40. But when I'm in the grave, it servs me no use.

      Now, it could be said that a person who works hard to create pieces of work like movies or books or songs should be allowed to bestow the revenue from use of that material after the original author is dead. If I write a book that still sells well 20 years after my death, my son and daughter should be allowed to benefit from this copyrighted item in my 'estate'.

      But I think that indefinite extensions are rediculous. I would say that 100 years is bordering on ridiculous. I think that 75 years is reasonable. If I create something when I'm 25, the copyright will outlive me by as much as 25 years.

      In fact, I would propose that copyright should be extended to the life of the creator plus 20 years **OR** 50 years. Whichever is less (so if you die two years after the copyright, the copyright is still in effect for another 20 years).

    3. Re:The REAL Problem by L.+J.+Beauregard · · Score: 3, Insightful
      There is some good in letting a copyright extend beyond the author's death. An author may die with children still not yet grown, and his royalties can provide for them. Life plus 20 or maybe 25 should be enough for this.

      Some posthumous works may come out under a life-plus-X term that might have been cast aside under a life-plus-zero term. Life plus 50 is probably more than enough.

      Life plus 70 is absurd and our so-called elected officials should be ashamed of going along with it. And may Sonny Bono *not* rest in peace.

      --
      Ooh, moderator points! Five more idjits go to Minus One Hell!
      Delendae sunt RIAA, MPAA et Windoze
  4. Re:Speech recognition? by Anonymous Coward · · Score: 1, Insightful

    When thinking of free labour, volunteers are most likely able to type faster than the can speak. Im wondering if it wouldn't be faster to scan it.

  5. Time to Request Digital Copies from Publishers by CaptCanuk · · Score: 4, Insightful

    All digital versions of books that publishers have should be requested and maintained in a safe place till their respective patents expire so that they can be easily integrated into the public domain.... especially if OCR or speech recognition doesn't get any better any time soon.

    --
    ---- The geek shall inherit the Earth.
  6. Just daydreaming here. by eniu!uine · · Score: 3, Insightful

    As someone pointed out, the real problem is the copyright issue. Most works are copyrighted and copyrights last for way too long. The consitution states that copyright should be limited, but when it's lifetime plus 90 years, it may as well be unlimited since we'll all be dead before they expire. There needs to be a grassroots movement to inspire a repeal of some seriously damaging legislation. I feel confident that most slashdot readers agree about what needs to be done, but we seem too apathetic to actually do something about it. Sometimes I wish someone would post a link that says 'click here to vote for freedom'. If only it were that easy.

    I think an interesting project would be public domain textbooks. Textbooks are grossly overpriced and contain information that is largely available for free. If a community of developers can create an OS like linux then the educational community should be able to come up with open textbooks.

  7. Huh? by Tyler+Eaves · · Score: 2, Insightful

    Huh? I can type a good bit faster than I can speak.

    --
    TODO: Something witty here...
  8. Transcribing? by bravehamster · · Score: 4, Insightful
    Hah, try transcribing "Huckleberry Finn", or any Dr. Seuss, or better yet, try "Feersum Endjinn" by Iain M. Banks. I'd love to see what a transcriber would do to that one. Given the amount of made-up words in literature, catching and correcting the mistakes a transcriber commits would make it less than useless.

    --
    ---- El diablo esta en mis pantalones! Mire, mire!
  9. Re:Books Are Printed With Computers... by BJH · · Score: 5, Insightful
    I used to be a book editor (at a Japanese publishing company). Let me give you a rundown of the process we followed (I'm sure there are more efficient places than the one I worked at - O'Reilly is well known for their high level of automation).

    Get manuscript from author.
    This could be either handwritten or typed. If typed, it's likely to be in either plain text or Word format, but with a lot of errors.

    If the manuscript's handwritten, farm it out to a typist.
    We used to pay 0.5 yen a letter for English, 1 yen a character for Japanese.

    Once it's data, edit.
    I used to do my editing on a Mac with BBEdit, but this varies a lot between editors - some do it on (shudder) Word, where all the formatting gets in the way.

    Reformat it to pass it to the DTP firm.
    When I say 'reformat', I don't mean making things bold or italic - I mean cleaning it up so it's easy to do the next step, which is...

    Print out and insert format directions.
    The manuscript is printed out, and you go through it one line at a time adding things like "Line break here" and "Use larger font for this".

    Proofs arrive from the DTP firm.
    You go through the proofs, making corrections by hand (i.e., "Move this down one line", etc.)

    The DTP firm passes you back the formatted data.
    QuarkXPress is king here. You get the data in a finished form and pass it to the printers.

    The printer produces the final proofs.
    You can still make corrections, but these have to be done by the DTP firm, who then give you the updated data.

    Last-minute corrections are made.
    This depends on the printer, but quite often these are done by pasting the changes over the top of the printer film (i.e., they're not reflected in the data).

    The book is printed.
    Corrections after printing are usually done as described above (pasting changes over the film).

    The problem with this is that the text data held by the editor is now out-of-date in all sorts of ways:
    - It doesn't have the corrections made by the DTP firm.
    - It doesn't have the corrections made by the printer.
    - It doesn't have any formatting.

    QuarkXPress can output the data in other forms, but it's still missing the last-minute changes and after-printing changes, and quite frankly once it's on the market, most publishing companies aren't interested in reworking the data to keep it as text for the next 90 years, so it can be released into the public domain.

  10. Re:WiReD by NerdSlayer · · Score: 2, Insightful

    Seriously. And there were a couple of more earlier in the week, I believe. What's the deal? Slashdot has turned into Wired with trolls substituted for pictures and illustrations. Well, I guess there's the goatse guy...

  11. Stupid article. Project Gutenberg doing great. by ChaosDiscord · · Score: 5, Insightful
    Thus Project Gutenberg has inched ahead at a snail's pace. In its 32nd year of existence, the collection has only 6,267 etexts.

    I prefer to phrase it, "Thus Project Gutenberg has raced ahead at an amazing rate. In its 32nd year in existence, the collection has 6,267 etexts, averaging almost 200 etexts per year. That works out to about one book every other day. This is more impressive given that in the first twenty years of the projects existance the Internet didn't exist anywhere near the form we take it for granted today. The popularization of the Internet has just accelerated the rate the Project Gutenberg grows. With the help of Distributed Proofreaders, a project that allows average people to donate small amounts of time to proofread just one page at a time, Project Gutenberg can expect to add over 400 etexts per year. Clearly Project Gutenberg is thriving."

  12. If one demands that the library be born. . . by kfg · · Score: 4, Insightful

    full grown, like Athena springing from the head of Zeus, this criticism is largely valid.

    Patience, however, is a virtue. Libraries of public domain works *grow.* Every work added remains. Although it may take many years, even generations, as did the construction of the Giza plaza, over time The pyramid grows toward its apex, another pyramid joins it, a temple is added to the side, and so on.

    That's part of the point of Project Gutenburg. Not just to provide an online library but to do so in an immutable manner that only grows over time.

    Adding only *one page* to the project is valuable, and that addition remains and is added to by others.

    Even brick and mortar libraries can take generations to build. A two hundred year plan only requires patience to complete.

    That said, I'm going to take an even more contrarian point of view to the Wired article. The amazing thing I find about Project Gutenburg is how much is already in there. It's already at the point that I think few people could manage to read one half of the texts available in their lifetime, and finding a project to donate is complicated by the fact that the hardest part may not be performing the labor, but simply finding a project that interests you that *hasn't already been done.*

    It's already a remarkable collection, and I've had to, on occasion, resort to it because my local library didn't have a lending copy of the work I wanted, but Project Gutenburg could give me free ownership of it.

    KFG

  13. Re:plain text -- WHY?? by johnwroach · · Score: 2, Insightful

    So it will be compatible with anything. Every computer can handle plain text (and darn near every program, too). The same isn't true with marked up source.

  14. Sure, who needs searchable text... by DeHar · · Score: 2, Insightful

    Scanned documents might be fine for readers, but what if you're looking for "oh, you know, that one line in the book, where the dude was talking about melons."

    A computer is NOT a glowing piece of paper with scrollbars.

  15. Re:The parent is "interesting"? by timeOday · · Score: 3, Insightful
    So what? Rowing across the ocean is faster than swimming. Most of us still fly.

    Sure, for the best scanning speed you have to cut the binding off and use a sheet feeder. But even scanning 2 pages at a time will be far faster than reading the whole thing out loud.

    So what is your point?

  16. Re:I don't like reading online! by Gholam · · Score: 2, Insightful

    The importance of having literature available in digital format extends far beyond just the ability to read it on your computer.

    For a start, digital copies are easier and cheaper to store than paper-based documents. For older documents, keeping a digital reproduction may be the only way to ensure the continuing existence of the work.

    The "plain ASCII" restriction on all the documents in PG is a boon for usability in areas other than screen-based reproduction. For instance, you can print the document in a variety of formats, or have it played to you as sound. Quoting and searching digital material is also significantly faster than with paper documents.

    Reading documents on your computer may be the most obvious, but it's certainly not the only benefit of digital literature.

    --
    -- Matt Ryall
  17. Re:Librarians? by Anonymous Coward · · Score: 1, Insightful

    Yeah, right. They don't have anything to do. Let's have people with advanced degrees and public service management jobs doing typing for free.

    Hey, this involves computers. Every person who has a computer should be expected to *volunteer* to each type, word for word, an entire book every year as a side project, because, I mean, hey, you ain't got nuthin' better to do with your life, right?

    Actually a lot of librarians do contribute to the many, many digital library projects *you* take advantage of, without thinking of the amount of work invested, and without appreciation for their efforts.

    Before volunteering the efforts of others, volunteer your own.

  18. The Wired article misses the point by Anonymous Coward · · Score: 2, Insightful

    The author makes a good observation, but misses the point afterwards. The Web is curiously devoid of primary subject matter. There are book reviews, but few books; movie reviews, but not the movies; music commentary but little music. It's a web of opinion, not knowledge.

    But the problem isn't volunteers, it's litigation. Copyright law, DMCA, etc. The sources aren't there because the greedy owners won't allow them to be put there. The ebook-list over the last week has been publishing notes from various authors (real authors, not corporations like Disney) that read, "You'll get my copyrights when you pry them from my cold dead hands (and even then I'd like to leave them to my children!)."

    If Project Gutenberg could publish modern texts, there would be an explosion of interest and activity, and a more or less immediate on-line library. But since it can only digitize books written before 1923, more or less, there's mainly interest from historians, English majors, and True Believers.

  19. Downside to that method: by Anonymous Coward · · Score: 4, Insightful

    I and probably many others here, like to read Project Gutenberg books on my Palm/Pocket PC. Whenever I have a little down time I can get that out and choose from a dozen "classic" books to read. Can't do that when the "book" is a 800x600 image, and your screen can only do 320x320 (Sony Clies, Palm Tungsten), 320x240 (PocketPCs, Handera), or 160x160 (almost all Palm and Handspring PDAs).

    Plain text, HTML, or XML are much more portable than compressed images. Which is at least partly why Gutenberg uses plain ASCII text; it's readable on literally anything with an alphanumeric display, and by all signs will be for decades, if not centuries or millenia. Good luck finding a GIF or BMP in 100 years, let alone formats nobody's even heard of. I have plenty of pictures I made only a few years ago on an Apple II that can't be read by anything, even when I get it off the 5.25" floppies. Yet I've read code and other things written on computers from the 70s and 80s. ASCII Just Doesn't Die.

  20. Re:And not going anywhere soon.... by dvdeug · · Score: 5, Insightful

    Floppy disks get magnetized, hard drives crash, optical disks get scratched...A book can take a beating, man. All the OCR and voice rec in the world won't change this until we can get widespread, cheap cartridged optical media.

    One small book also takes up the space of a hard drive, and can't be redownloaded, or backed up. If my roof leaks, or I have a fire, it will cost me thousands of dollars to replace my books, and some will be hard to impossible to replace. If my hard drive crashes, I redownload the files from Gutenberg, and/or restore them from my backups.

  21. Re:other issues... by failedlogic · · Score: 2, Insightful

    If you can publish a Poe compilation in 50,000+ prints, sell them all in a few years, reprint and repeat. Make a lot of money, no royalty payments and charge $20 a copy then they'll keep making it.

    And this is where PG is having volunteer problems. The old books still make publishers money so their pre-press electronic versions won't be made available for free. So the process of OCR'ng the text at no cost is painstaking, labour intensive and a limited readership. What motivation!

  22. Comment removed by account_deleted · · Score: 3, Insightful

    Comment removed based on user account deletion

  23. Semi-official response from Project Gutenberg by gbnewby · · Score: 5, Insightful
    Michael Hart and I are working on a written response that we'll send to Wired and other media, but by then this /. article will be off the front page. So, allow me to make a few comments.
    • Projecting back to 1971, Project Gutenberg has tracked Moore's Law quite precisely. January 2003 will be our most productive month ever, and we are looking forward to continuing to double our rate of new eBooks every 18 months.
    • Project Gutenberg has received some big donations, and we're working on grants and other funding. However, when you do the math you realize that there's essentially no hope for paying for content -- it takes thousands and thousands of people. The hope for "someone" to do it is naive -- the only answer is to figure out ways for "everyone" to work on digitization.
    • While the author makes 6200 books sound like small potatoes, in fact it represents about 1/3 of all eBooks listed in places like the Internet Public Library. Not bad, and it certainly explains why some random book the author wants isn't part of the collection -- there just aren't that many projects working on digitizing literature.
    • Where did the author figure on $750million, and for what? Over 30 million printed books were registered for copyright in the last 100 years (this doesn't count magazines, recordings, etc.). The notion that $25/book could pay for digitization is not unreasonable. But where do you get the books, and what about copyright? If there's a plan, I'd like to hear it.
    • One more point, to keep this short: We have just under 7000 eBooks (up about 800 from whenever the author did his research!). We have over 1000 active volunteers. The books are in over 20 languages, dozens of formats and, if printed, would fill a small library. We're on track to reach #10,000 in 2003. Via Distributed Proofreading, as mentioned here and in a previous /. story, we can and frequently do complete digitizing a 300 page book in just a few hours. Mr. DeLong, I don't feel apologetic about these numbers at all.

    That's all for now. Thanks to all the supportive comments in this thread, and to all the constructive criticism. And remember, a page a day is all it takes to contribute!

    Greg Newby, Director and CEO
    The Project Gutenberg Literary Archive Foundation
    www.gutenberg.net

  24. You are correct on all points by kfg · · Score: 2, Insightful

    In fact ASCII text can even be human translated (although not really human read) if all you have is the *binary*.

    The poster to whom you reply seems to have missed the essential point.

    I would give you one caveat though. English may well be the language of the internet ( and I'll leave the arguement as to whether that's a good or bad thing to the students), but it isn't the language of *literature.*

    It would certainly be a Good Thing to be able to store the Vedas and Sun-Tzu, in the original script, at the lowest possible human readable electronic form.

    This, however, as you note, will apparently have to wait for some future time.

    KFG

  25. Scanned Images are not Accessible by joyjoy · · Score: 2, Insightful

    Another side benefit of good old ASCII - text to speech! Or braille displays! Heck, you can read it on any device, changing it to any resolution you want quickly and easily.

  26. Re:copyright sucks but... by kalidasa · · Score: 2, Insightful

    ...humanity wrote some ok books in its first 3000 years (-ish) of literacy. The Koran, the Bible, Shakespeare... yeah there's some ok books out there not covered by the stupid copyright situation we are now in.

    Unfortunately, Bevington's, Taylor's, Kermode's, and even Muir's texts of Shakespeare are still under copyright. (Compare an Arden of Shakespeare to a facsimile of the First Folio some time: the printers of the First Folio were considered good in their day, but not in ours). Too bad most English translations of the Bible (the KJV and the Tyndale are two obvious exceptions) are still under copyright. Too bad most of the good translations of the Koran still are.

    Yes, there's plenty of good lit before 1923, but sometimes you need to look at a more modern edition to see what the original author most likely really wrote.

  27. Re:Tax funded... by fgb · · Score: 2, Insightful

    I think it's better that the work is done by people who really care about the project rather than some poorly paid "schlubs" who couldn't care less. The transcriptions are going to be much more accurate.