Slashdot Mirror


Why Project Gutenberg Isn't There Yet

option8 writes "This wired article ('Any Text. Anytime. Anywhere. (Any Volunteers?)'), goes into good detail on why Project Gutenberg, and similar efforts, are far from creating a complete, free electronic library. A quote: "The mechanics of a universal library are simple. The tricky part: harnessing the free labor." Though it doesn't go into technology much, I expect there's a lot of potential in mass OCR tech and good speech recognition (faster to read a book aloud than to transcribe it correctly)."

34 of 334 comments (clear)

  1. Re:Tex? by jonman_d · · Score: 4, Informative

    That's not how project Gutenberg works. Most everything that's on PG is public domain - that means the copyright has expired. Thus, most of the stuff is over 70 years old. They didn't exactly use Latex back in the 1930s.

    Besides, what I generally use PG for are the classics - greek/roman literature, etc... I don't think Plato used UNIX.

    It's all got to be somehow entered from dead-tree-format copy. Currently, that pretty much means typing up the entire book.

  2. Re:Cost of labor? by jonman_d · · Score: 5, Informative

    That's pretty much it - most of the books are in the public domain. AFAIK, the rest are all donated by their authors.

    From their FAQ:

    What books will I find in Project Gutenberg?

    We cannot publish any texts still in copyright. This generally means that our texts are taken from books published pre-1923. (It's more complicated than that, as our Copyright Page explains, but 1923 is a good first rule-of-thumb for the U.S.A.)

    So you won't find the latest bestsellers or modern computer books here. You will find the classic books from the start of this century and previous centuries, from authors like Shakespeare, Poe, Dante, as well as well-loved favorites like the Sherlock Holmes stories by Sir Arthur Conan Doyle, the Tarzan and Mars books of Edgar Rice Burroughs, Alice's adventures in Wonderland as told by Lewis Carroll, and thousands of others.

    These books are chosen by our volunteers. Simply, a volunteer decides that a certain book should be in the archives, obtains the book and does the work necessary to turn it into an e-text. If you're interested in volunteering, click here.

  3. If you do want to help by Anonymous Coward · · Score: 5, Informative

    Distributed Proofreaders. Recently discussed on /. as well.

    1. Re:If you do want to help by adamjaskie · · Score: 3, Informative

      They give you an image of the scanned page, along with the OCR'd text. I just looked closer, and did a few pages as well. Its pretty easy. Took me about 5-10 minutes/page. I had to remove a few end-of-line hyphenations, fix an OCR-mangled word, and replace single hyphens with double hyphens for em dashes a few times.

      --
      /usr/games/fortune
  4. Re:Tex? by AndrewRUK · · Score: 2, Informative

    The requires LaTex source to post, which means a modern edition, which means it's copyrighted, which means you can't copy it (unless you have the publisher's permission.*)
    Project Guttenburg only does texts which are in the public domain, which currently mean pre-1923 editions (PG of Australia has newer books) and, obviously, pre-1923 means that the only sources are print copies.

    * pedantic point: it's the copyright holder's permission, which isn't necessairly the publisher, but usually is.

  5. copyright information by Anonymous Coward · · Score: 5, Informative

    Keep in mind the following copyright rules:

    1. Works first published before January 1, 1923 with proper copyright notice entered the public domain no later than 75 years from the date copyright was first secured. Hence, all works whose copyrights were secured before 1923 are now in the public domain.
    (This is the rule Project Gutenberg uses most often)
    Works published from 1923-1977 retain copyright for 95 years. No such works will enter the public domain until 2019.
    2. Works first created on or after January 1, 1978 enter the public domain 70 years after the death of the author if the author is a natural person.
    (Nothing will enter the public domain under this rule until at least January 1, 2049.)
    3. Works first created on or after January 1, 1978 which are created by a corporate author enter the public domain 95 years after publication or 120 years after creation whichever occurs first.
    (Nothing will enter the public domain under this rule until at least January 1, 2074.)
    4. Works created before January 1, 1978 but not published before that date are copyrighted under rules 2 and 3 above, except that in no case will the copyright on a work not published prior to January 1, 1978 expire before December 31, 2002. If the work is published before December 31, 2002, its copyright will not expire before December 31, 2047.
    (This rule copyrights a lot of manuscripts that we would otherwise think of as public domain because of their age.)
    5. If a substantial number of copies were printed and distributed in the U.S. prior to March 1, 1989 without a copyright notice, and the work is of entirely American authorship, or was first published in the United States, the work is in the public domain in the U.S.
    6. (This rule is complicated, and is seldom applied). Works published before 1964 needed to have their copyrights renewed in their 28th year, or they'd enter into the public domain. Some books originally published outside of the US by non-Americans are exempt from this requirement, under GATT. Works from before 1964 were automatically renewed if ALL of these apply:
    At least one author was a citizen or resident of a foreign country (outside the US) that's a party to the applicable copyright agreements. (Almost all countries are parties to these agreements.)
    The work was still under copyright in at least one author's "home country" at the time the GATT copyright agreement went into effect for that country (January 1, 1996 for most countries).
    The work was first published abroad, and not published in the United States until at least 30 days after its first publication abroad.

    This means that we can't simply take electronic versions of modern texts and put them in the archive, because only out-of-copyright books are in there.

    1. Re:copyright information by ColaMan · · Score: 2, Informative
      Unless you visit some other , non-US version of project gutenburg , such as the Australianone, which I peruse through every now and then.

      From the .au front page:


      Works in the 'public domain' in Australia
      Under Australian copyright law, literary, dramatic, & musical work published, performed, communicated, or recorded and offered for sale in an author's lifetime are protected for the life of the author plus fifty years from the end of the year of the author's death. After this time they enter into the public domain. EBooks on this page may be still copyright in the US and are therefore not available from the US site.


      So , at present Australians can get up to the beginning of 1953. Seems a hell of a lot easier to follow than the mess of dates the parent posted.
      --

      You are in a twisty maze of processor lines, all alike.
      There is a lot of hype here.
  6. You can help by geyser · · Score: 3, Informative

    The volunteer page is the place to start:
    http://promo.net/pg/volunteer.html

  7. Re:The REAL Problem by IvyMike · · Score: 1, Informative

    No, the real REAL problem is that because of Disney, copyright lengths keep getting extended and extended. At the current rate, Mickey Mouse will never be public domain. This is actually unconstitutional, since Congress is enabled to grant exclusive rights for "limited times" only. But it's the way things are.

  8. Re:other issues... by MBCook · · Score: 2, Informative

    Most of the stuff on PG is public domain, IIRC. Unless Poe, Melville (I know it's wrong, so sue me), Shakespere, and others all climb out of their graves and form some kind of union (RIAA - Recently-undeceased Inkers of Aged Albums etc.) will people complain that they're getting ripped off by these works being put on the web.

    --
    Comment forecast: Bits of genius surrounded by a sea of mediocrity.
  9. Re:Tex? by Matrix · · Score: 5, Informative

    While this comment has been addressed, I'd like to point out that you can get pretty decent output from the Gutenberg texts by importing them into LyX. With just a little bit of work (basically setting up the chapters), LyX will allow you to create good looking PDF, Postscript, HTML, etc, along with the LaTeX source. Combine this with rbmake and you can even read them, complete with hyperlinks, on your eBook (if you have one!)

  10. The parent is "interesting"? by thac0 · · Score: 5, Informative

    The article didn't say that OCR was faster than speech, it said that speech was faster than transcibing it.

    Come on mod's, read more carefully.

    --
    poliglut.org: they're still alive and fighting the man
    1. Re:The parent is "interesting"? by nautical9 · · Score: 4, Informative
      Depending on the typist, I can't see reading a book out loud as being any faster than transcribing it - especially considering that the speech recognition software is unlikely to do the proper punctuation, paragraph breaks, people & place names, and general capitalization, so proofing the results would take a considerable amount of time.

      But as the GP said - a moot point since OCR'ing it and proofreading/fixing minor typos would be far quicker than either.

  11. What's more. . . by kfg · · Score: 5, Informative

    it is part of the philosophy of Project Gutenburg to publish all of their works in the lowest level stardard format, thus insuring continued cross platform, program independant readability, ad infinitum.

    That means *plain* ASCII. Plain ASCII means you could read it in edlin if you really had to.

    This is a Good Thing.

    This also means that if you wish to format any Project Gutenburg text, in HTML or TeX for publication, you start with a blank slate and can immediately start to work your own will upon the raw text.

    This is also a Good Thing.

    KFG

    1. Re:What's more. . . by Sgs-Cruz · · Score: 3, Informative
      It keeps it quite good for almost all European languages, thank you. Wouldn't you consider it better than nothing? Or would you prefer that Project Gutenberg supported the Unicode standard that is mired in controversy because it doesn't support all 10 to the freaking 24th ancient Chinese ideographs.

      I'd prefer that the books be transcribed now and maybe later we can add some foreign-language books once we figure out a standard that can satisfy the world. Besides, English (European languages, anyway) are the real languages of the Internet.

      --

      Karma: pi (Mostly due to circular reasoning in posts).

    2. Re:What's more. . . by dvdeug · · Score: 2, Informative

      No, it's a bad thing, because it renders Gutenberg near useless for anything other than English,

      Have you ever taken an actual look at Project Gutenberg? It uses whatever character set is necessary for the language in question; Unicode, CP1251, and ISO-8859-1 have all been used.

      Of course, so has DOS CP850, which is darn near unreadable unless you're a CS geek, which is why PG prefers ASCII.

    3. Re:What's more. . . by Doug+Merritt · · Score: 4, Informative
      No, it's a bad thing, because it renders Gutenberg near useless for anything other than English, and it cripples it for creating PDFs, TeX files for printing, and the like

      Strangely enough, people have actually addressed this, notably with the Gutenmark program to convert Gutenberg text into nicely formatted documents in a variety of markup formats (including PDF and TeX, using postprocessing filters).

      See GutenMark home

      It never ceases to amaze me that, when people see something that only addresses 90% of their own problem, they call it useless, rather than doing a web search to see whether someone has addressed the remaining 10% of their problem.

      Gutenberg is an amazingly important project; I urge everyone to support it.

      --
      Professional Wild-Eyed Visionary
  12. Online Interface.. by slashkitty · · Score: 2, Informative

    While I like the project, I think the biggest problem is the interface to use the books. They end up in this crappy.txt format. The searching and browsing is slow and painful. If they just spent a little time on the website, they might get more support!

    --
    -- these are only opinions and they might not be mine.
  13. #bookz --- bookwarez anyone? by Slashdotess · · Score: 1, Informative

    #bookz on irc.undernet is an excellent place for ebooks, of course, with a little illegality behind it. Many of these are the same one's that have been floating around on alt.binaries.ebooks since the stone ages, but I think this unrestricted database is probably the best library created.

  14. No. Boycott Dr. Seuss. by yerricde · · Score: 2, Informative

    Hah, try transcribing "Huckleberry Finn", or any Dr. Seuss

    No. Boycott Dr. Seuss. His estate submitted an amicus brief in favor of the Bono Act. Now that Project Gutenberg uses distributed proofreading, the Bono Act is the biggest barrier to the growth of PG.

    --
    Will I retire or break 10K?
  15. Re:Cost of labor? by GammaTau · · Score: 5, Informative

    Additionally translations might generate practical limitations. If a text was written in ancient Greece and translated to English or some other language in the 20th century, the translation might not be public domain even when the original work is. Of course you are free to read the original text or make a new translation. Anyway even if a piece of literature was public domain, the translation to your native language might not be.

  16. Re:plain text -- WHY?? by ChaosDiscord · · Score: 4, Informative
    I cannot believe that Project Gutenberg continues to use plain text as their source code! I can see why it would have been compelling in 1971, and it still may be true that there are systems out there that can only read 7-bit ASCII.

    That's exactly why. Since 1971 a wide variety of encodings and markup languages existed. 32 years later the only system still trivial to read is plain old ASCII. Project Gutenberg is most interested in preserving the texts themselves. The texts are quite well preserved in ASCII. Sure, some formatting is missing, but it's relatively minor for the majority of books in question. And given the existance of this unformatted text it's alot easier to create formatted text than from scratch, so you even get a benefit there.

    But that's absolutely no reason why the source shouldn't be marked up. Marked up source can always be converted to ASCII, but you cannot derive semantic markup from ASCII.

    I think you're a bit confused on semantic markup. By and large publishers aren't interested in semantics of the documention, just the formtting.

  17. That's part of what DP does by smiff · · Score: 5, Informative
    Why not modify that in such a way as to have avaliable a scanned image of a single page of the book, along with an empty box to enter text?

    That's basically what Distributed Proofers does. Except they OCR the book first, so the proofreaders just need to fix the OCR errors. Every page goes through two passes. Then the entire book goes into post-processing where a single person puts all the pages together, and checks for problems that the proofers didn't know how to solve (marked with an astrisk). Once Distributed Proofers finishes the book, they pass it on to Project Gutenberg where somebody reviews the whole text again.

    Distributed Proofers currently has a problem. After the previous Slashdot announcement, they were overwhelmed with volunteers. The volunteers processed books so fast, they were running out of material to work on. Three or four people scan in most of the books. They have been slaving away trying to keep up with the proofers.

    Distributed Proofers is also working on a standard to mark up the books to better preserve tables, illustrations, bold text, math, etc. I suspect that effort is being slowed due to the priority of keeping material on the site.

    1. Re:That's part of what DP does by kalidasa · · Score: 3, Informative

      Distributed Proofers is also working on a standard to mark up the books to better preserve tables, illustrations, bold text, math, etc. I suspect that effort is being slowed due to the priority of keeping material on the site.

      Three Little Letters:

      T E I

      TEI is to literature as DocBook is to documentation.

  18. Re:Gutenberg by CharlieG · · Score: 4, Informative

    Gutenberg did NOT invent the printing press - He invented moveable type -a BIG difference

    Before Gutenberg, there were printing presses, BUT you had to carve the master (the plate) for each page, and it could NOT be changed. Other folks had the IDEA of movable type, but what Gutenberg did was figure out a way to make it work (what he did was figure out how to make all the type the same length, so that when you press down, all the type comes in contact with the paper)

    Movable type gives you one huge advantage - you can make up a bunch of sets of letters, and reuse them for many pages.

    The total irony of this is that movable type is almost never used anymore - we make up a plate for each page. Of course, we are doing it with electronic movable type, but that is here nor there. Movable type started to go away with the Linotype machine - which made up one LINE of type at a time.

    I think I still have an ingot of linotype metal around somewhere

    --
    -- 73 de KG2V For the Children - RKBA! "You are what you do when it counts" - the Masso
  19. Re:The REAL Problem by Thomas+M+Hughes · · Score: 3, Informative
    This is actually unconstitutional, since Congress is enabled to grant exclusive rights for "limited times" only.


    As much as I wish you were right, you're actually wrong on this. The Supreme Court ruled on the case, and found that what the Congress did was constitutional, and since the constitution grants the Supreme Court the right to interpret the Constitution, it is constitutional to do so. This will only change if the Supreme Court changes its ruling at a future date, or the Congress were to ammend the constitution to make it unconstitutional, this issue remains constitutional, as unfortunate as it is.
  20. All Hail The Text! by Jason+Scott · · Score: 2, Informative

    Well, until it's free, there's always textfiles.com.

    Actually, a while ago I copied a lot of the Project Gutenberg library, along with some others, and created etext.textfiles.com.

    In my experience, the reason a lot of people don't donate free time to transcription or other similar drudge work is because a lot of sites that encourage it steal it. Witness CDDB, and just wait to see how long before you pay for IMDB.

  21. Re:Tex? by nels_tomlinson · · Score: 2, Informative
    I've been marking up the Project Gutenberg etexts using LaTeX for several years now. I can typeset an Oz book, or one of the Tom Swift books, in about 15 minutes. I have put about a week into typsetting ``The Voyage of the Beagle'', and no end in sight. I was able to typeset a translation of the bible in about one week, but it was sloppy work, and I wasn't satisfied.

    Lyx is nice, but I don't think that it really speeds things up. I can't imagine that Lyx could speed things up at all on a Tom Swift.

  22. Re:In Search of the Perfect Library by Allen+Varney · · Score: 2, Informative

    The Great Library in Alexandria was a wonder of the ancient world until it got burned down as part of a domestic dispute between Mark Antony and Cleopatra.

    Uh... what? For centuries people have blamed the burning of Alexandria's Great Library on the Romans, the Christians, or the Muslims, depending on which ones they disliked. But Mark Antony? Cleopatra? That's a new one. Maybe you're thinking of Julius Caesar, who gets the blame according to this fellow (a self-proclaimed Christian apologist).

  23. Distributed Proofreaders by Amata · · Score: 5, Informative

    I just found this site a few days ago. Essentially, volunteers can proofread one page at a time, so that huge time commitments of doing an entire book yourself are not required. Worth checking out.

    http://texts01.archive.org/dp/

  24. You can search DJVU files... by raytracer · · Score: 2, Informative
    Scanned documents might be fine for readers, but what if you're looking for "oh, you know, that one line in the book, where the dude was talking about melons."

    It might help to actually understand what you are talking about before you are so quick to dismiss it. DJVU does support searchable text, which can be inserted automatically via OCR. The advantage of this is that the OCR need not be 100% accurate to still be useful (vastly more useful and accurate than the indices in most books, for instance).

  25. Re:Gutenberg by MrOrn · · Score: 3, Informative
    Actually, he didn't even invent moveable type. The Chinese did that with wooden blocks much earlier and there were existing printing presses that used moveable blocks.

    Also, there were prior claimants to the "invention" in Europe, such as Laurens Coster in Haarlem, Netherlands, and others in Bruges, Flanders (Belgium), Avignon (Waldvogel, who is recorded as having "steel alphabets" in 1444) and Bologna.

    BTW Gutenburg's "invention" was not the length of the type. It was to have cast the movable type in metal using a matrix. As he was a goldsmith and his father was the Master of the Episcopal Mint in Mainz, this was a great instance of lateral thinking, adapting technology he knew well and applying it to a new field. He would have seen coins being minted and twigged that you could print books like that.

    He also designed the press (adapted from existing wine presses) and came up with an ink that was suitable for the process of printing with this type of press (the ink had to be viscous, rather than the ink used for manuscripts).

    His combination of the three things meant that he could successfully exploit printing commercially. So Gutenburg was probably the first to exploit it commercially, although he wasn't very successful (5 years (1450-1455) isn't a long time to have a revolutionary business). This fact has ensured that he is credited with the invention of modern printing.

  26. Re:Stupid article. Project Gutenberg doing great. by Overt+Coward · · Score: 2, Informative

    I'll point out that at the end of 2000, there were only roughly 2000 etexts in the entire PG library (I copied them all to a single CD)... So if they're up well over 6,000, then they've made amazing progress in two years!

  27. Re:Project Gutenberg is good anyway... by cmpalmer · · Score: 2, Informative

    I recently bought a Franklin eBookman ($39.95 at CostCo!) and then, more recently, got an iPaq through work. The last five books I've read have been on one or the other PDA's (I had the Baen CD-ROM of Honor Harrington books and others). It still isn't quite as good as a paper book, but it is the best way I've found yet to read in bed.

    I've been using the Mobipocket reader on both devices and the autoscroll feature is really cool -- you can prop up the device, turn on the backlight, and adjust the autoscroll to your reading speed. Hands-free, no reading lamp, no cramps from trying to prop up and turn pages.

    On thing that strikes me is how much typography and formatting matter, which is, as others have pointed out, the problem with Gutenberg texts. I have read quite a few PG texts in the past (or at least used them for reference when I was looking for particular quotes or need a big text file to test something :-), and the formatting leaves a lot to be desired. On the PDA's, weird page and line breaks or even bad justification or extreme ragged edges, are very disconcerting when reading.

    --
    -- stream of did I lock the front door consciousness