Slashdot Mirror


Just One Page a Day

Charles Franks writes "Two years ago I started building an online proofreading system as a way to help Project Gutenberg (PG) get more books online: Distributed Proofreaders (DP). The concept is simple, we scan books and load the image and OCR output for each page into the online system. Next, proofreaders compare the OCR text to the image making any corrections as necessary, each page gets looked at twice. Finally the output from the site is massaged into a PG e-text and submitted to PG for posting to the archive. Now, nearly 600 books and a lot of PHP code later, we have snuggled into our new home which is graciously provided by the Internet Archive and Project Gutenberg. Now that we have 'real' resources available to us (the original site ran on a Pentium 200 over my 128kbps upstream cablemodem) I would like to invite the online community at large to help us put even more books online. To this end I would like to ask everyone to do 'Just One Page a Day'. Thank you, Charles Franks"

179 of 373 comments (clear)

  1. Stop reading this by XiC · · Score: 5, Insightful

    And start reading a page!
    After that come back and you may continue();

    1. Re:Stop reading this by H0ek · · Score: 3, Insightful

      In fact, I feel it would be a Good Thing(tm) for our friendly Slashdot host to stick the link to this project into their Quick Link section on the main page.

      Of course, I've already bookmarked the page, but that's on one machine. What happens six months down the line when I need to rebuild my bookmarks? Search for the article on Slashdot? Ick.

      --
      H0ek
      Think you're smart? Prove you've got brains!
  2. And you ask the /. community.. by Harald74 · · Score: 5, Funny

    ... which is renowned for it's spelling prowess? ;)

    --
    A)bort, R)etry or S)elf-destruct?
    1. Re:And you ask the /. community.. by Textbook+Error · · Score: 5, Funny

      for it's spelling

      Or grammer... :-)

      ("it's" == "it is", "its" == possessive form)

      --

      Nae bother
    2. Re:And you ask the /. community.. by tswinzig · · Score: 5, Funny

      ... which is renowned for it's spelling prowess? ;)

      Are you kidding? With the number of people bitching about grammar and spelling in the comments, you just know there's a pool of talent here!

      (BTW, there's no apostrophe in the possessive form of "its.")

      --

      "And like that ... he's gone."
    3. Re:And you ask the /. community.. by Skirwan · · Score: 4, Funny
      And you ask the /. community..
      ... which is renowned for it's spelling prowess? ;)
      Is anyone else somewhat dismayed by the fact that the post pointing out our collective poor grammatical skills has a spurious apostrophe?

      :)

      --
      It's past the blind leading the blind; this is the blind and deaf leading the stupid.
    4. Re:And you ask the /. community.. by jaymz666 · · Score: 2, Funny

      then let's not forget that grammar has no e

    5. Re:And you ask the /. community.. by orthogonal · · Score: 4, Funny

      ... which is renowned for it's [sic] spelling prowess? ;)

      Not to mention it's [sic] excellence at spotting grammatical errors.

    6. Re:And you ask the /. community.. by donutz · · Score: 2

      Not to mention the incomplete ellipsis on the subject line. Of course, maybe that's just a little too picky...

    7. Re:And you ask the /. community.. by tswinzig · · Score: 5, Funny

      for it's spelling

      Or grammer...


      Or spelling?

      --

      "And like that ... he's gone."
    8. Re:And you ask the /. community.. by Erasei · · Score: 3, Funny

      What's even scarier is that there are this many comments telling a person that he is wrong when he so isn't. I mean, come on guys, even the Flowers know the real way to use the apostrophe: http://angryflower.com/bobsqu.gif

      --
      visit my free wallpaper collection, wp.erasei.com
    9. Re:And you ask the /. community.. by Anonymous Coward · · Score: 5, Funny

      Or sense of humour?

    10. Re:And you ask the /. community.. by leuk_he · · Score: 2

      ANd then you wonder what the goat.cx are doing in the ilias?

      more serious how do they fight off the trolls?

    11. Re:And you ask the /. community.. by CaseyB · · Score: 5, Informative

      I know you're joking, but in reality it doesn't matter how good your spelling is. In fact, I would imagine that any spelling errors found in the text should be reproduced intact, in the interest of accurately representing the original work. This project is about correcting OCR errors, not spelling / grammar.

    12. Re:And you ask the /. community.. by Erasei · · Score: 2
      Come on people, have a sense of humor here. This case doesn't need an apostrophe because it's a possesive pronoun. If it were a noun, then Yes, it would need an apostrophe. So it's in this case is incorrect because that rule doesn't apply to pronouns. If, "it" in this case were a proper name (like Steven King's book title, It), then it _would_ need an apostrophe, to show possession.

      Reference: http://owl.english.purdue.edu/handouts/grammar/g_a post.html


      Make any sense? :)

      --
      visit my free wallpaper collection, wp.erasei.com
    13. Re:And you ask the /. community.. by Scaba · · Score: 2

      The best cure for bad writing is Strunk & White's The Elements of Style.

    14. Re:And you ask the /. community.. by JoeBuck · · Score: 4, Insightful

      Since Project Gutenburg can only publish books whose copyright has expired, it's quite likely that a spelling "error" may instead reflect language evolution, that is, a change in the way words are spelled over time.

    15. Re:And you ask the /. community.. by dvdeug · · Score: 2

      I would imagine that any spelling errors found in the text should be reproduced intact, in the interest of accurately representing the original work

      Usually, Project Gutenberg volunteers correct spelling errors where they are obviously errors. The original work, as in what the author intended, is usually more interesting then that physical edition, which to reproduce we'd really need to keep page numbers and other junk.

      For one example, my current project is a cookbook published in the 1730's, and so far I've corrected Apricocr to Apricock and Lemon to Lemmon; in both cases the form I corrected it to was overwhelming used in the text.

    16. Re:And you ask the /. community.. by gTsiros · · Score: 2

      you mean spalling, of course!

      --
      Looking for people to chat about multicopters, coding, music. skype: gtsiros
    17. Re:And you ask the /. community.. by pmz · · Score: 2

      This project is about correcting OCR errors, not spelling / grammar.

      Yes. I remember reading that Tolkien had some trouble with editors who thought they could spell better than he did. IIRC, it was a real mess at first.

      So, to reiterate, don't second-guess the authors' intents.

    18. Re:And you ask the /. community.. by Greedo · · Score: 3, Insightful

      For one example, my current project is a cookbook published in the 1730's, and so far I've corrected Apricocr to Apricock and Lemon to Lemmon; in both cases the form I corrected it to was overwhelming used in the text.

      "Apricocr" I can see being a legitimate typo, but perhaps in converting "Lemon" to "Lemmon", you are eradicating one of the earliest uses (intentional or not) of the now-current spelling.

      My personal opinion -- and I yes, everyone on /. did ask for it -- would be to leave the spelling and typos intact, if the goal is to preserve literary creations. You are potentially losing information by changing it.

      Ask anyone who has studied the First Folio of Shakespeare about the importance of spelling.

      (And just incase you don't have a Shakespeare scholar handy: since Shakespeare's plays were almost always written down after they were first performed (and written down by someone else), there are many clues to the the original performance in how certain words are spelled, capitalized and how sentences are punctuated. Hamlet's "What a piece of worke is a man" is a good example of this.)

      --
      Tuus crepidae innexilis sunt.
    19. Re:And you ask the /. community.. by dvdeug · · Score: 2

      Ask anyone who has studied the First Folio of Shakespeare about the importance of spelling.

      Okay, now ask the high-school student who read Shakespeare in high-school if it would have been more fun to have got weird spelling in addition to weird vocabulary and grammar.

      I understand the importance of stuff like that to the linguist, (the last thing I skimmed while scanning in was The Roman Pronounciation of Latin where they dissect how Latin was spoken by the writings) but my primary audience is not the linguist. I'm sure there's information to be gained from the italics and hypenation I'm not transcribing either. Fortunately for the linguist, it was released in a facsimile edition in '83 that shouldn't be too hard to get a hold of; alternately, Project Gutenberg has taken to storing the images, and these will get filed away with them for those interested.

    20. Re:And you ask the /. community.. by Wanker · · Score: 2
      This project is about correcting OCR errors, not spelling / grammar.

      I heartily agree with this. Any speling erros I find will be left in place. ;-)

      After running through a few pages, it seems that most of the problems are quotes and spacing, which are understandably difficult for OCR to sort out. In all honesty, the OCR they're using seems to be pretty good. It's ignoring the noise nicely and converting to quite readable text.

      The issues seem to be things like:

      "Bob, come here,"she said softly,"I want you over here."" Can't, honey,"he said,"I'm glued to the handrail."

      Clearly that needs some spaces added to clear it up. Although there seems to be some disagreement about whether to space after a comma or not, I've elected to add the space in my proofs:

      "Bob, come here," she said softly, "I want you over here." "Can't, honey," he said, "I'm glued to the handrail."

      Now I was taught that a new speaker should start a new paragraph, which would avoid lots of these issues, but the author didn't do that in the book I was proofing.
  3. Excellent by drhairston · · Score: 2, Flamebait

    After some consideration, I propose that this system should be applied to Slashdot stories! Each Slashdot story, after being submitted by an editor, should be reviewed by at least two readers before being posted in order to correct inadvertent spelling mistakes and story duplicity. Thank you sir, for inspiration!

    --
    Dr. Joseph Hairston
    Superintendent, CCBC
    1. Re:Excellent by Draoi · · Score: 3, Funny
      in order to correct inadvertent spelling mistakes and story duplicity

      Not to mention malapropisms!! :-)

      http://www.dictionary.com/search?q=duplicity&d b=*

      I like the first definition better!

      --
      Alison

      "It is a miracle that curiosity survives formal education." - Albert Einstein

  4. Just one page a day? by Adam+Rightmann · · Score: 5, Funny

    Sounds like Gary Condit's plan for extramarital affairs.

    --
    A. Rightmann
    1. Re:Just one page a day? by indiigo · · Score: 4, Funny

      And Bill Clinton did contain himself, except it was one page every day!

      --
      fslg503-985-8686503-985-8686503-985-8686503-985-86 8650 3-985-fdsg8686503-985-8686503-985-8686503-9
  5. OCR Software by Zach+Garner · · Score: 4, Interesting

    Is there any worth-while open source OCR software? How about reasonably priced closed source OCR software for *BSD or Linux?

    1. Re:OCR Software by Anonymous Coward · · Score: 4, Informative
      Generally not used at dp. Mostly uses Abbyy Fine Reader (www.abbyy.com) which is commercial.


      gocr (http://jocr.sourceforge.net/) is open-source, and includes interesting bits like deskewing.


      As a proofreader, I really appreciate the best ocr, and the free guys are not the best.

    2. Re:OCR Software by Anonymous Coward · · Score: 2, Insightful

      >Just get just about any scanner - it'll almost certainly come with free OCR software.

      Generally not nearly as good as the top two (Scansoft (http://www.scansoft.com/sdk/: seems to have engulfed the Xerox/Textbridge and Caere/Omnipage technologies), ABBYY).

      When you scan for public use, think about the time of *other people* you waste if your OCR is not optimal or your scans are off-register/ skewed etc.

  6. Obvious... by OrangeSpyderMan · · Score: 5, Funny

    I'm shure that buy askin teh Salshdot crowd (esp. the editturs) to help, yule improove jamatically teh kwality off you're output.

    :-)

    --
    Try NetBSD... safe,straightforward,useful.
    1. Re:Obvious... by Otter · · Score: 2

      See, unlike the other people making the same point, OrangeSpyderman had the good sense to intentionally misspell most of his words so any unintentional misspellings or grammatical errors will be lost in the noise and go unflamed.

      It's like that stegosaurus encryption.

      With all the nitpicking, isn't anyone going to bitch at Michael for leaving the "Thank you, Charles Franks" in the submission for no apparent reason?

    2. Re:Obvious... by OrangeSpyderMan · · Score: 2

      Intentionally, yeah that's right. :-)

      --
      Try NetBSD... safe,straightforward,useful.
    3. Re:Obvious... by Myco · · Score: 2

      Hang on... which words were misspelled?

  7. Re:Legal Implications by phil+reed · · Score: 2, Informative

    I can't decide if this is a joke or not.

    You do know about Project Gutenberg, right?

    --

    ...phil
    "For a list of the ways which technology has failed to improve our quality of life, press 3."
  8. Re:Legal Implications by Junta · · Score: 4, Informative

    The only works that go into PG are works in the public domain. While publishers sell dead-tree copies still, they have no copyright over the original text contained within. (Which is why these works are typically available through multiple publishers.

    --
    XML is like violence. If it doesn't solve the problem, use more.
  9. Copyright is not an issue by ardmhacha · · Score: 5, Informative

    Project Gutenberg only publishes books that are out of copyright. That means Dickens is okay but you wont find the latest Stephen King

    1. Re:Copyright is not an issue by Twylite · · Score: 3, Informative

      Sadly, copyright is an issue in this sort of work. Just because Dickens' works are no longer copyright, doesn't mean you can go and pull a Dickens novel off the library/bookstore shelf and OCR it. Publishers tend to be careful to make slight alterations to the text here and there (formatting, spelling, come clarifications and corrections) which turns a copyright-expired work into a derived work over which they own the copyright. Shitty, isn't it?

      --
      i-name =twylite [http://public.xdi.org/=twylite], see idcommons.net
    2. Re:Copyright is not an issue by Anonymous Coward · · Score: 2, Insightful

      Well actually only the alterations would be copyrighted not the entire work. Only the original author can create a derivative work that is fully covered by copyright. Usually the publishers add a new foreward of absolutely not worth. If you take out that forward and copy only the original text it would be hard for them to prove otherwise. The only sticking point is translations of foreign work. You won't find a lot of Kafka in there (I found only Metamorphosis) because a lot of his stuff was translated only after WW II. The translations are basically new works and are copyrighted as of the date of translation.

    3. Re:Copyright is not an issue by Twylite · · Score: 2

      I'm afraid I can't find the original source where I was reading about this, but the problem extends to the text, mainly because it is not the original text. Shakespeare is good example, because most modern publications are not true to the original works: oldde englishe wordes have been changed into modern equivalents, and phrases here and there have been updated to ones we can understand today.

      You are correct in saying that the publishers copyright (in such cases) is over the modifications only; but it can be very difficult to determine which parts have or have not been modified. Typically, you need an old copy of the original work, which means you can't pick up a modern publication for your library or bookstore.

      --
      i-name =twylite [http://public.xdi.org/=twylite], see idcommons.net
    4. Re:Copyright is not an issue by Software · · Score: 2
      **sigh** I wish there was a mod category for "wrong" or "what the hell are you talking about" or "FUD", because I'd mod this instead of posting. I have never seen a publishing company make trivial changes to a work and claim copyright on it. I don't mean "The Wind Done Gone" or a major work like that. When I see classic books, I check the copyright page. They always say "Foreward (c) 2002 John Doe", but I have never seen that copyright was claimed over a whole work. I think a company that intentionally misrepresented an altered work as that of a famous author would be liable to fraud charges.

      Please provide specific examples of this, so that I can be proved wrong. Please give the ISBN and perhaps a link to an online bookseller.

    5. Re:Copyright is not an issue by p3d0 · · Score: 2

      Nope. See #6 on this list.

      --
      Patrick Doyle
      I mod down every jackass who puts his moderation policy in his sig. Oh, wait a sec....
    6. Re:Copyright is not an issue by j-beda · · Score: 2

      That just speaks to making derrivative works of things that are copyrighted (such as fan fiction). It is certianly not clear to me how this effects derrivative works of public domain material.

    7. Re:Copyright is not an issue by David+Jao · · Score: 2
      **sigh** I wish there was a mod category for "wrong" or "what the hell are you talking about"

      I wish there was too, but in this case you're the one who is wrong.

      You must not have a very large sampling of classic books. Almost all classic books in my collection have copyright asserted by the publisher.

      Please provide specific examples of this, so that I can be proved wrong. Please give the ISBN and perhaps a link to an online bookseller.

      Here's one: The Riverside Shakespeare, ISBN 0-395-04402-2, which says the following on the inside title page.

      Copyright © 1974 by Houghton Mifflin Company.

      All rights reserved. No part of this work may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying and recording, or by any information storage or retrieval system, without permission in writing from the publisher.

    8. Re:Copyright is not an issue by dvdeug · · Score: 2

      They always say "Foreward (c) 2002 John Doe", but I have never seen that copyright was claimed over a whole work

      That's weird. I almost never see that; usually I'll see a plain 'Copyright (c) 2002 John Doe' that obviously should only cover an introduction or something, but never mentions that. At an extreme, I've seen books that were photocopies of the originals, and nothing else, that claimed copyright.

      I think a company that intentionally misrepresented an altered work as that of a famous author would be liable to fraud charges.

      It is true, that when you read Shakespeare, what you read doesn't look like what was originally printed. Modern editions of Shakespeare have updated the spelling and made it consistent, and typeset it in modern forms (with the long s, for example.)

    9. Re:Copyright is not an issue by Junta · · Score: 2

      That does not necessarily mean that it is legally enforcable. I could, for example, say that I require one dollar payment from anyone who reads this comment. Just because I said so, does not make it true. Even if a modern work, that short statement is not enforceable, as it implies no fair use allowances, so while by their statement an academic copy is forbidden, law disagrees.

      There are a lot of cases where companies know very well what they can and cannot enforce, but will still at least do their best to make the customer *think* they have no rights. A prime example are the warnings on tapes/dvds that say no copy may be made under any circumstances. If you were dragged into court for making a copy for backup purposes and can prove you have the original and did not distribute copies, you would be let off, even though the warning would have you believe the FBI will bust in with guns drawn should you ever think so. You know those trucks with the bumber sticker "not liable for windshield damage"? They are indeed liable, the sticker has as much meaning as writing 'not liable for property damage, personal injury, or death' on a gun and using it to kill people.

      The practice of taking companies legal statements , disclaimers, EULAs, and warnings as absolutely truthful has caused a great deal of misinformation among the public. The large percentage of the population that does not think they have a legal right to make personal copies of movies and music they own, for example. If their word was true, how are other publishing companies publishing those works without deals with Houghton Mifflin?

      --
      XML is like violence. If it doesn't solve the problem, use more.
    10. Re:Copyright is not an issue by CoughDropAddict · · Score: 2

      Hot damn, if derived works aren't (c) their creator, why don't you set up a website offering Disney movies for download? See if Disney and the courts agree with you.

      Your link only addressed the fact that creating derived works from works still in copyright requires permission from the author of the existing work. It doesn't claim that the author of the derived work gets no copyright for the derived work.

  10. Re:Legal Implications by Chundra · · Score: 2

    Not when the authors have been dead for 300 years.

  11. Re:Legal Implications by seizer · · Score: 4, Informative

    It helps if you read the FAQ list.

    Due to copyright laws, it is only legal to do this with older books (copyrighted 75 or more years ago). As a result, Project Gutenberg is mostly comprised of the "Classics."

  12. Wow, what a scary thought by TheConfusedOne · · Score: 5, Funny

    Imagine the kids 200 years from now reading |-|uc||_3b3rry F1|\||\|.

    (That hurts my brain just trying to type it in...)

    --
    --- I wish I could hear the soundtrack to my life. That way I'd know when to duck.
    1. Re:Wow, what a scary thought by foistboinder · · Score: 2, Funny
      |-|uc||_3b3rry F1|\||\|.

      I must get out more - I was actually able to figure that out!

    2. Re:Wow, what a scary thought by Anonvmous+Coward · · Score: 2

      "|-|uc||_3b3rry F1|\||\|.

      I must get out more - I was actually able to figure that out!"


      Ouch. You just violated the DMCA!

    3. Re:Wow, what a scary thought by Jace+of+Fuse! · · Score: 2

      Anything more complicated than 733T H@x0rz and I get lost...

      Obviously, since the correct spelling is 1337 h4x0rz ...

      Oh wait...

      --

      "Everything you know is wrong. (And stupid.)"

      Moderation Totals: Wrong=2, Stupid=3, Total=5.
  13. A better use of time by Apreche · · Score: 5, Insightful

    I think a better use of time would be to have all these programmers here develop a better OCR. Then you wouldn't need the proofreading and could just feed books into the scanner. I mean there are lots of things wrong with OCR and reasons why it can't be absolutely perfect, but it CAN bet better. If we just write one line of code a day each we'll have better OCR in no time.

    --
    The GeekNights podcast is going strong. Listen!
    1. Re:A better use of time by scottcain · · Score: 2, Informative

      Perhaps, but the page I just proofed was from a book publish in the 1850's, so it was not the best image quality, and still the OCR did a great job. The most common mistake I corrected was converting I's to !'s. It got right things that I had to look at pretty closely to make sure it was right.

    2. Re:A better use of time by SteakJerky.com · · Score: 2, Insightful

      Even with fantastic OCR, there will be some small errors out there so a human double check is a great idea. If project Gutenberg isn't a great reason to buy a pda, I don't know what is. Its a huge library of great books ready to be read in the lunch line, on the bus, in the john...

    3. Re:A better use of time by rixster · · Score: 2

      I'll do the first bit and last bit for you...

      sub getPerfectOCR()
      {
      my $raw_data = shift;

      my $completed_text;

      # 1. Process
      # 2. ???
      # 3. Profit

      return $completed_text;
      }

      --
      Two wrongs may not make a right, but three ....
  14. Re:Book Pirating? by phil+reed · · Score: 2, Informative
    So are the books they are digitizing all in the public domain?


    Yup.
    It doesn't seem like there would be that many books in the public domain that haven't already been made available on the net.

    How do you suppose they make it to the net? Most of the public domain books were written before word processors, so there's no electronic text around.

    Of course I could be wrong.

    Yeah. Go look at Project Gutenberg's site - think of it as you homework assignment for the weekend.

    --

    ...phil
    "For a list of the ways which technology has failed to improve our quality of life, press 3."
  15. Re:How do I get to plug my online website? by Anonymous Coward · · Score: 2, Insightful

    a wonderful resource for poor areas.

    And where do the poor get online? In libraries.
    D'oh!

  16. Re:copyrights? by A+Commentor · · Score: 2

    The 'Project Gutenberg' is about making old books that have (finally) fallen into public domain available to whoever wants it. Those are the books I'm sure that they want to have proofed.

    --

    Looking for any old 8-bit Heathkit/Zenith software/hardware - http://heathkit.garlanger.com

  17. server test under load by lovebyte · · Score: 2, Funny

    Instead of proofreading the books, I think this guy is asking for his new server setup to be tested!

    --

    I'll do it for cheesy poofs.

  18. Re:copyrights? by Jeremy+Erwin · · Score: 4, Informative

    Copyrights aren't perpetual. The Gutenberg project aims to publish books that are no longer, or have never been under copyright.

  19. Re:Book Pirating? by raju1kabir · · Score: 4, Informative
    So are the books they are digitizing all in the public domain? It doesn't seem like there would be that many books in the public domain that haven't already been made available on the net. Of course I could be wrong.

    And you probably are. The best efforts of our duly elected Congressional representatives notwithstanding, copyright still does expire. After that, a work passes automatically into the public domain. That means there are hundreds of thousands of books available.

    In fact, if you've previously seen the classics online, they probably came from this project, which has been around for almost as long as I can remember.

    --
    "Patriotism is your conviction that this country is superior to all other countries because you were born in it." -- GBS
  20. Dirtributed OCR? by edwilli · · Score: 4, Interesting

    Have each client do the OCR (if you can find GPL software). Or maybe there's a company willing to donate it. That way you could farm out most of the processing too.

  21. Re:Legal Implications by stinky+wizzleteats · · Score: 5, Interesting

    While publishers sell dead-tree copies still, they have no copyright over the original text contained within.

    What? You mean to suggest that you have an actual example of a publisher making money without tyranny over the content?

    Gasp!

  22. Re:Which books are getting converted? by teeker · · Score: 5, Informative

    The books that are being converted are whatever people feel like contributing.

    Don't think your favorite authors are being represented? Can you demonstrate that the work is out of copyright? Make the conversion yourself!

    Doing the hard work yourself is the best way to guarantee your interests are represented.

    --
    teeker
  23. Graphics by mallfouf · · Score: 4, Interesting

    Very good idea.
    Will there be any support for proofing in other languages (french, spanish, arabic, etc...)?
    What about books published in other countries. Would we be able to post those books if they're not copyrighted in the US but copyrighted in other countries? or vice versa.

    1. Re:Graphics by dvdeug · · Score: 4, Informative

      Will there be any support for proofing in other languages (french, spanish, arabic, etc...)?

      DP has had books in Dutch, French, Spanish and German. No Arabic - no one has mentioned being able to do it, for one thing.

      Would we be able to post those books if they're not copyrighted in the US but copyrighted in other countries?

      Project Gutenberg only worries about the US copyright. If it's not copyrighted in the US, they'll do it.

    2. Re:Graphics by imadork · · Score: 2

      PG Australia falls under Aussie copyright law. They have shorter copyright terms than the good ol' USA does.

    3. Re:Graphics by dvdeug · · Score: 2

      They have shorter copyright terms than the good ol' USA does.

      That's not exactly true. Australia is life+50 years, where as the US is post-1923. Neither one is a subset of the other.

    4. Re:Graphics by dvdeug · · Score: 2

      US copyright is either life+50 with a 20-year extension that's coming under the SC right now, or 70 years +20 for copyrights held by an (immortal) corporation.

      Historically, the US has been on an X years rule for copyright, and the US copyright law has a lot of cruft relating to that. Anything published before 1923 has fallen into the public domain. Anything published between then and 1978, if it hasn't fallen into the public domain, has a flat 95 years, or 75 if the SC tosses out the extension. Life plus x years only kicks in if it was printed after 1978.

      It's not based on who holds the copyright, it's based on the creator's life span. So even if a corporation holds the copyright, it still expires. If it was done as a work for hire, it gets a straight 100 years (IIRC). So there's no big loophole for immortal corporations in there.

    5. Re:Graphics by dvdeug · · Score: 2

      No Arabic - no one has mentioned being able to do it, for one thing.

      And another, while I think of it--

      DP is set up to take OCRed texts. ABBY&Y, while an amazing multilingual OCR program (176 languages, using the scripts of Latin, Cyrillic, Greek, Armenian, and Georgian), doesn't handle Arabic. You'd have to get an Arabic OCR program to handle them, and considering non-English texts tend to take a long time to go through, it's not something they'll jump at buying.

  24. use proofreading meta-data to improve OCR! by tomlouie · · Score: 5, Interesting

    What if they kept track of every time the human reader finds an OCR-error. Couldn't you then build a profile of what words/phrases/letters the OCR software has the most problems with?

    Then, couldn't you just selectively have the humans review the highest probably error prone sections of a book, instead of every single word of every single page?

    What do you think?

    1. Re:use proofreading meta-data to improve OCR! by Big_Breaker · · Score: 4, Insightful

      Different book - different font - different problems.

      It might help a bit but most OCR programs already tag letters that it is unsure about. They don't mention in the article if the distributed system incorporates OCR ambiguity in prioritising proofreading.

      As an aside why not just store the raw image for any ambiguous text within the documents in the PG archive (Think of an HTML sort of thing). As people read the document just poll them as to what they think the letters in the bitmap are.

      I guess a lot of the stategy rests on how frequently the ocr software makes an error or find ambiguity.

    2. Re:use proofreading meta-data to improve OCR! by dmoynihan · · Score: 3, Informative

      Actually, they're working on that.

      The program is Gutcheck, was developed by PG's Jim Tinsley.

      Catches a lot!

  25. Re:Legal Implications by astrosmurf · · Score: 2, Insightful
    The only works that go into PG are works in the public domain. While publishers sell dead-tree copies still, they have no copyright over the original text contained within
    But the publishers still have copyright on their specific printing. Distributing scanned copies of pages probably still violates their copyright, even if distributing the OCR output does not.
  26. Mod Parent 'Twat' by henben · · Score: 3, Funny

    Nuff said.

  27. Re:copyrights? by Anonymous Coward · · Score: 2, Insightful

    Copyrights aren't perpetual In Theory. But isn't disney and microsoft (MS wrt printed works esp) working hard to insure they're perpetual In Practice?

  28. Re:public domain books? by teeker · · Score: 2, Informative

    True, but Project Gutenberg is a repository for digital copies of literature that are public domain. To remain a legitimate entity, they can't publish copyrighted works (without the author's consent).

    So, the answer to your question is no. But that's what p2p is for ;-)

    --
    teeker
  29. Re:Which books are getting converted? by Chundra · · Score: 2

    I'm sure interrest could be affected if people could, say, vote on what would be converted. Or do I make any sense?

    I'm trying to make sense of this, please help me out. Are you saying that if people could vote on which books are converted (or "electronificated" as we sometimes call it in the industry), that more people might be interested in the project?

  30. Re:Legal Implications by Anonymous Coward · · Score: 2, Informative

    >But the publishers still have copyright on their specific printing.

    Nope. Copyright holders (not necessarily the publisher) would have copyright on editorial corrections and (for music: a weird case) some on appearance, but not on the original text.

    Publishers often claim copyright on the entire contents of 300 year old works, but they have no legal basis for this.

  31. Read? by uneek · · Score: 5, Funny

    Don't you mean run a compare tool in the background using CPU idle time right?

    You don't actually want us to read a
    page of literature do you?

    1. Re:Read? by fobbman · · Score: 2

      Good point. I'll just go get the e-book.

  32. Re:How do I get to plug my online website? by Anonymous Coward · · Score: 2, Funny
    And where do the poor get online? In libraries.


    Hey, shut the fuck up. This site is about technology for technology's sake. We talk about humanitarian things just to justify it to our own conscience to relieve the guilt. Don't make us think logically!

    Remember, it's TECHNOLOGY = GOOD. WE ARE FUZZY BUNNIES THAT LOVE EVERYONE AND THINK WE'RE COOL 'CAUSE WE WRITE "HELLO, WORLD" IN C.
  33. Re:public domain books? by SamTheButcher · · Score: 2, Informative
    Also, if you read about the project, it's goal is to put all of the works into XML to create a searchable repository, not just to have all of these .txt documents floating around. Well, that's the newest goal, anyway.

    $.02. Like it or leave it.

  34. A better way - have computers do more work. by lawpoop · · Score: 5, Interesting
    I was thinking -

    In order to make the proofing faster, maybe you could OCR a document 2 or 3 times, and then have only the disagreements proofread.

    We use omnipro here at work, and I'm surprised at how well it works, even recreating page formats.

    Of course, it doesn't work 100%, but it sure does get about 95%. If you were to OCR a document 2-3 or more times, and most of it was identical, it would save a lot of time if you had humans going over only the parts that the different OCRs didn't agree on.

    Steve Lefevre

    --
    Computers are useless. They can only give you answers.
    -- Pablo Picasso
    1. Re:A better way - have computers do more work. by hands · · Score: 2, Insightful
      In order to make the proofing faster, maybe you could OCR a document 2 or 3 times, and then have only the disagreements proofread.

      This may eliminate some of the OCR errors, but it won't speed up the process because a good editor reads every word. You are asking for more errors when you ask your editors to become lazy and skip words.

      Most OCR will probably misread the same character incorrectly every time (read 'B' as '13', for example). That kind of error will not be flagged, and will be overlooked by editors who are used to only looking for flagged errors.

    2. Re:A better way - have computers do more work. by handorf · · Score: 2

      The "Three Monkeys" from Minority report?

      Interesting idea... to be even better, you'd want to use 2 different scanners and 2 different technologies.

      --
      -- IANAEG - I am not an elder god.
    3. Re:A better way - have computers do more work. by schlach · · Score: 2

      There is a company called Paper of Record that is archiving old newspapers using OCR technology. They scan the newspaper pages, OCR it, and create a searchable database you can scan for keywords. You do a search, and can read view the original scanned page or the OCR'd text.

      http://www.paperofrecord.com

      I bet their software/hardware combination would greatly help an effort such as this.


      Heh... Block-quoted for 2 free mod-points. =)

      Anyway, I just checked them out, and they have a really great idea. Except for the expensive membership part. They have searchable full-page images of a lot of *old* newspapers (like, early 1800s through present). The problem with using them for something like PG is that they want money. They're in the business of selling their work through subscriptions to their newspaper service, and selling their technology to media companies that want to put their newspaper online. Still, definitely worth checking out. Their parent company is Canadian, so they carry Canadian, US, and UK newspapers. Would be perfect without that "expensive as a regular newspaper that you don't pay for because you read it online"...

    4. Re:A better way - have computers do more work. by noodlez84 · · Score: 4, Informative

      Although your method of "proofreading" is actually useful for most documents, it is _not_ a good method for Project Gutenberg (as a contributor to DP, I can attest to this).

      The works put out by Project Gutenberg are going to be around for decades, if not, centuries. 95% accuracy is shit for those purposes. An issue that comes up on the PG mailing list (gutvol-d) every once in a while is whether or not to correct spelling mistakes that appear in the real, dead-tree versions of the books. What if, for example, it's obvious to almost any reader that the author meant the word "by" instead of "bye". Surprisingly (or not, depending on the way you look at it), the general response is *not* to correct those kinds of "mistakes". The rationality being that PG is -not- an editor, but simply a library (which is actually its legal status).

      So, in short, for works with millions of characters that are going to be around for many decades, 95% accuracy. The "bar" might be high, and, when proofreading for DP, I strive for 100%.

    5. Re:A better way - have computers do more work. by Plutor · · Score: 2

      > In order to make the proofing faster, maybe you could OCR a document 2 or 3 times, and then have only the disagreements proofread.

      Why not just have the Minority Reports discarded? Save you time, money, and bandwidth, and it's a flawless plan!

    6. Re:A better way - have computers do more work. by leuk_he · · Score: 3, Insightful

      [i] it doesn't work 100%, but it sure does get about 95%[/i]

      THAT IS 2000/20=100 errors per page.(That is the way OCR works, if it 99% ok, it is still 20 errors per page.

      And that doesn't include "strange" formatting like things scribbleing things in margins or heading above pages, italics and extra spaces.

      By the way you are not supposed to correct spelling errors made in the original pager. especially since this is often "old" english.

  35. Better make it quick by CatWrangler · · Score: 3, Funny

    The new congress might extend copyright protection to Shakespeare's great great great great great great great great great great great great great grandson's nephew's out of wedlock kid's son whose paternity is in question.

    --

    ---
    When you come to a fork in the road, take it! --Yogi Berra--

  36. And I shall call it... the wheel! by tiltowait · · Score: 3, Funny

    You mean a more communal approach than an oligarchy of "editros" that can't spot day-old duplicates? Great idea!

  37. will this work? by smeg168 · · Score: 2, Interesting

    I have a little problem with the logistics here. I can understand why every page is being sent to 2 people for proof reading in an effort to eliminate errors, but the problem arises that these arent 2 computers doing simple computations, if both of these people have different versions of a corrected page, as im sure they will. what happenes then? who does the final proof reading, and if there is someone doing the final proof reading that kinda eliminates the need for the distributed part. I could almost guarentee that any 2 people checking the same full page of data in their free time will find/create different errors. I hope I'm missing some large concept here, becouse i do love PG, they keep my palm stacked with good reading for free.

    1. Re:will this work? by GiMP · · Score: 3, Insightful

      These are humans comparing identical books to text.. if they have the IDENTICAL book they won't have this problem.

      Gutenburg often has published the same 'book' but of different publications due to slight variations in the text.

    2. Re:will this work? by clonebarkins · · Score: 4, Informative
      who does the final proof reading, and if there is someone doing the final proof reading that kinda eliminates the need for the distributed part.

      charlz has a workflow diagram for the works that go through his site. As you see, each book has a project manager, who has final processing/proofing responsibilities.

      Also, I'm not sure you get the idea of two rounds of proofing. They don't see different versions of a corrected page -- the first one sees the straight OCR output (or, sometimes the project manager will do some automated corrections on it first) and then the first round proofer edits the text. Then, when all the pages have gone through the first round, the second round proofer reads the text as it was edited by the first round proofer. This helps because it builds off the edits of the first round proofer and allows the second round proofer to perhaps catch things not caught in the first round.

      When proofreading, you're never going to capture all the mistakes with one pair of eyes. A distributed proofreading effort is very beneficial to the goals and efforts of Project Gutenberg, and I applaud the efforts of all those who have proofed even one page.

      Having said that, I've done over 300 (under a different name).

      --

      "The evil of the world is made possible by nothing but the sanction you give it." -- Ayn Rand

  38. OCR errors mostly caused by poor scan quality by oob · · Score: 4, Informative

    I've just proofed four pages, a mix of modern English, quoted Cockney and religious babble (Jonah 4:13, 9 etc.)

    OK it's only four pages, but the errors I've corrected so far have been when the scan has been poor and the OCR software has had to make a guess.

    1. Re:OCR errors mostly caused by poor scan quality by j-beda · · Score: 2

      Some versions of the bible are online, but not all of them. Multiple editions of a single work can be at PG, the bible is probably the most common one with multiple versions.

  39. Re:OCR Software -- Clara, perhaps? by timothy · · Score: 5, Informative

    Though the web page was last updated in July, I find several happy references (and some less happy) to "Clara," a GPL'd OCR program.

    Here's the web page: http://www.claraocr.org/index.html

    timothy

    --
    jrnl: http://tinyurl.com/c2l8yr / foes: http://tinyurl.com/ckjno5
  40. Re:copyrights? by msouth · · Score: 3, Insightful
    Copyrights aren't perpetual. The Gutenberg project aims to publish books that are no longer, or have never been under copyright.


    Well, copyrights weren't perpetual. Whether they will be or not remains to be seen.
    --
    Liberty uber alles.
  41. Re:Umm... by jandrese · · Score: 5, Interesting

    Someone needs to do a google search on " Public Domain". Public domain is there for a reason. Just as Copyright is available to give the artist a means of supporting himself, it was never ment to last his entire life. The purpose is to give the artist an incentive to work, current copyright law fails in this respect because an artist only needs to create one successful work and can immediatly switch to being a leech on society for the rest of his (and his childrens, and childrens childrens) life. Having the works pass into the Public domain is a good idea for two reasons:
    1. It is for the greater good of society as other people build on earlier works.
    2. It keeps the artist busy as they were supposed to have to keep releasing work to feed themselves as their early work passed into the public domain, just like any other job.

    --

    I read the internet for the articles.
  42. Why he came to slashdot by cachapa · · Score: 2, Funny

    I think he was just watching all his volunteers working on one page a day and thought:
    "Imagine a beowulf cluster of these!"

  43. Re:Umm... by Big_Breaker · · Score: 2, Interesting

    Lots of books aren't copyrighted anymore as the copyright expired. You see back before Disney bought legislation from people like Sonny Bono copyrights would be allowed to expire after about 50 years or so.

    Beowulf, Moby Dick, Shakespearre's plays, etc are all free as in speach and beer. Edited versions of the original text can be copyrighted. Examples of that are edition of Shakespearre's plays with "translations" next to the original text. You can buy his complete works, unedited, for very little $ these days. The only cost for the publisher is printing and typesetting.

  44. Books read to you while commuting by dudemaster · · Score: 3, Interesting

    How about this.... use an open source speech synthesis tool/API that can play these text books (especially as more get added) over a PDA, laptop, etc while cruising in on the way to work and home. Something like:

    http://www.cstr.ed.ac.uk/projects/festival/
    (no plug, just did a quick freshmeat search)

    would be pretty cool to get some good novels read to you w/o buying the tapes.

  45. Duplicity? by Andy+Social · · Score: 2

    Or duplication, maybe?

    --
    Illegitimi non carborundum
  46. Just one page a day, huh? by WIAKywbfatw · · Score: 5, Funny

    Sure, it starts as just one a day. But, before you know it, you're doing two, then five, then ten.

    You stop going out with friends or even returning their calls, personal hygiene takes a back seat and even Counter Strike and Warcraft III become unappealling. And, finally, after countless chapters and hundreds of pages you realise that you're friends were right: you're an addict.

    Just one page a day, huh? Yeah, right.

    Opium. Pot. Cocaine. Now pages.

    It might not be your older brother's drug, or your Daddy's or your grandfathers, but, trust me, this stuff can be dangerous.

    Do what I do. Just say no.

    --

    "Accept that some days you are the pigeon, and some days you are the statue." - David Brent, Wernham Hogg
    1. Re:Just one page a day, huh? by SDrifter · · Score: 2, Funny

      personal hygiene takes a back seat and even Counter Strike and Warcraft III become unappealling

      If counterstrike and warcraft are what you do for fun, somehow I doubt that personal hygeine is an issue. Or friends, for that matter.

      --
      --It burns! --It's loaded with wasabi.
    2. Re:Just one page a day, huh? by Spunk · · Score: 2

      Opium. Pot. Cocaine. Now pages.

      Funny, the text I'm proofing right now is about opium.

  47. Doing my part by cornjones · · Score: 2

    If we just write one line of code a day each we'll have better OCR in no time.

    #include

    Ok, there is my line of code, everybody else, finish it up.

    I can't wait to see this great new OCR.

    1. Re:Doing my part by Myco · · Score: 2

      Your contribution is a syntax error? Thanks so much. History is in your debt.

  48. What books need to be done? by Alethes · · Score: 3, Interesting

    Is there a list of books that are out of copyright and perhaps the status of those books on the Gutenberg Project website or anywhere else?

    1. Re:What books need to be done? by clonebarkins · · Score: 3, Informative

      Check out the following for a start:

      --

      "The evil of the world is made possible by nothing but the sanction you give it." -- Ayn Rand

    2. Re:What books need to be done? by stud9920 · · Score: 2

      We could start with Sony Bono's lyrics. I heard they would be public domain in 2012^H^H24^H^H45^H^H98^H^H^H^H3012^H^H^H^Hnever

    3. Re:What books need to be done? by dvdeug · · Score: 2

      There are probably millions of books that are out of copyright. http://www.dprice48.freeserve.co.uk/GutIP.html has a list of ones that are in process or released, but it's no where near a tiny fragment of the number of books out of copyright. Arguably, they all need to be done; personally, I put emphasis on the literature by famous authors (Millay, Tolstoy) and the non-fiction that everyone should have access to -- especially first-hand and soon after the fact accounts of historical events.

  49. Possible Enhancements by Niles_Stonne · · Score: 5, Interesting

    This a great project... But after doing my first page I found a couple of possible enhancements.

    Add a "Quality" stat for each person. Base it on the number of things that were missed(another words, the number of things that the second-string proofer finds).

    Use more than just two proofers. Have one "First String" proofer, who could be anybody, but have two second string proofers (who both get the output of the first string proofer). If the second string proofers have any differences in their output(with the exception of white space), then another second string proofer should be used. Only proofers with a certain quality rating(slightly higher than what a newbie's would be) should be able to do the second string proofing.

    The "User rating" should be a combination of the number of pages done and the quality rating of those pages. Note that quality rating would only be increased by doing first string proofing. Page count would go up for any proofing.

    Quality could be a float, starting at 1.0 for newbies. Every page that is completed and has a second-string person check would then go into a calculation like:

    _new_quality_ = _old_quality_ + (0.01 - (_num_differences_between_their_proof_and_final_pr oof_ / 1000))

    Thus, for every page proofed that requires NO corrections by the second string the user's quality would go up by 0.01. ( 0.01 - 0/1000 = 0.01 )

    if there were more than ten errors in the proofing, their quality would go down ( 0.01 - 10/1000 = 0.00 ), (0.01 - 20/1000 = -0.01)

    Have a threshold of 1.10 or some such for second string proofers... That way it would require the user to do at least 10 perfect pages, or 20 pages with 5 errors, etc, before they could do the second string proofing.

    Obviously, make sure that the second string proofer can't see who the first string proofer is.

    The "User Rating" (mentioned above) could just be a multiplication of the Quality and Page Counts...

    --
    Sticks and Stones may break my bones, but copyright will always protect me.
    1. Re:Possible Enhancements by Jerf · · Score: 2

      You should read this. It may not seem directly related at first, but it is.

      The root problem is unless you can measure EXACTLY what you are trying to measure, people will optimize to improve their standing with the measurement, rather then real quality.

      Your proposed optimizations would cause someone to create two accounts, one that they use to completely trash a page, and another to "correct" it, boosting the second account's rating at the expense of the first. (You can't force people to do pages they don't want to do, or you'll drop participation through the floor.)

      I know you mean well, but it is often better just to leave these statistics out completely, and deal with the fact that you are only attracting serious people to the project who will do it without the carrot of being in "first place" over everybody else on the stats page.

    2. Re:Possible Enhancements by Niles_Stonne · · Score: 2

      First of all, you do not get to choose the page that you do, just the book - so you couldn't reference a particular page.

      Second of all, with my proposed quality rating you couldn't do that. Sure, the first string proofer could screw up the page, but once one first string proofer finishes it, only second-string proofers can work on it. In my proposal the only people that would get their quality level adjusted would be the first string proofers. In other words, sure you could use your second account to fix what you screwed up in the first, but your second account's quality wouldn't be increased, and your first account's quality would be decreased.

      Perhaps a cutoff to not allow proofing once a person is below Quality 0.80 or so would be in order.

      The link you posted (to fogcreek.com) does have some good statements about user metrics. Keep in mind that this is a community effort, so there is no HR department to worry about.

      I was attempting to give users something that they could boast about (I have a Quality rating of 5.06!) that would encourage higher quality work, not just faster work.

      --
      Sticks and Stones may break my bones, but copyright will always protect me.
    3. Re:Possible Enhancements by Niles_Stonne · · Score: 2

      The site already does a two tiered approach, so it would be 50X pages currently. I just wanted to provide a couple of extra checks, as well as a performance estimate for user statistics. The drop from 50X to 33X is not nearly as great as the drop from 100X to 33X, although it is still significant. ;)

      Having a setup like I proposed makes it very difficult for a purposefully mangled page to get through.

      --
      Sticks and Stones may break my bones, but copyright will always protect me.
    4. Re:Possible Enhancements by Niles_Stonne · · Score: 2

      Thank you.

      Good tagline :)

      I tend to come up with way too many ideas/enhancements whenever I do something... Feature Creap is my greatest issue when writing software ;)

      --
      Sticks and Stones may break my bones, but copyright will always protect me.
  50. ASCII Only? by vondo · · Score: 5, Insightful

    Reading the blurb at the page-a-day site, it says ASCII only where bold is converted to ALL CAPS, the English pound symbol is rendered as "L," etc. No preservation of figures, drawings, or photos.

    This seems very short sighted to me. Devices that can only display ASCII are becoming rarer and rarer. Why not, instead, store docs in some sort of SGML format to handle the special markup (which must be rare) and then down convert to ASCII when needed.

    I've tried reading these things on my Palm. Very difficult. But if I could get a nice typeset PDF version, that would be a whole different story (no pun intended).

    1. Re:ASCII Only? by Robotech_Master · · Score: 5, Informative

      Check out Black Mask for a lot of nicely-formatted pubdom e-books, including many from Gutenberg but also some that Gutenberg doesn't have.

      --
      Editor Emeritus and Senior Writer, TeleRead.org
    2. Re:ASCII Only? by mattdm · · Score: 2

      This seems very short sighted to me. Devices that can only display ASCII are becoming rarer and rarer.

      And the key thing is: with a good markup language, converting to plain ASCII for those devices is trivial. Or *trivial*. It's a win-win proposition. In fact, the markup language doesn't even have to be that great -- HTML 4 would work fine.

    3. Re:ASCII Only? by Captain+Large+Face · · Score: 2

      What about DocBook, which features encoding for books in both SGML and XML? It was devised for computing books, but one imagines it would not be too hard to devise a standard to apply to all works of literature.

    4. Re:ASCII Only? by rusty0101 · · Score: 4, Informative

      When the project was started, SGML varients were not widly used, and the option of including images was a concern for storage space.

      Using things like BOLD and L for british pound were workarounds to have a common way of presenting the data. I suspect that it would be trivial to build a formating filter in perl, or another language that would convert BOLD to bold though it would require a bit more work to recognize that it really should be Bold or even that it should be BOLD.

      Converting monetary symbols would require a bit more work, but would also not be impossible.

      Re-inserting any diagrams, figures, illustrations or other graphics would require more work. If the original scanned pages are still available, as this part of the project suggests, even that would not be impossible.

      One variation is the free bookmobile project that is out there. They use scans of the original book to build a new book for kids. Preparation for printing involves downloading the book over the internet, via a dsl speed sattelite link. I am not sure however if the working material is suitable for e-book reading however.

      -Rusty

      --
      You never know...
    5. Re:ASCII Only? by quinto2000 · · Score: 3, Informative

      From actually proofing a few pages, this depends entirely on the particular project and when it was started. Some of the newer ones allow special characters.

      --
      Ceci n'est pas un post
    6. Re:ASCII Only? by sagwalla · · Score: 2, Insightful

      The beauty of this is that it is in the public domain. If you want a PDF version, or an HTML version, feel free to make one. The Gutenberg standards put the material out in a least common denominated format so anyone has the same freedom.

    7. Re:ASCII Only? by dvdeug · · Score: 2

      markup is a Bad Thing and that ASCII is the One True Format and they aren't even going to think about switching to anything else, ever.

      ASCII is the One True Format. It's been constant since 196x, unlike the world of alternatives. Since PG has been around since the early '70s, they tend to stay with what works. It's annoying when I try to read a book in PG, and the volunteer preserved the French characters, in DOS, so it doesn't display right anymore on almost anyone's computer.

      Most books don't have a huge collection of markup - maybe a few italics. The underscore convention for italics, used by most people now, can be automatically converted. The uppercase can't, but it's not that hard to put in the elbowgrease to fix it.

      They have copies of some stuff, that can't be handled well in ASCII, in other formats; HTML is popular, with TeX for those math works. But neither is universal; an ASCII version is still provided where feasible because it is universal.

  51. Re:public domain books? by RobotRunAmok · · Score: 2

    I know for a fact that there are a lot of digital copies of copyrighted works such as Frank Herbert's Dune series and The Lord of the Rings floating around the Net and I think the newsgroups as well.

    Of course, there are. And why shouldn't there be? Information (and Entertainment) Must Be Free!

    Just ask Harlan

  52. Distributed Proofreading has a "high score" table. by Lovepump · · Score: 3, Insightful

    How long before someone writes a script to hit "Save and get another Page" and they shoot to the top of the ladder claiming to have proofread 13,450,213 pages per day...

  53. Re:OCR Software -- Clara, perhaps? by Zach+Garner · · Score: 3, Informative

    I've used both clara and gOCR. Both are not yet working well enough to actually use to scan books..

  54. No, not really by Codex+The+Sloth · · Score: 4, Insightful

    OCR Engines are not email programs. You can't just add a line of code and all of a sudden it works better. Usually you have to spend time developing a complicated algorithm. Usually this is more than a line of code. Then you have to test it against known text (ground truth) to make sure it's a benefit, rather than a problem over a broad selection of pages. It's quite often the case that something that improves one page makes another worse.

    Actually, having people make verifications against the OCR results establishes the ground truth which someone could use to improve the OCR engine so by doing a Page a Day, you are helping to make future Open Source OCR engines better.

    --
    I am not a number! I am a man! And don't you ... oh wait, I'm #93427. Ha ha! In your face #93428!
  55. Re:I'm guessing... by JUSTONEMORELATTE · · Score: 2
    The odd thing is what Amazon chose to recommend to me when I view the page for The Road Ahead:
    Customers who shopped for this item also wear:
    • Clean Underwear from Amazon's Eddie Bauer Store
    • Ladybug Rain Boots from Amazon's Nordstrom Store
    • Suede Headwraps from Amazon's International Male Store
    • Cheetah Print Slippers from Amazon's Old Navy Store

    I used to be creative, now I'm merely observant.
    --
  56. Re:Legal Implications by Happy+Monkey · · Score: 2

    You seem to have skipped the second sentence of the post you replied to, even though the editorial corrections you refer to would undoubtedly appear on the scanned pages. One way around it might be that each page is covered under fair use, and they are not served to the proofreader in order, so you never are given more than a one-page exerpt.

    --
    __
    Do ya feel happy-go-lucky, punk?
  57. Distributed Everything by Mostly+Harmless · · Score: 2, Informative

    Something I posted on 10/24...

    Go here. Now. It's the most complete listing of distributed computing I've ever found. Has the usual, like folding and SETI, but also neat things like Distributed Proofreading and finding as-of-yet unknown comets.

    --
    "`Ford, you're turning into a penguin. Stop it.'" -Douglas Adams, THHGTTG
  58. Re:A better use of time (OK, here's mine) by gosand · · Score: 5, Funny
    If we just write one line of code a day each we'll have better OCR in no time.

    OK, here's mine:

    #include stdio.h

    next...

    --

    My beliefs do not require that you agree with them.

  59. OCRs aren't about to do context-sensitive thinking by Kjella · · Score: 2

    I just put in a few pages (15 if you care :), and while some were very conform in quality, at least one book had some smears and spots. There's no way an OCR of any quality would be able to reverse engineer the half-printed letters and words back to readable english without a *good* dictionary/grammar machine, and even then it would be more dangerous to have it do a half-assed guess than to have a human there that will at once tell that this is a trouble spot and that the OCR dropped the ball. God, that last was an ugly sentence, guess I should stick to proofreading and don't start writing myself...

    Kjella

    --
    Live today, because you never know what tomorrow brings
  60. Re:Legal Implications by Twylite · · Score: 2

    Sorry, but this isn't strictly true. See my earlier post. Publishers tweak the text ("corrections" mostly) which give them copyright over their particular publication.

    --
    i-name =twylite [http://public.xdi.org/=twylite], see idcommons.net
  61. Scanning without damaging the book? by mttlg · · Score: 3, Interesting

    I have a few books that are old enough to be well out of copyright (and obscure enough not to be found online already), and for a while I have been considering typing them in. OCR would be a lot easier, but getting a good image from a flatbed scanner would seriously damage most of these books. Even a handheld scanner would be impractical in some cases, and a digital camera seems even less likely to work. Is there any reasonable way to scan in pages from something like a 100+ year old 1.5" thick wire-bound paperback book that only opens about 60 degrees before putting up a fight?

    1. Re:Scanning without damaging the book? by jpetts · · Score: 5, Informative

      Is there any reasonable way to scan in pages from something like a 100+ year old 1.5" thick wire-bound paperback book that only opens about 60 degrees before putting up a fight?

      Yes indeed! *Any* decent academic library should have a photocopier which can do this. Older models tend to have a glass platen which extends right to the edge of the photocopier, and the side slopes away at around 60 degrees rather than dropping at a right angle. Newer models, such as the Minolta PS3000 will support the book in a cradle, face up, so that contact with the pages is minimised. They also tend to have a host of features, such as automagically erasing the gutter shadow that one gets with such a system.

      --
      Call me old fashioned, but I like a dump to be as memorable as it is devastating - Bender
    2. Re:Scanning without damaging the book? by Griim · · Score: 2

      The other poster's idea might be better here, I'm not familiar with those photocopiers, but I would think that one of the newer digital cameras (4 megapixels and up), accompanied with a macro lense would do the trick. You can get some pretty stunning detail out of the newer models, if you haven't seen them.

    3. Re:Scanning without damaging the book? by ChaosDiscord · · Score: 2, Informative
      ...but getting a good image from a flatbed scanner would seriously damage most of these books. ...a digital camera seems even less likely to work.

      Actually given a nice digital camera with a high resolution, you can generate perfectly fine images for OCRing. I've known a few people who have done exactly this to take images of rare books that they have access to but would never be allowed to put on a scanner.

    4. Re:Scanning without damaging the book? by mttlg · · Score: 2

      Actually given a nice digital camera with a high resolution, you can generate perfectly fine images for OCRing.

      I wasn't questioning the resolution of the camera, I was questioning the positioning of the book to get a good image. This would work easily if the book could be opened to lay flat, but otherwise it would require some apparatus to hold the book open, and even this won't work if the book can't be held open far enough with the page flat to get a good picture (as in the worst-case example I gave).

  62. Re:Umm... by Twylite · · Score: 4, Insightful

    Copyright law is supposed to give incentive to create, for the betterment of society, and allow the creator to derive direct benefits as a reward. An artist who has created a work so successful that (s)he can live on it indefinitely has arguably provided a suitable level of betterment to society.

    Saying that copyright law is an incentive to "work" is accepting mediocracy. Artists who produce works that society values more highly should (have the opportunity to) receive more benefits.

    On the other hand, I don't necessarily agree that copyright should last the lifetime of the creator (although there are strong arguments for this in the case of a natural person). But what is a "fair" limit?

    Is 5 years enough? Almost certainly not. Many authors only achieve popularity after 10 or more years, and then make a fair amount of money off increased sales of their older works. A good number accept this as a risk, and plan to use this phenomenon to their benefit - work up a good number of titles with varied content, and you'll pull more readers, who are then likely to try some of your other titles.

    Is 20 years enough? Maybe. But some of our best-loved authors were 15-20 years ahead of their time in terms of what readers wanted.

    Is life enough? Strangely, no. If an aging star has just completed his/her autobiography, concludes the publishing deal, and dies ... well, the family could well be screwed.

    Maybe the answer lies in a compromise, rather than an all-or-nothing approach. Copyright over a work lasts for the greater of 10 years or the creator's natural life (which gets very interesting when we get eternal life medications ...). But some rights fall away after the LESSER of those two times, such as exclusivity over derivative works (but not translations).

    This allows society to (culturally) enrich itself by building on a work after a shorter amount of time, while the creator (and/or family) can still derive value from the original work for a longer time.

    In the case of books this is easily understood: author writes book; 10 years later other people can write preludes and sequals, extend the world and characters, etc; 30 years later author dies and original book falls into public domain.

    --
    i-name =twylite [http://public.xdi.org/=twylite], see idcommons.net
  63. impressive by Anonymous Coward · · Score: 2, Interesting

    The response from the slashdot community is impressive. Already they have hit their mark for the day as far as 'pages processed'. They have over 1400 (at 10:13am CST) pages processed. When I visited their site at 8:45am CST they had only 615 pages. I predict that the project will hit the 3000 mark fairly quickly for today.

  64. Re:Legal Implications by j-beda · · Score: 2

    I am pretty sure that PG takes care to only use old copies of books that are in fact no longer copyrighted if that is in fact necessary. They seem very picky in making sure that they follow the rules.

  65. Are any of these resources distributed? by wls · · Score: 3, Insightful

    It seems like every few years I turn around and notice that some massive archive collection gets sued, goes out of business, has funding pulled, gets tangled in legal action, has a university board go into panic mode, etc. and suddenly it disappears without warning or notice to the frustration of many. I'm certain you also can name a number of services, collections, and resources that spontaneously vanished when hosted at friendly sites. History has proven that despite best intentions, nothing lasts forever unless we go out of our way to protect it.

    So that work isn't lost or destroyed, are any of the mega-sized projects replicated elsewhere in the event that a "it'll never happen" situation crops up to this unsuspecting resource?

    1. Re:Are any of these resources distributed? by Anonymous Coward · · Score: 2, Informative

      DP submits to project Gutenberg. This is a gutenberg FAQ.

      1. Michael Hart (gutenberg's leader) is very much in favor of massive replication. My favourite is when disk drive makers start putting the entire gutenberg collection on their drives before selling them (to fill up space/differentiator)

      2. PG has been around 20 years, and never been shut down. Judges actually understand and defend the public domain, within limits that PG understands.

      3. Nothing goes through DP without copyright approval from MHart. And if he makes a mistake, it is likely to be fixed by withdrawing the offending *book*, as far as possible.

  66. blackmask.com by night_flyer · · Score: 2

    after finding Thea von Harbou's Metropolis at www.blackmask.com, I go there first when looking for an ebook, especially since they have them in e-silo format (Palm). IF they dont have what Im looking for I go by Project Gutenberg...

    --


    Thanks to file sharing, I purchase more CDs
    Thanks to the RIAA, I buy them used...
  67. Re:And you ask the /. community... by Binestar · · Score: 5, Funny

    MY GOD! A story where nitpicking grammar and spelling is *ON* topic.

    This'll be a fun one to read through.

    --
    Do you Gentoo!?
  68. works fine! by magwm · · Score: 2, Interesting

    I just proofread 2 pages of some greek philosophy book. the system works really nice! quick database, not too large pages to read. except i would like to have source and text next to each other, and not above each other.

  69. I'm impressed by schmiddy · · Score: 2, Interesting

    I signed up for an account, and did a bit of proofing. One page was a bibliography with lots of numbers -- the OCR software made a few errors here and there, sometimes confusing "1" with "!". Another page was in old German. Since many old German characters look so different than their modern-day counterparts, I was quite impressed when it translated them flawlessly into their proper ASCII counterparts. The OCR software even got the umlauts right. Only problem was it sometimes mistook an end of line "-" for a "=". One problem I did have was that most of the scans seemed to be pretty low resolution. This causes problems when comparing the scanned text to the original image, as it can create difficulties for the proofreader. The software also had trouble translating the low-res blocks.

    --
    http://cltracker.net -- powerful craigslist multi-city search
  70. Pubic Domain? by Cap'n+Canuck · · Score: 2, Funny

    I'll help out.

    One question - is Playboy public domain yet?

  71. Re:Umm... by Dirtside · · Score: 2

    My wife had a suggestion for limiting the life of copyright. Basically, tie it to the amount of income you get from the work. Once you reach a certain plateau, the work falls into the public domain (although you could argue for an additional minimum time requirement, i.e. 5 years for movies, so that a gigantic blockbuster won't enter the public domain after 6 months). Or instead of income, base it on profit. That way, you are guaranteed that you will make a certain amount of money before the work enters the public domain. Of course, for works that never reach the plateau, they would enter the public domain after a suitable period -- e.g. life plus 10 years for natural persons, or something incredibly short for a corporation, like 20 years).

    Of course, there's practical problems with this method -- namely, accurately determining the amount of money a work takes in. It's all too easy to fudge financial data, as we've been too often reminded in the past year, and this idea may not be workable.

    --
    "Destroy science and religion. Science would re-emerge exactly the same; but not religion." - Penn Jillette, paraphrased
  72. And we shall call it kuro5hin! by drivers · · Score: 2

    (en tea)

  73. Can't get through? Try ibiblio by gbnewby · · Score: 3, Informative
    The main Gutenberg page is slashdotted right now, but you can get nearly the same access to the books via the main ibiblio page at ibiblio.org/gutenberg, which is the main distribution site for the collection.

    It looks like the texts01.archive.org/dp site is holding up fairly well! If you cannot get through today, though, please check back later. Slashdot effect aside, it's usually quite speedy and has a decent 'net connection. If you want to keep informed of current events, get on one of our mailing lists via (when it's not slashdotted) our subscriptions page.

    Dr. Gregory B. Newby
    Chief Executive and Director
    Project Gutenberg Literary Archive Foundation http://gutenberg.net
    A 501(c)(3) not-for-profit organization with EIN 64-6221541
    gbnewby@ils.unc.edu // 919-962-8064

  74. Re:Some PG books ARE copyrighted... by dpbsmith · · Score: 5, Informative

    ...Not many, but there are some Project Gutenberg books that are copyrighted and distributed with the author's permission.

    Also, Project Gutenberg of Australia publishes a number of works that are out of copyright in Australia, but still under copyright in the U.S. It is a copyright infringement for readers in the U. S. to download these works, which include, among others, Hervey Allen's _Anthony Adverse_(1933), F. Scott Fitzgerald's _The Great Gadsby_ (1944), Khalil Gibran's _The Prophet_ (1923), D. H. Lawrence's _Lady Chatterley's Lover_ (1928), all of George Orwell's novels, most of Virginia Woolf's, etc. etc.

    Not exactly "the latest Stephen King" but a lot newer than Dickens.

  75. I am programmer, let's automate this by LoRider · · Score: 2

    Do they want me to manually scan through a page of text compare it with an image and fix errors created by OCR? It goes against my very nature to do such a task. There has to be a better way, a programming way, to get this done without having to look at all of the files with human eyes.

    I haven't finished my first cup of coffee yet so I am at a loss for a solution, but it sounds like something Perl would be good at.

    The motto of the open source community should be or is, "Progress not perfection."

    --
    LoRider
    1. Re:I am programmer, let's automate this by Sloppy · · Score: 3, Insightful
      There has to be a better way, a programming way, to get this done without having to look at all of the files with human eyes.
      It's not human eyes that are needed, it's human brains. If it is possible to automate, then the OCR doesn't need checking; it just needs to be upgraded to include whatever algorithm that you're about to invent.
      --
      As copyright owner of this comment, I authorize everyone to defeat any technological measure which limits access to it.
    2. Re:I am programmer, let's automate this by dvdeug · · Score: 2

      There has to be a better way, a programming way, to get this done without having to look at all of the files with human eyes.

      If there were, then the ocr program would have done it. It requires large amounts of context and pattern recognition, things humans are much better at than computers.

  76. Re:A better use of time (OK, here's mine) by jim3e8 · · Score: 2, Funny

    OK, I'll start at the other end and work my way toward you:

    }

  77. Reminds me of the OED by Anonymous Coward · · Score: 2, Interesting

    Their approach to solving this reminds me of how the Oxford English Dictionary was started -- by compiling submissions and references from thousands of volunteers. A really enjoyable recounting of this (and of one particular person who contributed thousands of words while in an insane asylum) is The Professor and the Madman

  78. Re:Some PG books ARE copyrighted... by BlueGecko · · Score: 2

    I do believe you have linked to a copyright circumvention device (the .au domain) in violation of the DMCA. Please standby while you and your belongings are liquidated.

  79. Thieving TOS Violator! by timeOday · · Score: 2
    the original site ran on a Pentium 200 over my 128kbps upstream cablemodem
    This is a chilling example of the dire consequences of granting upstream bandwidth to home users!!!

    Er, wait...

  80. Looking for proofreaders on slashdot !! by tadas · · Score: 5, Funny

    If they're looking for proofreaders here, the project is in deep trouble...

    --
    This page accidentally left blank
  81. Re:Legal Implications by dvdeug · · Score: 2

    But the publishers still have copyright on their specific printing.

    I've heard this in the context of German law, but never in the context of American law. American law requires significant creative effort to be copyrighted, which dumping text to paper rarely counts. (New footnotes and illustartions are a different matter, of course.)

  82. Snide remark about markup... by davidmccabe · · Score: 2, Funny

    Wait a minute! Isn't PHP like evil or something?

    Programming languages may come and go, but good old fashion machine code will last as long as literature, very much like good old fasion ASCII text and good old fashion zip files with no meaningfull names.

  83. Re:Price on my head. by dvdeug · · Score: 2

    That would make an incentive for people to kill you so they can steal your work.

    People already have incentive to kill other people to take their work. It's robbery gone bad, and inheritance. I doubt there's enough value in any public domain work to make a death sentence worth facing.

  84. Blame it on Mickey Mouse by peter303 · · Score: 2

    Walt Disney wanted to extend the rights to his branded characters and got the lawmakers to do it. In some respects his old stuff is renewed every decade: new generations of kids and new media- film, theme park, video tape, DVD, IMAX ... Each reissue is a new pile of money.

  85. Re:Umm... by Dirtside · · Score: 2

    Interesting points. There is the fact that deliberately creating an artistic work that will reach a certain cash plateau is nearly impossible -- just look at how many creative endeavors never even get so far as to break even, and that's with authors trying really hard.

    Also, there's the fact that an over-successful work creates desire for an author's other works -- so writing something which will exceed its copyright profit cap would still create income for the author's other works.

    Additionally, if there's a minimum time limit set on the work (I'd say 15-20 years for books), then even if it is wildly successful, you could reap the profits for 20 years, even if you greatly exceeded the profit cap. Once that 20-year deadline hit, of course, the copyright would expire. Trying to calculate your work so that you only barely reach the profit cap *after* the minimum time would be utterly impossible, so I doubt that would have any effect on authors' efforts.

    All that said, yours is a simpler solution (and one that I would support) -- 20 year copyright, non-extendable, from the date of first publication, regardless of the author. Period. Copyrights would be transferable (i.e. I could sell my copyright to a new owner, and I would lose *all* rights to it). It's an acceptable solution, though it doesn't mean it's the best solution (or even realistic, politically speaking).

    --
    "Destroy science and religion. Science would re-emerge exactly the same; but not religion." - Penn Jillette, paraphrased
  86. Re:A better use of time (OK, here's mine) by gosand · · Score: 2
    1. It didn't say it had to be bug-free code. :p

    2. Do you know how long it has been since I wrote any C code? I was lucky I spelled stdio correctly.

    --

    My beliefs do not require that you agree with them.

  87. legal enforceability by David+Jao · · Score: 2
    In this case the text is (almost) legally enforceable. They really do own the copyright. They really do have the right to prohibit almost all copying of that book, except in limited fair use circumstances.

    For the reason why, I suggest you "learn up" on what public domain really means in the US. Public domain simply means that a particular work has no copyright restrictions. It does not mean that you are prohibited from adding further copyright restrictions of your own.

    In other words, a work which is public domain is free for all to copy in any way they wish, including copyrighting a copy for themselves. Note that placing your own copyright on the work does not mean that the original work is copyrighted. It just means that your copy is copyrighted. Anyone is still free to access the original copy, which is still in the public domain. But they can't use your copy if your copy has your copyright.

    You might ask "are there laws that prohibit you from lying about the authorship of a work?" The answer is yes. It's called fraud. It has nothing to do with copyright. Placing your own copyright on a work, and claiming authorship of a work, are two completely independent actions according to the legal system.

    You are totally right that the cover text is not enforceable with regard to "fair use" copying of the text, but the parts that say "Copyright 1974 Houghton Mifflin" and "All rights reserved" are definitely valid, enforceable, and meaningful.

    1. Re:legal enforceability by dvdeug · · Score: 2

      In other words, a work which is public domain is free for all to copy in any way they wish, including copyrighting a copy for themselves.

      That's not true. You can use it for a basis of your own copyrighted work, but you can't claim a copyright on something without adding significant creative value.

      From http://www.copyright.gov/circs/circ1.html

      Only the author or those deriving their rights through the author can rightfully claim copyright.

  88. Non-native proofers by Sangui5 · · Score: 3, Informative

    are actually the preferred way to proof text. A project to create "The Collected Works of Edmund Spenser" is headquartered here, and the English-types were looking for people to work on some software for them. The current most accurate way to create an electronic copy is to hire people without even a passing familiarity with the alphabet you are targeting, train them to identify the letters themselves (using the font you're targetting, which may be very much non-standard, esp. for work as old as Spencer's), and have them enter it in character by character. You then have another illiterate person do the same, and have 1 editor (English graduate student) check both copies. Then any differences have to be handled by another editor (English PhD), and the final copy signed off by yet another editor (PhD).

    A very very expensive way to do it.

    See, an illiterate person won't introduce any bias into the text. They will faithfully duplicate any spelling mistakes that they find. In the case of an English scholarly collection, the mistakes are amoung the most important part, since they can identify different print runs, and how language shifts over time.

    As a side note, the software project is hopeless. The best that cann be managed is to automate the administration of their current systems--no OCR will ever meet the level of accuracy that their current system provides.

  89. Maybe not for long -- still good by the+grace+of+R'hllor · · Score: 2, Informative

    It's on Slashdot, so everyone does a few pages, find out it's actually fairly tedious, and only a few will remain of the initial burst. They're at about 7000 for today right now, which is about 1000 more than what they've done so far, this month. Don't build your site based on these estimates.

    Check back there in a few weeks to see how the site is doing. Hopefully quite well, since it is a splendid and worthwhile[1] effort.

    [1]: And only in the preview did I realize I sounded like that woman in the HHGTTG.

  90. Proofing FAQ by Wanker · · Score: 3, Informative
    Stop reading this
    And start reading a page!
    After that come back and you may continue();

    ...but first read the Proofing FAQ on the site and save yourself some confusion:

    http://texts01.archive.org/dp/faq/ProoferFAQ.html

    Especially read section 5 for some of their typesetting-to-ASCII conventions which would be non-obvious otherwise.

  91. Re:creative value by dvdeug · · Score: 2

    The already mentioned spelling modernization, for instance, is an example of a tangible modification to the Shakespeare texts over which Houghton-Mifflin can legitimately claim copyright.

    Sure. But an edition of Hemmingway, where massive changes are neither needed or expected, is slightly different.

  92. Re:Cantor, Hilbert, G�del, Turing ... by dvdeug · · Score: 2

    Are these copyrighted?

    In the US, look at the dates of what they wrote. Most of Cantor and Hilbert are in the public domain, while Goedel and Turing are still under copyright. Unfortunately, math has always been penalty copy to typeset; the closest thing Project Gutenberg has to a real historical math text is Maxwell's On the Dynamics of a Top.

  93. Re:Cantor, Hilbert, G�del, Turing ... by ninthwave · · Score: 2

    Amazon has Godel in the uk at this link

    here it is

    IF you want the original paper for it sake if you want it free I don't know if it is out there.

    --
    I was thinking of the immortal words of Socrates, who said: "I drank what?" - Chris Knight (Val Kilmer)- Real Genius