Distributed Proofreaders Posts 5,000th E-book
bbc writes "Distributed Proofreaders has posted its 5,000th ebook to Project Gutenberg. The book, a Short Biographical Dictionary of English Literature, by John W. Cousin, was proofed for this special occasion by over 500 volunteers.
Distributed Proofreaders is a project that distributes the otherwise gargantuan task of correcting scanning and recognition errors in an OCR'ed text. The project has thousands of volunteers, of which many hundreds are active on any given day. It is currently the main supplier of etexts for Project Gutenberg."
I am prowd to bee won off thows prewf reeders
....I guess the slashdot editors aren't members?
As I get older, reading texts on-screen gets easier. My vision is still 20/20, but I now require reading glasses, which are generally out of reach when I need them. Project Gutenberg has come in as a real lifesaver (well, sanity-saver) now that I'm turning into a geezer. That, and the price is perfect!
Neopets - the best free game on the Int
They should offer their services to authors and magazines, and raise some money from what they do. It wouldn't be enough to split between the involved proof readers I guess, but the project itself could get some money to buy...well, whatever they might need. Perhaps they already do this, I'm too lazy to find out :-)
Martin
Wear can I apply? i have excellent grammer skills.
I am still on the 4986th book, this one isn't that good, but I have to finish it, oh, page 34, line 7 there is a mistake in the 4th word, I think you know it, yes.
Other than this I just found, the other 4985 are AOK so far.
Good work guys. Free the books. ook.
(re-reading Sourcery on the commute today... ook oook)
#hostfile 0.0.0.0 primidi.com 0.0.0.0 www.primidi.com 0.0.0.0 radio.weblogs.com
And since then they seem to have proofed another 52 books - that's not a bad rate considering...
"I think it would be a good idea" Gandhi, on Western Civilisation
Why must Slashdot get all excited when a number like 5000 pops up? I don't understand why everyone is so excited about numbers. I took my 500th shit this month, you dont hear me calling the press do you?
What about the 5001st book? Will that also yeild a news item?
because playboy hasnt lapsed into the public domain yet...
wait a few (read: lots) of years and you'll be seeing 'em tossed up there, editors duitifully rendering pictures into ascii, etc.
"goodbye and hello, as always" ~Prince Corwin, from Zelazny's Amber series
The book, a Short Biographical Dictionary of English Literature, by John W. Cousin, was proofed for this special occasion by over 500 volunteers.
Hardly a non-put-downable... I suppose that is is a Biography (Shouldn't that be bibliography *chuckle*) of english literature is kinda symbolic.
I guess this more than doubles the total number of people who have read this book though!
I like Gutenberg, I hope they start a system where you can download copyright books for a micropayment, I would pay good money for text ebooks.
Lets hope ebooks don't go the way of music, keep the costs low, no DRM fluffing up the download. If you can click 3 times and start reading a new book, and it costs you euro's then you would preffer that than d/l gigs of warez.
Anyone who illegally downloads lots of books, tends to be the person who does't read them much anyway (Someone boasted to me that they had 300 O'Reilly books, squirming under the desire to tell me that they were eBooks, off irc, oh lawks, what a riot, I wish I was your friend, go away)
#hostfile 0.0.0.0 primidi.com 0.0.0.0 www.primidi.com 0.0.0.0 radio.weblogs.com
...that a million net monkeys can fix the complete works of Shakespeare so that they language is spoken the correct way?
Instead of 'WHat light through yonder windows breaks?' we get 'Who is that hot chick I can see through my binoculars?'
If I point out that you are incorrect, making me a foe does not make you any more correct.
I would have thought that if they were going to hold this up as a landmark that they would have picked a more notable book. I mean, I skimmed through that book and it seemed to have a lot of information and everything, but it's hardly something you would ever read all the way through.
IT is what you fail
I think it's really a shame that current copyright laws (and retroactive extensions) have limited project Gutenberg to texts from a little after the turn of the century and before.
I just don't understand the point of retroactive copyright extensions. The idea behind copyrights, like patents, is to encourage innovation by allowing the creator an exclusive right for a limited time. If people believe copyright terms need to be extended to achieve this goal, fine. I disagree, but whatever. However, I think it's ludicrous that terms should be extended on works that have already been created, unless maybe they think that extending terms retroactively will lead to more works being produced in the past?
Still, I look forward to the day when someone starts digitizing the Mechanics Institute Library in San Francisco. It's a beautiful private library one can join. The books are in excellent condition, and there are century old original editions on the shelves.
But it's the magazine collection that's stunning. They have Popular Mechanics in bound volumes, all the way back to the beginning, when it was a serious scientific journal. All the major railroad magazines from the heyday of railroading. Every issue of Electric Railway Journal (the trade magazine of streetcars). Few other libraries kept that stuff.
does anyone have suggestions for fiction titles on gutenburg?
i need a good read, but i dont want to pay or find something good myself.
-judging another only defines yourself
All in all, I have to say that I think this project is better than nothing at all. I am sure that the proofreading is better than what was there before.
However, I am curious as to just how accurate the proofreading is. I think that they try to improve accuracy by having many different volunteers; accuracy in numbers and all that. However, just because many people think in a certain way, does not mean that what they think is accurate. Just look at standardized tests. They are specifically designed to make use of common mistakes, so that the majority (the swell of the bell curve) all get the wrong answer together. Only a slim minority will get all the questions correct. Considering how many people (even educated people), get around average on even the verbal and English sections of such tests as the SAT, GRE, etc., I wonder if certain passages in books will be incorrectly edited on a mass scale. This would especially be true for older or more complex works.
Hello, Ano!
[n/t] :D
Just be aware that the Gutenberg is some 135GB, and much of it is gif jpg and mp3 (spoken work books). So i just used --include in rsync to download the .txt .htm and .html files. Its a more manageable 10GB download.
It's so Disney can keep milking Mickey Mouse.
Here's what I want to see:
You get automatic copyright for 25 years. After that, you must pay $1 per year to keep something in copyright. If you can't be bothered to keep track of your stuff and pay the $1, it lapses into the public domain.
Disney will pay the $1 for Mickey ($1 for Steamboat Willy, $1 for each other cartoon, $1 for each book, etc.). But forgotten gems, like ancient Apple ][ games, will become legal public domain items.
I'd actually like to see a hard limit of 50 years or so for copyright, but even if you can't get that, at least the above scheme makes alot of stuff lapse into the public domain.
A cool feature: if the legal trail is tangled and murky, and no one knows who owns it anymore, no one will pay the $1 and it will fall into public domain. Let's say LSD Software wrote a fun game for the Commodore 64. Then ABC Games bought the game from LSD (who kept the rights to use the music in future games). Then ABC Games went under, but its assets were bought by PDQ Games, which later split into PDQ Software and Foo Bar Games. After that it gets REALLY complicated... anyway, after all that, who exactly owns that fun game? No one knows. It would take a court case to decide, but no one will bother so no one will ever know. Under the current system, you are technically a pirate if you keep the game, but there is no one you can pay a license fee and legally have the game! Catch-22.
Heck, Disney should want this. They make big bucks by Disney-ifying public domain stuff, so they should make sure things will actually go into the public domain in the future.
...they don't use Microsoft word
Even those who arrange and design shrubberies are under considerable economic stress at this period in history.
I think the Gutenburg project is a terrific idea!
My only complaint is with the formatting. Project Gutenburg uses hard formatting within the text. I think that's an extremely stupid idea.
There should be zero formatting within the text (other than paragraph breaks). Whatever client you're using should provide the formatting for you.
Let the client handle the presentation!!
It seems to me that this project could have a large impact on OCR readers.
Think about it. You have thousands of volunteers pouring over images, and then providing the corrected text (if necessary). Couldn't this also be used to "train" the OCR software to become better at identifying text?
If you log the image, the original OCR'd text, and the manually verified text you could use it in a test case for future OCR software.
I do this all the time when I write data validation/cleanup software.. I run my input data through a program, capture the output, and manually verify that it is correct.. making changes if necessary. I then use the two pieces of information in my test cases as a benchmark. If I introduce a bug in my code that causes something I already wrote to suddenly break, or output incorrect results, I know about it instantly. Works great with database correction code.
Maybe I'm simplifying this too much, but I sure hope someone is capturing all this great data. It could come in handy..
I didn't realise this department existed at Slashdot.
But not as shocking as this
There are so many books there, how can you choose one to read?
A lot of books I download need to be proof read as well, usally books that are still copyrighted in my particular duristiction. I still haven't figured out what the best format is when I contribute my own books as well, HTML?
wow that is awesome community work indeed !
Chris ,
Php Programmers.
my brothers (Team-Lib) crossed that number many years back.
Keep up the good work anyway.
Everything's always been about money. There's just more money in it now.
Milo
Right now, we've got plenty of old math intensive books ready to move through the DP system. Because of ASCII terrible ability to handle equation formatting, we use TeX layout. The average DPer doesn't know TeX and it's a rather high learning curve to get started on. So, since Slashdot is full of self-professed geeks...all you TeX geeks should join up and help with the TeX formatted MATH texts. I've got plenty of books scanned and ready to go, so don't think you'll run us out of 'em any time soon!
JHutch
I would like to apologize to TPTB (The Powers That Be) at Distributed Proofreaders for messing up by posting this story to Slashdot.
The 5000th Posted celebrations were supposed to be internal. There is a discrepancy between works posted and books posted: sometimes a book gets split up. The big celebrations were intended for 5000 actual books posted.
I am afraid I got a little carried away, and hope Slashdot will still carry the real story of 5000 books posted to Project Gutenberg.
but it's good to finally get electronic versions of those books that are bought by the yard to fill the bookshelves in 'Bohemian' pubs and coffee shops. To round out the experience, download the text of these books and write a PERL script to 'pulp' them. One gripe - plenty of books by Abbott but none by Costello - call that a library?
--- Yx3 = Delilah ---
If the scanned images were made available after the books are "finished", then people would be able to make better scolarly use of the e-books. It is essential to have the "raw data."
Also, I often find what I think is an error, and it would be very convenient to check it against the the scanned image, as going to the library or sending an email to someone else to have it checked is usually too much trouble.
One of the books I worked on was the "Anatomy of Melancholy" and I (conveniently) have a copy myself. There were often more differences between the scanned image of the page and my copy than between the scanned image and the proofread text.
Don't underestimate the amount of work people put into this too - for "Anatomy of Melancholy" it often took 30 minutes to proof a single page because the page often had latin and very small footnotes.
This previous story mentions a possible split with a company charging for all the books and taking the name. I see now that http://www.projectgutenberg.info/ doesn't seem to be selling books anymore, but www.worldebooklibrary.com is up. Did they give up the Project Gutenberg trademark?
Now, the really sad question is, how do you know either of these smells?
if not, give it a try:
1. Download a text: (say Alice's Adventures in Wonderland). The new site has a vastly improved interface; listing books in available formats (always plain text, sometimes pdf, palm doc, tex)
2. Have at it in you text reader of choice. If you are on the mac, I highly recommend the free tofu. It breaks the text into columns that are high as the window. Navigate by shifting columns or pages of text. This simple change makes a huge difference when reading large amounts of text. It makes reading books on my laptop pleasant rather than an ordeal.
What about on other platforms? What are the best programs for reading etexts?
By the time the copyright got to 21 years it'd be over a million dollars to renew it, which would strongly encourage people to just let it go to the public domain. This way would also protect small time inventors/writers, since even at 7 years, it's only $64 to renew.
The sending of this message pretty much inconveniences everyone involved.
Again, I think the likelihood to two independent OCR processes (seperate text, seperate scanners, seperate OCR packages) would both make the same mistake, so it's mostly trustworthy, as long as they're in agreement.
Computers are useless. They can only give you answers.
-- Pablo Picasso
You mean I've been grammar trolling slashdot, correcting the retarded writing of idiots on slashdot, when I could've been contributing to society in a meaningful way?
Crap!
How about a compromise. A format that uses plain text to store both data and metainformation.
You could also store documentation for this format and even source code (in a variety of languages) for a program that converts the metadocument into straight text. Then you won't have to worry about converting each of them painfully or worry about outdated formats.
And even if the worst happens and the format becomes outdated and unreadable, the text is still there, hidden in markup. It wouldn't be that hard for someone to reverse engineer it and write a small program to convert it to a more recent format.
With the markup data separated from the content, you can choose to only show certain portions of the content. It'd be much harder to filter plain text.
Plain text has no real advantage over a sensible format. HTML documents will be readable (maybe not pretty, but readable) just as long as plain text.
Of course eventually when ASCII dies and we all switch to a 32bit character set (endorsed by our alien overlords), neither "plain text" nor any sensible format will be readable by anyone.
Heck, even switching to 16bit character unicode would screw it up.
How about old scientific works, journals up to say 1920's?
I know the more recent journal articles are copyrighted and therefore must have some lengthy protection on them, but what about classic old articles (like some of Einstein's work in the early 1900's)?
"Provided by the management for your protection."