Just One Page a Day

How do I get to plug my online website? by FortKnox · 2002-11-08 02:38 · Score: 1, Funny

Seriously? Just make a distributed system, put in PHP code, and make it all open source and free?

What's the criteria?

--
Good quote, too many chars. Seriously, the slashdot 120 char limit sucks!

Re:How do I get to plug my online website? by Anonymous Coward · 2002-11-08 02:46 · Score: 2, Insightful

a wonderful resource for poor areas.

And where do the poor get online? In libraries.
D'oh!
Re:How do I get to plug my online website? by Anonymous Coward · 2002-11-08 02:59 · Score: 2, Funny

And where do the poor get online? In libraries.

Hey, shut the fuck up. This site is about technology for technology's sake. We talk about humanitarian things just to justify it to our own conscience to relieve the guilt. Don't make us think logically!

Remember, it's TECHNOLOGY = GOOD. WE ARE FUZZY BUNNIES THAT LOVE EVERYONE AND THINK WE'RE COOL 'CAUSE WE WRITE "HELLO, WORLD" IN C.

Stop reading this by XiC · 2002-11-08 02:38 · Score: 5, Insightful

And start reading a page!
After that come back and you may continue();

Re:Stop reading this by H0ek · 2002-11-08 07:01 · Score: 3, Insightful

In fact, I feel it would be a Good Thing(tm) for our friendly Slashdot host to stick the link to this project into their Quick Link section on the main page.

Of course, I've already bookmarked the page, but that's on one machine. What happens six months down the line when I need to rebuild my bookmarks? Search for the article on Slashdot? Ick.

--
H0ek
Think you're smart? Prove you've got brains!

And you ask the /. community.. by Harald74 · 2002-11-08 02:39 · Score: 5, Funny

... which is renowned for it's spelling prowess? ;)

--
A)bort, R)etry or S)elf-destruct?

Re:And you ask the /. community.. by Textbook+Error · 2002-11-08 02:46 · Score: 5, Funny

for it's spelling

Or grammer... :-)

("it's" == "it is", "its" == possessive form)

--

Nae bother
Re:And you ask the /. community.. by tswinzig · 2002-11-08 02:52 · Score: 5, Funny

... which is renowned for it's spelling prowess? ;)

Are you kidding? With the number of people bitching about grammar and spelling in the comments, you just know there's a pool of talent here!

(BTW, there's no apostrophe in the possessive form of "its.")

--

"And like that ... he's gone."
Re:And you ask the /. community.. by Harald74 · 2002-11-08 02:54 · Score: 1

(BTW, there's no apostrophe in the possessive form of "its.")

Yeah, I know. I was being, uh, ironic. Yeah, that's it. Ironic.

--
A)bort, R)etry or S)elf-destruct?
Re:And you ask the /. community.. by Skirwan · 2002-11-08 02:55 · Score: 4, Funny

And you ask the /. community..
... which is renowned for it's spelling prowess? ;)
Is anyone else somewhat dismayed by the fact that the post pointing out our collective poor grammatical skills has a spurious apostrophe?

:)

--
It's past the blind leading the blind; this is the blind and deaf leading the stupid.
Re:And you ask the /. community.. by jaymz666 · 2002-11-08 02:59 · Score: 2, Funny

then let's not forget that grammar has no e
Re:And you ask the /. community.. by orthogonal · 2002-11-08 03:07 · Score: 4, Funny

... which is renowned for it's [sic] spelling prowess? ;)

Not to mention it's [sic] excellence at spotting grammatical errors.

--
Opinions on the Twiddler2 hand-held keyboard?
Re:And you ask the /. community.. by donutz · 2002-11-08 03:08 · Score: 2

Not to mention the incomplete ellipsis on the subject line. Of course, maybe that's just a little too picky...
Re:And you ask the /. community.. by tswinzig · 2002-11-08 03:10 · Score: 5, Funny

for it's spelling

Or grammer...

Or spelling?

--

"And like that ... he's gone."
Re:And you ask the /. community.. by Erasei · 2002-11-08 03:13 · Score: 3, Funny

What's even scarier is that there are this many comments telling a person that he is wrong when he so isn't. I mean, come on guys, even the Flowers know the real way to use the apostrophe: http://angryflower.com/bobsqu.gif

--
visit my free wallpaper collection, wp.erasei.com
Re:And you ask the /. community.. by Anonymous Coward · 2002-11-08 03:16 · Score: 5, Funny

Or sense of humour?
Re:And you ask the /. community.. by Uma+Thurman · 2002-11-08 03:24 · Score: 1

then --> Then

--
This is America, damnit. Speak Spanish!
Re:And you ask the /. community.. by leuk_he · 2002-11-08 03:45 · Score: 2

ANd then you wonder what the goat.cx are doing in the ilias?

more serious how do they fight off the trolls?
Re:And you ask the /. community.. by Surak · 2002-11-08 03:47 · Score: 1, Redundant

*sigh*

You guys are PATHETIC.

The possessive of "it" is "its". So:

"The cat got out of its bag."

*NOT* "The cat got out of it's bag."

The only the time you use an apostrophe is when you are doing a contraction of "it is", i.e.,

"It's the cat that got out of the bag." (as opposed to the dog ;)

I'm sending you *all* back to grammar school, including the twit that marked this post as "insightful". Go on, get going.

--
My journal has hot /. gossip.
Re:And you ask the /. community.. by CaseyB · 2002-11-08 04:03 · Score: 5, Informative

I know you're joking, but in reality it doesn't matter how good your spelling is. In fact, I would imagine that any spelling errors found in the text should be reproduced intact, in the interest of accurately representing the original work. This project is about correcting OCR errors, not spelling / grammar.
Re:And you ask the /. community.. by jbrownc1 · 2002-11-08 04:08 · Score: 1

...or grammar...
Re:And you ask the /. community.. by Desult · 2002-11-08 04:18 · Score: 1

LUCKILY, proofreading in this case is mainly comparing the difference between Image and Interpretation of Image (Source and OCR Text). Grammatical and spelling errors are intended to be left in text, because in many cases they may have been intended by the authors (stylistic choice, or to imply dialect, etc).

EVERYONE knows Slashdot readers are the best people on the planet to tell the difference between reality and image! Hell, every time we're warned about it, it seems 300 people let us know that the warning of FUD is in fact FUD. And then another 50 to warn us that the warning of FUD warning of FUD is in fact FUD.

So, in summary, proofread on, oh ye paragons of insight and watchfulness.

-Greg

--
-Greg
Re:And you ask the /. community.. by Erasei · 2002-11-08 04:18 · Score: 2

Come on people, have a sense of humor here. This case doesn't need an apostrophe because it's a possesive pronoun. If it were a noun, then Yes, it would need an apostrophe. So it's in this case is incorrect because that rule doesn't apply to pronouns. If, "it" in this case were a proper name (like Steven King's book title, It), then it _would_ need an apostrophe, to show possession.

Reference: http://owl.english.purdue.edu/handouts/grammar/g_a post.html

Make any sense? :)

--
visit my free wallpaper collection, wp.erasei.com
Re:And you ask the /. community.. by InUse · 2002-11-08 04:25 · Score: 1

Or using "test if equal to", rather than "set equal to" ("==" != "=").
Re:And you ask the /. community.. by Scaba · 2002-11-08 04:30 · Score: 2

The best cure for bad writing is Strunk & White's The Elements of Style.
Re:And you ask the /. community.. by JoeBuck · 2002-11-08 05:42 · Score: 4, Insightful

Since Project Gutenburg can only publish books whose copyright has expired, it's quite likely that a spelling "error" may instead reflect language evolution, that is, a change in the way words are spelled over time.
Re:And you ask the /. community.. by tnak · 2002-11-08 05:50 · Score: 1

nope. they're right and you and the poster are wrong. the apostrophe is used to indicate possession except in the single (AFIK) case of "it". It's is the contraction of "it is". Its is the possessive of it.
Re:And you ask the /. community.. by dvdeug · 2002-11-08 05:57 · Score: 2

I would imagine that any spelling errors found in the text should be reproduced intact, in the interest of accurately representing the original work

Usually, Project Gutenberg volunteers correct spelling errors where they are obviously errors. The original work, as in what the author intended, is usually more interesting then that physical edition, which to reproduce we'd really need to keep page numbers and other junk.

For one example, my current project is a cookbook published in the 1730's, and so far I've corrected Apricocr to Apricock and Lemon to Lemmon; in both cases the form I corrected it to was overwhelming used in the text.
Re:And you ask the /. community.. by gTsiros · 2002-11-08 06:23 · Score: 2

you mean spalling, of course!

--
Looking for people to chat about multicopters, coding, music. skype: gtsiros
Re:And you ask the /. community.. by pmz · 2002-11-08 06:38 · Score: 2

This project is about correcting OCR errors, not spelling / grammar.

Yes. I remember reading that Tolkien had some trouble with editors who thought they could spell better than he did. IIRC, it was a real mess at first.

So, to reiterate, don't second-guess the authors' intents.

--
Healthcare article at Kuro5hin
Re:And you ask the /. community.. by Greedo · 2002-11-08 06:49 · Score: 3, Insightful

For one example, my current project is a cookbook published in the 1730's, and so far I've corrected Apricocr to Apricock and Lemon to Lemmon; in both cases the form I corrected it to was overwhelming used in the text.

"Apricocr" I can see being a legitimate typo, but perhaps in converting "Lemon" to "Lemmon", you are eradicating one of the earliest uses (intentional or not) of the now-current spelling.

My personal opinion -- and I yes, everyone on /. did ask for it -- would be to leave the spelling and typos intact, if the goal is to preserve literary creations. You are potentially losing information by changing it.

Ask anyone who has studied the First Folio of Shakespeare about the importance of spelling.

(And just incase you don't have a Shakespeare scholar handy: since Shakespeare's plays were almost always written down after they were first performed (and written down by someone else), there are many clues to the the original performance in how certain words are spelled, capitalized and how sentences are punctuated. Hamlet's "What a piece of worke is a man" is a good example of this.)

--
Tuus crepidae innexilis sunt.
Re:And you ask the /. community.. by dvdeug · 2002-11-08 07:15 · Score: 2

Ask anyone who has studied the First Folio of Shakespeare about the importance of spelling.

Okay, now ask the high-school student who read Shakespeare in high-school if it would have been more fun to have got weird spelling in addition to weird vocabulary and grammar.

I understand the importance of stuff like that to the linguist, (the last thing I skimmed while scanning in was The Roman Pronounciation of Latin where they dissect how Latin was spoken by the writings) but my primary audience is not the linguist. I'm sure there's information to be gained from the italics and hypenation I'm not transcribing either. Fortunately for the linguist, it was released in a facsimile edition in '83 that shouldn't be too hard to get a hold of; alternately, Project Gutenberg has taken to storing the images, and these will get filed away with them for those interested.
Re:And you ask the /. community.. by Hater's+Leaving,+The · 2002-11-08 09:39 · Score: 1

The _reasons_ you give are bogus. I'm sure you know where to put an apostrophe, but you either don't know the reasons or don't want to make them known to others.

Apostrophisation represents the elision of one or more letters. Period. The genetive form on OE used to have a vowel, almost always an "e", before the final "s" in the words that we now spell with "'s".

Bloke named John. Well it used to be "Johnes cloak". And just like "John is fat" becomes "John's fat", "Johnes cloak" becomes "John's cloak". Missing letters; apostrophisation.

Pronouns are somewhat of a red herring, alas.
I kind of agree with you about pronouns, but look at "ones". "Ones" never had an apostrophisation, but over the years (centuries), an apostrophe was put in, so people with no respect for logic now think the correct third person neutral singular peronal pronoun's genetive form is "one's".

Following the same pattern of language change, there's no reason to not fear that in 200 years the /correct/ form of the posessive "its" will be "it's".

Sad but true.

THL.

--
Keeping /. cynic density high since the fscking Kwhores/trolls arrived.
Re:And you ask the /. community.. by Hater's+Leaving,+The · 2002-11-08 09:41 · Score: 1

his, hers, theirs, yours

its ain't so special.

--
Keeping /. cynic density high since the fscking Kwhores/trolls arrived.
Re:And you ask the /. community.. by Wanker · 2002-11-08 11:44 · Score: 2

This project is about correcting OCR errors, not spelling / grammar.

I heartily agree with this. Any speling erros I find will be left in place. ;-)

After running through a few pages, it seems that most of the problems are quotes and spacing, which are understandably difficult for OCR to sort out. In all honesty, the OCR they're using seems to be pretty good. It's ignoring the noise nicely and converting to quite readable text.

The issues seem to be things like:

"Bob, come here,"she said softly,"I want you over here."" Can't, honey,"he said,"I'm glued to the handrail."

Clearly that needs some spaces added to clear it up. Although there seems to be some disagreement about whether to space after a comma or not, I've elected to add the space in my proofs:

"Bob, come here," she said softly, "I want you over here." "Can't, honey," he said, "I'm glued to the handrail."

Now I was taught that a new speaker should start a new paragraph, which would avoid lots of these issues, but the author didn't do that in the book I was proofing.
Re:And you ask the /. community.. by psamuels · 2002-11-08 17:54 · Score: 1

Since Project Gutenburg can only publish books whose copyright has expired,

What??? Copyright can expire? I thought Congress did away with that sort of thing.

--
"How can you claim that you are anti-crack, while still writing a window manager?" — Metacity README

Excellent by drhairston · 2002-11-08 02:41 · Score: 2, Flamebait

After some consideration, I propose that this system should be applied to Slashdot stories! Each Slashdot story, after being submitted by an editor, should be reviewed by at least two readers before being posted in order to correct inadvertent spelling mistakes and story duplicity. Thank you sir, for inspiration!

--
Dr. Joseph Hairston
Superintendent, CCBC

Re:Excellent by cyborch · 2002-11-08 02:52 · Score: 1

After some consideration, I propose that this system should be applied to Slashdot stories!

after some consideration?!?! how does that take consideration? with the state of our collective spelling skills we need to apply it immediately!
Re:Excellent by phil+reed · 2002-11-08 03:00 · Score: 1

http://www.kuro5hin.org

--

...phil
"For a list of the ways which technology has failed to improve our quality of life, press 3."
Re:Excellent by Draoi · 2002-11-08 03:11 · Score: 3, Funny

in order to correct inadvertent spelling mistakes and story duplicity
Not to mention malapropisms!! :-)
http://www.dictionary.com/search?q=duplicity&d b=*
I like the first definition better!

--
Alison
"It is a miracle that curiosity survives formal education." - Albert Einstein
Re:Excellent by Anne_Nonymous · 2002-11-08 03:42 · Score: 1

Actually some of the stories around here could stand to be a little less duplicitous too.
Re:Excellent by LordKronos · 2002-11-08 03:44 · Score: 1

So are you suggesting that at least 2 people should read more than just a headline before anyone is allowed to post comments?
Re:Excellent by psamuels · 2002-11-08 17:59 · Score: 1

Not to mention malapropisms!! :-)

Boy, I wish I hadn't used up those mod points earlier tonight. That's the funniest post I've read in, ummm, at least two days.

--
"How can you claim that you are anti-crack, while still writing a window manager?" — Metacity README

Just one page a day? by Adam+Rightmann · 2002-11-08 02:42 · Score: 5, Funny

Sounds like Gary Condit's plan for extramarital affairs.

--
A. Rightmann

Re:Just one page a day? by indiigo · 2002-11-08 03:05 · Score: 4, Funny

And Bill Clinton did contain himself, except it was one page every day!

--
fslg503-985-8686503-985-8686503-985-8686503-985-86 8650 3-985-fdsg8686503-985-8686503-985-8686503-9

OCR Software by Zach+Garner · 2002-11-08 02:42 · Score: 4, Interesting

Is there any worth-while open source OCR software? How about reasonably priced closed source OCR software for *BSD or Linux?

Re:OCR Software by Anonymous Coward · 2002-11-08 02:53 · Score: 4, Informative

Generally not used at dp. Mostly uses Abbyy Fine Reader (www.abbyy.com) which is commercial.

gocr (http://jocr.sourceforge.net/) is open-source, and includes interesting bits like deskewing.

As a proofreader, I really appreciate the best ocr, and the free guys are not the best.
Re:OCR Software by Anonymous Coward · 2002-11-08 03:11 · Score: 2, Insightful

>Just get just about any scanner - it'll almost certainly come with free OCR software.

Generally not nearly as good as the top two (Scansoft (http://www.scansoft.com/sdk/: seems to have engulfed the Xerox/Textbridge and Caere/Omnipage technologies), ABBYY).

When you scan for public use, think about the time of *other people* you waste if your OCR is not optimal or your scans are off-register/ skewed etc.
Re:OCR Software by Simonetta · 2002-11-08 09:12 · Score: 1

I love scanning books. I have done about four so far. Recently I got a UMAX Astra 3400 USB(1.1) scanner and Abbyy FineReader 6.0 (from Kazaa and the working krak from www.cracks4u.com) and have been impressed at the speed at which I can scan books. I recommend that everyone scan their favorite books and share them.
The results from the OCR is OK but not good. I have been learning VBA in order to assemble a series of Word Macros that will analyse and correct many of the scanning spelling and formatting errors with minimum effort on the user's part.
It is important to digitize books regardless of the current absurd copyright laws. This is our cultural heritage. No one has the right to tell us that we can't do this.
Thank you for your consideration of my perspective.

Obvious... by OrangeSpyderMan · 2002-11-08 02:42 · Score: 5, Funny

I'm shure that buy askin teh Salshdot crowd (esp. the editturs) to help, yule improove jamatically teh kwality off you're output.

:-)

--
Try NetBSD... safe,straightforward,useful.

Re:Obvious... by nogoodmonkey · 2002-11-08 03:01 · Score: 1

How did anybody understand what he said to mod him up? Maybe I should start posting in l33t to get better moderation on my posts.
Re:Obvious... by Otter · 2002-11-08 03:28 · Score: 2

See, unlike the other people making the same point, OrangeSpyderman had the good sense to intentionally misspell most of his words so any unintentional misspellings or grammatical errors will be lost in the noise and go unflamed.

It's like that stegosaurus encryption.

With all the nitpicking, isn't anyone going to bitch at Michael for leaving the "Thank you, Charles Franks" in the submission for no apparent reason?

--
What I'm listening to now on Pandora...
Re:Obvious... by carlos_benj · 2002-11-08 03:32 · Score: 1

That wasn't l33t! It was kind of a cross between the Anguish Languish and poor spelling....

--
--

As a matter of fact, I am a lawyer. But I play an actor on TV.
Re:Obvious... by OrangeSpyderMan · 2002-11-08 03:35 · Score: 2

Intentionally, yeah that's right. :-)

--
Try NetBSD... safe,straightforward,useful.
Re:Obvious... by nogoodmonkey · 2002-11-08 03:54 · Score: 1

No, I was just stating that if I type in a format that is hard for people to read, maybe I will get modded up. I wasn't saying that he was speaking in l33t.
Re:Obvious... by kiscica · 2002-11-08 04:49 · Score: 1

Funny you should mention that Anguish Languish site. Now there is an OCR'd text in need of some serious proofreading. Already in the introduction we read:

Egervescent further delerent saturations an witch way harem, wade hei[er haliver tam sang [...]

which should of course read:

Effervescent further deferent saturations an witch way harem, wade heifer haliver tam sang [...]

Proofreading that e-text might be a little bit of a challenge without the original book (which I happen to own, though I'm not volunteering for the job!), given the nature of the text...

Kiscica
Re:Obvious... by Myco · 2002-11-08 06:13 · Score: 2

Hang on... which words were misspelled?

--
My deviantArt site
Re:Obvious... by carlos_benj · 2002-11-12 02:18 · Score: 1

I was only familiar with 'Ladle Rat Rotten Hut' before stumbling onto the site. I think someone who 'gets it' could probably do it without the book. Think of it as a puzzle to be solved.

--
--

As a matter of fact, I am a lawyer. But I play an actor on TV.

Re:Legal Implications by phil+reed · 2002-11-08 02:42 · Score: 2, Informative

I can't decide if this is a joke or not.

You do know about Project Gutenberg, right?

--

...phil
"For a list of the ways which technology has failed to improve our quality of life, press 3."

Re:Legal Implications by Junta · 2002-11-08 02:42 · Score: 4, Informative

The only works that go into PG are works in the public domain. While publishers sell dead-tree copies still, they have no copyright over the original text contained within. (Which is why these works are typically available through multiple publishers.

--
XML is like violence. If it doesn't solve the problem, use more.

Copyright is not an issue by ardmhacha · 2002-11-08 02:43 · Score: 5, Informative

Project Gutenberg only publishes books that are out of copyright. That means Dickens is okay but you wont find the latest Stephen King

Re:Copyright is not an issue by Twylite · 2002-11-08 03:41 · Score: 3, Informative

Sadly, copyright is an issue in this sort of work. Just because Dickens' works are no longer copyright, doesn't mean you can go and pull a Dickens novel off the library/bookstore shelf and OCR it. Publishers tend to be careful to make slight alterations to the text here and there (formatting, spelling, come clarifications and corrections) which turns a copyright-expired work into a derived work over which they own the copyright. Shitty, isn't it?

--
i-name =twylite [http://public.xdi.org/=twylite], see idcommons.net
Re:Copyright is not an issue by Anonymous Coward · 2002-11-08 03:51 · Score: 2, Insightful

Well actually only the alterations would be copyrighted not the entire work. Only the original author can create a derivative work that is fully covered by copyright. Usually the publishers add a new foreward of absolutely not worth. If you take out that forward and copy only the original text it would be hard for them to prove otherwise. The only sticking point is translations of foreign work. You won't find a lot of Kafka in there (I found only Metamorphosis) because a lot of his stuff was translated only after WW II. The translations are basically new works and are copyrighted as of the date of translation.
Re:Copyright is not an issue by Twylite · 2002-11-08 04:15 · Score: 2

I'm afraid I can't find the original source where I was reading about this, but the problem extends to the text, mainly because it is not the original text. Shakespeare is good example, because most modern publications are not true to the original works: oldde englishe wordes have been changed into modern equivalents, and phrases here and there have been updated to ones we can understand today.

You are correct in saying that the publishers copyright (in such cases) is over the modifications only; but it can be very difficult to determine which parts have or have not been modified. Typically, you need an old copy of the original work, which means you can't pick up a modern publication for your library or bookstore.

--
i-name =twylite [http://public.xdi.org/=twylite], see idcommons.net
Re:Copyright is not an issue by Software · 2002-11-08 04:49 · Score: 2

**sigh** I wish there was a mod category for "wrong" or "what the hell are you talking about" or "FUD", because I'd mod this instead of posting. I have never seen a publishing company make trivial changes to a work and claim copyright on it. I don't mean "The Wind Done Gone" or a major work like that. When I see classic books, I check the copyright page. They always say "Foreward (c) 2002 John Doe", but I have never seen that copyright was claimed over a whole work. I think a company that intentionally misrepresented an altered work as that of a famous author would be liable to fraud charges.
Please provide specific examples of this, so that I can be proved wrong. Please give the ISBN and perhaps a link to an online bookseller.
Re:Copyright is not an issue by p3d0 · 2002-11-08 04:58 · Score: 2

Nope. See #6 on this list.

--
Patrick Doyle
I mod down every jackass who puts his moderation policy in his sig. Oh, wait a sec....
Re:Copyright is not an issue by j-beda · 2002-11-08 05:14 · Score: 2

That just speaks to making derrivative works of things that are copyrighted (such as fan fiction). It is certianly not clear to me how this effects derrivative works of public domain material.
Re:Copyright is not an issue by David+Jao · 2002-11-08 06:02 · Score: 2

**sigh** I wish there was a mod category for "wrong" or "what the hell are you talking about"
I wish there was too, but in this case you're the one who is wrong.
You must not have a very large sampling of classic books. Almost all classic books in my collection have copyright asserted by the publisher.
Please provide specific examples of this, so that I can be proved wrong. Please give the ISBN and perhaps a link to an online bookseller.
Here's one: The Riverside Shakespeare, ISBN 0-395-04402-2, which says the following on the inside title page.
Copyright © 1974 by Houghton Mifflin Company.
All rights reserved. No part of this work may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying and recording, or by any information storage or retrieval system, without permission in writing from the publisher.
Re:Copyright is not an issue by dvdeug · 2002-11-08 06:03 · Score: 2

They always say "Foreward (c) 2002 John Doe", but I have never seen that copyright was claimed over a whole work

That's weird. I almost never see that; usually I'll see a plain 'Copyright (c) 2002 John Doe' that obviously should only cover an introduction or something, but never mentions that. At an extreme, I've seen books that were photocopies of the originals, and nothing else, that claimed copyright.

I think a company that intentionally misrepresented an altered work as that of a famous author would be liable to fraud charges.

It is true, that when you read Shakespeare, what you read doesn't look like what was originally printed. Modern editions of Shakespeare have updated the spelling and made it consistent, and typeset it in modern forms (with the long s, for example.)
Re:Copyright is not an issue by Junta · 2002-11-08 07:02 · Score: 2

That does not necessarily mean that it is legally enforcable. I could, for example, say that I require one dollar payment from anyone who reads this comment. Just because I said so, does not make it true. Even if a modern work, that short statement is not enforceable, as it implies no fair use allowances, so while by their statement an academic copy is forbidden, law disagrees.

There are a lot of cases where companies know very well what they can and cannot enforce, but will still at least do their best to make the customer *think* they have no rights. A prime example are the warnings on tapes/dvds that say no copy may be made under any circumstances. If you were dragged into court for making a copy for backup purposes and can prove you have the original and did not distribute copies, you would be let off, even though the warning would have you believe the FBI will bust in with guns drawn should you ever think so. You know those trucks with the bumber sticker "not liable for windshield damage"? They are indeed liable, the sticker has as much meaning as writing 'not liable for property damage, personal injury, or death' on a gun and using it to kill people.

The practice of taking companies legal statements , disclaimers, EULAs, and warnings as absolutely truthful has caused a great deal of misinformation among the public. The large percentage of the population that does not think they have a legal right to make personal copies of movies and music they own, for example. If their word was true, how are other publishing companies publishing those works without deals with Houghton Mifflin?

--
XML is like violence. If it doesn't solve the problem, use more.
Re:Copyright is not an issue by kippy · 2002-11-08 09:22 · Score: 1

Of course you won't find Stephen King there.

Didn't you hear on public radio that he died?

--
Blaze a trail to the New World
Re:Copyright is not an issue by CoughDropAddict · 2002-11-08 13:49 · Score: 2

Hot damn, if derived works aren't (c) their creator, why don't you set up a website offering Disney movies for download? See if Disney and the courts agree with you.

Your link only addressed the fact that creating derived works from works still in copyright requires permission from the author of the existing work. It doesn't claim that the author of the derived work gets no copyright for the derived work.

Re:Legal Implications by jallen02 · 2002-11-08 02:43 · Score: 1

Project Gutenberg specifically deals with texts that are not copyrighted. So it is all legit. :)

Jeremy

Re:Legal Implications by Chundra · 2002-11-08 02:43 · Score: 2

Not when the authors have been dead for 300 years.

Re:Legal Implications by seizer · 2002-11-08 02:43 · Score: 4, Informative

It helps if you read the FAQ list.

Due to copyright laws, it is only legal to do this with older books (copyrighted 75 or more years ago). As a result, Project Gutenberg is mostly comprised of the "Classics."

Wow, what a scary thought by TheConfusedOne · 2002-11-08 02:44 · Score: 5, Funny

Imagine the kids 200 years from now reading |-|uc||_3b3rry F1|\||\|.

(That hurts my brain just trying to type it in...)

--
--- I wish I could hear the soundtrack to my life. That way I'd know when to duck.

Re:Wow, what a scary thought by foistboinder · 2002-11-08 02:50 · Score: 2, Funny

|-|uc||_3b3rry F1|\||\|.
I must get out more - I was actually able to figure that out!

--
Yet Another Web Site
Re:Wow, what a scary thought by TheConfusedOne · 2002-11-08 05:26 · Score: 1

It was supposed to be Huckleberry Finn. I screwed up the K by not HTML encoding the less than sign. (Of course I said "Plain Old Text" and it still choked on it.) Trust me it took awhile for me to get that close to 31337 speak myself. :-}

--
--- I wish I could hear the soundtrack to my life. That way I'd know when to duck.
Re:Wow, what a scary thought by Anonvmous+Coward · 2002-11-08 05:43 · Score: 2

"|-|uc||_3b3rry F1|\||\|.

I must get out more - I was actually able to figure that out!"

Ouch. You just violated the DMCA!
Re:Wow, what a scary thought by Jace+of+Fuse! · 2002-11-08 12:30 · Score: 2

Anything more complicated than 733T H@x0rz and I get lost...

Obviously, since the correct spelling is 1337 h4x0rz ...

Oh wait...

--

"Everything you know is wrong. (And stupid.)"

Moderation Totals: Wrong=2, Stupid=3, Total=5.

I'm guessing... by actor_au · 2002-11-08 02:44 · Score: 1, Funny

"The Road Ahead" will not be included, at least in this round of distributed OCRing.

--
Read Errant Story.

Re:I'm guessing... by JUSTONEMORELATTE · 2002-11-08 03:50 · Score: 2
The odd thing is what Amazon chose to recommend to me when I view the page for The Road Ahead:
Customers who shopped for this item also wear:
- Clean Underwear from Amazon's Eddie Bauer Store
- Ladybug Rain Boots from Amazon's Nordstrom Store
- Suede Headwraps from Amazon's International Male Store
- Cheetah Print Slippers from Amazon's Old Navy Store
I used to be creative, now I'm merely observant.
--

A better use of time by Apreche · 2002-11-08 02:45 · Score: 5, Insightful

I think a better use of time would be to have all these programmers here develop a better OCR. Then you wouldn't need the proofreading and could just feed books into the scanner. I mean there are lots of things wrong with OCR and reasons why it can't be absolutely perfect, but it CAN bet better. If we just write one line of code a day each we'll have better OCR in no time.

--
The GeekNights podcast is going strong. Listen!

Re:A better use of time by Anonymous Coward · 2002-11-08 03:02 · Score: 1, Informative

>I think a better use of time would be to have all these programmers here develop a better OCR.

Maybe. OCR has improved to the level that is better than re-typing. Still averages more than an error a page, 'though. And is a hard problem.

The most sucessful recent hacks on dp have been further exploiting the output of existing OCR (thanks Aldorando) to do things like handle end-of-line dashes (mostly) automatically.
Re:A better use of time by Anonymous Coward · 2002-11-08 03:04 · Score: 1, Insightful

I think a better use of time would be to have all these programmers here develop a better OCR. Then you wouldn't need the proofreading and could just feed books into the scanner. I mean there are lots of things wrong with OCR and reasons why it can't be absolutely perfect, but it CAN bet better. If we just write one line of code a day each we'll have better OCR in no time.

Great idea! Allow me to offer this line today:

$legible_book_copy = getPerfectOCR($famous_book);

Now someone just needs to implement the simple function, getPerfectOCR().
Re:A better use of time by scottcain · 2002-11-08 03:05 · Score: 2, Informative

Perhaps, but the page I just proofed was from a book publish in the 1850's, so it was not the best image quality, and still the OCR did a great job. The most common mistake I corrected was converting I's to !'s. It got right things that I had to look at pretty closely to make sure it was right.
Re:A better use of time by SteakJerky.com · 2002-11-08 03:22 · Score: 2, Insightful

Even with fantastic OCR, there will be some small errors out there so a human double check is a great idea. If project Gutenberg isn't a great reason to buy a pda, I don't know what is. Its a huge library of great books ready to be read in the lunch line, on the bus, in the john...
Re:A better use of time by rixster · 2002-11-08 03:50 · Score: 2

I'll do the first bit and last bit for you...

sub getPerfectOCR()
{
my $raw_data = shift;

my $completed_text;

# 1. Process
# 2. ???
# 3. Profit

return $completed_text;
}

--
Two wrongs may not make a right, but three ....
Re:A better use of time by carlos_benj · 2002-11-08 04:33 · Score: 1

Oooh. You shouldn't read in the john. It'll cause Hemmoroi...
Hammer...
Hemroi...

It'll make your anus hurt.

--
--

As a matter of fact, I am a lawyer. But I play an actor on TV.
Re:A better use of time by wkitchen · 2002-11-09 14:10 · Score: 1

I think a better use of time would be to have all these programmers here develop a better OCR. Then you wouldn't need the proofreading and could just feed books into the scanner. I mean there are lots of things wrong with OCR and reasons why it can't be absolutely perfect, but it CAN bet better. If we just write one line of code a day each we'll have better OCR in no time.
Oh great! A wealth of classic literature and no one but our computers will read it!

Re:Book Pirating? by phil+reed · 2002-11-08 02:45 · Score: 2, Informative

So are the books they are digitizing all in the public domain?

Yup.
It doesn't seem like there would be that many books in the public domain that haven't already been made available on the net.

How do you suppose they make it to the net? Most of the public domain books were written before word processors, so there's no electronic text around.

Of course I could be wrong.

Yeah. Go look at Project Gutenberg's site - think of it as you homework assignment for the weekend.

--

...phil
"For a list of the ways which technology has failed to improve our quality of life, press 3."

Prufe reed? by conduit4 · 2002-11-08 02:46 · Score: 1, Redundant

Y wood any 1 nede sum one too prufereed there buk. Eye du fyne bye myselph.

Re:copyrights? by A+Commentor · 2002-11-08 02:46 · Score: 2

The 'Project Gutenberg' is about making old books that have (finally) fallen into public domain available to whoever wants it. Those are the books I'm sure that they want to have proofed.

--

Looking for any old 8-bit Heathkit/Zenith software/hardware - http://heathkit.garlanger.com

server test under load by lovebyte · 2002-11-08 02:46 · Score: 2, Funny

Instead of proofreading the books, I think this guy is asking for his new server setup to be tested!

--

I'll do it for cheesy poofs.

Re:server test under load by Inkwina · 2002-11-08 06:13 · Score: 1

I beleive you are comletely wrong here!
Although surviving the /. effect will be a good babtism of fire for the server, I beleive that most readers of /. and those working on PG share a common goal --- Libarate as much IP as possible!

any /.er who also loves literature (I don't mean the latest O'Riely book) will want to help!

Re:copyrights? by Jeremy+Erwin · 2002-11-08 02:47 · Score: 4, Informative

Copyrights aren't perpetual. The Gutenberg project aims to publish books that are no longer, or have never been under copyright.

Re:Book Pirating? by raju1kabir · 2002-11-08 02:49 · Score: 4, Informative

So are the books they are digitizing all in the public domain? It doesn't seem like there would be that many books in the public domain that haven't already been made available on the net. Of course I could be wrong.

And you probably are. The best efforts of our duly elected Congressional representatives notwithstanding, copyright still does expire. After that, a work passes automatically into the public domain. That means there are hundreds of thousands of books available.

In fact, if you've previously seen the classics online, they probably came from this project, which has been around for almost as long as I can remember.

--
"Patriotism is your conviction that this country is superior to all other countries because you were born in it." -- GBS

Dirtributed OCR? by edwilli · 2002-11-08 02:49 · Score: 4, Interesting

Have each client do the OCR (if you can find GPL software). Or maybe there's a company willing to donate it. That way you could farm out most of the processing too.

Re:Dirtributed OCR? by pigpen_ · 2002-11-08 05:19 · Score: 1

Individuals already do all the OCR and contibute the scanned files and the OCRed text. Distributed OCR doesn't make any sense since you'd waste a huge amount of time and bandwidth distributing the scans.

--
Zambozay! My brain must've been eatin' a sandwich!

Re:Legal Implications by stinky+wizzleteats · 2002-11-08 02:50 · Score: 5, Interesting

While publishers sell dead-tree copies still, they have no copyright over the original text contained within.

What? You mean to suggest that you have an actual example of a publisher making money without tyranny over the content?

Gasp!

Re:Which books are getting converted? by teeker · 2002-11-08 02:50 · Score: 5, Informative

The books that are being converted are whatever people feel like contributing.

Don't think your favorite authors are being represented? Can you demonstrate that the work is out of copyright? Make the conversion yourself!

Doing the hard work yourself is the best way to guarantee your interests are represented.

--
teeker

Graphics by mallfouf · 2002-11-08 02:51 · Score: 4, Interesting

Very good idea.
Will there be any support for proofing in other languages (french, spanish, arabic, etc...)?
What about books published in other countries. Would we be able to post those books if they're not copyrighted in the US but copyrighted in other countries? or vice versa.

Re:Graphics by TulioSerpio · 2002-11-08 03:01 · Score: 1

I've just proofread a page in arcaic Spanih from Voltaire.

Think it's public domain in any country

--
I'm from Argentina: Tango, Asado, Mate, Gaucho, Maradona, YPF
Re:Graphics by dvdeug · 2002-11-08 03:23 · Score: 4, Informative

Will there be any support for proofing in other languages (french, spanish, arabic, etc...)?

DP has had books in Dutch, French, Spanish and German. No Arabic - no one has mentioned being able to do it, for one thing.

Would we be able to post those books if they're not copyrighted in the US but copyrighted in other countries?

Project Gutenberg only worries about the US copyright. If it's not copyrighted in the US, they'll do it.
Re:Graphics by imadork · 2002-11-08 04:31 · Score: 2

PG Australia falls under Aussie copyright law. They have shorter copyright terms than the good ol' USA does.
Re:Graphics by dvdeug · 2002-11-08 05:49 · Score: 2

They have shorter copyright terms than the good ol' USA does.

That's not exactly true. Australia is life+50 years, where as the US is post-1923. Neither one is a subset of the other.
Re:Graphics by Planesdragon · 2002-11-08 06:06 · Score: 1

That's not exactly true. Australia is life+50 years, where as the US is post-1923. Neither one is a subset of the other.

US copyright is either life+50 with a 20-year extension that's coming under the SC right now, or 70 years +20 for copyrights held by an (immortal) corporation.

The most intelligent guess I've heard about the SC is that if they don't toss out the 1998 twenty-year extension, they'll toss out the next one. Althought it's intelligent at the moment to plan as if US copyright is "infinte for anything younger than mickey mouse", only a fool wouldn't have a backup for that supposition being found false.
Re:Graphics by dvdeug · 2002-11-08 06:30 · Score: 2

US copyright is either life+50 with a 20-year extension that's coming under the SC right now, or 70 years +20 for copyrights held by an (immortal) corporation.

Historically, the US has been on an X years rule for copyright, and the US copyright law has a lot of cruft relating to that. Anything published before 1923 has fallen into the public domain. Anything published between then and 1978, if it hasn't fallen into the public domain, has a flat 95 years, or 75 if the SC tosses out the extension. Life plus x years only kicks in if it was printed after 1978.

It's not based on who holds the copyright, it's based on the creator's life span. So even if a corporation holds the copyright, it still expires. If it was done as a work for hire, it gets a straight 100 years (IIRC). So there's no big loophole for immortal corporations in there.
Re:Graphics by dvdeug · 2002-11-08 06:50 · Score: 2

No Arabic - no one has mentioned being able to do it, for one thing.

And another, while I think of it--

DP is set up to take OCRed texts. ABBY&Y, while an amazing multilingual OCR program (176 languages, using the scripts of Latin, Cyrillic, Greek, Armenian, and Georgian), doesn't handle Arabic. You'd have to get an Arabic OCR program to handle them, and considering non-English texts tend to take a long time to go through, it's not something they'll jump at buying.
Re:Graphics by Planesdragon · 2002-11-08 10:05 · Score: 1

It's not based on who holds the copyright, it's based on the creator's life span.

How about "who created it?"

Saying "it's not based on who holds the copyright" implies a misunderstanding that you (obviously) don't have.

So there's no big loophole for immortal corporations in there.

Not unles the corporation manages to get one real; person to be immortal...

use proofreading meta-data to improve OCR! by tomlouie · 2002-11-08 02:52 · Score: 5, Interesting

What if they kept track of every time the human reader finds an OCR-error. Couldn't you then build a profile of what words/phrases/letters the OCR software has the most problems with?

Then, couldn't you just selectively have the humans review the highest probably error prone sections of a book, instead of every single word of every single page?

What do you think?

Re:use proofreading meta-data to improve OCR! by lovebyte · 2002-11-08 03:07 · Score: 1

Then, couldn't you just selectively have the humans review the highest probably error prone sections of a book, instead of every single word of every single page?

This is stupid. How could you understand the plot then?

--
I'll do it for cheesy poofs.
Re:use proofreading meta-data to improve OCR! by Big_Breaker · 2002-11-08 03:09 · Score: 4, Insightful

Different book - different font - different problems.

It might help a bit but most OCR programs already tag letters that it is unsure about. They don't mention in the article if the distributed system incorporates OCR ambiguity in prioritising proofreading.

As an aside why not just store the raw image for any ambiguous text within the documents in the PG archive (Think of an HTML sort of thing). As people read the document just poll them as to what they think the letters in the bitmap are.

I guess a lot of the stategy rests on how frequently the ocr software makes an error or find ambiguity.
Re:use proofreading meta-data to improve OCR! by dmoynihan · 2002-11-08 05:31 · Score: 3, Informative

Actually, they're working on that.

The program is Gutcheck, was developed by PG's Jim Tinsley.

Catches a lot!

/.ed by Midnight+Thunder · 2002-11-08 02:52 · Score: 1

Looks like for the first time in years project Gutenberg has been /.ed.

--
Jumpstart the tartan drive.

Re:Legal Implications by astrosmurf · 2002-11-08 02:52 · Score: 2, Insightful

The only works that go into PG are works in the public domain. While publishers sell dead-tree copies still, they have no copyright over the original text contained within

But the publishers still have copyright on their specific printing. Distributing scanned copies of pages probably still violates their copyright, even if distributing the OCR output does not.

Mod Parent 'Twat' by henben · 2002-11-08 02:53 · Score: 3, Funny

Nuff said.

Re:Legal Implications by Anonymous Coward · 2002-11-08 02:55 · Score: 1, Interesting

It's surprising that so many people are either trolling or are unaware of the concept of "public domain." I personally fear the latter more because it shows the ideological degradation of America. The Slashdot community is much more likely to be aware of copyright issues than most Americans. If so many of us are so naive then I genuinely fear for the survival of our country as a free nation. Perhaps that is the reason why the media corporations can encroach upon our rights by pushing inferior products and getting unanimous approval of the DMCA in the senate.

Re:copyrights? by Anonymous Coward · 2002-11-08 02:55 · Score: 2, Insightful

Copyrights aren't perpetual In Theory. But isn't disney and microsoft (MS wrt printed works esp) working hard to insure they're perpetual In Practice?

Re:public domain books? by teeker · 2002-11-08 02:56 · Score: 2, Informative

True, but Project Gutenberg is a repository for digital copies of literature that are public domain. To remain a legitimate entity, they can't publish copyrighted works (without the author's consent).

So, the answer to your question is no. But that's what p2p is for ;-)

--
teeker

Re:Which books are getting converted? by Chundra · 2002-11-08 02:57 · Score: 2

I'm sure interrest could be affected if people could, say, vote on what would be converted. Or do I make any sense?

I'm trying to make sense of this, please help me out. Are you saying that if people could vote on which books are converted (or "electronificated" as we sometimes call it in the industry), that more people might be interested in the project?

Re:Legal Implications by Anonymous Coward · 2002-11-08 02:57 · Score: 2, Informative

>But the publishers still have copyright on their specific printing.

Nope. Copyright holders (not necessarily the publisher) would have copyright on editorial corrections and (for music: a weird case) some on appearance, but not on the original text.

Publishers often claim copyright on the entire contents of 300 year old works, but they have no legal basis for this.

Read? by uneek · 2002-11-08 02:58 · Score: 5, Funny

Don't you mean run a compare tool in the background using CPU idle time right?

You don't actually want us to read a
page of literature do you?

Re:Read? by (nil) · 2002-11-08 04:00 · Score: 1

i will likely never read another page (or pages) of literature again in my life.
perfect 10, fhm, and maxim excluded of course.
No need, buddy, no need... -(())
Re:Read? by fobbman · 2002-11-08 04:13 · Score: 2

Good point. I'll just go get the e-book.

Of course no Stephen King! by tswinzig · 2002-11-08 02:59 · Score: 1, Troll

You'll find that on Project GNUtenberg.

--

"And like that ... he's gone."

Re:public domain books? by SamTheButcher · 2002-11-08 03:01 · Score: 2, Informative

Also, if you read about the project, it's goal is to put all of the works into XML to create a searchable repository, not just to have all of these .txt documents floating around. Well, that's the newest goal, anyway.

$.02. Like it or leave it.

A better way - have computers do more work. by lawpoop · 2002-11-08 03:02 · Score: 5, Interesting

I was thinking -

In order to make the proofing faster, maybe you could OCR a document 2 or 3 times, and then have only the disagreements proofread.

We use omnipro here at work, and I'm surprised at how well it works, even recreating page formats.

Of course, it doesn't work 100%, but it sure does get about 95%. If you were to OCR a document 2-3 or more times, and most of it was identical, it would save a lot of time if you had humans going over only the parts that the different OCRs didn't agree on.

Steve Lefevre

--
Computers are useless. They can only give you answers.
-- Pablo Picasso

Re:A better way - have computers do more work. by Anonymous Coward · 2002-11-08 03:16 · Score: 1, Insightful

> In order to make the proofing faster, maybe you could OCR a document 2 or 3 times, and then have only the disagreements proofread.

Maybe worth a try, but could well die if they get the same words wrong. For dp, the extra time for scanning could well eat up the time saved by the proofreaders. Not to mention extra development to support this (with extra GUI/ more chances for confused newbies).
Re:A better way - have computers do more work. by hands · 2002-11-08 03:18 · Score: 2, Insightful

In order to make the proofing faster, maybe you could OCR a document 2 or 3 times, and then have only the disagreements proofread.

This may eliminate some of the OCR errors, but it won't speed up the process because a good editor reads every word. You are asking for more errors when you ask your editors to become lazy and skip words.
Most OCR will probably misread the same character incorrectly every time (read 'B' as '13', for example). That kind of error will not be flagged, and will be overlooked by editors who are used to only looking for flagged errors.
Re:A better way - have computers do more work. by handorf · 2002-11-08 03:33 · Score: 2

The "Three Monkeys" from Minority report?

Interesting idea... to be even better, you'd want to use 2 different scanners and 2 different technologies.

--
-- IANAEG - I am not an elder god.
Re:A better way - have computers do more work. by schlach · 2002-11-08 04:05 · Score: 2

There is a company called Paper of Record that is archiving old newspapers using OCR technology. They scan the newspaper pages, OCR it, and create a searchable database you can scan for keywords. You do a search, and can read view the original scanned page or the OCR'd text.

http://www.paperofrecord.com

I bet their software/hardware combination would greatly help an effort such as this.

Heh... Block-quoted for 2 free mod-points. =)

Anyway, I just checked them out, and they have a really great idea. Except for the expensive membership part. They have searchable full-page images of a lot of *old* newspapers (like, early 1800s through present). The problem with using them for something like PG is that they want money. They're in the business of selling their work through subscriptions to their newspaper service, and selling their technology to media companies that want to put their newspaper online. Still, definitely worth checking out. Their parent company is Canadian, so they carry Canadian, US, and UK newspapers. Would be perfect without that "expensive as a regular newspaper that you don't pay for because you read it online"...
Re:A better way - have computers do more work. by noodlez84 · 2002-11-08 04:13 · Score: 4, Informative

Although your method of "proofreading" is actually useful for most documents, it is _not_ a good method for Project Gutenberg (as a contributor to DP, I can attest to this).

The works put out by Project Gutenberg are going to be around for decades, if not, centuries. 95% accuracy is shit for those purposes. An issue that comes up on the PG mailing list (gutvol-d) every once in a while is whether or not to correct spelling mistakes that appear in the real, dead-tree versions of the books. What if, for example, it's obvious to almost any reader that the author meant the word "by" instead of "bye". Surprisingly (or not, depending on the way you look at it), the general response is *not* to correct those kinds of "mistakes". The rationality being that PG is -not- an editor, but simply a library (which is actually its legal status).

So, in short, for works with millions of characters that are going to be around for many decades, 95% accuracy. The "bar" might be high, and, when proofreading for DP, I strive for 100%.
Re:A better way - have computers do more work. by Plutor · 2002-11-08 04:15 · Score: 2

> In order to make the proofing faster, maybe you could OCR a document 2 or 3 times, and then have only the disagreements proofread.

Why not just have the Minority Reports discarded? Save you time, money, and bandwidth, and it's a flawless plan!
Re:A better way - have computers do more work. by leuk_he · 2002-11-08 04:41 · Score: 3, Insightful

[i] it doesn't work 100%, but it sure does get about 95%[/i]

THAT IS 2000/20=100 errors per page.(That is the way OCR works, if it 99% ok, it is still 20 errors per page.

And that doesn't include "strange" formatting like things scribbleing things in margins or heading above pages, italics and extra spaces.

By the way you are not supposed to correct spelling errors made in the original pager. especially since this is often "old" english.
Re:A better way - have computers do more work. by Anonymous Coward · 2002-11-08 06:27 · Score: 1, Informative

Ya got two approaches to preserving old text.

1. Scan it.
Pros:
Automates well. Susceptible to massive implementation.
Cons:
Output is bulky/slow to view/not searchable/not editable (by comparison to ascii)

2. Make it text.
Pros:
OCR can (now) really save you time
Susceptible to massive implementation.
Small/quick to view/searchable/editable/
Cons:
Not as automatable. Loses formatting.

Now you can mix these up. (Add TEI or Docbook
tags to the text. Simulate columns with spaces.
OCR the images and search on
the OCR).

Paperofrecord is OCRing the images, which
has been a known successful method of allowing
adequate searching for a decade or so (the
OCR does not even have to be very good by
modern standards).

Microfiche folk have been preserving images
for decades, now, so the economics and technology
is well understood.

Gutenberg was new (20 years ago) in actually
careing about the public domain.

DP is new since we can now do massive scale
'clickworking', which allows for greater voluntarism.

Better make it quick by CatWrangler · 2002-11-08 03:05 · Score: 3, Funny

The new congress might extend copyright protection to Shakespeare's great great great great great great great great great great great great great grandson's nephew's out of wedlock kid's son whose paternity is in question.

--

---
When you come to a fork in the road, take it! --Yogi Berra--

And I shall call it... the wheel! by tiltowait · 2002-11-08 03:05 · Score: 3, Funny

You mean a more communal approach than an oligarchy of "editros" that can't spot day-old duplicates? Great idea!

Technology without morality is an atrocity by Anonymous Coward · 2002-11-08 03:06 · Score: 1, Insightful

This site is about technology for technology's sake.

Bollocks.

Technology is a human endeavour and as with all human work it is subject to ethical and moral considerations.

It's a disgrace that moral philosophy is not a required course in most tech. degree programs.

will this work? by smeg168 · 2002-11-08 03:06 · Score: 2, Interesting

I have a little problem with the logistics here. I can understand why every page is being sent to 2 people for proof reading in an effort to eliminate errors, but the problem arises that these arent 2 computers doing simple computations, if both of these people have different versions of a corrected page, as im sure they will. what happenes then? who does the final proof reading, and if there is someone doing the final proof reading that kinda eliminates the need for the distributed part. I could almost guarentee that any 2 people checking the same full page of data in their free time will find/create different errors. I hope I'm missing some large concept here, becouse i do love PG, they keep my palm stacked with good reading for free.

Re:will this work? by GiMP · 2002-11-08 03:11 · Score: 3, Insightful

These are humans comparing identical books to text.. if they have the IDENTICAL book they won't have this problem.

Gutenburg often has published the same 'book' but of different publications due to slight variations in the text.
Re:will this work? by clonebarkins · 2002-11-08 03:32 · Score: 4, Informative

who does the final proof reading, and if there is someone doing the final proof reading that kinda eliminates the need for the distributed part.

charlz has a workflow diagram for the works that go through his site. As you see, each book has a project manager, who has final processing/proofing responsibilities.

Also, I'm not sure you get the idea of two rounds of proofing. They don't see different versions of a corrected page -- the first one sees the straight OCR output (or, sometimes the project manager will do some automated corrections on it first) and then the first round proofer edits the text. Then, when all the pages have gone through the first round, the second round proofer reads the text as it was edited by the first round proofer. This helps because it builds off the edits of the first round proofer and allows the second round proofer to perhaps catch things not caught in the first round.

When proofreading, you're never going to capture all the mistakes with one pair of eyes. A distributed proofreading effort is very beneficial to the goals and efforts of Project Gutenberg, and I applaud the efforts of all those who have proofed even one page.

Having said that, I've done over 300 (under a different name).

--

"The evil of the world is made possible by nothing but the sanction you give it." -- Ayn Rand
Re:will this work? by sagwalla · 2002-11-08 05:56 · Score: 1

Another point is that if you are subsequently reading a Gutenberg text and you find obvious mistakes that even the proofreaders didn't catch, you can send them along and they will be incorporated as a later revision. I've had very fast responses when I have submitted (and documented) such corrections. What's important is that the texts are published.
Re:will this work? by smeg168 · 2002-11-08 11:58 · Score: 1

" the second round proofer reads the text as it was edited by the first round proofer" ah i see, i thought this was like seti's concept of haveing multiple crunchers for reliability.
Re:will this work? by smeg168 · 2002-11-08 12:02 · Score: 1

ya what i ment was that if after proofing if both people had caught different mistakes and perhapes made some themselves than in the end they would have different pages. and i didnt know they were actually compareing them to text i thought they would just give them the page to proof, most ocr mistakes are just incorrect letters.

Great, but I asked to help w/HWG a while back- by SamTheButcher · 2002-11-08 03:06 · Score: 1

-and still haven't heard back. They're converting the texts into XML/XHTML and I offered to do some Shakespeare. No answer.

But it looks like this is a more automated system, so that should help.

OCR errors mostly caused by poor scan quality by oob · 2002-11-08 03:11 · Score: 4, Informative

I've just proofed four pages, a mix of modern English, quoted Cockney and religious babble (Jonah 4:13, 9 etc.)

OK it's only four pages, but the errors I've corrected so far have been when the scan has been poor and the OCR software has had to make a guess.

Re:OCR errors mostly caused by poor scan quality by JM_the_Great · 2002-11-08 04:31 · Score: 1

er, why are they OCRing the bible? That's already online in many places.

--

--Justin Mitchell
"2nd Place is a fancy word for losing" --Bender (Futurama)
Re:OCR errors mostly caused by poor scan quality by j-beda · 2002-11-08 05:21 · Score: 2

Some versions of the bible are online, but not all of them. Multiple editions of a single work can be at PG, the bible is probably the most common one with multiple versions.

Sorry... by hpavc · 2002-11-08 03:12 · Score: 1

... way to busy scanning in all the wizards of the coast materials for a simular project.

I suggest getting a hp network scanner ... this thing rocks. Flawless sheet feeder, awesome quality, scans right to the network into pdf format, sends me a instant message when its done.

i should have sprung for the duplexer.

--
members are seeing something, your seeing an ad

Re:Sorry... by hpavc · 2002-11-08 14:42 · Score: 1

This is it:

http://www.cdw.com/shop/products/default.asp?EDC =1 34586

--
members are seeing something, your seeing an ad

Cantor, Hilbert, G�del, Turing ... by muyuubyou · 2002-11-08 03:12 · Score: 1, Offtopic

Are these copyrighted? damn I've read tons of paper about them and never actually read their original papers.

Re:Cantor, Hilbert, G�del, Turing ... by dvdeug · 2002-11-09 08:33 · Score: 2

Are these copyrighted?

In the US, look at the dates of what they wrote. Most of Cantor and Hilbert are in the public domain, while Goedel and Turing are still under copyright. Unfortunately, math has always been penalty copy to typeset; the closest thing Project Gutenberg has to a real historical math text is Maxwell's On the Dynamics of a Top.
Re:Cantor, Hilbert, G�del, Turing ... by ninthwave · 2002-11-14 05:00 · Score: 2

Amazon has Godel in the uk at this link

here it is

IF you want the original paper for it sake if you want it free I don't know if it is out there.

--
I was thinking of the immortal words of Socrates, who said: "I drank what?" - Chris Knight (Val Kilmer)- Real Genius

Re:OCR Software -- Clara, perhaps? by timothy · 2002-11-08 03:13 · Score: 5, Informative

Though the web page was last updated in July, I find several happy references (and some less happy) to "Clara," a GPL'd OCR program.

Here's the web page: http://www.claraocr.org/index.html

timothy

--
jrnl: http://tinyurl.com/c2l8yr / foes: http://tinyurl.com/ckjno5

Re:copyrights? by msouth · 2002-11-08 03:14 · Score: 3, Insightful

Copyrights aren't perpetual. The Gutenberg project aims to publish books that are no longer, or have never been under copyright.

Well, copyrights weren't perpetual. Whether they will be or not remains to be seen.

--
Liberty uber alles.

Great! by b0bby · 2002-11-08 03:14 · Score: 1

This is a great project, I always try to correct texts on my Palm but it's much better to have them correctly proofed in the first place.
I just did a couple of pages - fun & easy!

Here's an interesting example of... by ed1park · 2002-11-08 03:16 · Score: 1

Distributed Computing.

Harnessing the proofreading computing power of human minds on the internet. very cool...

I wonder what other problems can be successfully tackled this way.

Re:Umm... by jandrese · 2002-11-08 03:17 · Score: 5, Interesting

Someone needs to do a google search on " Public Domain". Public domain is there for a reason. Just as Copyright is available to give the artist a means of supporting himself, it was never ment to last his entire life. The purpose is to give the artist an incentive to work, current copyright law fails in this respect because an artist only needs to create one successful work and can immediatly switch to being a leech on society for the rest of his (and his childrens, and childrens childrens) life. Having the works pass into the Public domain is a good idea for two reasons:
1. It is for the greater good of society as other people build on earlier works.
2. It keeps the artist busy as they were supposed to have to keep releasing work to feed themselves as their early work passed into the public domain, just like any other job.

--

I read the internet for the articles.

Why he came to slashdot by cachapa · 2002-11-08 03:18 · Score: 2, Funny

I think he was just watching all his volunteers working on one page a day and thought:
"Imagine a beowulf cluster of these!"

Re:Umm... by Big_Breaker · 2002-11-08 03:18 · Score: 2, Interesting

Lots of books aren't copyrighted anymore as the copyright expired. You see back before Disney bought legislation from people like Sonny Bono copyrights would be allowed to expire after about 50 years or so.

Beowulf, Moby Dick, Shakespearre's plays, etc are all free as in speach and beer. Edited versions of the original text can be copyrighted. Examples of that are edition of Shakespearre's plays with "translations" next to the original text. You can buy his complete works, unedited, for very little $ these days. The only cost for the publisher is printing and typesetting.

Books read to you while commuting by dudemaster · 2002-11-08 03:19 · Score: 3, Interesting

How about this.... use an open source speech synthesis tool/API that can play these text books (especially as more get added) over a PDA, laptop, etc while cruising in on the way to work and home. Something like:

http://www.cstr.ed.ac.uk/projects/festival/
(no plug, just did a quick freshmeat search)

would be pretty cool to get some good novels read to you w/o buying the tapes.

Duplicity? by Andy+Social · 2002-11-08 03:22 · Score: 2

Or duplication, maybe?

--
Illegitimi non carborundum

Just one page a day, huh? by WIAKywbfatw · 2002-11-08 03:23 · Score: 5, Funny

Sure, it starts as just one a day. But, before you know it, you're doing two, then five, then ten.

You stop going out with friends or even returning their calls, personal hygiene takes a back seat and even Counter Strike and Warcraft III become unappealling. And, finally, after countless chapters and hundreds of pages you realise that you're friends were right: you're an addict.

Just one page a day, huh? Yeah, right.

Opium. Pot. Cocaine. Now pages.

It might not be your older brother's drug, or your Daddy's or your grandfathers, but, trust me, this stuff can be dangerous.

Do what I do. Just say no.

--

"Accept that some days you are the pigeon, and some days you are the statue." - David Brent, Wernham Hogg

Re:Just one page a day, huh? by SDrifter · 2002-11-08 04:19 · Score: 2, Funny

personal hygiene takes a back seat and even Counter Strike and Warcraft III become unappealling

If counterstrike and warcraft are what you do for fun, somehow I doubt that personal hygeine is an issue. Or friends, for that matter.

--
--It burns! --It's loaded with wasabi.
Re:Just one page a day, huh? by leuk_he · 2002-11-08 04:45 · Score: 1

Opium. Pot. Cocaine. Now pages

You mean ... and before you know it you are a slashdot troll. Happens a lot arround here.

makes me wonder, how do they prevent distributed trolls? and people who just do it for the statistics (think about seti@home)
Re:Just one page a day, huh? by Spunk · 2002-11-08 06:28 · Score: 2

Opium. Pot. Cocaine. Now pages.

Funny, the text I'm proofing right now is about opium.

Doing my part by cornjones · 2002-11-08 03:28 · Score: 2

If we just write one line of code a day each we'll have better OCR in no time.

#include

Ok, there is my line of code, everybody else, finish it up.

I can't wait to see this great new OCR.

Re:Doing my part by Myco · 2002-11-08 07:38 · Score: 2

Your contribution is a syntax error? Thanks so much. History is in your debt.

--
My deviantArt site

What books need to be done? by Alethes · 2002-11-08 03:29 · Score: 3, Interesting

Is there a list of books that are out of copyright and perhaps the status of those books on the Gutenberg Project website or anywhere else?

Re:What books need to be done? by clonebarkins · 2002-11-08 03:39 · Score: 3, Informative
Check out the following for a start:
- Books in Progress and Requested
- Steve Harris' PG To-do List
- David Price's In-Progress Page (some have been "in-progress" for quite awhile now, so they are probably free to grab)
--

"The evil of the world is made possible by nothing but the sanction you give it." -- Ayn Rand
Re:What books need to be done? by stud9920 · 2002-11-08 03:56 · Score: 2

We could start with Sony Bono's lyrics. I heard they would be public domain in 2012^H^H24^H^H45^H^H98^H^H^H^H3012^H^H^H^Hnever
Re:What books need to be done? by dvdeug · 2002-11-08 06:10 · Score: 2

There are probably millions of books that are out of copyright. http://www.dprice48.freeserve.co.uk/GutIP.html has a list of ones that are in process or released, but it's no where near a tiny fragment of the number of books out of copyright. Arguably, they all need to be done; personally, I put emphasis on the literature by famous authors (Millay, Tolstoy) and the non-fiction that everyone should have access to -- especially first-hand and soon after the fact accounts of historical events.

Second and closing fast by ardmhacha · 2002-11-08 03:29 · Score: 1

On the archive.org Gutenberg page they list the most popular downloads.

Number 2 is something called "New Hacker's Dictionary, The"

Every time I refresh the page the download count has increased.

A variation on the slashdot effect ?

Gutenberg page at archive.org

Possible Enhancements by Niles_Stonne · 2002-11-08 03:31 · Score: 5, Interesting

This a great project... But after doing my first page I found a couple of possible enhancements.

Add a "Quality" stat for each person. Base it on the number of things that were missed(another words, the number of things that the second-string proofer finds).

Use more than just two proofers. Have one "First String" proofer, who could be anybody, but have two second string proofers (who both get the output of the first string proofer). If the second string proofers have any differences in their output(with the exception of white space), then another second string proofer should be used. Only proofers with a certain quality rating(slightly higher than what a newbie's would be) should be able to do the second string proofing.

The "User rating" should be a combination of the number of pages done and the quality rating of those pages. Note that quality rating would only be increased by doing first string proofing. Page count would go up for any proofing.

Quality could be a float, starting at 1.0 for newbies. Every page that is completed and has a second-string person check would then go into a calculation like:

_new_quality_ = _old_quality_ + (0.01 - (_num_differences_between_their_proof_and_final_pr oof_ / 1000))

Thus, for every page proofed that requires NO corrections by the second string the user's quality would go up by 0.01. ( 0.01 - 0/1000 = 0.01 )

if there were more than ten errors in the proofing, their quality would go down ( 0.01 - 10/1000 = 0.00 ), (0.01 - 20/1000 = -0.01)

Have a threshold of 1.10 or some such for second string proofers... That way it would require the user to do at least 10 perfect pages, or 20 pages with 5 errors, etc, before they could do the second string proofing.

Obviously, make sure that the second string proofer can't see who the first string proofer is.

The "User Rating" (mentioned above) could just be a multiplication of the Quality and Page Counts...

--
Sticks and Stones may break my bones, but copyright will always protect me.

Re:Possible Enhancements by BSemrad · 2002-11-08 04:38 · Score: 1

Another improvement, although difficult to implement, which would enhance the proofing speed and accuracy would be to highlight the current line on the scanned image and the text control. You would also want to make sure they scrolled together.
Re:Possible Enhancements by Jerf · 2002-11-08 04:47 · Score: 2

You should read this. It may not seem directly related at first, but it is.

The root problem is unless you can measure EXACTLY what you are trying to measure, people will optimize to improve their standing with the measurement, rather then real quality.

Your proposed optimizations would cause someone to create two accounts, one that they use to completely trash a page, and another to "correct" it, boosting the second account's rating at the expense of the first. (You can't force people to do pages they don't want to do, or you'll drop participation through the floor.)

I know you mean well, but it is often better just to leave these statistics out completely, and deal with the fact that you are only attracting serious people to the project who will do it without the carrot of being in "first place" over everybody else on the stats page.
Re:Possible Enhancements by BSemrad · 2002-11-08 06:01 · Score: 1

I'm assuming that you were trying to control the scrolling via the relative position top to bottom in the text control as finding the current line in the OCR document by locating the actual text from the current line in the text control would very tough.
Since the windows are primarily the same on a line by line basis it is possible that you could scroll the two windows relatively accurately just based on the relative position as long as you centered each assumed current line in their respective window. This would probably be effective even without current line highlighting.
Re:Possible Enhancements by Niles_Stonne · 2002-11-08 06:11 · Score: 2

First of all, you do not get to choose the page that you do, just the book - so you couldn't reference a particular page.

Second of all, with my proposed quality rating you couldn't do that. Sure, the first string proofer could screw up the page, but once one first string proofer finishes it, only second-string proofers can work on it. In my proposal the only people that would get their quality level adjusted would be the first string proofers. In other words, sure you could use your second account to fix what you screwed up in the first, but your second account's quality wouldn't be increased, and your first account's quality would be decreased.

Perhaps a cutoff to not allow proofing once a person is below Quality 0.80 or so would be in order.

The link you posted (to fogcreek.com) does have some good statements about user metrics. Keep in mind that this is a community effort, so there is no HR department to worry about.

I was attempting to give users something that they could boast about (I have a Quality rating of 5.06!) that would encourage higher quality work, not just faster work.

--
Sticks and Stones may break my bones, but copyright will always protect me.
Re:Possible Enhancements by Niles_Stonne · 2002-11-08 06:41 · Score: 2

The site already does a two tiered approach, so it would be 50X pages currently. I just wanted to provide a couple of extra checks, as well as a performance estimate for user statistics. The drop from 50X to 33X is not nearly as great as the drop from 100X to 33X, although it is still significant. ;)

Having a setup like I proposed makes it very difficult for a purposefully mangled page to get through.

--
Sticks and Stones may break my bones, but copyright will always protect me.
Re:Possible Enhancements by manaway · 2002-11-08 06:45 · Score: 1

There is considerable merit and thought in this suggestion. The appeal of improving the system is a strong temptation. However, sometimes a project shows more progress with more worker bees and fewer rules, structures, and PHBs.
(Hmmm, new tagline: More worker bees, fewer PHBs!)
Re:Possible Enhancements by Niles_Stonne · 2002-11-08 06:47 · Score: 2

Thank you.

Good tagline :)

I tend to come up with way too many ideas/enhancements whenever I do something... Feature Creap is my greatest issue when writing software ;)

--
Sticks and Stones may break my bones, but copyright will always protect me.

ASCII Only? by vondo · 2002-11-08 03:34 · Score: 5, Insightful

Reading the blurb at the page-a-day site, it says ASCII only where bold is converted to ALL CAPS, the English pound symbol is rendered as "L," etc. No preservation of figures, drawings, or photos.

This seems very short sighted to me. Devices that can only display ASCII are becoming rarer and rarer. Why not, instead, store docs in some sort of SGML format to handle the special markup (which must be rare) and then down convert to ASCII when needed.

I've tried reading these things on my Palm. Very difficult. But if I could get a nice typeset PDF version, that would be a whole different story (no pun intended).

Re:ASCII Only? by Robotech_Master · 2002-11-08 03:42 · Score: 5, Informative

Check out Black Mask for a lot of nicely-formatted pubdom e-books, including many from Gutenberg but also some that Gutenberg doesn't have.

--
Editor Emeritus and Senior Writer, TeleRead.org
Re:ASCII Only? by mattdm · 2002-11-08 04:08 · Score: 2

This seems very short sighted to me. Devices that can only display ASCII are becoming rarer and rarer.

And the key thing is: with a good markup language, converting to plain ASCII for those devices is trivial. Or *trivial*. It's a win-win proposition. In fact, the markup language doesn't even have to be that great -- HTML 4 would work fine.
Re:ASCII Only? by Captain+Large+Face · 2002-11-08 04:50 · Score: 2

What about DocBook, which features encoding for books in both SGML and XML? It was devised for computing books, but one imagines it would not be too hard to devise a standard to apply to all works of literature.
Re:ASCII Only? by rusty0101 · 2002-11-08 04:53 · Score: 4, Informative

When the project was started, SGML varients were not widly used, and the option of including images was a concern for storage space.

Using things like BOLD and L for british pound were workarounds to have a common way of presenting the data. I suspect that it would be trivial to build a formating filter in perl, or another language that would convert BOLD to bold though it would require a bit more work to recognize that it really should be Bold or even that it should be BOLD.

Converting monetary symbols would require a bit more work, but would also not be impossible.

Re-inserting any diagrams, figures, illustrations or other graphics would require more work. If the original scanned pages are still available, as this part of the project suggests, even that would not be impossible.

One variation is the free bookmobile project that is out there. They use scans of the original book to build a new book for kids. Preparation for printing involves downloading the book over the internet, via a dsl speed sattelite link. I am not sure however if the working material is suitable for e-book reading however.

-Rusty

--
You never know...
Re:ASCII Only? by J'raxis · 2002-11-08 05:00 · Score: 1

SGML? How about just straight-up UTF-8?

--
Liberty in your lifetime
Re:ASCII Only? by pigpen_ · 2002-11-08 05:03 · Score: 1

Typeset your own with LaTeX or pdfTeX. It's not that hard and with a little pstops magic you can have 4up 32page signatures for binding your own little books.

--
Zambozay! My brain must've been eatin' a sandwich!
Re:ASCII Only? by quinto2000 · 2002-11-08 05:07 · Score: 3, Informative

From actually proofing a few pages, this depends entirely on the particular project and when it was started. Some of the newer ones allow special characters.

--
Ceci n'est pas un post
Re:ASCII Only? by sagwalla · 2002-11-08 05:51 · Score: 2, Insightful

The beauty of this is that it is in the public domain. If you want a PDF version, or an HTML version, feel free to make one. The Gutenberg standards put the material out in a least common denominated format so anyone has the same freedom.
Re:ASCII Only? by demi · 2002-11-08 07:09 · Score: 1

I more or less concur--the vast majority of the data in question is best represented by ASCII text with no markup, and the few exceptions are hardly damaging to the text.

However, there are a few problems--one of which is that each project has slightly different rules for registering or not registering textual enhancement, like italic or bold, or special characters. One project, for example, asked me to render italics for emphasis using all caps, except when the word is "I", when it should be underscores, or when it is a title or vessel, when it should be left out (i.e. "I do not want to sail on the Titanic" becomes "_I_ do NOT want to sail on the Titanic"). Another wanted me to include HTML markup for italics (<I></I>).

Huh?

Some projects want you to flatten 8-bit characters (á [a with accent] becomes a), others want them preserved, but are ambiguous about what character set encoding is desirable. ISO-8859 Latin-1 would probably be a common choice, but one project asks you to preserve 8-bit characters using the "Windows Character Somethingorother" tool and I'm pretty sure it will encode using some Windows code page. And of course, the resulting PG text files are not MIME documents and will not carry their character set encodings with them.

There are perfectly fine policy-based solutions to these problems, but there seems to be no master policy for these things, or at least the project instructions don't reflect one if there is.

--
demi
Re:ASCII Only? by luisdom · 2002-11-08 07:55 · Score: 1

And what about an old, standard, available and simple format?: rtf.
(and no, it is not read the f.. nothing)
Re:ASCII Only? by dachshund · 2002-11-08 09:04 · Score: 1

I suspect that it would be trivial to build a formating filter in perl, or another language that would convert BOLD to bold though it would require a bit more work to recognize that it really should be Bold or even that it should be BOLD.
It wouldn't necessarily be trivial to do what you describe (ASCII-to-markup). It would, however, be trivial to go the other way (markup-to-ASCII).
For instance, take your example of entering boldfaced text as BOLDFACED TEXT. You could write a script to turn those all-caps back into boldfaced text, but... what if my original text actually includes words in all-caps as well as boldfaced text? What if those Boldfaced WoRdS had embedded capitalization info? If I ran your proposed script on them, they would be translated into boldfaced text with no capitalization. Or alternatively, what if the original text used both the degrees symbol (that little "o") and the word "degrees" (as in "six degrees of separation")? How does my ASCII-to-markup script know when to use the symbol and when to use the word?
If you start with a markup text that contains all of the necessary info, and then convert to ASCII when necessary, you can reliably perform the translations without losing any of the original meaning of the text.
Re:ASCII Only? by davidmccabe · 2002-11-08 19:44 · Score: 1

Exactly. For while stripping a document of well-known markup is trivial, adding markup to a loosely-formed collection of formatting kludges designed only so that humans can see them is hard.

And it would be okay if they just made this mistake in the past and their new books are marked up. But they seem to have (and maybe this isn't their intention; it's just how they sound to me) this attitude, almost elitist, that markup is a Bad Thing and that ASCII is the One True Format and they aren't even going to think about switching to anything else, ever.
Re:ASCII Only? by dvdeug · 2002-11-09 07:32 · Score: 2

markup is a Bad Thing and that ASCII is the One True Format and they aren't even going to think about switching to anything else, ever.

ASCII is the One True Format. It's been constant since 196x, unlike the world of alternatives. Since PG has been around since the early '70s, they tend to stay with what works. It's annoying when I try to read a book in PG, and the volunteer preserved the French characters, in DOS, so it doesn't display right anymore on almost anyone's computer.

Most books don't have a huge collection of markup - maybe a few italics. The underscore convention for italics, used by most people now, can be automatically converted. The uppercase can't, but it's not that hard to put in the elbowgrease to fix it.

They have copies of some stuff, that can't be handled well in ASCII, in other formats; HTML is popular, with TeX for those math works. But neither is universal; an ASCII version is still provided where feasible because it is universal.

Re:public domain books? by RobotRunAmok · 2002-11-08 03:34 · Score: 2

I know for a fact that there are a lot of digital copies of copyrighted works such as Frank Herbert's Dune series and The Lord of the Rings floating around the Net and I think the newsgroups as well.

Of course, there are. And why shouldn't there be? Information (and Entertainment) Must Be Free!

Just ask Harlan

Distributed Proofreading has a "high score" table. by Lovepump · 2002-11-08 03:37 · Score: 3, Insightful

How long before someone writes a script to hit "Save and get another Page" and they shoot to the top of the ladder claiming to have proofread 13,450,213 pages per day...

Re:OCR Software -- Clara, perhaps? by Zach+Garner · 2002-11-08 03:39 · Score: 3, Informative

I've used both clara and gOCR. Both are not yet working well enough to actually use to scan books..

I'll be reviewing my Bloom County books. by saider · 2002-11-08 03:40 · Score: 1

One page a day shouldn't be a problem.

--

Remember, You are unique...just like everyone else.

Re:Just one page a day by URSpider · 2002-11-08 03:44 · Score: 1

I see the distributed part, except our computers aren't doing the processes in the background, we're doing it in the foreground.

I beg to differ. Foreground and background are all relative -- everything your computer does is foreground to IT -- it's devoting 100% of its attention (if it's a single-processor machine) to one task at a time.

In this case, the term is relative to your boss -- foreground is that report that's due tomorrow, background is reading Slashdot, drinking coffee and doing distributed proofreading. Which is all fine as long as bosses don't have good human-load-management tools...

ssh bob's-head

uptime

10:40am up 2 days straight, 1 user, load average: 2.06, 2.08, 2.08

killall slashdot-reading

uptime

10:41am up 2 days 1 minute, 1 user, load average: 0.85, 0.83, 0.89

lo

No, not really by Codex+The+Sloth · 2002-11-08 03:46 · Score: 4, Insightful

OCR Engines are not email programs. You can't just add a line of code and all of a sudden it works better. Usually you have to spend time developing a complicated algorithm. Usually this is more than a line of code. Then you have to test it against known text (ground truth) to make sure it's a benefit, rather than a problem over a broad selection of pages. It's quite often the case that something that improves one page makes another worse.

Actually, having people make verifications against the OCR results establishes the ground truth which someone could use to improve the OCR engine so by doing a Page a Day, you are helping to make future Open Source OCR engines better.

--
I am not a number! I am a man! And don't you ... oh wait, I'm #93427. Ha ha! In your face #93428!

How many? by demigod · 2002-11-08 03:50 · Score: 1

How many books are we talking about? Those out of copyright and not in PG.

If the trend of copyright extentison doesn't end soon that number may reach zero, but how soon is that?.

--
"The last thing I want to do is deal with a bunch of people who want something."
Major Major

Re:How many? by clonebarkins · 2002-11-08 04:00 · Score: 1

How many books are we talking about? Those out of copyright and not in PG.

Well, under the current state of US Copyright Law (TM), anything before 1923 is public domain. In addition, some things after 1923 are public domain due to legal technicalities (e.g., forgetting to re-register the copyright, explicit gifts to the public domain, etc.). So, that number is very big and pretty much uncalculable.

According to the latest Project Gutenberg Weekly Newsletter (which are sent to the Book People Mailing List, archived here), there are 6267 books in Project Gutenberg.

So, for a simple calculation:

how_many_books_we_are_talking_about = ${Number of books printed before 1923} - 6267

--

"The evil of the world is made possible by nothing but the sanction you give it." -- Ayn Rand

Re:Legal Implications by Happy+Monkey · 2002-11-08 03:51 · Score: 2

You seem to have skipped the second sentence of the post you replied to, even though the editorial corrections you refer to would undoubtedly appear on the scanned pages. One way around it might be that each page is covered under fair use, and they are not served to the proofreader in order, so you never are given more than a one-page exerpt.

--
__
Do ya feel happy-go-lucky, punk?

Distributed Everything by Mostly+Harmless · 2002-11-08 03:53 · Score: 2, Informative

Something I posted on 10/24...

Go here. Now. It's the most complete listing of distributed computing I've ever found. Has the usual, like folding and SETI, but also neat things like Distributed Proofreading and finding as-of-yet unknown comets.

--
"`Ford, you're turning into a penguin. Stop it.'" -Douglas Adams, THHGTTG

Great idea by brandonsr · 2002-11-08 03:54 · Score: 1

It helps the community and adds books to my shelf, at the same time. Amazing, I'm in.

Re:A better use of time (OK, here's mine) by gosand · 2002-11-08 03:58 · Score: 5, Funny

If we just write one line of code a day each we'll have better OCR in no time.

OK, here's mine:

#include stdio.h

next...

--

My beliefs do not require that you agree with them.

OCRs aren't about to do context-sensitive thinking by Kjella · 2002-11-08 03:58 · Score: 2

I just put in a few pages (15 if you care :), and while some were very conform in quality, at least one book had some smears and spots. There's no way an OCR of any quality would be able to reverse engineer the half-printed letters and words back to readable english without a *good* dictionary/grammar machine, and even then it would be more dangerous to have it do a half-assed guess than to have a human there that will at once tell that this is a trouble spot and that the OCR dropped the ball. God, that last was an ugly sentence, guess I should stick to proofreading and don't start writing myself...

Kjella

--
Live today, because you never know what tomorrow brings

Re:Legal Implications by Twylite · 2002-11-08 03:59 · Score: 2

Sorry, but this isn't strictly true. See my earlier post. Publishers tweak the text ("corrections" mostly) which give them copyright over their particular publication.

--
i-name =twylite [http://public.xdi.org/=twylite], see idcommons.net

ENOUGH! by flyneye · 2002-11-08 04:00 · Score: 1

De man wit de gun sez we be spellin an readin good.now you just sit there wich yo hans where i can see em an READ!you gunna be uh literate poofreader when we git through.

--
*Repent!Quit Your Job!Slack Off!The World Ends Tomorrow and You May Die!

Scanning without damaging the book? by mttlg · 2002-11-08 04:01 · Score: 3, Interesting

I have a few books that are old enough to be well out of copyright (and obscure enough not to be found online already), and for a while I have been considering typing them in. OCR would be a lot easier, but getting a good image from a flatbed scanner would seriously damage most of these books. Even a handheld scanner would be impractical in some cases, and a digital camera seems even less likely to work. Is there any reasonable way to scan in pages from something like a 100+ year old 1.5" thick wire-bound paperback book that only opens about 60 degrees before putting up a fight?

Re:Scanning without damaging the book? by jpetts · 2002-11-08 04:19 · Score: 5, Informative

Is there any reasonable way to scan in pages from something like a 100+ year old 1.5" thick wire-bound paperback book that only opens about 60 degrees before putting up a fight?

Yes indeed! *Any* decent academic library should have a photocopier which can do this. Older models tend to have a glass platen which extends right to the edge of the photocopier, and the side slopes away at around 60 degrees rather than dropping at a right angle. Newer models, such as the Minolta PS3000 will support the book in a cradle, face up, so that contact with the pages is minimised. They also tend to have a host of features, such as automagically erasing the gutter shadow that one gets with such a system.

--
Call me old fashioned, but I like a dump to be as memorable as it is devastating - Bender
Re:Scanning without damaging the book? by Griim · 2002-11-08 04:43 · Score: 2

The other poster's idea might be better here, I'm not familiar with those photocopiers, but I would think that one of the newer digital cameras (4 megapixels and up), accompanied with a macro lense would do the trick. You can get some pretty stunning detail out of the newer models, if you haven't seen them.
Re:Scanning without damaging the book? by ChaosDiscord · 2002-11-08 07:35 · Score: 2, Informative

...but getting a good image from a flatbed scanner would seriously damage most of these books. ...a digital camera seems even less likely to work.

Actually given a nice digital camera with a high resolution, you can generate perfectly fine images for OCRing. I've known a few people who have done exactly this to take images of rare books that they have access to but would never be allowed to put on a scanner.

--
Search 2010 Gen Con events
Re:Scanning without damaging the book? by mttlg · 2002-11-08 09:15 · Score: 2

Actually given a nice digital camera with a high resolution, you can generate perfectly fine images for OCRing.

I wasn't questioning the resolution of the camera, I was questioning the positioning of the book to get a good image. This would work easily if the book could be opened to lay flat, but otherwise it would require some apparatus to hold the book open, and even this won't work if the book can't be held open far enough with the page flat to get a good picture (as in the worst-case example I gave).
Re:Scanning without damaging the book? by Anonymous Coward · 2002-11-08 14:01 · Score: 1, Informative

Contact Project Gutenberg (http://promo.net/pg). They have use of an orbital scanner, I believe it's called, in San Francisco which can do non-destructive scanning of fragile bindings. (I'm not an AC, I'm just so technologically challenged that I don't see a reason to create an account here. I type in weird-font books on browned paper for PG. OCR distinctly has its limits in the face of the creativity of font designers.)

Re:Umm... by Twylite · 2002-11-08 04:04 · Score: 4, Insightful

Copyright law is supposed to give incentive to create, for the betterment of society, and allow the creator to derive direct benefits as a reward. An artist who has created a work so successful that (s)he can live on it indefinitely has arguably provided a suitable level of betterment to society.

Saying that copyright law is an incentive to "work" is accepting mediocracy. Artists who produce works that society values more highly should (have the opportunity to) receive more benefits.

On the other hand, I don't necessarily agree that copyright should last the lifetime of the creator (although there are strong arguments for this in the case of a natural person). But what is a "fair" limit?

Is 5 years enough? Almost certainly not. Many authors only achieve popularity after 10 or more years, and then make a fair amount of money off increased sales of their older works. A good number accept this as a risk, and plan to use this phenomenon to their benefit - work up a good number of titles with varied content, and you'll pull more readers, who are then likely to try some of your other titles.

Is 20 years enough? Maybe. But some of our best-loved authors were 15-20 years ahead of their time in terms of what readers wanted.

Is life enough? Strangely, no. If an aging star has just completed his/her autobiography, concludes the publishing deal, and dies ... well, the family could well be screwed.

Maybe the answer lies in a compromise, rather than an all-or-nothing approach. Copyright over a work lasts for the greater of 10 years or the creator's natural life (which gets very interesting when we get eternal life medications ...). But some rights fall away after the LESSER of those two times, such as exclusivity over derivative works (but not translations).

This allows society to (culturally) enrich itself by building on a work after a shorter amount of time, while the creator (and/or family) can still derive value from the original work for a longer time.

In the case of books this is easily understood: author writes book; 10 years later other people can write preludes and sequals, extend the world and characters, etc; 30 years later author dies and original book falls into public domain.

--
i-name =twylite [http://public.xdi.org/=twylite], see idcommons.net

DP, PG? by Jackazz · 2002-11-08 04:05 · Score: 1

...as a way to help Project Gutenberg (PG) get more books online: Distributed Proofreaders (DP)...

I don't think a DP could ever be PG!
hehe, pr0n...

impressive by Anonymous Coward · 2002-11-08 04:14 · Score: 2, Interesting

The response from the slashdot community is impressive. Already they have hit their mark for the day as far as 'pages processed'. They have over 1400 (at 10:13am CST) pages processed. When I visited their site at 8:45am CST they had only 615 pages. I predict that the project will hit the 3000 mark fairly quickly for today.

Re:impressive by cymandee · 2002-11-08 09:39 · Score: 1

And at 3PM PST they now have almost 7400 pages processed!
Must be the first time something valuable comes out from slashdotting a website...
Re:impressive by khayo · 2002-11-09 18:53 · Score: 1

It looks like the project got a huge boost from /. and others.

But there was a dark side -- a few cretins who went after the number of processed pages, and submitted them without any proofing (I came across at least 10 of these today).

The only answer to this is to drown these bozos in the Good Thing(tm). So I'm appealing to all other /.'ers to stay with this project and carefully do a page or two a day, even after the original thread and article fade from the front page. Thanks a lot.

Re:Legal Implications by j-beda · 2002-11-08 04:15 · Score: 2

I am pretty sure that PG takes care to only use old copies of books that are in fact no longer copyrighted if that is in fact necessary. They seem very picky in making sure that they follow the rules.

Are any of these resources distributed? by wls · 2002-11-08 04:21 · Score: 3, Insightful

It seems like every few years I turn around and notice that some massive archive collection gets sued, goes out of business, has funding pulled, gets tangled in legal action, has a university board go into panic mode, etc. and suddenly it disappears without warning or notice to the frustration of many. I'm certain you also can name a number of services, collections, and resources that spontaneously vanished when hosted at friendly sites. History has proven that despite best intentions, nothing lasts forever unless we go out of our way to protect it.

So that work isn't lost or destroyed, are any of the mega-sized projects replicated elsewhere in the event that a "it'll never happen" situation crops up to this unsuspecting resource?

Re:Are any of these resources distributed? by Anonymous Coward · 2002-11-08 05:38 · Score: 2, Informative

DP submits to project Gutenberg. This is a gutenberg FAQ.

1. Michael Hart (gutenberg's leader) is very much in favor of massive replication. My favourite is when disk drive makers start putting the entire gutenberg collection on their drives before selling them (to fill up space/differentiator)

2. PG has been around 20 years, and never been shut down. Judges actually understand and defend the public domain, within limits that PG understands.

3. Nothing goes through DP without copyright approval from MHart. And if he makes a mistake, it is likely to be fixed by withdrawing the offending *book*, as far as possible.

blackmask.com by night_flyer · 2002-11-08 04:22 · Score: 2

after finding Thea von Harbou's Metropolis at www.blackmask.com, I go there first when looking for an ebook, especially since they have them in e-silo format (Palm). IF they dont have what Im looking for I go by Project Gutenberg...

--

Thanks to file sharing, I purchase more CDs
Thanks to the RIAA, I buy them used...

Re:public domain books? by Jack+Admiral · 2002-11-08 04:40 · Score: 1

OT: I can't believe I got modded down for this comment. Sheesh. Serves me right for posting at all. From now on I'll just keep to myself and not contribute anything at all. It's not even worth it to moderate others as well when I get moderator points.

Mod me down if you wish I don't care anymore.

Re:And you ask the /. community... by Binestar · 2002-11-08 04:45 · Score: 5, Funny

MY GOD! A story where nitpicking grammar and spelling is *ON* topic.

This'll be a fun one to read through.

--
Do you Gentoo!?

works fine! by magwm · 2002-11-08 04:46 · Score: 2, Interesting

I just proofread 2 pages of some greek philosophy book. the system works really nice! quick database, not too large pages to read. except i would like to have source and text next to each other, and not above each other.

I'm impressed by schmiddy · 2002-11-08 04:47 · Score: 2, Interesting

I signed up for an account, and did a bit of proofing. One page was a bibliography with lots of numbers -- the OCR software made a few errors here and there, sometimes confusing "1" with "!". Another page was in old German. Since many old German characters look so different than their modern-day counterparts, I was quite impressed when it translated them flawlessly into their proper ASCII counterparts. The OCR software even got the umlauts right. Only problem was it sometimes mistook an end of line "-" for a "=". One problem I did have was that most of the scans seemed to be pretty low resolution. This causes problems when comparing the scanned text to the original image, as it can create difficulties for the proofreader. The software also had trouble translating the low-res blocks.

--
http://cltracker.net -- powerful craigslist multi-city search

use spell and grammar checking by primus_sucks · 2002-11-08 04:57 · Score: 1

You could use spelling and grammar checking to improve the ocr.

The quick brwn fox jumped over lazy dog.

It would be easy to figure out that brwn should be brown. The ocr should see something between lazy and dog, using grammar rules it could possibly figure out what the most likey word should be.

Re:use spell and grammar checking by danielkdwalker · 2002-11-09 02:06 · Score: 1

The quick brwn fox jumped over lazy dog.
Of course, the obvious error is that is *should* be
the quick brown fox jumps over the lazy dog
otherwise there's no 's'.

Pubic Domain? by Cap'n+Canuck · 2002-11-08 04:59 · Score: 2, Funny

I'll help out.

One question - is Playboy public domain yet?

Re:Umm... by Dirtside · 2002-11-08 05:02 · Score: 2

My wife had a suggestion for limiting the life of copyright. Basically, tie it to the amount of income you get from the work. Once you reach a certain plateau, the work falls into the public domain (although you could argue for an additional minimum time requirement, i.e. 5 years for movies, so that a gigantic blockbuster won't enter the public domain after 6 months). Or instead of income, base it on profit. That way, you are guaranteed that you will make a certain amount of money before the work enters the public domain. Of course, for works that never reach the plateau, they would enter the public domain after a suitable period -- e.g. life plus 10 years for natural persons, or something incredibly short for a corporation, like 20 years).

Of course, there's practical problems with this method -- namely, accurately determining the amount of money a work takes in. It's all too easy to fudge financial data, as we've been too often reminded in the past year, and this idea may not be workable.

--
"Destroy science and religion. Science would re-emerge exactly the same; but not religion." - Penn Jillette, paraphrased

And we shall call it kuro5hin! by drivers · 2002-11-08 05:03 · Score: 2

(en tea)

Can't get through? Try ibiblio by gbnewby · 2002-11-08 05:15 · Score: 3, Informative

The main Gutenberg page is slashdotted right now, but you can get nearly the same access to the books via the main ibiblio page at ibiblio.org/gutenberg, which is the main distribution site for the collection.

It looks like the texts01.archive.org/dp site is holding up fairly well! If you cannot get through today, though, please check back later. Slashdot effect aside, it's usually quite speedy and has a decent 'net connection. If you want to keep informed of current events, get on one of our mailing lists via (when it's not slashdotted) our subscriptions page.

Dr. Gregory B. Newby
Chief Executive and Director
Project Gutenberg Literary Archive Foundation http://gutenberg.net
A 501(c)(3) not-for-profit organization with EIN 64-6221541
gbnewby@ils.unc.edu // 919-962-8064

Re:Some PG books ARE copyrighted... by dpbsmith · 2002-11-08 05:17 · Score: 5, Informative

...Not many, but there are some Project Gutenberg books that are copyrighted and distributed with the author's permission.

Also, Project Gutenberg of Australia publishes a number of works that are out of copyright in Australia, but still under copyright in the U.S. It is a copyright infringement for readers in the U. S. to download these works, which include, among others, Hervey Allen's _Anthony Adverse_(1933), F. Scott Fitzgerald's _The Great Gadsby_ (1944), Khalil Gibran's _The Prophet_ (1923), D. H. Lawrence's _Lady Chatterley's Lover_ (1928), all of George Orwell's novels, most of Virginia Woolf's, etc. etc.

Not exactly "the latest Stephen King" but a lot newer than Dickens.

--

"How to Do Nothing," kids activities, back in print!

Work is getting in the way of my proofing... by fluppy88 · 2002-11-08 05:19 · Score: 1

Damn, work is getting in the way of my proof reading! Why can't I make my boss understand that Project Gutenberg is much more important than what I'm normally doing?

See /. effect in action! by macalmaclan · 2002-11-08 05:24 · Score: 1

From the website:
Pages completed today: 2633 as of 9:01 Pacific Time today
The average pages / day so far this month (before today) is less than 1100!

I am programmer, let's automate this by LoRider · 2002-11-08 05:24 · Score: 2

Do they want me to manually scan through a page of text compare it with an image and fix errors created by OCR? It goes against my very nature to do such a task. There has to be a better way, a programming way, to get this done without having to look at all of the files with human eyes.

I haven't finished my first cup of coffee yet so I am at a loss for a solution, but it sounds like something Perl would be good at.

The motto of the open source community should be or is, "Progress not perfection."

--
LoRider

Re:I am programmer, let's automate this by Sloppy · 2002-11-08 05:52 · Score: 3, Insightful

There has to be a better way, a programming way, to get this done without having to look at all of the files with human eyes.
It's not human eyes that are needed, it's human brains. If it is possible to automate, then the OCR doesn't need checking; it just needs to be upgraded to include whatever algorithm that you're about to invent.

--
As copyright owner of this comment, I authorize everyone to defeat any technological measure which limits access to it.
Re:I am programmer, let's automate this by dvdeug · 2002-11-08 06:14 · Score: 2

There has to be a better way, a programming way, to get this done without having to look at all of the files with human eyes.

If there were, then the ocr program would have done it. It requires large amounts of context and pattern recognition, things humans are much better at than computers.

Way to go Slashdot! by specialized_sworks · 2002-11-08 05:25 · Score: 1

Looks like slashdot has caused the pages per day to go way up! Last couple of days had only accomplished between 1000 and 1100 pages per day. As of 12:30PM, the pages for today are already above 2600!

Wonder how long the increased rate will last?

-Dubya

Since when was /. an advertisement service? by dfj225 · 2002-11-08 05:32 · Score: 1

I wish I could get my webpage to showup on the main /. page!

--
SIGFAULT

Re:Umm... by RazzleFrog · 2002-11-08 05:40 · Score: 1

That sounds like a great way to encourage mediocrity. I won't write the best book possible. I will only write one good enough to sell enough copies to reach my maximum royalty. Your wife's method would reward the Danielle Steele's of the world and punish the Joseph Heller's. I think the ideal way is 20 years alive or dead. What other job pays you for 20 years for what too you maybe 1 or 2 years?

Re:A better use of time (OK, here's mine) by jim3e8 · 2002-11-08 05:42 · Score: 2, Funny

OK, I'll start at the other end and work my way toward you:

}

Reminds me of the OED by Anonymous Coward · 2002-11-08 05:46 · Score: 2, Interesting

Their approach to solving this reminds me of how the Oxford English Dictionary was started -- by compiling submissions and references from thousands of volunteers. A really enjoyable recounting of this (and of one particular person who contributed thousands of words while in an insane asylum) is The Professor and the Madman

Re:A better use of time (OK, here's mine) by Anonymous Coward · 2002-11-08 05:53 · Score: 1, Funny

#include stdio.h

YM,

#include <stdio.h>

One line, one bug. Yikes!

Re:Legal Implications by stinky+wizzleteats · 2002-11-08 06:05 · Score: 1

Well, yeah - it's easy to make money publishing a book if you don't have to pay the author anything and all the marketing has already been done. For new books, the copyright system is the best way to ensure a publisher can recoup these costs.

I'm confused. Is copyright protection supposed to protect the marketers or the artist?

Re:Some PG books ARE copyrighted... by BlueGecko · 2002-11-08 06:10 · Score: 2

I do believe you have linked to a copyright circumvention device (the .au domain) in violation of the DMCA. Please standby while you and your belongings are liquidated.

Thieving TOS Violator! by timeOday · 2002-11-08 06:23 · Score: 2

the original site ran on a Pentium 200 over my 128kbps upstream cablemodem

This is a chilling example of the dire consequences of granting upstream bandwidth to home users!!!

Er, wait...

Linux Client by kapheine · 2002-11-08 06:35 · Score: 1

A while ago I started to write a Linux client for the distributed proofreaders site. I got a fair amount of it done, but there were some messy parts, buggy parts, and parts left undone. If anyone would like to check it out, or even work on it, it is at http://kapheine.hypa.net. I haven't worked on it in a while, unfortunately, and I probably won't.

--
-- kapheine

Looking for proofreaders on slashdot !! by tadas · 2002-11-08 06:44 · Score: 5, Funny

If they're looking for proofreaders here, the project is in deep trouble...

--
This page accidentally left blank

Re:Legal Implications by dvdeug · 2002-11-08 06:47 · Score: 2

But the publishers still have copyright on their specific printing.

I've heard this in the context of German law, but never in the context of American law. American law requires significant creative effort to be copyrighted, which dumping text to paper rarely counts. (New footnotes and illustartions are a different matter, of course.)

Price on my head. by Domo-Sun · 2002-11-08 06:51 · Score: 1

In the case of books this is easily understood: author writes book; ... 30 years later author dies and original book falls into public domain.

That would make an incentive for people to kill you so they can steal your work.

Re:Price on my head. by dvdeug · 2002-11-08 07:12 · Score: 2

That would make an incentive for people to kill you so they can steal your work.

People already have incentive to kill other people to take their work. It's robbery gone bad, and inheritance. I doubt there's enough value in any public domain work to make a death sentence worth facing.

Snide remark about markup... by davidmccabe · 2002-11-08 07:06 · Score: 2, Funny

Wait a minute! Isn't PHP like evil or something?

Programming languages may come and go, but good old fashion machine code will last as long as literature, very much like good old fasion ASCII text and good old fashion zip files with no meaningfull names.

Look! Up in the sky! by charon_on_acheron · 2002-11-08 07:18 · Score: 1

It's absurd!
It's inane!
It's Malaprop Man . :^)

Blame it on Mickey Mouse by peter303 · 2002-11-08 07:27 · Score: 2

Walt Disney wanted to extend the rights to his branded characters and got the lawmakers to do it. In some respects his old stuff is renewed every decade: new generations of kids and new media- film, theme park, video tape, DVD, IMAX ... Each reissue is a new pile of money.

Re:Umm... by Dirtside · 2002-11-08 07:46 · Score: 2

Interesting points. There is the fact that deliberately creating an artistic work that will reach a certain cash plateau is nearly impossible -- just look at how many creative endeavors never even get so far as to break even, and that's with authors trying really hard.

Also, there's the fact that an over-successful work creates desire for an author's other works -- so writing something which will exceed its copyright profit cap would still create income for the author's other works.

Additionally, if there's a minimum time limit set on the work (I'd say 15-20 years for books), then even if it is wildly successful, you could reap the profits for 20 years, even if you greatly exceeded the profit cap. Once that 20-year deadline hit, of course, the copyright would expire. Trying to calculate your work so that you only barely reach the profit cap *after* the minimum time would be utterly impossible, so I doubt that would have any effect on authors' efforts.

All that said, yours is a simpler solution (and one that I would support) -- 20 year copyright, non-extendable, from the date of first publication, regardless of the author. Period. Copyrights would be transferable (i.e. I could sell my copyright to a new owner, and I would lose *all* rights to it). It's an acceptable solution, though it doesn't mean it's the best solution (or even realistic, politically speaking).

--
"Destroy science and religion. Science would re-emerge exactly the same; but not religion." - Penn Jillette, paraphrased

Wow... by Myco · 2002-11-08 07:56 · Score: 1, Offtopic

This is great, but it's even more addictive than the Kill Everyone Project. Though arguably not as worthwhile.

--
My deviantArt site

Re:A better use of time (OK, here's mine) by jafuser · 2002-11-08 08:08 · Score: 1

set_bugs = 0;

--
Please consider making an automatic monthly recurring donation to the EFF

Re:A better use of time (OK, here's mine) by gosand · 2002-11-08 08:19 · Score: 2

1. It didn't say it had to be bug-free code. :p

2. Do you know how long it has been since I wrote any C code? I was lucky I spelled stdio correctly.

--

My beliefs do not require that you agree with them.

legal enforceability by David+Jao · 2002-11-08 08:27 · Score: 2

In this case the text is (almost) legally enforceable. They really do own the copyright. They really do have the right to prohibit almost all copying of that book, except in limited fair use circumstances.

For the reason why, I suggest you "learn up" on what public domain really means in the US. Public domain simply means that a particular work has no copyright restrictions. It does not mean that you are prohibited from adding further copyright restrictions of your own.

In other words, a work which is public domain is free for all to copy in any way they wish, including copyrighting a copy for themselves. Note that placing your own copyright on the work does not mean that the original work is copyrighted. It just means that your copy is copyrighted. Anyone is still free to access the original copy, which is still in the public domain. But they can't use your copy if your copy has your copyright.

You might ask "are there laws that prohibit you from lying about the authorship of a work?" The answer is yes. It's called fraud. It has nothing to do with copyright. Placing your own copyright on a work, and claiming authorship of a work, are two completely independent actions according to the legal system.

You are totally right that the cover text is not enforceable with regard to "fair use" copying of the text, but the parts that say "Copyright 1974 Houghton Mifflin" and "All rights reserved" are definitely valid, enforceable, and meaningful.

Re:legal enforceability by dvdeug · 2002-11-08 09:43 · Score: 2

In other words, a work which is public domain is free for all to copy in any way they wish, including copyrighting a copy for themselves.

That's not true. You can use it for a basis of your own copyrighted work, but you can't claim a copyright on something without adding significant creative value.

From http://www.copyright.gov/circs/circ1.html

Only the author or those deriving their rights through the author can rightfully claim copyright.

Non-native proofers by Sangui5 · 2002-11-08 09:08 · Score: 3, Informative

are actually the preferred way to proof text. A project to create "The Collected Works of Edmund Spenser" is headquartered here, and the English-types were looking for people to work on some software for them. The current most accurate way to create an electronic copy is to hire people without even a passing familiarity with the alphabet you are targeting, train them to identify the letters themselves (using the font you're targetting, which may be very much non-standard, esp. for work as old as Spencer's), and have them enter it in character by character. You then have another illiterate person do the same, and have 1 editor (English graduate student) check both copies. Then any differences have to be handled by another editor (English PhD), and the final copy signed off by yet another editor (PhD).

A very very expensive way to do it.

See, an illiterate person won't introduce any bias into the text. They will faithfully duplicate any spelling mistakes that they find. In the case of an English scholarly collection, the mistakes are amoung the most important part, since they can identify different print runs, and how language shifts over time.

As a side note, the software project is hopeless. The best that cann be managed is to automate the administration of their current systems--no OCR will ever meet the level of accuracy that their current system provides.

Re:A better use of time (OK, here's mine) by beta21 · 2002-11-08 09:48 · Score: 1

next line

#include "ocrLib.h"

Re:Umm... by hobit · 2002-11-08 09:53 · Score: 1

Nice points. How about having sole rights for 15 years, and then only commerical rights for something like 25 or 30 years? So PG could use the text at 15, but no one could sell the work, other than the author, for 25-30 years?

Workable?

--
As Nietsche famously said, "If you stare too long into the Abyss, 1d4 Tanar'ri of random type will attack you."

Re:A better use of time (OK, here's mine) by Hater's+Leaving,+The · 2002-11-08 10:05 · Score: 1

I don't want to get squeezed in the middle, so I'll work _downwards_ from you.

#else

Well, I'm assuming you want it to work in both Windoze and lunix. I just got the feeling that what you were writing wouldn't be portable.

THL.

--
Keeping /. cynic density high since the fscking Kwhores/trolls arrived.

Maybe not for long -- still good by the+grace+of+R'hllor · 2002-11-08 10:08 · Score: 2, Informative

It's on Slashdot, so everyone does a few pages, find out it's actually fairly tedious, and only a few will remain of the initial burst. They're at about 7000 for today right now, which is about 1000 more than what they've done so far, this month. Don't build your site based on these estimates.

Check back there in a few weeks to see how the site is doing. Hopefully quite well, since it is a splendid and worthwhile[1] effort.

[1]: And only in the preview did I realize I sounded like that woman in the HHGTTG.

Call the firemen by Pseudonymus+Bosch · 2002-11-08 12:20 · Score: 1

That would make an incentive for people to kill you so they can steal your work.

Do authors burn at 431 F?

--
__
Men with no respect for life must never be allowed to control the ultimate instruments of death.
GW Bu

Re:public domain books? by SamTheButcher · 2002-11-08 12:28 · Score: 1

My guess, Jack, is that you got modded down because the modder thought it obvious that they can't transcribe copyrighted works. They're still under copyright and the PG site states - We cannot publish any texts still in copyright. This generally means that our texts are taken from books published pre-1923. (It's more complicated than that, as our Copyright Page explains, but 1923 is a good first rule-of-thumb for the U.S.A.)

So you won't find the latest bestsellers or modern computer books here. You will find the classic books from the start of this century and previous centuries, from authors like Shakespeare, Poe, Dante, as well as well-loved favorites like the Sherlock Holmes stories by Sir Arthur Conan Doyle, the Tarzan and Mars books of Edgar Rice Burroughs, Alice's adventures in Wonderland as told by Lewis Carroll, and thousands of others.

The texts you mention are "illegal" replications/duplications. But please do read about the travesty of copyright laws on their site as well. And vote accordingly in 2004. And don't get discouraged - keep posting.

Proofing FAQ by Wanker · 2002-11-08 12:43 · Score: 3, Informative

Stop reading this
And start reading a page!
After that come back and you may continue();

...but first read the Proofing FAQ on the site and save yourself some confusion:

http://texts01.archive.org/dp/faq/ProoferFAQ.html

Especially read section 5 for some of their typesetting-to-ASCII conventions which would be non-obvious otherwise.

Re:Book Pirating? by Genyin · 2002-11-08 13:03 · Score: 1

And you probably are. The best efforts of our duly elected Congressional representatives notwithstanding, copyright still does expire.

No it doesn't. (not counting copyright not being renewed, which I suppose counts, but...) At this point, no new works about which the author actually cares enough to renew copyright are going into the public domain; if no new laws are passed this will change in 2018...

creative value by David+Jao · 2002-11-08 17:27 · Score: 1

You can use it for a basis of your own copyrighted work, but you can't claim a copyright on something without adding significant creative value.

The standards for whether a particular addition constitutes significant creative value are remarkably low. The already mentioned spelling modernization, for instance, is an example of a tangible modification to the Shakespeare texts over which Houghton-Mifflin can legitimately claim copyright.

You could, in theory, copy their Shakespeare book, IF you somehow removed all of their spelling and editorial changes, line numberings, page numberings, annotations, commentary, illustrations, etc. from every page of the book. In practice, this is not so easy, because it is not easy to tell what was changed and what was not changed unless you have an original copy to compare it with. And if you have an original copy of Shakespeare, then you don't need the Houghton-Mifflin published version anyway.

Re:creative value by dvdeug · 2002-11-09 08:06 · Score: 2

The already mentioned spelling modernization, for instance, is an example of a tangible modification to the Shakespeare texts over which Houghton-Mifflin can legitimately claim copyright.

Sure. But an edition of Hemmingway, where massive changes are neither needed or expected, is slightly different.

The Slashdot effect put to good use by jfmiller · 2002-11-08 20:07 · Score: 1

You can now see the (benifitial) results of a good old-fashion Slashdotting on the front page, with the graph for pages from Nov. 8 going way off the scale.

JFMILLER

--
Strive to make your client happy, not necessarly give them what they ask for

Re:A better use of time (OK, here's mine) by novakreo · 2002-11-09 00:15 · Score: 1

Here's mine:

return 0;

By the time this is working most of what I'd like to read should be public domain anyway....

--
O frabjous day! Callooh! Callay!

Re:public domain books? by psamuels · 2002-11-09 00:58 · Score: 1

OT: I can't believe I got modded down for this comment. Sheesh.

Next time I get mod points, I think I'm gonna just start modding down posts that complain about moderations done to their authors. Nothing personal, Jack, it's just something I've noticed a lot of recently on slashdot, and it's really frickin' lame. I mean, it's OK to complain that someone else got hammered by the mods, but to whinge about one's own fate - dude, it's only karma, get over it already.

(This is of course in addition to the policy I'm borrowing from someone's sig: "I moderate down any post that says 'I'll probably get moderated down for this'." Same principle.)

Of course, my new-found policy will probably be a big hit with the metamods, some of whom have taken on personal quests to rate all downmods as unfair. So logically I should hide behind "overrated". But I won't. That would be lame. (:

--
"How can you claim that you are anti-crack, while still writing a window manager?" — Metacity README

Re:public domain books? by psamuels · 2002-11-09 01:02 · Score: 1

Oh, and before you get all pissed at me - yes, I agree that you were moderated unfairly. Yes, a certain amount of crack was most likely involved. I don't care - it's still tacky and tiresome to complain about it yourself.

--
"How can you claim that you are anti-crack, while still writing a window manager?" — Metacity README

Re:Which books are getting converted? by Max+Romantschuk · 2002-11-10 17:59 · Score: 1

I meant to say that if you could affect which books get converted into electronic form you might be more interrested.

Voting might not be the way to go, but I don't feel that I'd be very interrested if there are no books I have any interrest in personally.

--
.: Max Romantschuk :: http://max.romantschuk.fi/

Last Post! by alpg · 2002-11-22 05:58 · Score: 1

Earl Wiener, 55, a University of Miami professor of management science,
telling the Airline Pilots Association (in jest) about 21st century aircraft:

"The crew will consist of one pilot and a dog. The pilot will
nurture and feed the dog. The dog will be there to bite the
pilot if he touches anything.
-- Fortune, Sept. 26, 1988
[the *magazine*, silly!]

- this post brought to you by the Automated Last Post Generator...

282 of 373 comments (clear)