Just One Page a Day
Charles Franks writes "Two years ago I started building an online proofreading system as a way to help Project Gutenberg (PG) get more books online: Distributed Proofreaders (DP). The concept is simple, we scan books and load the image and OCR output for each page into the online system. Next, proofreaders compare the OCR text to the image making any corrections as necessary, each page gets looked at twice. Finally the output from the site is massaged into a PG e-text and submitted to PG for posting to the archive. Now, nearly 600 books and a lot of PHP code later, we have snuggled into our new home which is graciously provided by the Internet Archive and Project Gutenberg. Now that we have 'real' resources available to us (the original site ran on a Pentium 200 over my 128kbps upstream cablemodem) I would like to invite the online community at large to help us put even more books online. To this end I would like to ask everyone to do 'Just One Page a Day'. Thank you, Charles Franks"
Seriously? Just make a distributed system, put in PHP code, and make it all open source and free?
What's the criteria?
Good quote, too many chars. Seriously, the slashdot 120 char limit sucks!
And start reading a page!
After that come back and you may continue();
... which is renowned for it's spelling prowess? ;)
A)bort, R)etry or S)elf-destruct?
After some consideration, I propose that this system should be applied to Slashdot stories! Each Slashdot story, after being submitted by an editor, should be reviewed by at least two readers before being posted in order to correct inadvertent spelling mistakes and story duplicity. Thank you sir, for inspiration!
Dr. Joseph Hairston
Superintendent, CCBC
Sounds like Gary Condit's plan for extramarital affairs.
A. Rightmann
Is there any worth-while open source OCR software? How about reasonably priced closed source OCR software for *BSD or Linux?
I'm shure that buy askin teh Salshdot crowd (esp. the editturs) to help, yule improove jamatically teh kwality off you're output.
:-)
Try NetBSD... safe,straightforward,useful.
I can't decide if this is a joke or not.
You do know about Project Gutenberg, right?
...phil
"For a list of the ways which technology has failed to improve our quality of life, press 3."
The only works that go into PG are works in the public domain. While publishers sell dead-tree copies still, they have no copyright over the original text contained within. (Which is why these works are typically available through multiple publishers.
XML is like violence. If it doesn't solve the problem, use more.
Project Gutenberg only publishes books that are out of copyright. That means Dickens is okay but you wont find the latest Stephen King
Project Gutenberg specifically deals with texts that are not copyrighted. So it is all legit. :)
Jeremy
Not when the authors have been dead for 300 years.
It helps if you read the FAQ list.
Due to copyright laws, it is only legal to do this with older books (copyrighted 75 or more years ago). As a result, Project Gutenberg is mostly comprised of the "Classics."
Imagine the kids 200 years from now reading |-|uc||_3b3rry F1|\||\|.
(That hurts my brain just trying to type it in...)
--- I wish I could hear the soundtrack to my life. That way I'd know when to duck.
"The Road Ahead" will not be included, at least in this round of distributed OCRing.
Read Errant Story.
I think a better use of time would be to have all these programmers here develop a better OCR. Then you wouldn't need the proofreading and could just feed books into the scanner. I mean there are lots of things wrong with OCR and reasons why it can't be absolutely perfect, but it CAN bet better. If we just write one line of code a day each we'll have better OCR in no time.
The GeekNights podcast is going strong. Listen!
Yup.
It doesn't seem like there would be that many books in the public domain that haven't already been made available on the net.
How do you suppose they make it to the net? Most of the public domain books were written before word processors, so there's no electronic text around.
Of course I could be wrong.
Yeah. Go look at Project Gutenberg's site - think of it as you homework assignment for the weekend.
...phil
"For a list of the ways which technology has failed to improve our quality of life, press 3."
Y wood any 1 nede sum one too prufereed there buk. Eye du fyne bye myselph.
The 'Project Gutenberg' is about making old books that have (finally) fallen into public domain available to whoever wants it. Those are the books I'm sure that they want to have proofed.
Looking for any old 8-bit Heathkit/Zenith software/hardware - http://heathkit.garlanger.com
Instead of proofreading the books, I think this guy is asking for his new server setup to be tested!
I'll do it for cheesy poofs.
Copyrights aren't perpetual. The Gutenberg project aims to publish books that are no longer, or have never been under copyright.
And you probably are. The best efforts of our duly elected Congressional representatives notwithstanding, copyright still does expire. After that, a work passes automatically into the public domain. That means there are hundreds of thousands of books available.
In fact, if you've previously seen the classics online, they probably came from this project, which has been around for almost as long as I can remember.
"Patriotism is your conviction that this country is superior to all other countries because you were born in it." -- GBS
Have each client do the OCR (if you can find GPL software). Or maybe there's a company willing to donate it. That way you could farm out most of the processing too.
While publishers sell dead-tree copies still, they have no copyright over the original text contained within.
What? You mean to suggest that you have an actual example of a publisher making money without tyranny over the content?
Gasp!
The books that are being converted are whatever people feel like contributing.
Don't think your favorite authors are being represented? Can you demonstrate that the work is out of copyright? Make the conversion yourself!
Doing the hard work yourself is the best way to guarantee your interests are represented.
teeker
Very good idea.
Will there be any support for proofing in other languages (french, spanish, arabic, etc...)?
What about books published in other countries. Would we be able to post those books if they're not copyrighted in the US but copyrighted in other countries? or vice versa.
What if they kept track of every time the human reader finds an OCR-error. Couldn't you then build a profile of what words/phrases/letters the OCR software has the most problems with?
Then, couldn't you just selectively have the humans review the highest probably error prone sections of a book, instead of every single word of every single page?
What do you think?
Looks like for the first time in years project Gutenberg has been /.ed.
Jumpstart the tartan drive.
Nuff said.
It's surprising that so many people are either trolling or are unaware of the concept of "public domain." I personally fear the latter more because it shows the ideological degradation of America. The Slashdot community is much more likely to be aware of copyright issues than most Americans. If so many of us are so naive then I genuinely fear for the survival of our country as a free nation. Perhaps that is the reason why the media corporations can encroach upon our rights by pushing inferior products and getting unanimous approval of the DMCA in the senate.
Copyrights aren't perpetual In Theory. But isn't disney and microsoft (MS wrt printed works esp) working hard to insure they're perpetual In Practice?
True, but Project Gutenberg is a repository for digital copies of literature that are public domain. To remain a legitimate entity, they can't publish copyrighted works (without the author's consent).
;-)
So, the answer to your question is no. But that's what p2p is for
teeker
I'm sure interrest could be affected if people could, say, vote on what would be converted. Or do I make any sense?
I'm trying to make sense of this, please help me out. Are you saying that if people could vote on which books are converted (or "electronificated" as we sometimes call it in the industry), that more people might be interested in the project?
>But the publishers still have copyright on their specific printing.
Nope. Copyright holders (not necessarily the publisher) would have copyright on editorial corrections and (for music: a weird case) some on appearance, but not on the original text.
Publishers often claim copyright on the entire contents of 300 year old works, but they have no legal basis for this.
Don't you mean run a compare tool in the background using CPU idle time right?
You don't actually want us to read a
page of literature do you?
You'll find that on Project GNUtenberg.
"And like that
$.02. Like it or leave it.
In order to make the proofing faster, maybe you could OCR a document 2 or 3 times, and then have only the disagreements proofread.
We use omnipro here at work, and I'm surprised at how well it works, even recreating page formats.
Of course, it doesn't work 100%, but it sure does get about 95%. If you were to OCR a document 2-3 or more times, and most of it was identical, it would save a lot of time if you had humans going over only the parts that the different OCRs didn't agree on.
Steve Lefevre
Computers are useless. They can only give you answers.
-- Pablo Picasso
The new congress might extend copyright protection to Shakespeare's great great great great great great great great great great great great great grandson's nephew's out of wedlock kid's son whose paternity is in question.
---
When you come to a fork in the road, take it! --Yogi Berra--
You mean a more communal approach than an oligarchy of "editros" that can't spot day-old duplicates? Great idea!
Bollocks.
Technology is a human endeavour and as with all human work it is subject to ethical and moral considerations.
It's a disgrace that moral philosophy is not a required course in most tech. degree programs.
I have a little problem with the logistics here. I can understand why every page is being sent to 2 people for proof reading in an effort to eliminate errors, but the problem arises that these arent 2 computers doing simple computations, if both of these people have different versions of a corrected page, as im sure they will. what happenes then? who does the final proof reading, and if there is someone doing the final proof reading that kinda eliminates the need for the distributed part. I could almost guarentee that any 2 people checking the same full page of data in their free time will find/create different errors. I hope I'm missing some large concept here, becouse i do love PG, they keep my palm stacked with good reading for free.
But it looks like this is a more automated system, so that should help.
I've just proofed four pages, a mix of modern English, quoted Cockney and religious babble (Jonah 4:13, 9 etc.)
OK it's only four pages, but the errors I've corrected so far have been when the scan has been poor and the OCR software has had to make a guess.
... way to busy scanning in all the wizards of the coast materials for a simular project.
... this thing rocks. Flawless sheet feeder, awesome quality, scans right to the network into pdf format, sends me a instant message when its done.
I suggest getting a hp network scanner
i should have sprung for the duplexer.
members are seeing something, your seeing an ad
Are these copyrighted? damn I've read tons of paper about them and never actually read their original papers.
Though the web page was last updated in July, I find several happy references (and some less happy) to "Clara," a GPL'd OCR program.
Here's the web page: http://www.claraocr.org/index.html
timothy
jrnl: http://tinyurl.com/c2l8yr / foes: http://tinyurl.com/ckjno5
Well, copyrights weren't perpetual. Whether they will be or not remains to be seen.
Liberty uber alles.
This is a great project, I always try to correct texts on my Palm but it's much better to have them correctly proofed in the first place.
I just did a couple of pages - fun & easy!
Distributed Computing.
Harnessing the proofreading computing power of human minds on the internet. very cool...
I wonder what other problems can be successfully tackled this way.
Someone needs to do a google search on " Public Domain". Public domain is there for a reason. Just as Copyright is available to give the artist a means of supporting himself, it was never ment to last his entire life. The purpose is to give the artist an incentive to work, current copyright law fails in this respect because an artist only needs to create one successful work and can immediatly switch to being a leech on society for the rest of his (and his childrens, and childrens childrens) life. Having the works pass into the Public domain is a good idea for two reasons:
1. It is for the greater good of society as other people build on earlier works.
2. It keeps the artist busy as they were supposed to have to keep releasing work to feed themselves as their early work passed into the public domain, just like any other job.
I read the internet for the articles.
I think he was just watching all his volunteers working on one page a day and thought:
"Imagine a beowulf cluster of these!"
Lots of books aren't copyrighted anymore as the copyright expired. You see back before Disney bought legislation from people like Sonny Bono copyrights would be allowed to expire after about 50 years or so.
Beowulf, Moby Dick, Shakespearre's plays, etc are all free as in speach and beer. Edited versions of the original text can be copyrighted. Examples of that are edition of Shakespearre's plays with "translations" next to the original text. You can buy his complete works, unedited, for very little $ these days. The only cost for the publisher is printing and typesetting.
How about this.... use an open source speech synthesis tool/API that can play these text books (especially as more get added) over a PDA, laptop, etc while cruising in on the way to work and home. Something like:
o plug, just did a quick freshmeat search)
http://www.cstr.ed.ac.uk/projects/festival/
(n
would be pretty cool to get some good novels read to you w/o buying the tapes.
Or duplication, maybe?
Illegitimi non carborundum
Sure, it starts as just one a day. But, before you know it, you're doing two, then five, then ten.
You stop going out with friends or even returning their calls, personal hygiene takes a back seat and even Counter Strike and Warcraft III become unappealling. And, finally, after countless chapters and hundreds of pages you realise that you're friends were right: you're an addict.
Just one page a day, huh? Yeah, right.
Opium. Pot. Cocaine. Now pages.
It might not be your older brother's drug, or your Daddy's or your grandfathers, but, trust me, this stuff can be dangerous.
Do what I do. Just say no.
"Accept that some days you are the pigeon, and some days you are the statue." - David Brent, Wernham Hogg
If we just write one line of code a day each we'll have better OCR in no time.
#include
Ok, there is my line of code, everybody else, finish it up.
I can't wait to see this great new OCR.
Is there a list of books that are out of copyright and perhaps the status of those books on the Gutenberg Project website or anywhere else?
On the archive.org Gutenberg page they list the most popular downloads.
Number 2 is something called "New Hacker's Dictionary, The"
Every time I refresh the page the download count has increased.
A variation on the slashdot effect ?
Gutenberg page at archive.org
This a great project... But after doing my first page I found a couple of possible enhancements.
r oof_ / 1000))
Add a "Quality" stat for each person. Base it on the number of things that were missed(another words, the number of things that the second-string proofer finds).
Use more than just two proofers. Have one "First String" proofer, who could be anybody, but have two second string proofers (who both get the output of the first string proofer). If the second string proofers have any differences in their output(with the exception of white space), then another second string proofer should be used. Only proofers with a certain quality rating(slightly higher than what a newbie's would be) should be able to do the second string proofing.
The "User rating" should be a combination of the number of pages done and the quality rating of those pages. Note that quality rating would only be increased by doing first string proofing. Page count would go up for any proofing.
Quality could be a float, starting at 1.0 for newbies. Every page that is completed and has a second-string person check would then go into a calculation like:
_new_quality_ = _old_quality_ + (0.01 - (_num_differences_between_their_proof_and_final_p
Thus, for every page proofed that requires NO corrections by the second string the user's quality would go up by 0.01. ( 0.01 - 0/1000 = 0.01 )
if there were more than ten errors in the proofing, their quality would go down ( 0.01 - 10/1000 = 0.00 ), (0.01 - 20/1000 = -0.01)
Have a threshold of 1.10 or some such for second string proofers... That way it would require the user to do at least 10 perfect pages, or 20 pages with 5 errors, etc, before they could do the second string proofing.
Obviously, make sure that the second string proofer can't see who the first string proofer is.
The "User Rating" (mentioned above) could just be a multiplication of the Quality and Page Counts...
Sticks and Stones may break my bones, but copyright will always protect me.
Reading the blurb at the page-a-day site, it says ASCII only where bold is converted to ALL CAPS, the English pound symbol is rendered as "L," etc. No preservation of figures, drawings, or photos.
This seems very short sighted to me. Devices that can only display ASCII are becoming rarer and rarer. Why not, instead, store docs in some sort of SGML format to handle the special markup (which must be rare) and then down convert to ASCII when needed.
I've tried reading these things on my Palm. Very difficult. But if I could get a nice typeset PDF version, that would be a whole different story (no pun intended).
I know for a fact that there are a lot of digital copies of copyrighted works such as Frank Herbert's Dune series and The Lord of the Rings floating around the Net and I think the newsgroups as well.
Of course, there are. And why shouldn't there be? Information (and Entertainment) Must Be Free!
Just ask Harlan
How long before someone writes a script to hit "Save and get another Page" and they shoot to the top of the ladder claiming to have proofread 13,450,213 pages per day...
I've used both clara and gOCR. Both are not yet working well enough to actually use to scan books..
One page a day shouldn't be a problem.
Remember, You are unique...just like everyone else.
I beg to differ. Foreground and background are all relative -- everything your computer does is foreground to IT -- it's devoting 100% of its attention (if it's a single-processor machine) to one task at a time.
In this case, the term is relative to your boss -- foreground is that report that's due tomorrow, background is reading Slashdot, drinking coffee and doing distributed proofreading. Which is all fine as long as bosses don't have good human-load-management tools...
ssh bob's-head
uptime
10:40am up 2 days straight, 1 user, load average: 2.06, 2.08, 2.08
killall slashdot-reading
uptime
10:41am up 2 days 1 minute, 1 user, load average: 0.85, 0.83, 0.89
lo
OCR Engines are not email programs. You can't just add a line of code and all of a sudden it works better. Usually you have to spend time developing a complicated algorithm. Usually this is more than a line of code. Then you have to test it against known text (ground truth) to make sure it's a benefit, rather than a problem over a broad selection of pages. It's quite often the case that something that improves one page makes another worse.
Actually, having people make verifications against the OCR results establishes the ground truth which someone could use to improve the OCR engine so by doing a Page a Day, you are helping to make future Open Source OCR engines better.
I am not a number! I am a man! And don't you
How many books are we talking about? Those out of copyright and not in PG.
If the trend of copyright extentison doesn't end soon that number may reach zero, but how soon is that?.
"The last thing I want to do is deal with a bunch of people who want something."
Major Major
You seem to have skipped the second sentence of the post you replied to, even though the editorial corrections you refer to would undoubtedly appear on the scanned pages. One way around it might be that each page is covered under fair use, and they are not served to the proofreader in order, so you never are given more than a one-page exerpt.
__
Do ya feel happy-go-lucky, punk?
Something I posted on 10/24...
Go here. Now. It's the most complete listing of distributed computing I've ever found. Has the usual, like folding and SETI, but also neat things like Distributed Proofreading and finding as-of-yet unknown comets.
"`Ford, you're turning into a penguin. Stop it.'" -Douglas Adams, THHGTTG
It helps the community and adds books to my shelf, at the same time. Amazing, I'm in.
OK, here's mine:
#include stdio.h
next...
My beliefs do not require that you agree with them.
I just put in a few pages (15 if you care :), and while some were very conform in quality, at least one book had some smears and spots. There's no way an OCR of any quality would be able to reverse engineer the half-printed letters and words back to readable english without a *good* dictionary/grammar machine, and even then it would be more dangerous to have it do a half-assed guess than to have a human there that will at once tell that this is a trouble spot and that the OCR dropped the ball. God, that last was an ugly sentence, guess I should stick to proofreading and don't start writing myself...
Kjella
Live today, because you never know what tomorrow brings
Sorry, but this isn't strictly true. See my earlier post. Publishers tweak the text ("corrections" mostly) which give them copyright over their particular publication.
i-name =twylite [http://public.xdi.org/=twylite], see idcommons.net
De man wit de gun sez we be spellin an readin good.now you just sit there wich yo hans where i can see em an READ!you gunna be uh literate poofreader when we git through.
*Repent!Quit Your Job!Slack Off!The World Ends Tomorrow and You May Die!
I have a few books that are old enough to be well out of copyright (and obscure enough not to be found online already), and for a while I have been considering typing them in. OCR would be a lot easier, but getting a good image from a flatbed scanner would seriously damage most of these books. Even a handheld scanner would be impractical in some cases, and a digital camera seems even less likely to work. Is there any reasonable way to scan in pages from something like a 100+ year old 1.5" thick wire-bound paperback book that only opens about 60 degrees before putting up a fight?
Copyright law is supposed to give incentive to create, for the betterment of society, and allow the creator to derive direct benefits as a reward. An artist who has created a work so successful that (s)he can live on it indefinitely has arguably provided a suitable level of betterment to society.
Saying that copyright law is an incentive to "work" is accepting mediocracy. Artists who produce works that society values more highly should (have the opportunity to) receive more benefits.
On the other hand, I don't necessarily agree that copyright should last the lifetime of the creator (although there are strong arguments for this in the case of a natural person). But what is a "fair" limit?
Is 5 years enough? Almost certainly not. Many authors only achieve popularity after 10 or more years, and then make a fair amount of money off increased sales of their older works. A good number accept this as a risk, and plan to use this phenomenon to their benefit - work up a good number of titles with varied content, and you'll pull more readers, who are then likely to try some of your other titles.
Is 20 years enough? Maybe. But some of our best-loved authors were 15-20 years ahead of their time in terms of what readers wanted.
Is life enough? Strangely, no. If an aging star has just completed his/her autobiography, concludes the publishing deal, and dies ... well, the family could well be screwed.
Maybe the answer lies in a compromise, rather than an all-or-nothing approach. Copyright over a work lasts for the greater of 10 years or the creator's natural life (which gets very interesting when we get eternal life medications ...). But some rights fall away after the LESSER of those two times, such as exclusivity over derivative works (but not translations).
This allows society to (culturally) enrich itself by building on a work after a shorter amount of time, while the creator (and/or family) can still derive value from the original work for a longer time.
In the case of books this is easily understood: author writes book; 10 years later other people can write preludes and sequals, extend the world and characters, etc; 30 years later author dies and original book falls into public domain.
i-name =twylite [http://public.xdi.org/=twylite], see idcommons.net
hehe, pr0n...
The response from the slashdot community is impressive. Already they have hit their mark for the day as far as 'pages processed'. They have over 1400 (at 10:13am CST) pages processed. When I visited their site at 8:45am CST they had only 615 pages. I predict that the project will hit the 3000 mark fairly quickly for today.
I am pretty sure that PG takes care to only use old copies of books that are in fact no longer copyrighted if that is in fact necessary. They seem very picky in making sure that they follow the rules.
It seems like every few years I turn around and notice that some massive archive collection gets sued, goes out of business, has funding pulled, gets tangled in legal action, has a university board go into panic mode, etc. and suddenly it disappears without warning or notice to the frustration of many. I'm certain you also can name a number of services, collections, and resources that spontaneously vanished when hosted at friendly sites. History has proven that despite best intentions, nothing lasts forever unless we go out of our way to protect it.
So that work isn't lost or destroyed, are any of the mega-sized projects replicated elsewhere in the event that a "it'll never happen" situation crops up to this unsuspecting resource?
after finding Thea von Harbou's Metropolis at www.blackmask.com, I go there first when looking for an ebook, especially since they have them in e-silo format (Palm). IF they dont have what Im looking for I go by Project Gutenberg...
Thanks to file sharing, I purchase more CDs
Thanks to the RIAA, I buy them used...
OT: I can't believe I got modded down for this comment. Sheesh. Serves me right for posting at all. From now on I'll just keep to myself and not contribute anything at all. It's not even worth it to moderate others as well when I get moderator points.
Mod me down if you wish I don't care anymore.
MY GOD! A story where nitpicking grammar and spelling is *ON* topic.
This'll be a fun one to read through.
Do you Gentoo!?
I just proofread 2 pages of some greek philosophy book. the system works really nice! quick database, not too large pages to read. except i would like to have source and text next to each other, and not above each other.
I signed up for an account, and did a bit of proofing. One page was a bibliography with lots of numbers -- the OCR software made a few errors here and there, sometimes confusing "1" with "!". Another page was in old German. Since many old German characters look so different than their modern-day counterparts, I was quite impressed when it translated them flawlessly into their proper ASCII counterparts. The OCR software even got the umlauts right. Only problem was it sometimes mistook an end of line "-" for a "=". One problem I did have was that most of the scans seemed to be pretty low resolution. This causes problems when comparing the scanned text to the original image, as it can create difficulties for the proofreader. The software also had trouble translating the low-res blocks.
http://cltracker.net -- powerful craigslist multi-city search
You could use spelling and grammar checking to improve the ocr.
The quick brwn fox jumped over lazy dog.
It would be easy to figure out that brwn should be brown. The ocr should see something between lazy and dog, using grammar rules it could possibly figure out what the most likey word should be.
I'll help out.
One question - is Playboy public domain yet?
My wife had a suggestion for limiting the life of copyright. Basically, tie it to the amount of income you get from the work. Once you reach a certain plateau, the work falls into the public domain (although you could argue for an additional minimum time requirement, i.e. 5 years for movies, so that a gigantic blockbuster won't enter the public domain after 6 months). Or instead of income, base it on profit. That way, you are guaranteed that you will make a certain amount of money before the work enters the public domain. Of course, for works that never reach the plateau, they would enter the public domain after a suitable period -- e.g. life plus 10 years for natural persons, or something incredibly short for a corporation, like 20 years).
Of course, there's practical problems with this method -- namely, accurately determining the amount of money a work takes in. It's all too easy to fudge financial data, as we've been too often reminded in the past year, and this idea may not be workable.
"Destroy science and religion. Science would re-emerge exactly the same; but not religion." - Penn Jillette, paraphrased
(en tea)
It looks like the texts01.archive.org/dp site is holding up fairly well! If you cannot get through today, though, please check back later. Slashdot effect aside, it's usually quite speedy and has a decent 'net connection. If you want to keep informed of current events, get on one of our mailing lists via (when it's not slashdotted) our subscriptions page.
Dr. Gregory B. Newby // 919-962-8064
Chief Executive and Director
Project Gutenberg Literary Archive Foundation http://gutenberg.net
A 501(c)(3) not-for-profit organization with EIN 64-6221541
gbnewby@ils.unc.edu
...Not many, but there are some Project Gutenberg books that are copyrighted and distributed with the author's permission.
Also, Project Gutenberg of Australia publishes a number of works that are out of copyright in Australia, but still under copyright in the U.S. It is a copyright infringement for readers in the U. S. to download these works, which include, among others, Hervey Allen's _Anthony Adverse_(1933), F. Scott Fitzgerald's _The Great Gadsby_ (1944), Khalil Gibran's _The Prophet_ (1923), D. H. Lawrence's _Lady Chatterley's Lover_ (1928), all of George Orwell's novels, most of Virginia Woolf's, etc. etc.
Not exactly "the latest Stephen King" but a lot newer than Dickens.
"How to Do Nothing," kids activities, back in print!
Damn, work is getting in the way of my proof reading! Why can't I make my boss understand that Project Gutenberg is much more important than what I'm normally doing?
From the website:
Pages completed today: 2633 as of 9:01 Pacific Time today
The average pages / day so far this month (before today) is less than 1100!
Do they want me to manually scan through a page of text compare it with an image and fix errors created by OCR? It goes against my very nature to do such a task. There has to be a better way, a programming way, to get this done without having to look at all of the files with human eyes.
I haven't finished my first cup of coffee yet so I am at a loss for a solution, but it sounds like something Perl would be good at.
The motto of the open source community should be or is, "Progress not perfection."
LoRider
Looks like slashdot has caused the pages per day to go way up! Last couple of days had only accomplished between 1000 and 1100 pages per day. As of 12:30PM, the pages for today are already above 2600!
Wonder how long the increased rate will last?
-Dubya
I wish I could get my webpage to showup on the main /. page!
SIGFAULT
That sounds like a great way to encourage mediocrity. I won't write the best book possible. I will only write one good enough to sell enough copies to reach my maximum royalty. Your wife's method would reward the Danielle Steele's of the world and punish the Joseph Heller's. I think the ideal way is 20 years alive or dead. What other job pays you for 20 years for what too you maybe 1 or 2 years?
OK, I'll start at the other end and work my way toward you:
}
Their approach to solving this reminds me of how the Oxford English Dictionary was started -- by compiling submissions and references from thousands of volunteers. A really enjoyable recounting of this (and of one particular person who contributed thousands of words while in an insane asylum) is The Professor and the Madman
YM,
#include <stdio.h>
One line, one bug. Yikes!
Well, yeah - it's easy to make money publishing a book if you don't have to pay the author anything and all the marketing has already been done. For new books, the copyright system is the best way to ensure a publisher can recoup these costs.
I'm confused. Is copyright protection supposed to protect the marketers or the artist?
I do believe you have linked to a copyright circumvention device (the .au domain) in violation of the DMCA. Please standby while you and your belongings are liquidated.
Er, wait...
A while ago I started to write a Linux client for the distributed proofreaders site. I got a fair amount of it done, but there were some messy parts, buggy parts, and parts left undone. If anyone would like to check it out, or even work on it, it is at http://kapheine.hypa.net. I haven't worked on it in a while, unfortunately, and I probably won't.
-- kapheine
If they're looking for proofreaders here, the project is in deep trouble...
This page accidentally left blank
But the publishers still have copyright on their specific printing.
I've heard this in the context of German law, but never in the context of American law. American law requires significant creative effort to be copyrighted, which dumping text to paper rarely counts. (New footnotes and illustartions are a different matter, of course.)
In the case of books this is easily understood: author writes book; ... 30 years later author dies and original book falls into public domain.
That would make an incentive for people to kill you so they can steal your work.
Wait a minute! Isn't PHP like evil or something?
Programming languages may come and go, but good old fashion machine code will last as long as literature, very much like good old fasion ASCII text and good old fashion zip files with no meaningfull names.
It's absurd! :^)
It's inane!
It's Malaprop Man .
Walt Disney wanted to extend the rights to his branded characters and got the lawmakers to do it. In some respects his old stuff is renewed every decade: new generations of kids and new media- film, theme park, video tape, DVD, IMAX ...
Each reissue is a new pile of money.
Interesting points. There is the fact that deliberately creating an artistic work that will reach a certain cash plateau is nearly impossible -- just look at how many creative endeavors never even get so far as to break even, and that's with authors trying really hard.
Also, there's the fact that an over-successful work creates desire for an author's other works -- so writing something which will exceed its copyright profit cap would still create income for the author's other works.
Additionally, if there's a minimum time limit set on the work (I'd say 15-20 years for books), then even if it is wildly successful, you could reap the profits for 20 years, even if you greatly exceeded the profit cap. Once that 20-year deadline hit, of course, the copyright would expire. Trying to calculate your work so that you only barely reach the profit cap *after* the minimum time would be utterly impossible, so I doubt that would have any effect on authors' efforts.
All that said, yours is a simpler solution (and one that I would support) -- 20 year copyright, non-extendable, from the date of first publication, regardless of the author. Period. Copyrights would be transferable (i.e. I could sell my copyright to a new owner, and I would lose *all* rights to it). It's an acceptable solution, though it doesn't mean it's the best solution (or even realistic, politically speaking).
"Destroy science and religion. Science would re-emerge exactly the same; but not religion." - Penn Jillette, paraphrased
This is great, but it's even more addictive than the Kill Everyone Project. Though arguably not as worthwhile.
My deviantArt site
set_bugs = 0;
Please consider making an automatic monthly recurring donation to the EFF
2. Do you know how long it has been since I wrote any C code? I was lucky I spelled stdio correctly.
My beliefs do not require that you agree with them.
For the reason why, I suggest you "learn up" on what public domain really means in the US. Public domain simply means that a particular work has no copyright restrictions. It does not mean that you are prohibited from adding further copyright restrictions of your own.
In other words, a work which is public domain is free for all to copy in any way they wish, including copyrighting a copy for themselves. Note that placing your own copyright on the work does not mean that the original work is copyrighted. It just means that your copy is copyrighted. Anyone is still free to access the original copy, which is still in the public domain. But they can't use your copy if your copy has your copyright.
You might ask "are there laws that prohibit you from lying about the authorship of a work?" The answer is yes. It's called fraud. It has nothing to do with copyright. Placing your own copyright on a work, and claiming authorship of a work, are two completely independent actions according to the legal system.
You are totally right that the cover text is not enforceable with regard to "fair use" copying of the text, but the parts that say "Copyright 1974 Houghton Mifflin" and "All rights reserved" are definitely valid, enforceable, and meaningful.
are actually the preferred way to proof text. A project to create "The Collected Works of Edmund Spenser" is headquartered here, and the English-types were looking for people to work on some software for them. The current most accurate way to create an electronic copy is to hire people without even a passing familiarity with the alphabet you are targeting, train them to identify the letters themselves (using the font you're targetting, which may be very much non-standard, esp. for work as old as Spencer's), and have them enter it in character by character. You then have another illiterate person do the same, and have 1 editor (English graduate student) check both copies. Then any differences have to be handled by another editor (English PhD), and the final copy signed off by yet another editor (PhD).
A very very expensive way to do it.
See, an illiterate person won't introduce any bias into the text. They will faithfully duplicate any spelling mistakes that they find. In the case of an English scholarly collection, the mistakes are amoung the most important part, since they can identify different print runs, and how language shifts over time.
As a side note, the software project is hopeless. The best that cann be managed is to automate the administration of their current systems--no OCR will ever meet the level of accuracy that their current system provides.
next line
#include "ocrLib.h"
Workable?
As Nietsche famously said, "If you stare too long into the Abyss, 1d4 Tanar'ri of random type will attack you."
I don't want to get squeezed in the middle, so I'll work _downwards_ from you.
#else
Well, I'm assuming you want it to work in both Windoze and lunix. I just got the feeling that what you were writing wouldn't be portable.
THL.
Keeping
It's on Slashdot, so everyone does a few pages, find out it's actually fairly tedious, and only a few will remain of the initial burst. They're at about 7000 for today right now, which is about 1000 more than what they've done so far, this month. Don't build your site based on these estimates.
Check back there in a few weeks to see how the site is doing. Hopefully quite well, since it is a splendid and worthwhile[1] effort.
[1]: And only in the preview did I realize I sounded like that woman in the HHGTTG.
That would make an incentive for people to kill you so they can steal your work.
Do authors burn at 431 F?
__
Men with no respect for life must never be allowed to control the ultimate instruments of death.
GW Bu
So you won't find the latest bestsellers or modern computer books here. You will find the classic books from the start of this century and previous centuries, from authors like Shakespeare, Poe, Dante, as well as well-loved favorites like the Sherlock Holmes stories by Sir Arthur Conan Doyle, the Tarzan and Mars books of Edgar Rice Burroughs, Alice's adventures in Wonderland as told by Lewis Carroll, and thousands of others.
The texts you mention are "illegal" replications/duplications. But please do read about the travesty of copyright laws on their site as well. And vote accordingly in 2004. And don't get discouraged - keep posting.
...but first read the Proofing FAQ on the site and save yourself some confusion:
http://texts01.archive.org/dp/faq/ProoferFAQ.htmlEspecially read section 5 for some of their typesetting-to-ASCII conventions which would be non-obvious otherwise.
No it doesn't. (not counting copyright not being renewed, which I suppose counts, but...) At this point, no new works about which the author actually cares enough to renew copyright are going into the public domain; if no new laws are passed this will change in 2018...
The standards for whether a particular addition constitutes significant creative value are remarkably low. The already mentioned spelling modernization, for instance, is an example of a tangible modification to the Shakespeare texts over which Houghton-Mifflin can legitimately claim copyright.
You could, in theory, copy their Shakespeare book, IF you somehow removed all of their spelling and editorial changes, line numberings, page numberings, annotations, commentary, illustrations, etc. from every page of the book. In practice, this is not so easy, because it is not easy to tell what was changed and what was not changed unless you have an original copy to compare it with. And if you have an original copy of Shakespeare, then you don't need the Houghton-Mifflin published version anyway.
You can now see the (benifitial) results of a good old-fashion Slashdotting on the front page, with the graph for pages from Nov. 8 going way off the scale.
JFMILLER
Strive to make your client happy, not necessarly give them what they ask for
Here's mine:
return 0;By the time this is working most of what I'd like to read should be public domain anyway....
O frabjous day! Callooh! Callay!
Next time I get mod points, I think I'm gonna just start modding down posts that complain about moderations done to their authors. Nothing personal, Jack, it's just something I've noticed a lot of recently on slashdot, and it's really frickin' lame. I mean, it's OK to complain that someone else got hammered by the mods, but to whinge about one's own fate - dude, it's only karma, get over it already.
(This is of course in addition to the policy I'm borrowing from someone's sig: "I moderate down any post that says 'I'll probably get moderated down for this'." Same principle.)
Of course, my new-found policy will probably be a big hit with the metamods, some of whom have taken on personal quests to rate all downmods as unfair. So logically I should hide behind "overrated". But I won't. That would be lame. (:
"How can you claim that you are anti-crack, while still writing a window manager?" — Metacity README
Oh, and before you get all pissed at me - yes, I agree that you were moderated unfairly. Yes, a certain amount of crack was most likely involved. I don't care - it's still tacky and tiresome to complain about it yourself.
"How can you claim that you are anti-crack, while still writing a window manager?" — Metacity README
I meant to say that if you could affect which books get converted into electronic form you might be more interrested.
Voting might not be the way to go, but I don't feel that I'd be very interrested if there are no books I have any interrest in personally.
.: Max Romantschuk
Earl Wiener, 55, a University of Miami professor of management science,
telling the Airline Pilots Association (in jest) about 21st century aircraft:
"The crew will consist of one pilot and a dog. The pilot will
nurture and feed the dog. The dog will be there to bite the
pilot if he touches anything.
-- Fortune, Sept. 26, 1988
[the *magazine*, silly!]
- this post brought to you by the Automated Last Post Generator...