Why Project Gutenberg Isn't There Yet
option8 writes "This wired article ('Any Text. Anytime. Anywhere. (Any Volunteers?)'), goes into good detail on why Project Gutenberg, and similar efforts, are far from creating a complete, free electronic library. A quote: "The mechanics of a universal library are simple. The tricky part: harnessing the free labor." Though it doesn't go into technology much, I expect there's a lot of potential in mass OCR tech and good speech recognition (faster to read a book aloud than to transcribe it correctly)."
That's crazy. OCR will always be faster than speech, even if speech recognition ever works, which it currently does not.
Aren't many books typeset using Latex? Just post the source.
word.
I really like project gutenberg. I have many of their texts....
I'd really like to see them succeed.
What about the cost of the books? Unless the only books you have in this "universal library" are old enough to be without copyrights, won't there be a problem in finding funding to buy current day books?
I'm not too informed about this topic; feel free to correct me.
If the goal is a universal library, and there is a need for a work force, wouldn't a program iniated on the library level to utilizie librarians as a volunteer work force, perhaps as a side project they might be interesting in helping along? I think of it as SETI in the library world.. *shrug*
So many of the things that people want to read are copyrighted, and won't be availble until long after we're dead.
There are more serious technical issues for an online library. A real library has only a limited amount of books, and they can only be lent to limited amounts of people at once. An electronic library would have to follow the same principals, having time limits and user limits for each document. Otherwise they would be giving out works for free, and most are not intended to be distributed that way.
Stanley Feinbaum, professional journalist and master debater! God bless the USA!
It would make life some much easier if I could search an online library rather than searching the library index. Just think how much space we could save as well rather than shelves full of books that are basically dead weight 95% of the time.
I think copyrights got to be the biggest hurdle, publishing houses arnt easily going to be perswaded to put oh say the next harry potter book online for free and risk losing millions
Distributed Proofreaders. Recently discussed on /. as well.
The mechanics of a universal library are simple. The tricky part: hairdressing the free labor.
Karma: Barber
#3 pencils and quadrille pads.
So the point of this post is: why not ask publishers for the material? If it's already public domain, it's not like they'll lose profits, and maybe Project Gutenberg could let them put a little
kind of thing at the top of each book they donate. Plus, maybe it's a tax write off. I don't know. That said, I'd thing it'd be much easier to just type things in than OCR it or use Speach-To-Text.Comment forecast: Bits of genius surrounded by a sea of mediocrity.
I'd just like to point out that this is the third story from Wired to show up on slashdot today. And it's not even that bad of a story. I think this must mean Wired is cool again.
Keep in mind the following copyright rules:
1. Works first published before January 1, 1923 with proper copyright notice entered the public domain no later than 75 years from the date copyright was first secured. Hence, all works whose copyrights were secured before 1923 are now in the public domain.
(This is the rule Project Gutenberg uses most often)
Works published from 1923-1977 retain copyright for 95 years. No such works will enter the public domain until 2019.
2. Works first created on or after January 1, 1978 enter the public domain 70 years after the death of the author if the author is a natural person.
(Nothing will enter the public domain under this rule until at least January 1, 2049.)
3. Works first created on or after January 1, 1978 which are created by a corporate author enter the public domain 95 years after publication or 120 years after creation whichever occurs first.
(Nothing will enter the public domain under this rule until at least January 1, 2074.)
4. Works created before January 1, 1978 but not published before that date are copyrighted under rules 2 and 3 above, except that in no case will the copyright on a work not published prior to January 1, 1978 expire before December 31, 2002. If the work is published before December 31, 2002, its copyright will not expire before December 31, 2047.
(This rule copyrights a lot of manuscripts that we would otherwise think of as public domain because of their age.)
5. If a substantial number of copies were printed and distributed in the U.S. prior to March 1, 1989 without a copyright notice, and the work is of entirely American authorship, or was first published in the United States, the work is in the public domain in the U.S.
6. (This rule is complicated, and is seldom applied). Works published before 1964 needed to have their copyrights renewed in their 28th year, or they'd enter into the public domain. Some books originally published outside of the US by non-Americans are exempt from this requirement, under GATT. Works from before 1964 were automatically renewed if ALL of these apply:
At least one author was a citizen or resident of a foreign country (outside the US) that's a party to the applicable copyright agreements. (Almost all countries are parties to these agreements.)
The work was still under copyright in at least one author's "home country" at the time the GATT copyright agreement went into effect for that country (January 1, 1996 for most countries).
The work was first published abroad, and not published in the United States until at least 30 days after its first publication abroad.
This means that we can't simply take electronic versions of modern texts and put them in the archive, because only out-of-copyright books are in there.
and post the on freenet or something. These IP 'laws' are just ridiculous.
In any case, the real obstacle to a useful electronic library isn't labor. It's copyright.
which will be ready first, Project Gutenberg or Duke Nukem?
Apparently the author of the article missed Distributed Proofreaders. They seem to have survived their Slashdotting and actually retained a good fraction of their new users. This month they've proofed 116,827 pages! (Cut that in half for unique pages, I think) They have completed in their 2(?) years of existence 918 books, and have another 317 being assembled. It really seems like they are only limited by what they can get their hands on in the public domain.
Ah, good old Gutenberg, the German man who invented the printing press. I believe he was made Man of the Millenium in 2000. Not bad for a guy whos been dead for a few hundred years. The Library of Congress has a Gutenberg Bible on display (the Bible being, of course, the first book made with a printing press.)
And while we're discussing the speech recognition for books, it wouldn't make sense for poetry, which uses alternate spellings sometimes. It also wouldn't make sense for at least one work that I can think of - Through the Looking-Glass by Lewis Carroll, which is already up there. When Alice first looks at the poem Jabberwocky, it's backwards. Try saying that backwards faster than you can type it!
Anonymous Coward: (n.) 1. nerd at school or library. 2. karmawhore in training. 3. embarrased prep.
The volunteer page is the place to start:
http://promo.net/pg/volunteer.html
All digital versions of books that publishers have should be requested and maintained in a safe place till their respective patents expire so that they can be easily integrated into the public domain.... especially if OCR or speech recognition doesn't get any better any time soon.
---- The geek shall inherit the Earth.
I cannot believe that Project Gutenberg continues to use plain text as their source code! I can see why it would have been compelling in 1971, and it still may be true that there are systems out there that can only read 7-bit ASCII.
But that's absolutely no reason why the source shouldn't be marked up. Marked up source can always be converted to ASCII, but you cannot derive semantic markup from ASCII.
Project Gutenberg just doesn't come across as something interesting or the first thing you think of when you think "Free electronic library". Even "WikiLibrary" would be better (although not a wiki).
Analytic & algebraic topology of locally Euclidean meterization of infinitely differentiable Riemmanian manifold
The best and cheapest way to get existing books on the web is to scan them and compress the images. Compression technology for text images is so good (see DjVu), and storage so cheap nowadays that you are better off just distributing high resolution scans.
This is a much more efficient way to make books available on the web, much more efficient than having volunteers painstakingly transcribe the text or correcting OCR mistakes.
OCR can be used for indexing scanned documents, but there is no need to do manual correction. DjVu can compress 300dpi black and white pages of text to 5-25KB. That's less than most HTML pages, and the images look just like the original book.
The Million Book Project at the Internet Archive uses DjVu (as well as other formats).
The open source implementation of DjVu is available on sourceforge
I might be wrong, or maybe some books are more '1337' than others, but I got the impression that there definitively are enough people willing to get the texts to digital format.
--
Stay tuned for some shock and awe coming right up after this messages!
As someone pointed out, the real problem is the copyright issue. Most works are copyrighted and copyrights last for way too long. The consitution states that copyright should be limited, but when it's lifetime plus 90 years, it may as well be unlimited since we'll all be dead before they expire. There needs to be a grassroots movement to inspire a repeal of some seriously damaging legislation. I feel confident that most slashdot readers agree about what needs to be done, but we seem too apathetic to actually do something about it. Sometimes I wish someone would post a link that says 'click here to vote for freedom'. If only it were that easy.
I think an interesting project would be public domain textbooks. Textbooks are grossly overpriced and contain information that is largely available for free. If a community of developers can create an OS like linux then the educational community should be able to come up with open textbooks.
My Blog
The article didn't say that OCR was faster than speech, it said that speech was faster than transcibing it.
Come on mod's, read more carefully.
poliglut.org: they're still alive and fighting the man
Huh? I can type a good bit faster than I can speak.
TODO: Something witty here...
it is part of the philosophy of Project Gutenburg to publish all of their works in the lowest level stardard format, thus insuring continued cross platform, program independant readability, ad infinitum.
That means *plain* ASCII. Plain ASCII means you could read it in edlin if you really had to.
This is a Good Thing.
This also means that if you wish to format any Project Gutenburg text, in HTML or TeX for publication, you start with a blank slate and can immediately start to work your own will upon the raw text.
This is also a Good Thing.
KFG
No 'project' is going to get Steve Guttenberg back in Police Academy.
It is time to move on...
WARNING: This sig does not contain a joke
---- El diablo esta en mis pantalones! Mire, mire!
It seems paradoxical, but there it is. I spend a huge amount of time glued to the screen, reading articles, blogs, forums, FAQs, HOW-TOs, etc. But I don't like it, in fact I find it aggravating.
I am lured and lulled by the vast amount of easy information suitably tailored to my interests, all with an easy to use intuitive associational ( read hypertextual ) interface. But it is tiring, staring at a flickering glaring screen for hours, my eyes get dry, and I strain and get tired picking out fuzzy objects when I try to focus at distance. Its nasty and annoying.
Here is my point about this project. Nobody wants to read books on their computers. Well maybe some do, but I think the vast majority don't. Paper books are easily available and cheap. If you can't find the one you want in a local library or bookstore there are a multitude of ways of ordering them. You don't get tired looking at them, they are actually enjoyable. So why should there be a desire amongst the majority for e-books?
Don't get me wrong, I think its a good idea, but not one that I, nor I think the majority, will go in for until a better way is developped of presenting them. LCDs are an improvement, but they still are shabby. I don't think a project like this is going to see much public interest until some better presentation media is found. E-Paper will be needed before the E-book becomes a reality for most people. Some kind of little book-sized unit that you can hold and which will display on a matt - non-glaring, non-luminous surface.
There are a thousand forms of subversion, but few can equal the convenience and immediacy of a cream pie -Noel Godin
While I like the project, I think the biggest problem is the interface to use the books. They end up in this crappy.txt format. The searching and browsing is slow and painful. If they just spent a little time on the website, they might get more support!
-- these are only opinions and they might not be mine.
for short command sets. Mac OS X has excellent speech recognition for example. What we are lacking is a way to differentiate a larger vocabulary.
I can see PG's next release now:
Welcome to the audiotape version of.......
You can't judge a book by the way it wears its hair.
There seems to be an interesting recurring theme in human history - we constantly strive to build libraries but we have never yet built one that is quite "good enough".
The Great Library in Alexandria was a wonder of the ancient world until it got burned down as part of a domestic dispute between Mark Anthony and Cleopatra. I was amused to note that the local University recently received funding approval to rebuild it - grants committees move slowly.
In mediaeval times, monks were the guardians of knowledge and the various monasteries dotted around Europe were oases of learning and knowledge in those times. Knowledge was restricted to the few.
The original Gutenberg made it possible to create huge volumes (literally) of knowledge and disseminate it on a wide scale. Ever since, people in power have sought to control this technology - either through censorship, copyright, or even education (you have to be able to read before a book is of greatest use to you.)
In Victorian England, the mark of a scholarly gentleman was in the breadth of works he maintained in his private library.
Perhaps a new initiative might be Gutenberg@Home whereby any reader made an electronic copy of physical works by some convenient, nondestructive means. By keeping such a personal library private, one would not have to worry about copyright laws, even as currently framed.
How much of what is holding us back from building the perfect library simply our insistence on monetary-related restrictions? How long will it take us to realize that lengthy (in time) and complex or intensive (in resources consumed) PHYSICAL processes are the only ones to which we need to attach a value. Whatever happens inthe electronic world should be free and that the collation, assembly, verification, dissemination and application of the sum of human knowledge is one of the most important things that we could achieve?
STF
Storing books online is one thing. Gutenberg also needs readers to be successful. How many readers are willing to read .txt or .pdf files instead of printed material ?
Several times I downloaded Gutenberg books, with the intention to read them from laptop or screen lateron. Turns out this is too inconvenient, when compared to paper print. ...
If only electronic paper would be at 1c a page
Look in the back of a good book, the credits for the font, the mark up design, the basic look. Formating an ascii book to something that is a pleasure to read is a lot of work. A book is more than just words. But Go Gut! We love ya.
Yup. Except that the vast majority of publishers won't give out their digital masters, even if the work in question is public domain. The formatting and page layout cost them money, and they (rightly or wrongly) feel that such a release would undercut their sales.
And even if you could get hold of the digital representation, it'd very likely be copyrighted as a "derivative work" (due to the layout info, page numbers, and even spelling corrections).
#bookz on irc.undernet is an excellent place for ebooks, of course, with a little illegality behind it. Many of these are the same one's that have been floating around on alt.binaries.ebooks since the stone ages, but I think this unrestricted database is probably the best library created.
Floppy disks get magnetized, hard drives crash, optical disks get scratched...A book can take a beating, man. All the OCR and voice rec in the world won't change this until we can get widespread, cheap cartridged optical media.
I think this take on media longevity also prevents progress WRT Project Gutenberg. Too many people don't see the point, when they can have the Library of Congress backed up on disk one day but be looking at a screen full of garbage characters the next because someone accidentally yanked the power supply on the server or whathaveyou.
A single $5 paperback book can be propagated more reliably than tens of thousands of dollars worth of networks and storage, although the latter system can admittedly hold a whole library's worth of that single book. But think about the infrastructure required to maintain the latter system. Until we have better media, the costs aren't justifiable, IMHO. It's an idea whose time has not yet come.
Introducing the new Bio-Optic Organized Knowledge device - BOOK.
BOOK is a revolutionary breakthrough in technology; no wires, no electric circuits, no batteries, nothing to be connected or switched on. It's easy to use. Even a child can operate it.
Compact and portable, it can be used anywhere - even sitting in an armchair by the fire - yet is powerful enough to hold as much as a CD-ROM.
[...]
BOOK never crashes nor requires rebooting. The 'browse' function allows instant movement to any sheet, forward or back, as one wishes. Many come with an 'index' feature, which pinpoints the exact location of any selected information for instant retrieval.
Portable, durable and affordable, BOOK is being hailed as a precursor of a new entertainment wave. BOOK's appeal seems so certain that thousands of content creators have committed to the platform and investors are reportadly flocking to the medium.
There are a thousand forms of subversion, but few can equal the convenience and immediacy of a cream pie -Noel Godin
I prefer to phrase it, "Thus Project Gutenberg has raced ahead at an amazing rate. In its 32nd year in existence, the collection has 6,267 etexts, averaging almost 200 etexts per year. That works out to about one book every other day. This is more impressive given that in the first twenty years of the projects existance the Internet didn't exist anywhere near the form we take it for granted today. The popularization of the Internet has just accelerated the rate the Project Gutenberg grows. With the help of Distributed Proofreaders, a project that allows average people to donate small amounts of time to proofread just one page at a time, Project Gutenberg can expect to add over 400 etexts per year. Clearly Project Gutenberg is thriving."
Search 2010 Gen Con events
Hah, try transcribing "Huckleberry Finn", or any Dr. Seuss
No. Boycott Dr. Seuss. His estate submitted an amicus brief in favor of the Bono Act. Now that Project Gutenberg uses distributed proofreading, the Bono Act is the biggest barrier to the growth of PG.
Will I retire or break 10K?
full grown, like Athena springing from the head of Zeus, this criticism is largely valid.
Patience, however, is a virtue. Libraries of public domain works *grow.* Every work added remains. Although it may take many years, even generations, as did the construction of the Giza plaza, over time The pyramid grows toward its apex, another pyramid joins it, a temple is added to the side, and so on.
That's part of the point of Project Gutenburg. Not just to provide an online library but to do so in an immutable manner that only grows over time.
Adding only *one page* to the project is valuable, and that addition remains and is added to by others.
Even brick and mortar libraries can take generations to build. A two hundred year plan only requires patience to complete.
That said, I'm going to take an even more contrarian point of view to the Wired article. The amazing thing I find about Project Gutenburg is how much is already in there. It's already at the point that I think few people could manage to read one half of the texts available in their lifetime, and finding a project to donate is complicated by the fact that the hardest part may not be performing the labor, but simply finding a project that interests you that *hasn't already been done.*
It's already a remarkable collection, and I've had to, on occasion, resort to it because my local library didn't have a lending copy of the work I wanted, but Project Gutenburg could give me free ownership of it.
KFG
Wired is just looking for content. Gutenberg is alot better than nothing, and having 2627 texts in one easy-to-download place is good by me. Books like "A Young Girl's Diary" are fascinating, and very likely it is one of many texts in their resource which would be very difficult to find otherwise, or which would be completely forgotten.
Scanned documents might be fine for readers, but what if you're looking for "oh, you know, that one line in the book, where the dude was talking about melons."
A computer is NOT a glowing piece of paper with scrollbars.
That's basically what Distributed Proofers does. Except they OCR the book first, so the proofreaders just need to fix the OCR errors. Every page goes through two passes. Then the entire book goes into post-processing where a single person puts all the pages together, and checks for problems that the proofers didn't know how to solve (marked with an astrisk). Once Distributed Proofers finishes the book, they pass it on to Project Gutenberg where somebody reviews the whole text again.
Distributed Proofers currently has a problem. After the previous Slashdot announcement, they were overwhelmed with volunteers. The volunteers processed books so fast, they were running out of material to work on. Three or four people scan in most of the books. They have been slaving away trying to keep up with the proofers.
Distributed Proofers is also working on a standard to mark up the books to better preserve tables, illustrations, bold text, math, etc. I suspect that effort is being slowed due to the priority of keeping material on the site.
The author makes a good observation, but misses the point afterwards. The Web is curiously devoid of primary subject matter. There are book reviews, but few books; movie reviews, but not the movies; music commentary but little music. It's a web of opinion, not knowledge.
But the problem isn't volunteers, it's litigation. Copyright law, DMCA, etc. The sources aren't there because the greedy owners won't allow them to be put there. The ebook-list over the last week has been publishing notes from various authors (real authors, not corporations like Disney) that read, "You'll get my copyrights when you pry them from my cold dead hands (and even then I'd like to leave them to my children!)."
If Project Gutenberg could publish modern texts, there would be an explosion of interest and activity, and a more or less immediate on-line library. But since it can only digitize books written before 1923, more or less, there's mainly interest from historians, English majors, and True Believers.
I and probably many others here, like to read Project Gutenberg books on my Palm/Pocket PC. Whenever I have a little down time I can get that out and choose from a dozen "classic" books to read. Can't do that when the "book" is a 800x600 image, and your screen can only do 320x320 (Sony Clies, Palm Tungsten), 320x240 (PocketPCs, Handera), or 160x160 (almost all Palm and Handspring PDAs).
Plain text, HTML, or XML are much more portable than compressed images. Which is at least partly why Gutenberg uses plain ASCII text; it's readable on literally anything with an alphanumeric display, and by all signs will be for decades, if not centuries or millenia. Good luck finding a GIF or BMP in 100 years, let alone formats nobody's even heard of. I have plenty of pictures I made only a few years ago on an Apple II that can't be read by anything, even when I get it off the 5.25" floppies. Yet I've read code and other things written on computers from the 70s and 80s. ASCII Just Doesn't Die.
.. when the font used is different from fonts it is programmed to recognise. I tried scanning a 40-year-old book - a drama script written in Indonesian - and the combination of unusual font *and* unrecognised language was enough to make the OCR software's output 50% rubbish.
Hmm, imagine scanning a 500-year-old book hand-written in Cyrillic... forgetting for one second the damage that scanning might do to the book in the first place.
Michel
Fedora Project Contribut
You can get a subscription for $12.
PLEASE, someone find something more interesting to submit!
Well, until it's free, there's always textfiles.com.
Actually, a while ago I copied a lot of the Project Gutenberg library, along with some others, and created etext.textfiles.com.
In my experience, the reason a lot of people don't donate free time to transcription or other similar drudge work is because a lot of sites that encourage it steal it. Witness CDDB, and just wait to see how long before you pay for IMDB.
Have you ever looked at the amount of material in Gutenberg's archives? When it comes to books and material written in english, that is in the public domain, I have to say, that Gutenberg offers almost everything of interest already.
The reason the Gutenberg project isn't hugely succesful is not the lack of text. Part of it might be the lack of formatting. Nobody want's to read 600 pages of a classic work on a computer screen in ASCII. Some may be masochistic enough to do it if it was in HTML. Personally, I still prefer it in book-form.
But even if it was properly formatted in several formats (including .pdf's in several sizes), it still is a lot of work to print it out, find a decent way to keep it together (no, ring binders isn't very appropriate for something you are going to read).
The main reason Gutenberg isn't succesful is because it is not what people want. People don't want to read or print out old literature in the public domain. They either want a nice edition that looks good on the shelf, or a cheap paperback to carry around with them . And most likely they aren't particulary into really old books (with a possible few exceptions which the Gutenberg project long since have covered).
It's not like the work the Gutenberg people are doing isn't important, or isn't of good enough quality or anything else. The simple reasons it's not heavily succesfull is because very few people are really interested. I'm sure much of the work the Gutenberg people have done will become important as soon as on-demand printing is more common and affordable.
I am.
I have ready many of the PG texts on my Palm III using the Weasel reader. Easy to carry. Something to do anywhere. Waiting for an appointment or at an airport. I have gained enormously from their work.
I may have never read anything by Zane Grey without them.
Burroughs on the go.
There's no money in it.
If there was, someone would do it.
But there isn't, so hardly anyone tries.
Get it?
Though it doesn't go into technology much, I expect there's a lot of potential in mass OCR tech and good speech recognition (faster to read a book aloud than to transcribe it correctly).
Was thinking about voice recognition today while lamenting that I haven't done more to type in my copy of The Queen's Necklace by Alexandre Dumas, copyright 1910.
Here are two problems that came to mindn why I probably won't be able to use voice recog soon:
1.) Works who have been lucky enough to actually have their copyright lapse are often pretty old works. Their English (let's use English just b/c it's the lang I'm using) isn't exactly today's English, and sometimes even spellings, etc, change. Try reading anything from the 1800s and before.
2.) Names (so any protracted dialog) and other tough-to-translate stuff is going to be a pain to proofread. My book in particular has quite a bit of French in it (lots of "Parbleu" and French names with crazy accents all over the place).
I'd like to say voice recog could produce a "new version" with "updated spellings", but I just don't think that'd fly.
So once voice recog is commonplace for, say, office use (still quite a ways off) and affordable (not sure there, but I haven't heard of a friend using it yet, even just to play) we'll still have a ways to go before we can get true literature into PG simply by reading.
As an aside, at the same time I've been thinking about simply taping me reading the book and donating *that* via mp3 (or Ogg or whatever the heck). For the time being anyone who wants can listen in the car, and as soon as voice recog is up to snuff, voila. Just run it on my recording, proofread (easier said than done), and you're ready to go!
It's all 0s and 1s. Or it's not.
Sorry to hear the project's in trouble. Man, it sucks that big companies keep enforcing these frivolous patents.
If it helps in proving prior art, some guy invented something similar about 500 years ago, but I can't remember his name...
Actually I've found the most value from the project is downloading and reading classics. I've downloaded works by people such as: Adam Smith, Nietzsche, Aristotle, Plato, Karl Marx, Oscar Wilde, Thomas More, and various other classic writers. I've found this resource indispensable. It provides high quality texts for free. I probably wouldn't read many works by these authors if I had to purchase them. I unfortunately, don't have the money to spend on many small works such as these (they're short, but sometimes cost $10-15). I also don't have easy access to a library and I like keeping a copy for my own personal use.
So I find that Project Gutenberg is a very useful resource.
neurostarI was about to say it but you said it better..MOD IT UP FOLKS
and yes, the article sucks...
I love Project Gutenberg, and I've used and supported it since the pre-web days. However, I don't think they go far enough.
There are plenty of places on the net that one can find and download copyrighted works. Web sites, mail servers, IRC networks, and so on. I've used them extensively, myself. Many of the books I've downloaded, I own, and I got the electronic format for searching, reading on pocket devices, and so on. I think that this is fair - I've paid for the information once, and my sense of Fair Use tells me that it's okay to get this bits in this way.
I've also downloaded many, many books that I do not, nor will ever, own. (Some of these, I will probably never read.) Is this a copyright violation? Almost definitely. Is it ethically wrong? I don't think so. I would probably never buy a new copy of these works. If I hadn't downloaded them, I would have borrowed them from a friend, or a library, or bought a used copy, and sold it back later. None of these legal methods would have earned the author or publishers a cent. So, how are they different from downloading an electronic version? In my eyes, they are not.
I buy plenty of books - hundreds or dollars worth every year. I love to read. I support local authors, and independent publishers. I do not think my actions are criminal. If someone disagrees, tough. You won't stop me, or the legions of other electronic book traders. Ever. Sorry. If it helps, think of us as the "books" in Fahrenheit 451, keeping a distributed library available for public use, in the event that something terrible should happen someday. Eventually, one way or the other, copyright will go away, and the words will be truly free again.
(And anyway, I was just joking. I'd never knowingly violate copyright law. What am I, stupid?)
is going great and my thanks to those involved.
Well as I pointed out DjVu is a good format, and it is indexable. I recommend looking at the examples posted on the site. There are browser plugins for all the platforms. I've even recommended it to my local library.
Comment removed based on user account deletion
Even though the project did not take as expected, one has to realise that there are copyright problems that are prevalent. The laws are different in different countries and the enforcement is equally varied. So in this setting, I think that Gutenberg has done a decent job. Only if there were less strict copyright laws, may be people would be interested to convert data.
Rescuing Steve Gutenberg's career requires more than just planning. hell, a generous donation from bill gates probably couldnt even do it.
Here's a prediction. My next issue of WIRED will be filled with interesting articles. I'll read the whole thing, then two weeks later, half the stories in the magazine will be submitted as /. stories.
Some month, I'm gonna go through story by story and submit the whole damn magazine the day I get it.
anything i tell you will cloud your opinion.
On the other hand, if you need to input a 500-year old work in Cyrillic, it just might be worth doing it by hand, or hiring a Russian typist to input it if you have a bunch of hot dates that week or something. After all, this hypothetical Cyrillic book must be pretty important, huh?
Freedom: "I won't!"
Whoops my mistake - that link is to the wrong bookcity - doh! I meant bookcity in Toronto, Canada.
There are a thousand forms of subversion, but few can equal the convenience and immediacy of a cream pie -Noel Godin
This distributed proofreading group looks like they might have the answer for helping PG get closer to being 'there'. Having people proofread one page at a time comparing the ocr'd text to the original scan is an excellent idea for speeding up the proofreading process as well as improving the quality.
My wife heard about Project Gutenberg a couple of years ago and thought of OCRing and editing an English translation of Machiavelli's 1518 Italian play La Mandragola. She briefly corresponded with PG Executive Director Michael Hart, who was extremely kind and helpful. Had that been all there was to getting involved, she certainly would have put in the weekend or less of work the project required. But to avoid copyright issues with a translation that might not be public domain, Hart asked my wife to snail-mail a photocopy of the title page or copyright page of her chosen translation, so that PG could legally verify the work's availability.
Fair enough. But we were flakes, the library was waaay downtown, her work deadlines loomed.... She let the idea fade. I wonder how many other volunteers lose interest in the same way? By the way, Gutenberg still doesn't show a text of Mandragola.
Why isn't a project like this tax funded? It would be trivial for Congress to put aside a million or two to pay some schlubs to sit around doing data entry all day. Heck, create a department to do it. Almody all brick 'n mortar libraries are tax funded, so why shouldn't a public electronic library be tax funded? You could (theoertically) crank up production of the conversions to save even more rare works, on top of the fact that ideally the project could work directly with major libraries around the USA, or even the world. Of course, realistically such a project would turn into some buereuacracy that gets barely more done than the volunteer version, but it would at least look like someone cares.
Really, information is the most important thing humanity has, and the people literally "Saving" the world are doing it on their free time.
I just found this site a few days ago. Essentially, volunteers can proofread one page at a time, so that huge time commitments of doing an entire book yourself are not required. Worth checking out.
http://texts01.archive.org/dp/
They obviously publish articles written by people with their head up their asses.
Honestly, just what is Mr. J. Bradford DeLong thinking? To characterize Project Gutenberg as a failure is just imbecilic. From PG's own pages, 203 ebooks were released in October 2002. 1975 new books in 2002 (1240 in 2001). It's a lot of work to produce even one book, and PG is churning them out at a pretty good clip for an entirely volunteer effort.
Even as it is, I've found PG to be pretty damned useful. It's kind of nice to be able to grep the collected works of Shakespeare. Or Darwin. Or Conan Doyle. Or H. G. Wells. Or Jules Verne. Or Charles Dickens. Or Frank. L. Baum.
Despite advances in technology, scanning, OCRing and proofreading books remains a very labor intensive process, and it is a boring, often thankless process as well. The Million Book project wants to take a somewhat different approach to providing digital books: they actually scan the books and store them in DJVU format (a very nice format similar to PDF). They can do OCR on it to provide searchable text, but such text doesn't have to be 100% accurate to be effective. Most of the time you print and read the original scans. After all, some publisher went to the trouble of carefully typesetting the book and proofreading it once, why bother to do it all again?
I first became aware of this project and technology when I met Brewster Kahle as he drove the Internet Bookmobile around the U.S., going to libraries and schools trying to drum up interest in Eldred vs. Ashcroft. A compressed version of Alice in Wonderland in DJVU format is about 5 megabytes (the same as a single MP3) including the illustrations and fancy typesetting. He could print and bind a copy of it for about $2 in materials, on demand using an HP laser printer out of the back of the mobile. The binding isn't amazing, but consider the possibility of having literally any book in any small town library in any place in the world. It's an exciting idea, and one that technology is only making easier and cheaper. You can get a decent scanner for $100 (even one small enough to hook to a laptop and take to a library). You can scan a book in an evening. And after you do, the file can be converted to a simple, easy to use format that everyone can use. Forever. One evening. One person. One book.
Despite the setback of Eldred v. Ashcroft, more and more books are going to be made available by the true philanthropists of the world: the volunteers who give something of their own time to make the world a better place. I wonder what Mr. DeLong has done to make the world a better place...
There is much pleasure to be gained in useless knowledge.
That's all for now. Thanks to all the supportive comments in this thread, and to all the constructive criticism. And remember, a page a day is all it takes to contribute!
Greg Newby, Director and CEO
The Project Gutenberg Literary Archive Foundation
www.gutenberg.net
Anyone with a quick scanner and a bit of good software could make book pages into formatted text at the rate of 10ppm or more. The question is, are there many good programs out there for doing that?
Repeal the DMCA!
In fact ASCII text can even be human translated (although not really human read) if all you have is the *binary*.
The poster to whom you reply seems to have missed the essential point.
I would give you one caveat though. English may well be the language of the internet ( and I'll leave the arguement as to whether that's a good or bad thing to the students), but it isn't the language of *literature.*
It would certainly be a Good Thing to be able to store the Vedas and Sun-Tzu, in the original script, at the lowest possible human readable electronic form.
This, however, as you note, will apparently have to wait for some future time.
KFG
full grown, like Athena springing from the head of Zeus
What about "full grown"?
Subject does not mean "first half of your first sentence." I usually skim over the subjects because they're mixed in with meaningless stuff (poster, date, etc). Keep that in mind, be nice to people when you want them to read what you've written. (and if you don't, why post?)
That's about six articles from the most recent Wired that have been covered on /. Hey - if we're that interested we can always go buy a copy. Is /. that hard up for articles....?
Rich people are eccentric. Poor people are strange. Me, I'd be happy with odd.
"it cripples it for creating PDFs, TeX files for printing"
You've seemed to go completely doofey here.
Wanting to produce printable documents from an ASCII terminal was kinda the reason Knuth invented TeX. N'cest pas?
If I wanted to use TeX to print Walden the very first step to take would be . . . what?
Firing up vim. That's right.
Now here I'll quote from Adobe's pdf page:
Adobe® Portable Document Format (PDF) is the open de facto standard for electronic document distribution worldwide. Adobe PDF is a universal file format that preserves all the fonts, formatting, graphics, and color of any source document, *regardless of the application and platform used to create it.*
Yes, I added the emphasis myself.
I'll refrain from mentioning how weird it would be to produce a pdf document from ASCII text though, since ASCII already perfectly duplicates ASCII, and in a nonproriatary and smaller file size.
Instead I'll simply point out that to convert an ASCII file into pdf one would *first* format it into the finished product of your choice and then convert *that* to pdf.
Why one would want to distribute Walden as a pdf file I'll leave as an exercise for the student (mostly because I'd be interested to see the answer myself. It beats the hell out of me. Maybe you're a font Nazi and don't believe in letting the reader use a font that *they* find pleasant to read?).
Your SGML comment is doofey beyond comprehension. SGML was developed at IBM as an ASCII markup language. HTML and XML are both interpretations of the SGML standard. The *point* of SGML is to take the plain text and create a document. I do it all the frikkin' time. So do millions of others. In vim.
You can find the author's recollection of its development here:
http://www.sgmlsource.com/history/roots.htm
I also use ASCII to write in more than one language. It's true that I don't write Chinese ideograms in it, but how one would go about it is trivial and obvious, although one *would* need an interpreting display layer, such as SGML/HTML/XML, where the trivial and obvious work has already been done, although not to everyone's satisfaction, to make it conviently human readable.
Forgive me if this post seems a bit bluff, but I'm truely baffled by your post.
KFG
Everyone's into the technical difficulties, but, hey, think big: Bringing most classical works online would be one of the greates achievements of the internet. It totally dwarfs technical difficulties and another 50 megabucks on the expense account of the world community. Makes me wonder if we're sometimes too hung up on the medium to remember the message...
Perhaps I could have written "Fire!" in the subject box. That would have been attention getting, although false. "My comment on the story" would have been factual, but pointless.
Sometimes "Subject" means "Write something here to provide the reader a reason for proceding".
My approach seems to have worked.
If, in future, you wish to skip my posts, go ahead. I won't be offended. To each his own.
KFG
It might help to actually understand what you are talking about before you are so quick to dismiss it. DJVU does support searchable text, which can be inserted automatically via OCR. The advantage of this is that the OCR need not be 100% accurate to still be useful (vastly more useful and accurate than the indices in most books, for instance).
There is much pleasure to be gained in useless knowledge.
...humanity wrote some ok books in its first 3000 years (-ish) of literacy. The Koran, the Bible, Shakespeare... yeah there's some ok books out there not covered by the stupid copyright situation we are now in. Hopefully Gutenburg can bring some pressure on the ridiculous copyright fiasco, but in the meanwhile, there's a whole store of amazing works of learning and literature out there.
Why project Gutenberg won't succeed?
/not/ a warez group: a lot of the regulars there just want e-versions of books they already own so they can read them on their PDA's or computers, and would be willing to pay for them if they were DRM-free and the price was decent. "Decent" means not like the Star Trek e-books for example, which cost more than the hardcover edition (which is probably the primary reason why they're DRMed).
Two reasons: (one) their "universally readable" text format sucks mud, and (two) the US Government, eh, I mean Disney decided to extend copyright duration beyond any reasonable length, so no recent texts are available.
Harnessing free labor is easy enough: just stop by in alt.binaries.e-book on usenet.
Realize that for many people this is
I think Gutenberg is very much there... Have you ever looked at the amount of material in Gutenberg's archives? When it comes to books and material written in english, that is in the public domain, I have to say, that Gutenberg offers almost everything of interest already.
The 'vision' that the author of the Wired article had was somewhat different: To be able to access all texts electronically. Something that everybody who had to hunt down old magazine articles has dreamt of (I still have nightmares from that one dark and dusty university library cellar, *shudder*). While Gutenberg is a great project, to come closer to full availability of all texts via electronic media, there will have to be initiative from governmental organizations as well as commercial entities. Obviously, not all texts will be available for free. But even a somewhat unified way of searching and finding these texts will be huge task.
There is CiteSeer for articles on computer science, there is IEEExplore if you happen to be looking for something from IEEE. But you have to know these places. Even with better search engines like Google it's still quite a task to get your hands on a text, even if you have some time to do the search and are willing to spend money.
A large database of text references (maybe including abstracts) would also be nice to just see what's available while you are still doing research.
The reason the Gutenberg project isn't hugely succesful is not the lack of text. Part of it might be the lack of formatting. Nobody want's to read 600 pages of a classic work on a computer screen in ASCII.
GutenMark does that (almost) automatically. Uses LaTeX.
I'm not going to hunt this down, but I will point out that standard, easy-to-understand speech is about 150-200 wpm. That *handily* outstrips all but the most blindlingly fast of typists.
blog
Another side benefit of good old ASCII - text to speech! Or braille displays! Heck, you can read it on any device, changing it to any resolution you want quickly and easily.
PG doesn't restrict itself to the written word and it's works include midi and mp3 files.
Perhaps I should have been more exact and explicit in my original statements. PG *tries* to provide whatever coding method that results in the lowest level human understandable output, preferably in a nonpropriatary format.
Obviously for Kanji or Vedic Sanskrit ( or recorded music) this is not plain ASCII ( by which I actually mean extended ASCII, not "teletype" ASCII).
KFG
He knocked up a program to convert PG etexts into LaTeX. It's not difficult to do and get something that looks quite good. I'm sure I could write something similar in a day or two, if there isn't something on freshmeat already.
Any sufficiently advanced technology is indistinguishable from a rigged demo
--Andy Finkel (J. Klass?)
Just going out on a limb here... Do you guys think it is possible to scan the books and store them somewhere so that an open source client such as Seti@Home's can work on the pages?
I guess this would tend to deal with the most expensive part of the process IMO, the typing. Of course, storage and scanning of the pages is still an issue.
Just my 2 cents...
Bullshit,
Don't you remember in English class having to go around the room reading? Do you remember some of those retards in your class?
Ro ro ro ro Romeo, Romeo, wa wa wa were ffffor art thou Ro ro ro Romeo.
Oh yeah, that ought to be mu mu much faster
WTF? Over?
Mostly relatively painless anyway - I've spent some time working on the "Anatomy of Melancholy" which is a bear to do. Many english texts I've proofed here I can proof at the rate of a few minutes per page. "Melancholy" is more like a half hour per page.
Most of the works are nowhere near that bad though and this is a good way to make all that cool (or not so cool) stuff available and usable electronically.
This is one of the first apps I loaded on my Zaurus.
Guttenburg rocks! I have to disagree with this
post. There's are many great works available for download. Moby Dick was taken down.. I don't know why. But other than that, It's defenitely worth a compile and try! Reference books are available too!
How do people format the ASCII texts? That is, if you don't just open the .txt file in in editor, but instead mark it up as HTML (or whatever) to improve its readability, how do you do this? Got any scripts or filters to share? (If so, maybe the PG folks can post them to their web site.)
After all, the ASCII should just be a starting point -- you take that, add a little layout, and have yourself as pretty a book as you'd like to read onscreen or print out.
I have tried this in the past, adding sparse HTML tags to, say, a Willa Cather book, but it was too distracting to read while I marked up, and just too dull to mark up an entire novel. That's why I think borrowing a script or filter would be cool.
P.G. is a worldwide project, not only north american. I'm living in France at the moment, and have a pile of old books that contain ancient french, latin and ancien greek. Transcribing the caracters to ASCII would be absurd. It would be a huge loss of information.
Secondly, you seem to say that basic formatting, like those described in the document guidelines are good enough.That brings up two problems :
Because this is digital media, you do not want to use formatting, you want to use semantic markup : the reader could be blind, of deaf, of using a PDA etc. Formatting is static, semantic markup can be reinterpreted again and again to suit best the reader. This is where the W3C is going.
The idea is not to make some alphabet soup that can be used to create formatted text. The idea is to provide books that are directly usable and readable. The document guidelines only specify one level of chapter heading for exemple. Why ?
I was getting all excited about contributing to this project, but the current guidelines are just too weak. Using unicode encoded xml documents with a specific DTD (or Schema) seems to be a good solution, but I'm no specialist.
It's not the books, it's not the format... it's the fact that 99% of the people don't have a good way to read it. PCs suck. Laptops suck. PDAs & tablets maybe, certainly not the old ones I've used.
However, having project Gutenberg there is still good, even if noone actually reads it from there. Why? Because when you want to buy a reprint (yes I still like dead-tree version), there's no room for extravagant mark-up. If it's too big it's very simple for a publisher to take the Gutenberg text, format it to a book and sell it for less.
For any of you having studied economics, it acts like an undifferentiated Bertrand duopoly - price is pushed down to cost (in theory). It's a solution every business man hates, all the surplus value is given to the consumers, nothing to the company.
That's why I'm proofreading on DP (top 1000, but not exactly devoting my life to it), even though I never have and probably never will read a book directly from PG. Unless there's a miracle cheap electronic paper break-through or something at least...
Kjella
Live today, because you never know what tomorrow brings
So , at present Australians can get up to the beginning of 1953. Seems a hell of a lot easier to follow than the mess of dates the parent posted.
Not quite.
Up to 50 years after the end of the year of the author's death
i.e - they can get stuff up to the end of 1952, assuming the author also died that year.
I wonder though. What if they wrote something in 1951, died in 1952, but it was only discovered (and published) in 1973. What applies?
I'll point out that at the end of 2000, there were only roughly 2000 etexts in the entire PG library (I copied them all to a single CD)... So if they're up well over 6,000, then they've made amazing progress in two years!
...and they made their PDF format completely inaccessible to many types of disabled people. Since it's bad business to have sites offer a PDF version and then an alternate 'accessible' version of things, they're correcting the situation. Similarly, a scanned image is impossible for a screen reader to comprehend or for a text editor to search.
How many times have you read something you've never seen before to someone else with 100% accuracy? With no "ums" or "uhs"? With no corrections at all? You'd still have to go back and correct the transcription, because it's not going to be 100% accurate to what you said anyway.
Comment removed based on user account deletion
This is an excellent set of "copyright guidelines" to forward to your local congressman. Highlight the dates! 2074? Most of your congressmen will be long dead by this time, so really why are they being corporate whores to the point of making copyrights practically indefinite? Tell your congressman why it's good to put works in the public domain. Tell him why he should support shortening copyright periods to something reasonable! Tell him to stop whoring to the companies and start representing the people of his district! Corporations are just in it for the money: corporations are some fake "entity" which is basically just human greed embodied. Help people, not greed!
If you want anyone to see this, you should get an account and log in. I would wager that most people don't read stuff at moderation level 0, where you posts are because they are anonymous.
Liberty uber alles.
..I think posting like that is fun, and reading a well done one is fun. Although I do like it when they put the continuation ... at the front so I know I was supposed to read the subject.
Liberty uber alles.
The Constitution does no such thing. Read the damned thing before you comment. The Copyright Clause requires copyright terms to have "limited times". A term that can be repeatedly extended (retroactively) at the whim of Congress is not limited.
The issue here is: does the Supreme Court have the right to enforce this sort of Constitutional limitation on Congress, or does that responsibility fall to Congress itself? The traditional answer is "Congress can do whatever it wants and the Court must restrain itself". In 1995, Justices Rehnquist, Scalia, Thomas, O'Connor and Kennedy threw this understanding on its head and declared that they have the power to enforce the restrictions on enumerated powers (the right to grant copyrights is one of these.)
The plaintiffs in Eldred asked the Court to apply this plain-stated logic with respect to the copyright clause, and the Court's response was... well, nothing. They didn't even bother to explain why they would do it in some circumstances and not in this one.
Eldred & Co had a strong Constitutional argument for limiting Congressional power if the Court was willing to obey its own precedent. This Court just didn't feel the need to do so, or even to explain why this case was different.
About 10 years ago, I volunteered to do some work for Project Gutenberg. The way it worked then (and I'm pretty sure it'd be the way it still works) is that they would OCR a particular edition of a particular book. Then, they would get volunteers (more than one per book) to read through the OCR-ed version alongside the actual printed edition from which it was OCR-ed to validate that the OCR didn't make any mistakes.
This is a very dull, volunteer-intensive task for even interesting books.
Why are we wasting our time bitching about this?
At the very least, PG is better than nothing at all. Free books is free books.
"Project Gutenberg is in the cross hairs of J. Bradford DeLong, a Berkeley professor and Wired Magazine contributor, who accuses PG of failing to 'achieve any form of critical mass.' I'll get to Gutenberg in time. But first a few words on the DeLong column and then plenty more on his former employer, the Clinton Administration..."
Read the rest at http://www.teleread.org/blog/index.html.
The provisions in the Constitution were written to address the problems the colonies were having with trade secrets. The colonies, and then what became the USA, was forced to go back to Britain for all sorts of machines and goods because no one could easily produce them here.
New England became a manufacturing center for hundreds of years because its residents became good at "reinventing the wheel", or "reverse engineering", the kinds of things that the USA acuses the far east at doing.
To foster the process of duplicating Britain's and Europe's manufacturing technology here, the patent system was created. For a short period of time you got exclusive rights AS LONG AS YOU TOLD EVERYONE YOUR SECRET AS SOON AS YOU THOUGHT IT UP.
Copyright was given similar status as part of a program to ensure that ideas flowed freely - the best way to protect your ideas was to publish them, rather than just "speak" them.
At no time was it intended to create a welfare program where you could work for a week or year and then live off that for the rest of your life. The idea was that you could think up something and tell the world and benefit from that just as much as working cutting down a tree and selling it or raising and butchering a cow.
In one regard, your exclusive rights to an idea should be no longer than it took you to produce it. If you watched an event and wrote up an article on it, then your exclusive copyright should be a day. But a couple hundred years ago, what was required was observing, writing, setting the type, running the press, shipping the copies around the world, and so on, so the process of making money from a new event might take a year. Books took even more time.
As the issues were complex, Congress was given the job of figuring out the tradeoffs.
The situation today is the opposite of what it was several hundred years ago. The intellectual giants and the influential politians are saying "secrecy is good", "exclusivity is good", "restriction in the flow of ideas is good", "prevention of reverse engineering is good", "unfettered innovation and creativity is EVIL".
I'd like just one explanation of how extending patents or copyrights will make you more creative than you are?
I'd like one explanation of why you think that you should have the right to restrict the free flow of ideas for an indefinite period of time?
If you can convince me that it takes so much work to write something that it will take you decades to make the money back, then you have to explain how it is that people write things with absolutely no expectation of getting any monetary value from it, and why you are so unique that you deserve special treatment so that we might be blessed with your special ideas?
--Get them to volunteer for PG, and give them added credit for their English classes.
.
== WolfriderV6 == I'm willing to admit that *I just might* be wrong... Are you??