Why Project Gutenberg Isn't There Yet
option8 writes "This wired article ('Any Text. Anytime. Anywhere. (Any Volunteers?)'), goes into good detail on why Project Gutenberg, and similar efforts, are far from creating a complete, free electronic library. A quote: "The mechanics of a universal library are simple. The tricky part: harnessing the free labor." Though it doesn't go into technology much, I expect there's a lot of potential in mass OCR tech and good speech recognition (faster to read a book aloud than to transcribe it correctly)."
That's crazy. OCR will always be faster than speech, even if speech recognition ever works, which it currently does not.
I'm not too informed about this topic; feel free to correct me.
If the goal is a universal library, and there is a need for a work force, wouldn't a program iniated on the library level to utilizie librarians as a volunteer work force, perhaps as a side project they might be interesting in helping along? I think of it as SETI in the library world.. *shrug*
It would make life some much easier if I could search an online library rather than searching the library index. Just think how much space we could save as well rather than shelves full of books that are basically dead weight 95% of the time.
I think copyrights got to be the biggest hurdle, publishing houses arnt easily going to be perswaded to put oh say the next harry potter book online for free and risk losing millions
So the point of this post is: why not ask publishers for the material? If it's already public domain, it's not like they'll lose profits, and maybe Project Gutenberg could let them put a little
kind of thing at the top of each book they donate. Plus, maybe it's a tax write off. I don't know. That said, I'd thing it'd be much easier to just type things in than OCR it or use Speach-To-Text.Comment forecast: Bits of genius surrounded by a sea of mediocrity.
I'd just like to point out that this is the third story from Wired to show up on slashdot today. And it's not even that bad of a story. I think this must mean Wired is cool again.
In any case, the real obstacle to a useful electronic library isn't labor. It's copyright.
Apparently the author of the article missed Distributed Proofreaders. They seem to have survived their Slashdotting and actually retained a good fraction of their new users. This month they've proofed 116,827 pages! (Cut that in half for unique pages, I think) They have completed in their 2(?) years of existence 918 books, and have another 317 being assembled. It really seems like they are only limited by what they can get their hands on in the public domain.
I cannot believe that Project Gutenberg continues to use plain text as their source code! I can see why it would have been compelling in 1971, and it still may be true that there are systems out there that can only read 7-bit ASCII.
But that's absolutely no reason why the source shouldn't be marked up. Marked up source can always be converted to ASCII, but you cannot derive semantic markup from ASCII.
The best and cheapest way to get existing books on the web is to scan them and compress the images. Compression technology for text images is so good (see DjVu), and storage so cheap nowadays that you are better off just distributing high resolution scans.
This is a much more efficient way to make books available on the web, much more efficient than having volunteers painstakingly transcribe the text or correcting OCR mistakes.
OCR can be used for indexing scanned documents, but there is no need to do manual correction. DjVu can compress 300dpi black and white pages of text to 5-25KB. That's less than most HTML pages, and the images look just like the original book.
The Million Book Project at the Internet Archive uses DjVu (as well as other formats).
The open source implementation of DjVu is available on sourceforge
It seems paradoxical, but there it is. I spend a huge amount of time glued to the screen, reading articles, blogs, forums, FAQs, HOW-TOs, etc. But I don't like it, in fact I find it aggravating.
I am lured and lulled by the vast amount of easy information suitably tailored to my interests, all with an easy to use intuitive associational ( read hypertextual ) interface. But it is tiring, staring at a flickering glaring screen for hours, my eyes get dry, and I strain and get tired picking out fuzzy objects when I try to focus at distance. Its nasty and annoying.
Here is my point about this project. Nobody wants to read books on their computers. Well maybe some do, but I think the vast majority don't. Paper books are easily available and cheap. If you can't find the one you want in a local library or bookstore there are a multitude of ways of ordering them. You don't get tired looking at them, they are actually enjoyable. So why should there be a desire amongst the majority for e-books?
Don't get me wrong, I think its a good idea, but not one that I, nor I think the majority, will go in for until a better way is developped of presenting them. LCDs are an improvement, but they still are shabby. I don't think a project like this is going to see much public interest until some better presentation media is found. E-Paper will be needed before the E-book becomes a reality for most people. Some kind of little book-sized unit that you can hold and which will display on a matt - non-glaring, non-luminous surface.
There are a thousand forms of subversion, but few can equal the convenience and immediacy of a cream pie -Noel Godin
There seems to be an interesting recurring theme in human history - we constantly strive to build libraries but we have never yet built one that is quite "good enough".
The Great Library in Alexandria was a wonder of the ancient world until it got burned down as part of a domestic dispute between Mark Anthony and Cleopatra. I was amused to note that the local University recently received funding approval to rebuild it - grants committees move slowly.
In mediaeval times, monks were the guardians of knowledge and the various monasteries dotted around Europe were oases of learning and knowledge in those times. Knowledge was restricted to the few.
The original Gutenberg made it possible to create huge volumes (literally) of knowledge and disseminate it on a wide scale. Ever since, people in power have sought to control this technology - either through censorship, copyright, or even education (you have to be able to read before a book is of greatest use to you.)
In Victorian England, the mark of a scholarly gentleman was in the breadth of works he maintained in his private library.
Perhaps a new initiative might be Gutenberg@Home whereby any reader made an electronic copy of physical works by some convenient, nondestructive means. By keeping such a personal library private, one would not have to worry about copyright laws, even as currently framed.
How much of what is holding us back from building the perfect library simply our insistence on monetary-related restrictions? How long will it take us to realize that lengthy (in time) and complex or intensive (in resources consumed) PHYSICAL processes are the only ones to which we need to attach a value. Whatever happens inthe electronic world should be free and that the collation, assembly, verification, dissemination and application of the sum of human knowledge is one of the most important things that we could achieve?
STF
Storing books online is one thing. Gutenberg also needs readers to be successful. How many readers are willing to read .txt or .pdf files instead of printed material ?
Several times I downloaded Gutenberg books, with the intention to read them from laptop or screen lateron. Turns out this is too inconvenient, when compared to paper print. ...
If only electronic paper would be at 1c a page
Why not modify that in such a way as to have avaliable a scanned image of a single page of the book, along with an empty box to enter text? That way, people could work on ONE page at a time, while others work on other pages. A single book could be typed in by 547 different people, each typing up one page.
/usr/games/fortune
Floppy disks get magnetized, hard drives crash, optical disks get scratched...A book can take a beating, man. All the OCR and voice rec in the world won't change this until we can get widespread, cheap cartridged optical media.
I think this take on media longevity also prevents progress WRT Project Gutenberg. Too many people don't see the point, when they can have the Library of Congress backed up on disk one day but be looking at a screen full of garbage characters the next because someone accidentally yanked the power supply on the server or whathaveyou.
A single $5 paperback book can be propagated more reliably than tens of thousands of dollars worth of networks and storage, although the latter system can admittedly hold a whole library's worth of that single book. But think about the infrastructure required to maintain the latter system. Until we have better media, the costs aren't justifiable, IMHO. It's an idea whose time has not yet come.
No you can't, unless you're impaired in some way.
Average speaking rate (in English) is 100-180 wpm. The world's fastest typist hit 212 wpm on a Dvorak keyboard. See also this
I took a quickie online typing test, one pass, 60 seconds, and here's my score. I'm a decent typist (better when coding). What's your score?
Percentage Accuracy : 100%
Percentage Inaccuracy : 0.8333333333333334%
Characters per minute : 360 cpm
Characters per second : 6 cps
Words per minute : 67 wpm
Words per second : 1 wps
Total Speed status : Too Good
Overall Accuracy : Absolutely Spot on
.. when the font used is different from fonts it is programmed to recognise. I tried scanning a 40-year-old book - a drama script written in Indonesian - and the combination of unusual font *and* unrecognised language was enough to make the OCR software's output 50% rubbish.
Hmm, imagine scanning a 500-year-old book hand-written in Cyrillic... forgetting for one second the damage that scanning might do to the book in the first place.
Michel
Fedora Project Contribut
Have you ever looked at the amount of material in Gutenberg's archives? When it comes to books and material written in english, that is in the public domain, I have to say, that Gutenberg offers almost everything of interest already.
The reason the Gutenberg project isn't hugely succesful is not the lack of text. Part of it might be the lack of formatting. Nobody want's to read 600 pages of a classic work on a computer screen in ASCII. Some may be masochistic enough to do it if it was in HTML. Personally, I still prefer it in book-form.
But even if it was properly formatted in several formats (including .pdf's in several sizes), it still is a lot of work to print it out, find a decent way to keep it together (no, ring binders isn't very appropriate for something you are going to read).
The main reason Gutenberg isn't succesful is because it is not what people want. People don't want to read or print out old literature in the public domain. They either want a nice edition that looks good on the shelf, or a cheap paperback to carry around with them . And most likely they aren't particulary into really old books (with a possible few exceptions which the Gutenberg project long since have covered).
It's not like the work the Gutenberg people are doing isn't important, or isn't of good enough quality or anything else. The simple reasons it's not heavily succesfull is because very few people are really interested. I'm sure much of the work the Gutenberg people have done will become important as soon as on-demand printing is more common and affordable.
Though it doesn't go into technology much, I expect there's a lot of potential in mass OCR tech and good speech recognition (faster to read a book aloud than to transcribe it correctly).
Was thinking about voice recognition today while lamenting that I haven't done more to type in my copy of The Queen's Necklace by Alexandre Dumas, copyright 1910.
Here are two problems that came to mindn why I probably won't be able to use voice recog soon:
1.) Works who have been lucky enough to actually have their copyright lapse are often pretty old works. Their English (let's use English just b/c it's the lang I'm using) isn't exactly today's English, and sometimes even spellings, etc, change. Try reading anything from the 1800s and before.
2.) Names (so any protracted dialog) and other tough-to-translate stuff is going to be a pain to proofread. My book in particular has quite a bit of French in it (lots of "Parbleu" and French names with crazy accents all over the place).
I'd like to say voice recog could produce a "new version" with "updated spellings", but I just don't think that'd fly.
So once voice recog is commonplace for, say, office use (still quite a ways off) and affordable (not sure there, but I haven't heard of a friend using it yet, even just to play) we'll still have a ways to go before we can get true literature into PG simply by reading.
As an aside, at the same time I've been thinking about simply taping me reading the book and donating *that* via mp3 (or Ogg or whatever the heck). For the time being anyone who wants can listen in the car, and as soon as voice recog is up to snuff, voila. Just run it on my recording, proofread (easier said than done), and you're ready to go!
It's all 0s and 1s. Or it's not.
Actually I've found the most value from the project is downloading and reading classics. I've downloaded works by people such as: Adam Smith, Nietzsche, Aristotle, Plato, Karl Marx, Oscar Wilde, Thomas More, and various other classic writers. I've found this resource indispensable. It provides high quality texts for free. I probably wouldn't read many works by these authors if I had to purchase them. I unfortunately, don't have the money to spend on many small works such as these (they're short, but sometimes cost $10-15). I also don't have easy access to a library and I like keeping a copy for my own personal use.
So I find that Project Gutenberg is a very useful resource.
neurostarI love Project Gutenberg, and I've used and supported it since the pre-web days. However, I don't think they go far enough.
There are plenty of places on the net that one can find and download copyrighted works. Web sites, mail servers, IRC networks, and so on. I've used them extensively, myself. Many of the books I've downloaded, I own, and I got the electronic format for searching, reading on pocket devices, and so on. I think that this is fair - I've paid for the information once, and my sense of Fair Use tells me that it's okay to get this bits in this way.
I've also downloaded many, many books that I do not, nor will ever, own. (Some of these, I will probably never read.) Is this a copyright violation? Almost definitely. Is it ethically wrong? I don't think so. I would probably never buy a new copy of these works. If I hadn't downloaded them, I would have borrowed them from a friend, or a library, or bought a used copy, and sold it back later. None of these legal methods would have earned the author or publishers a cent. So, how are they different from downloading an electronic version? In my eyes, they are not.
I buy plenty of books - hundreds or dollars worth every year. I love to read. I support local authors, and independent publishers. I do not think my actions are criminal. If someone disagrees, tough. You won't stop me, or the legions of other electronic book traders. Ever. Sorry. If it helps, think of us as the "books" in Fahrenheit 451, keeping a distributed library available for public use, in the event that something terrible should happen someday. Eventually, one way or the other, copyright will go away, and the words will be truly free again.
(And anyway, I was just joking. I'd never knowingly violate copyright law. What am I, stupid?)
Why isn't a project like this tax funded? It would be trivial for Congress to put aside a million or two to pay some schlubs to sit around doing data entry all day. Heck, create a department to do it. Almody all brick 'n mortar libraries are tax funded, so why shouldn't a public electronic library be tax funded? You could (theoertically) crank up production of the conversions to save even more rare works, on top of the fact that ideally the project could work directly with major libraries around the USA, or even the world. Of course, realistically such a project would turn into some buereuacracy that gets barely more done than the volunteer version, but it would at least look like someone cares.
Really, information is the most important thing humanity has, and the people literally "Saving" the world are doing it on their free time.
They obviously publish articles written by people with their head up their asses.
Honestly, just what is Mr. J. Bradford DeLong thinking? To characterize Project Gutenberg as a failure is just imbecilic. From PG's own pages, 203 ebooks were released in October 2002. 1975 new books in 2002 (1240 in 2001). It's a lot of work to produce even one book, and PG is churning them out at a pretty good clip for an entirely volunteer effort.
Even as it is, I've found PG to be pretty damned useful. It's kind of nice to be able to grep the collected works of Shakespeare. Or Darwin. Or Conan Doyle. Or H. G. Wells. Or Jules Verne. Or Charles Dickens. Or Frank. L. Baum.
Despite advances in technology, scanning, OCRing and proofreading books remains a very labor intensive process, and it is a boring, often thankless process as well. The Million Book project wants to take a somewhat different approach to providing digital books: they actually scan the books and store them in DJVU format (a very nice format similar to PDF). They can do OCR on it to provide searchable text, but such text doesn't have to be 100% accurate to be effective. Most of the time you print and read the original scans. After all, some publisher went to the trouble of carefully typesetting the book and proofreading it once, why bother to do it all again?
I first became aware of this project and technology when I met Brewster Kahle as he drove the Internet Bookmobile around the U.S., going to libraries and schools trying to drum up interest in Eldred vs. Ashcroft. A compressed version of Alice in Wonderland in DJVU format is about 5 megabytes (the same as a single MP3) including the illustrations and fancy typesetting. He could print and bind a copy of it for about $2 in materials, on demand using an HP laser printer out of the back of the mobile. The binding isn't amazing, but consider the possibility of having literally any book in any small town library in any place in the world. It's an exciting idea, and one that technology is only making easier and cheaper. You can get a decent scanner for $100 (even one small enough to hook to a laptop and take to a library). You can scan a book in an evening. And after you do, the file can be converted to a simple, easy to use format that everyone can use. Forever. One evening. One person. One book.
Despite the setback of Eldred v. Ashcroft, more and more books are going to be made available by the true philanthropists of the world: the volunteers who give something of their own time to make the world a better place. I wonder what Mr. DeLong has done to make the world a better place...
There is much pleasure to be gained in useless knowledge.
In addition, very few people can read a book aloud at the speed a trained typist can type it, without making numerous mistakes. A skilled typist can transcribe a document they have never seen befor in excess of 120 words/minute One of my room-mates, during the mid-eighties, typed Stuart Mills Jr.'s "On Liberty" in a couple of hours. When I asked him about the speed, he said "Oh thats nothing. This book is well edited. Normaly when I do this I have to correct spelling, punctuation, and grammer." Of course he was a professional editor. BTW, the reason he was doining this was a project to distribute public domain texts on floppy along with a pretty cool turbo pascal software package for reading searching and indexing. Remember, this was the mid eighties!
Control of ones hands and the processing of visual information are closely linked in the brain. More closely then the eyes and the mouth.
Why project Gutenberg won't succeed?
/not/ a warez group: a lot of the regulars there just want e-versions of books they already own so they can read them on their PDA's or computers, and would be willing to pay for them if they were DRM-free and the price was decent. "Decent" means not like the Star Trek e-books for example, which cost more than the hardcover edition (which is probably the primary reason why they're DRMed).
Two reasons: (one) their "universally readable" text format sucks mud, and (two) the US Government, eh, I mean Disney decided to extend copyright duration beyond any reasonable length, so no recent texts are available.
Harnessing free labor is easy enough: just stop by in alt.binaries.e-book on usenet.
Realize that for many people this is
Additionally translations might generate practical limitations. If a text was written in ancient Greece and translated to English or some other language in the 20th century, the translation might not be public domain even when the original work is. Of course you are free to read the original text or make a new translation. Anyway even if a piece of literature was public domain, the translation to your native language might not be.
Exactly. What's worse, modern texts of an ancient work are not usually considered to be in the public domain, because the work to try to clean up the errors that inevitably creep into the manuscript tradition is commonly accepted to be a copyrightable contribution. However, I don't think this is something that has ever been tested in a court (IANAL).
And many texts from before 1923 aren't very good by modern standards (too many errors).
So the solution is to try to get those who have the necessary philological skills to make translations to agree to donate their services - something that has proven an uphill struggle so far, as some translations have scored big time as bestsellers (Ciardi's Dante, Fitzgerald's Homer, Pevear's translations from the Russian, Mitchell's translations - mediocre though they are - of Biblical/spiritual "classics") and a lot of translators secretly nourish the hopes to be the next Arthur Waley.
And scholarly texts take years to produce. Again, the editors tend to nourish hopes they might supplement their income (slightly, here; we're not talking about Stephen King, or David Pogue, or even Simson Garfinkel type numbers; I doubt that editors of ancient texts even make 1/100 from their books what David Pogue makes) from the royalties from the 2,000 copies sold to libraries, or the 10,000+ copies sold to students.
Actually, he didn't even invent moveable type. The Chinese did that with wooden blocks much earlier and there were existing printing presses that used moveable blocks.
Are you sure about this? My understanding was that early (pre-Gutenburg) Chinese presses didn't have sorts, because with the sheer size of the Chinese writing system, they wouldn't have been efficient with the level of technology to produce wooden blocks. But I'm willing to be corrected (with a reference, preferably).
Anyway, see Elizabeth Eisenstein's The Printing Press as an Agent of Change for a lot of the information that Mr. Orn describes. A more accessible book, The Nature of the Book, Adrian Johns, discusses some of this in the in the earlier chapters.
So , at present Australians can get up to the beginning of 1953. Seems a hell of a lot easier to follow than the mess of dates the parent posted.
Not quite.
Up to 50 years after the end of the year of the author's death
i.e - they can get stuff up to the end of 1952, assuming the author also died that year.
I wonder though. What if they wrote something in 1951, died in 1952, but it was only discovered (and published) in 1973. What applies?