Domain: pgdp.net
Stories and comments across the archive that link to pgdp.net.
Comments · 147
-
Re:Editing?
Hi, pedant, there is a perfect project for you where they scan a bunch of books in and need pedants to correct any errors. It's a good cause, and it'll let the rest of us just read our nerd news.
-
Leave the conversion to those skilled at it
I think the potential of new Google-backed OCR software is pretty high but I'm not certain that your average library would have the manpower and technical know-how to manage a book-to-ebook conversion, Google OCR software or not.
If libraries are interested in getting their out-of-copyright assets into digital form, they really only need contact someone with Digital Proofreaders to get the ball rolling. DPers would take care of the scanning, proofing, formatting, and post-processing of the book on behalf of the library requiring nothing but a temporary loan of the book or manuscript (something the libraries already excel at
:) -
Re:Wonderful!
This'll be a much needed boost for us Linux users who want to help out Project Gutenburg.
Join the Distributed proofreaders
and do any or all of:
1) do some proofreading or formatting of a PG text
or
2) Smooth read a near-finished text looking for overlooked oddities
or
3) help improve DP's processing software. Lots of extra features wanted...
or
4) Get copyright clearance, scan a book and upload to DP's OCR pool
or
4) Run your Windows OCR under WINE like I do...
More details here -
Re:Wonderful!
This'll be a much needed boost for us Linux users who want to help out Project Gutenburg.
Join the Distributed proofreaders
and do any or all of:
1) do some proofreading or formatting of a PG text
or
2) Smooth read a near-finished text looking for overlooked oddities
or
3) help improve DP's processing software. Lots of extra features wanted...
or
4) Get copyright clearance, scan a book and upload to DP's OCR pool
or
4) Run your Windows OCR under WINE like I do...
More details here -
Obvious, I declareThose skilled in the art will recognize this as the obvious solution to a problem. I really need to come to grips with the fact that every single application of any software construct is patentable, no matter what. I keep forgetting that.
Those skilled in the art will also recognize this "invention" as being very similar to the Gutenberg Distributed Proofreaders Project, which is notable for a) rocking and b) having gone online before the patent was filed.
It's to the point of insanity now—it's difficult to imagine any software you write doesn't violate a patent. Everyone just ignores them and hopes they don't get sued, with big companies relying on mutually assured destruction to keep the lawyers at bay. Is this really the type of patent system we want?
-
Re:Please do a better job, not just a bigger job
In my scanning for Distributed Proofreaders http://www.pgdp.net/, I would have to say that over 90% of the books do not have illustrations or footnotes that require scanning at higher resolutions than 300 DPI. Even the ones with illustrations are probably fine with 300-600 DPI scans. For the most part, the black and white images of text pages in the Google PDF files are adequate, although the illustrations are low resolution garbage. The real problem I see with Google's work is that it's substandard, with missing illustrations, missing pages, and poorly scanned pages. They do rescan books when errors are reported, but even then, it takes a few months and the rescans are subject to error. Unless there are multiple projects independently scanning, OCRing these works, I fear that single source for these works will end up with an incomplete and somewhat unusable digital copy of these Google partner libraries.
The second issue I have is that the full image display at both Google and the OCA/live.com (and PDF downloads of full images) is not particularly useful on low resolution displays, like PDAs, mobile phones, tablets, and dedicated ebook readers. Perhaps future generations of ebook readers will have the form factor of a paperback book with high enough resolution to view the scans made available by google and the OCA, but I don't see it happening for years.
The last complaint I have about Google, is that with their proprietary database, it's not easy to create searches based on criteria not in their search parameters (for example, based on number of pages).
Despite these complaints, I'm still pleased that Google decided to try to digitize the works in these libraries, if for no other reason than it got several digitzation projects funded. -
Re:Scanning a book is easy...
Now you've got a real problem. How does one know if a page of a book is OCRed correctly?
Now that's simple. Distributed proofreading. Just the sort of thing Google is good at.
No, the sort of thing the distributed proofreaders are good at. And they're already raiding Google's scans, as well as many other page image collections.... -
Re:false positives
I worked on image recognition software on my last job. The project was a failure, for many reasons. Our stuff was worse than average for such software, but the biggest problem is that facial recognition is very hard. The best stuff out there can maybe hit 90% accuracy. With lots of time consuming help from people, the accuracy rate can be pushed up to 98% or 99%. For 90% accuracy to even be possible, the subjects have to be photographed in a very controlled environment. Everyone's face must be in exactly the same orientation and position, everyone must have the same facial expression, the lighting must never vary, and the background should be blank. Superficial changes like adding or removing glasses, facial hair, makeup, pimples, and so on add to the difficulty. And there's changes from aging. Whittling 10000 pictures down to 10 would take, what, 99.9% accuracy? Just not there, so your plan won't work.
Computers can kick human butt at number crunching, chess, and such like. But in recognition, animals (including people) are still the masters. There's this vague idea floating around that computers will just naturally be better at facial recognition when scientists figure out how to properly apply their vast processing powers to the problem. Add to that the horror stories of eyewitnesses getting it wrong, of bad disguises actually working especially in the movies (Clark Kent for instance), and one could easily be lead to think we all suck at recognition. Not true-- we're great at it. The above characterizations greatly misrepresent the supposed powers of machine and man. It's the sort of thinking, underestimation, and oversimplification that has lead AI down so many blind alleys, with many fantastically wrong predictions over the years of what computers would soon be able to do. Beat the best human chess players? Just about there, at last, 40 years after the optimists of the 1950s thought it would happen, and the method is a disappointing brute force cruncher that does not employ "real" thinking involving making plans and such like reasoning. Translate languages? Still working on that one, with Babelfish good enough to be occasionally useful. OCR is decent but still not a match for people, or we wouldn't have this. Surely OCR is a much simpler problem of the same sort as facial recognition, and look how hard that still is for computers.
Law enforcement organizations don't understand or appreciate the difficulties and limitations, and aren't being helped by the many businesses that shade the truth and obscure the finer points for the sake of potential big sales. And if all the problems I mentioned above aren't enough, scaling up to millions presents another hard problem. Researchers aren't giving scale much attention at the moment because without something that works at 99% plus, there's no point. That leaves the ignorant dreaming that soon we'll have facial recognition software that can be applied to millions. They will be disappointed.
-
Distributed
This sounds like Project Gutenberg Distributed Proofreading.
Every now and then, I log into Project Gutenberg Distributed Proofreading. There, I proofread a couple of scanned pages and then leave it at that for a few weeks. It's not much but that's OK; it's the power of numbers that kicks in. -
Re:An Unintended Consequence
The original images Google made available were really low quality, about 75 DPI black and white, and the Open Content Alliance's images will be about 400 DPI color (Yahoo! and MSN are contributing to OCA). Volunteers at Distributed Proofreaders http://www.pgdp.net/ have been scraping images from Google since they started the Google Books (Library) project. Now that Google makes PDF files with higher resolution black and white images available, the images work better with the OCR programs used by the DP volunteers, but it's still not as high a resolution as OCA's. Google has made over 100,000 public domain books available in full view mode, but quality control has been rather spotty. Most books with illustrated plates are missing the illustrations and/or pages around the illustrations.
Any scraping done by a competitor would need to have people doing some portion of the scraping, as Google either requires CAPTCHA responses or blocks all access from IP addresses/ranges where they detect signs of scraping. -
Re:version version everywhere
Optical Character Recognition.
Open source OCR is crap. (If you know of one that's not, point me to it!). ABBYY FineReader is pretty good, and its pretty much the standard used by the Distributed Proofreaders but that's Windows only. for all practical purposes--there's a corporate Unix version starting at a huge corporate price....
So I run FR under WINE, and it works well enough. But I hanker after a native version. -
Re:Wrong... They are using all types of books
Great! This means I'm not necessarily contributing to Spam 2.0 when I proofread at Distributed Proofreaders http://www.pgdp.net/!
-
Nothing
I am reading ebooks right now.
Having said that, there are a few reasons why I am reading ebooks, and there are still plenty of things I would like to see improved.
First of all, I use my Palm to read the classics with. The advantage isn't even so much that of price, but of availability. Project Gutenberg has scanned pretty much all of the classics in the English language. (And where it hasn't, that's your fault for not warning us.)
Ah, there's the second reason: I am a Distributed Proofreaders volunteer, and my Palm helps me read the books I helped produce. For instance, currently I am reading H.G. Wells' Certain Personal Matters (not published since 1901, and a lovely collection of satirical essays!).
As for the things that need improving: devices. I am someone who carries books with him, and so I have this wishlist that devices currently do not live up to. Weight, price, size, power consumption, ports, software, none of the devices currently available get all of these right.
And of course it speaks for itself that I own the device, not the publishers. That is why I will never buy Sony. -
Other PointsHere are a few other points I haven't seen mentioned.
Not all e-books are DRM'd up the wazoo, manybooks.net has 12,537 free un-DRM'd texts available in a variety of formats for each text. They are produced by Project Gutenberg and if you like you can help out too !
I have a palm tungsten C which has ended up being used mainly as an e-reader. I currently have around 70 full length books on it, and one of the best aspects of it is, if I get bored of one book and feel like a change, when I go back to the same book later on, it's still at the same page (dead handy for tomes like War and Peace !). Add to that the inline dictionary, bookmarks and notepad and it's a cool tool.
I have had to play with the background and font colours a bit to get it just right for my eyes, but now I can read comfortably at the same distance as I would a normal book, either while in the dark in bed, or in daylight.
Also, as the Tungsten C is wireless, I can dload the ebooks to my home server, and if I need to add a new selection to the palm, I just dload it over the LAN, and it goes right into the library. No need to fire up the crappy "Documents To Go" software on my XP laptop anymore.
Battery life is fine. I can read a whole book with only one recharge, as long as I turn off the wifi.
I have yet to see any real life "e-paper" and so I'll reserve judgement, but it can't be much easier to use than my tungsten c.
-
Re:How can I help?How can I help? I'm willing to give a couple of hours a week, I don't have a scanner, but I'm willing to type...if this is truly "open", I will be more than willing to contribute my time.
As a few others have mentioned, jump in to Distributed Proofreaders. We take the raw images (either scanned specifically for DP or taken from scanning projects like this) and produce checked, corrected text, which then goes to Project Gutenberg. A few hours a week can help a lot.
-
Re:Contributing to Gutenberg
The scans won't be added to Project Gutenberg, but it's very likely that the scans will be used by Project Gutenberg's Distributed Proofreading project, which I'm involved in. We're already 'harvesting' images from quite a few sites, as well as all the images our volunteers scan. Now that there are several large and relatively well funded scanning operations getting off the ground, I imagine that over time an ever increasing proportion of the works that go through DP will be based from harvested images.
I maintain several lists that show the DP harvesting status of several image collections, including The Internet Archive's Canadian Libraries collection, Google Print, and Early Canadiana Online. As you can see, we will not be running short of material to work on for a very long time, even without any of these recently announced initiatives. That said, it's always great to see more material be made freely available, rather than locked up behind expensive subscription services like Jstor and EEBO. -
Re:Why not join the Gutenberg Project
So, as the summary states:
make them available for Web searching
does not mean that there will be a complete text index available (that is full text search,) but instead you can only search for specific works?That probably means that the search index will be uncorrected OCR, which leads to some inaccurate searches. The problem with using raw OCR is scannos, words that may be recognised as a different word that "looks" the same, for example modem and modern, or an i might be recognised as a slash.
I do that every once in a while on their German counterpart: GaGa
Your time might be better spent at the real Distributed Proofreaders, or DP-Europe, since Projekt Gutenberg-DE is not an offical branch of PG, and actually copyrights its output (unlike the real PG).
-
Re:Why not join the Gutenberg Project
Project Gutenberg and the Open Content Alliance are working on two slightly different things:
The OCA is making available the images of scanned pages. That's fine for reading an entire book, but you can't search it, nor copy a section of text into a document of your own.
Project Gutenberg makes available plain text, usually illustrated HTML, and occasionally other versions, of public domain books, which can be used by anyone for no cost.
If you'd like to help prepare public domain ebooks, visit Distributed Proofreaders and proofread a page a day (or more!).
-
Sorta.
Project Gutenberg frequently makes use of the page scans for source material. What PG does is to run the images through OCR, proofread and post-process it. It's more useful than a stack of page images, but considerably more work.
If you look at the current books on Distributed Proofreaders, you'll see that some of them credit the Million Books Project for the page scans. -
Re:Yahoo does it better via Open Content AllianceHere is why Projet Gutenberh sucks:
Hmm, both words of the service misspelled. This ought to be good.
no good search engine
Does the catalog not count? Ok, I'll admit that the search engine is more of a card catalog, but using "site:http://www.gutenberg.org/ search_terms" on Google has always worked fine for me.
no publishing house participation
The idea is to make the *full* text of books avalable. How much of the already in-copyright texts do you think you'll be seeing on Yahoo? So little it won't matter. Sure, you'll be able to search it, but you can already do that w/ Google's project. Yahoo simply seems to simply be mimicing stuff already happening, instead of trying to team up. In my book, it wastes time and makes them look suckup-ish
no standards or accuracy
You are correct, there are not *official* standatrds, but there
/are/ an _unwritten set of rules_ about how text should be formatted that I've seen in 95% of the texts there. As of the accuracy, it's surprisingly high. The *worst* accurecy I've ever seen was six typos in a ~200 page book. The thing is, it's like Wikipedia. Stuff gets edited and updated to make it more accurate. However, this is just for the independantly produced sumissions.
I've guessing well over 90% of the books for PG are submitted by Distributed Proofreaders They have a ton of standards, and the quality of the texts they produce is pretty amazing.Don't compare one guy's hobby to a serious effort to make content available...
One guys hobby?! In the last 24 hours alone, there have been 470 people logged in and working at Distributed Proofreaders. When you consider that all these people are volenteers, that's pretty amazing. This figure probably does not include those who were scanning in more texts for processing.
...AND searchable.Google "site:http://www.gutenberg.org/ search_terms" if you will. Or, since you seem to be a shrill for Yahoo, try "site:gutenberg.org search_terms". (The URL seems to have to be dumbed down for Yahoo, for some reason.)
-
It's not quite that bad.
At least in Canada, they have Life+50 copyright, so that they celebrate Public Domain Day every January 1. (This year: Albert Einstein! Next year: A. A. Milne! And so forth.) There's talk of setting up a Project Gutenberg in Canada, so at least old works from that era will be preserved, if not made legally available in most places. (Australia also has Life+50, but I think that's changing, alas.)
In any case, the set of books copyrighted by January 1, 1923 (not 1922) is indeed finite, but you might be interested to know (see Free Culture, p. 147) that the average copyright term in 1973 in the US was 32.2 years, because most (more than 85% of) works were not renewed. Due to retroactive extensions and associated bullshit, after 1978, works created 1964 or later were automatically renewed. But Project Gutenberg has a Rule 6 to deal with that. Consider (I think you may have to sign in to see this) Plague Ship , by Andre Norton, published in 1956, currently being post-processed.
'Course, the fact that folks are working hard to drag works into the public domain where they would be in a sane legal system at this point doesn't invalidate your original point. But Project Gutenberg isn't about to run out of material, not when they have a big chunk of the 20th century to deal with. (They just don't have anything particularly popular from that period.)
Oh, and PG doesn't really have 16,000 books. Some works were released in little bitty pieces. Consider an example. But there are still, I think I've seen estimated, around 10,000 real, individual titles in there. (Of course, any measure that counts the encyclopedia-sized "Modern Machine-Shop Practice" and the Declaration of Independence equally can't really be that accurate, now can it?) -
Re:Brick and Mortar??
While it is labor intensive, I can check a book out from the library, scan it in an hour or two, run it through OCR, and voila! my own personal ebook. The latest Harry Potter was available as an ebook within a day of release.
Right now, I only do this for books in the public domain for Distributed Proofreaders http://www.pgdp.net/, and depending on the binding, I can do 3-12 pages per minute.
Personally, I do not see Google getting into making hardcopies for two reasons: First, I believe their OCR is good enough for approximate search, and not a true digital copy suitable for on-demand printing, and second, they're not interested in trying to figure out what books published after 1922 are really public domain because they were not renewed.
IANAL, but as long as they severely limit the search results, I think it will be hard to argue that the search results return more than what is acceptable with fair use. However, I think the copyright holders might be able to argue that borrowing a book from a library to make a permanent copy of their own is a copyright violation. I personally think that unless these publishers are mostly printing PD texts (with or without new copyrighted material like prefaces by some scholar), including their work in Google Library is a win-win situation for Google and the publisher. -
Epson for speedI bought a bunch of scanners for my Project Gutenberg work. I found a lot of really slow ones that will drive you crazy when trying to scan a whole book. I recommend the Epson 2400, it was cheap and is really fast. It might not still be available but Epson is a good start in general. Check the SANE List if you care about Linux combatibility.
Also see this wiki for a discussion of various opinions on scanners.
-
Epson for speedI bought a bunch of scanners for my Project Gutenberg work. I found a lot of really slow ones that will drive you crazy when trying to scan a whole book. I recommend the Epson 2400, it was cheap and is really fast. It might not still be available but Epson is a good start in general. Check the SANE List if you care about Linux combatibility.
Also see this wiki for a discussion of various opinions on scanners.
-
Distributed Government Document Browsing
This new glut of public information gave me an idea. Something similar to Distributed Proofreaders but for scrutinising government documents. Volunteer readers would look at a few scanned pages, marking the ones that would be of broad interest, and then the most interesting get compiled into a list.
If only there were 25 hours in a day.
-
Re:Quick Script + Gutenberg?
Why do you have to warn someone about Gutenberg due to being an English Major?
With this attitude you shouldn't read any old books, in fact I would recommend not to read a lot of today's new papers that are very poorly edited.
The books on Gutenberg are not edited, but preserved as close as possible to their original content. I find it very interesting and educating to read those books despite any mistakes or odd spellings authors and printers may have done. I find often a lot more wisdom in those "badly edited" books than in the very shallow books you find today (btw. often not better edited due to fast turn-arounds).
But then
... that are only my 2 cents worth... -
Distributed ProofreadingHey, instead of warning off people from something you clearly do not understand, why not use your super English-major skills and help Gutenberg as a distributed proofreader?
Go on. Join.
-
Re:Electronic Dog Poo...
If only their existed somewhere a company
... with the gumption to take on a project as large as OCRing such an immense Public Domain repository.
Or, indeed, a volunteer organisation. -
Re:Digitalisation
perhaps consideration could be made to include digitalisation of all Literary works which have fallen into the public domain.
Excellent idea, which is why quite a few of us are doing just that. If you'd like to help Project Gutenberg's effort to digitise the public domain, then join our Distributed Proofreading site and get to work! Over 6000 books turned into electronic text form so far. -
There is digitalization, and "digitalization"...
Mod me down if you wish, but I have to say that I found Google Print nice, but not too useful. Sure, it's a nice thing that you can search through paper books, but in most cases you can't actually read them; you have to buy them, and this even goes for classics such as "20,000 leagues under the sea" which are already digitized by Project Gutenberg or similar organizations: Google digitizes newer, copyrighted editions even when there are older, public domain editions available. Thus, in my eyes Google Print is little more than a marketing door for on-line bookstores.
On the other hand, French digitalization project Gallica, though sometimes mocked on Slashdot, not only digitizes books, but gives the scans away freely (as in speech), so everyone can read the books in entirety or use them as they please. Both Distributed Proofreaders and Distributed Proofreaders Europe already use Gallica scans to produce completely digitized and free e-books which you can search, read, datamine, or do with them anything that suits you. If Slashdot readers are supporters of free software, this too is something they should revere.
I hope that Europeans will not compete with Google. I hope that they will make bigger, better, and more diverse Gallica.
-
Re:What I see
The result? a bunch of low res image in locked PDF (can't select and copy) of some two hundred years books
And it's far more important to culture to scan the latest Photoshop books? The reason you can't select and copy is because there's no OCR.
which have been stupidly spent for a totaly useless project!
The Gallica archive at BNF has been a wonderful source for Distributed Proofreaders and from them, Project Gutenberg, with BHF's blessing. Their scans have been far from useless. -
Consider the technologyConsider that Google is showing you the original scanned page image, and yet they are highlighting the search terms. Howdeydodat?!?
Only way I can figure is, they have OCR'd the images and indexed each OCR'd word by its X/Y location on its image. So then, when you search for "frummage" they know where every instance of "frummage" starts (and its height and width) in the page image, so they can display the page image with the yellow hilite over the word.
At first this seems like a peculiar approach. Having OCR'd and indexed every word, why would they not simply store the words and discard those bulky bitmaps? I can think of two reasons.
One is probably copyright: they have a deal with that publisher to index that edition, and when words are indexed to specific page locations, Google is forever confined to displaying those page images and no others.
Second is that while it is fairly easy to extract words from page images, it is darn hard to reconstruct those words in pleasing HT- or XML that properly conveys the look and feel of the original book. I've worked on doing just this for Gutenberg Distributed Proofreaders and it is very demanding.
-
Quit moaning and helpA lot of people here seem to be complaining about Google's implementation, and their treatment of old book as copyrighted, when they are clearly not.
If everyone who posted to slashdot proofed a few pages every now and again at Project Gutenberg's Distributed Proofreading project, things would be a lot better. Go on, you'll be making a much more important contribution to the world!
P.S., I'm not a hypocrite. I've done 4 pages today.
-
Re:Nice!
Actually we do save the images. Many of the initial projects images are saved on CD's but anything from the last few years will make its way to the 'Open Library System' which is an image archive of the DP page scans. You can find a pre-alpha version at: http://www.pgdp.org/ols There are images for about a 1,000 projects there with many more pending me getting around to importing them. Lots of work to be done, developers welcome. Charles Franks Founder, Distributed Proofreaders
-
Re:Nice!
Well, there's the Distributed Proofreaders project for Project Gutenberg... but PG isn't a "we must be the source" attitude from what I've seen. As far as PG is concerned, the more eBooks, the better.
DP probably isn't threatened either - they just shift focus to books that are not in the Harvard collection to avoid duplication of effort. -
Re:Firefox still has one major issueLet me first say that I've been using Gecko-based browsers since _1999_ - the first release of the almost UI-less Gecko HTML rendering component - and have pesisted with bug reports and whatnot since then. I perversely keep each new Firebird instance running until it crashes.
I assume this is a Linux-only issue because there'd be more fuss if it affected the win32 users. For me at least, 1.0 is a stability disaster. I'm getting three or four crashes a day on average. Talkback kicks off each time but more often than not I end up with multiple queued incidents and 'network connect failed' errors - could this be caused by millions of other crashing Firebirds saturating the pipe to crash-data? I'd love to report a bug but don't see any commonality except that it seems to happen when I have Project Gutenberg Distributed Proofreaders' proofing interface open, which means I have two windows - one with PGDP and one with multiple tabs. I only usually see a few other sites, mostly Slashdot, the Reg,. and the BBC's text-only pages (no broadband out here in the west of the UK
:( )It's got so bad I'm now switching to Knoquerer for normal browsing. Every Firebird crash is losing me data, in fact *work* in the form of half-proofed Gutebnberg pages. It's especially heartbreaking when I've been with the project so long, thru' all he various M-xx milestones, the mozilla 1.0 'gold' back in June 2002 - when I picked up a souvenir CD - all in all, very very sad.
-
Re:Supporting the Environment & China
Please help me out...
-
Re:It does matter
Please read this pleas for help!
-
Re:You call that a deal?
There's nothing odd about public domain works to be sold. To the contrary, it is the nature of the public domain that anything can be used in any way you like.
That, however, has little to do with format conversion. AC, what development work does it take to dump the ASCII file from Gutenberg onto the Palm sevice? That is something I do on a regular basis, and all it takes is the minute I need to start and run Pyrite Publisher.
Perhaps you are talking about adding value to the text (prettifying, adding commentary, adding in links, et cetera), but a trained person can do that in a few hours.
Case in point: Distributed Proofreaders, main supplier of Project Gutenberg, now submits over 70% of its etexts both in ASCII and HTML format. Having the HTML file available cuts tremendously on making nice-looking PDFs or other formats, and allows us to retain much of the original lay-out of a work. This is all done by volunteers, many of whom had no experience with HTML whatsoever before they started working with us. The learning curve seems to steep, but after a few books most people seem to be comfortable producing great HTML editions. (And for every one of the few volunteers who just do not get the hang of it, there are several volunteers who are glad to do nothing but HTML editions.) -
Re:Reading Is Life
no money, but proofreading (and for a good cause).
-
Re:Reading Is LifeHow do you get hired to proofread? I love to read and am pretty sure that I can catch a lot of errors. Do they send you a test manuscript to see how many errors you can detect? Thanks for indulging my curiosity.
No money, but if you enjoy proofreading, try Distributed Proofreaders.
-
Re:Reading Is Life
You don't get paid, but if you really want to proofread, consider Project Gutenberg Distributed Proofreaders: http://www.pgdp.net/
-
Re:National Geographic online
I was thinking about digging up old National Geographics, scanning the text and photos, and posting that online. It would make for a great distributed project.
If you're willing to scan them, Distributed Proofreaders (http://www.pgdp.net/) is willing to correct the OCR and even have people assemble them into HTML, provided you're willing to let Project Gutenberg (http://www.gutenberg.org/) post them. -
Re:Typeface ?
Yef, we could recalibrate the OCR for the early fontf, but the text ftartf to look ftrange.
That's not hard. It would be easy to get the OCR to recognize the long-s (which does in fact look different from the f); even if you don't, post-processing (dictionary lookups to see if f or s is valid at a point) can clear up many cases, and for those it doesn't, well, you're going to have to check and fix the OCR anyway.
(This is not theory; Distributed Proofreaders (http://www.pgdp.net/) has and uses such a post-processor. -
Re:Why not pass it through project Gutenburg?
Technically, it's us folks at Distributed Proofreaders that do the dirty work of fixing OCR problems.
I've done over a thousand pages since it's started... It's gotten really easy for me to pump out pages, and I've been turned on to alot of different information that I'd normally not expose myself too... It's quite enriching -- so you should try it if you got time! ;-) -
Re:It Isn't a "Threat"
Note that PG and PGA, while related, are distinct entities. When PGEU and PGCanada get going (both are in the planning stages), then we'll have a group of projects, all with the same aim, but tailored to their particular geographical areas. PGEU, in particular, will concentrate on the large amount of *non-English* public domain material out there -- you can help proofread some of it by joining the European version of the US-based Distributed Proofreaders.
It's a nonsense to say that the only things PG should publish should be public domain in *all* countries -- indeed, the major difference between copyright laws in the US and those in the *entire rest of the world* is the main reason to want to branch out and create regional 'editions' of PG. Due to corporate interests, no new material will enter the public domain in the US for at least the next 14 years -- in the rest of the world, new material is added to the public domain on January 1st each year. By 2018, when material published in 1923 becomes public domain in the US, every work published by authors who died before 1948 (for the EU), 1958 (for India), or even 1968 (for Canada) will be public domain in those areas.
The US is currently trying to push life+50 countries to become life+70. When it succeeds in this, it will start pushing for life+70 countries to become life+90. The trend for ever-increasing copyright terms has to be resisted. One of the key ways to do this is to build people's understanding of the need for, and benefits of, the public domain. PG is a key part of this. -
Distributed Proofreaders
I'm not convinced that OCR quality is good enough today to store the books as ASCII text. You're going to be doing a lot of work making the scan
A lot of work by Distributed Proofreaders?
-
Re:Request for MATH experts
Only one MATH book is ever in the first round at any one time. Hilbert's book is that one right now.
The logic behind this is simple. Most of our volunteers avoid these books like the plague and if we kept releasing new ones, pretty soon the entire first round would be only MATH books.
To see what's waiting in the queue for English language math books, see here. For Languages Other Than English (LOTE) math books, see here.
-
Re:Request for MATH experts
Only one MATH book is ever in the first round at any one time. Hilbert's book is that one right now.
The logic behind this is simple. Most of our volunteers avoid these books like the plague and if we kept releasing new ones, pretty soon the entire first round would be only MATH books.
To see what's waiting in the queue for English language math books, see here. For Languages Other Than English (LOTE) math books, see here.
-
Re:law of averages?
"However, I am curious as to just how accurate the proofreading is."
That's very hard to tell, as there is no gold standard for accuracy. There are two sometimes conflicting goals in regards to accuracy that we have; one is to preserve the author's intent, the other to preserve the actual printed text. At some points these two conflict, for instance, when we would like to normalize spelling to increase readability.
There is currently some talk going on at the DP forums as to which system would be best to eliminate common errors, that everybody tends to overlook.
We already have several systems in place to help us with these. For instance, we use a specially modified font that helps to highlight differences between letters. It's dog ugly, but that's intentional; because it grates, you see errors much more quickly.
Also, once common errors are identified as such, we write software that can help us find such errors.
Finally, we use these new-found methods to look at books we posted to Project Gutenberg in the past, to measure the increase in accuracy.