Digital Cameras vs Scanners for OCR?
ttennebkram asks: "With 6 and 8 Megapixel cameras on the market, some now with Wifi built in, it might be more convenient to shoot pictures of your bills and papers with a camera than fussing with the scanner. By the numbers, it would seem feasible. 300dpi for an 8.5"x11" sheet of paper works out to about 8 megapixels; 300 dpi is usually what OCR vendors suggest. I imagine for high volume good results you'd want to maybe mount the camera on a tripod arm over your desk. Heck, I was thinking of a glass desk and maybe one camera below and one above, and maybe a foot pedal to trigger the cameras (and I suppose a flash and high F-stop would help as well). If I could quickly 'snap' all the junk paper I have and electronically file it, maybe OCR the images at night in batch while I'm asleep, and then maybe get rid of all that paper once and for all. Using a traditional cheap scanner just takes too long. So has anybody tried this? I realize that camera optics are different than scanner optics, so maybe it's not just a question of raw pixel counts. Any thoughts?"
...the aspect ratio and even lighting are your enemies. It's almost impossible to shoot a bill or a check stub dead on, at close rage, without fish-eye'ing, and without getting in your own shadow. Sure, you might have a little white linnen box that you use to take your eBay photos, but, seriously, this is a job for a scanner.
If it's just for keeping a record of bills and other junk, why even bother with OCR? As long as you can read the results, just snap away.
I was conned by an old man in a cloak. It turns out those *were* the droids I was looking for.
Get a scanner
If a job's not worth doing, it's not worth doing right.
What you want is a scanner with a sheet feeder and a GOOD one at that.
Absolutely.
I tried this, myself, a few years ago. I guarantee that, using a camera, you'll get through, maybe, 100 pages. I got a decent scanner (HP something or other) with a sheet feeder. It does about 12ppm and that turned out to be too slow. I got tired of it in a day or two.
I tried a bunch of different solutions, but I finally had to take it all to work. We had a Fujitsu M4097D and an enormous Ricoh Copier/Scanner/Fax machine. Both did 60ppm, both sides (120 images a minute). I actually made some headway with that setup, but I still didn't finish.
As far as OCR is concerned, don't bother. Even today, it's nowhere near accurate enough. In my experience, the best software out there get an average of one error per page on a really good scan. Trust me: it will take a lot more of your time than you think to fix that. Assuming you're doing mostly black and white text, G4 compression will compress a 300dpi, 8.5x11 image down to about 100k. At that rate, you can store close to 7000 pages on one CD.
Sit, Ubuntu, sit. Good dog.
so get yourself an A-size scanner and just scan each page in two parts?
Or if there aren't too many grayscales that you'd trash,
just run it all through a photocopier to shrink to 8.5x11 and scan that?
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
So to clarify... You want to trade the hassle of:
1) lift a lid
2) stick a paper in a well-defined corner
3) press a button
for the hassle of:
1) align a camera on a tripod, including angle as well as position
2) align a paper with no guide
3) adjust the lighting so that you get an even tone
4) make sure you didn't accidentally move the camera, the tripod, or bump the desk
5) step on a foot pedal that you jury-rigged to make take a picture
OR
5) Push a button on a camera that you can't afford to move even a hair.
6) Use image software to continue adjusting the photo so that the OCR will read it properly
7) Hope you did everything right the first time.
I think I'd pick door number 1.
"If you make people think they're thinking, they'll love you; But if you really make them think, they'll hate you." - DM
Once you've OCRd, is there any (preferably Free) software that can parse the text against a grammar and word list and hopefully fix some of these errors? Surely "if there's a digit in the middle of a word, it's probably really the letter with the similar shape," "if an unknown word is a character or two different from a known word, it's probably the known word," etc. aren't difficult heuristics, right?
"[Regarding the 'cloud,'] ownership was what made America different than Russia." -- Woz
Well, I'm trying to max quality with modest equipment, but the basics are always the same. You still need some sort of support like a camera stand, lighting, and something like glass to hold down the documents. Lighting and reflections will always be a problem. I've done this for real quickie jobs using camera on a tripod, and the results sucked. A flatbed scanner is still a much quicker, cheaper, and better way to do the job.
BTW, I have privately circulated a few of my PDFs amongst some online punk communities, and they went nuts over them. The old school punks love them for the nostalgia, but to the new punks who weren't even born in the 1970s it might as well be Elizabethan English, they don't get it at all. Ha! Some of these magazines are still around, and even have major online websites, but none of this old material is available through the official sites. It's a shame, since they presumably have high quality reproductions in their archives, I just have 30 year old mouldering newsprint. They could probably never re-release this material, it all depends on context, half the fun is the advertisements next to the articles, and they could probably never get all the rights and sort out all the royalties to reproduce all the trademarks in the ads. But I could probably get away with circulating my scans openly, I don't think a British court could touch me here in the US. And some of these magazines don't exist anymore and no company has any financial interests in the content, so there's nobody left to file a lawsuit.