Digital Cameras vs Scanners for OCR?
ttennebkram asks: "With 6 and 8 Megapixel cameras on the market, some now with Wifi built in, it might be more convenient to shoot pictures of your bills and papers with a camera than fussing with the scanner. By the numbers, it would seem feasible. 300dpi for an 8.5"x11" sheet of paper works out to about 8 megapixels; 300 dpi is usually what OCR vendors suggest. I imagine for high volume good results you'd want to maybe mount the camera on a tripod arm over your desk. Heck, I was thinking of a glass desk and maybe one camera below and one above, and maybe a foot pedal to trigger the cameras (and I suppose a flash and high F-stop would help as well). If I could quickly 'snap' all the junk paper I have and electronically file it, maybe OCR the images at night in batch while I'm asleep, and then maybe get rid of all that paper once and for all. Using a traditional cheap scanner just takes too long. So has anybody tried this? I realize that camera optics are different than scanner optics, so maybe it's not just a question of raw pixel counts. Any thoughts?"
...the aspect ratio and even lighting are your enemies. It's almost impossible to shoot a bill or a check stub dead on, at close rage, without fish-eye'ing, and without getting in your own shadow. Sure, you might have a little white linnen box that you use to take your eBay photos, but, seriously, this is a job for a scanner.
If it's just for keeping a record of bills and other junk, why even bother with OCR? As long as you can read the results, just snap away.
I was conned by an old man in a cloak. It turns out those *were* the droids I was looking for.
Get a scanner
If a job's not worth doing, it's not worth doing right.
What you want is a scanner with a sheet feeder and a GOOD one at that.
Absolutely.
I tried this, myself, a few years ago. I guarantee that, using a camera, you'll get through, maybe, 100 pages. I got a decent scanner (HP something or other) with a sheet feeder. It does about 12ppm and that turned out to be too slow. I got tired of it in a day or two.
I tried a bunch of different solutions, but I finally had to take it all to work. We had a Fujitsu M4097D and an enormous Ricoh Copier/Scanner/Fax machine. Both did 60ppm, both sides (120 images a minute). I actually made some headway with that setup, but I still didn't finish.
As far as OCR is concerned, don't bother. Even today, it's nowhere near accurate enough. In my experience, the best software out there get an average of one error per page on a really good scan. Trust me: it will take a lot more of your time than you think to fix that. Assuming you're doing mostly black and white text, G4 compression will compress a 300dpi, 8.5x11 image down to about 100k. At that rate, you can store close to 7000 pages on one CD.
Sit, Ubuntu, sit. Good dog.
It really sounds like you haven't actually tried it. I've got a Powershot 450, I think they just upped the versions so they're past that now, I imagine. Anyways, it's 5~mp, and I used it all the time in the library to take pictures of books so I could accurately quote them...much much easier than copying long quotes word for word, and then just look at the picture and re-type the text.
My suggestion is to not take things so far digital with your process. Paper doesn't take up so much space that you can't hold onto digital and physical copies:
1) Gather your big stack of "documents I probably don't need but if I do need, I will really really need"
2) Take digital images of them all, either with a scanner or a digital camera. Choose a method, try it a few times, check the output to make sure it works and is legible, and then go with it.
3) File the digital image on your computer, and file the paper in a system with the same setup and title. Even if you put all the papers in a an office-style cardboard box and stick them all in the back of the closet, the attic, whatever, do it in the same system as best you can.
The point here is that you use the computer files if you ever need the information, and if they fail, you've still got the paper in the back of the closet in your basement. If you keep the papers organized, you can store them in such a way that after a few years, you'll know that an entire box is safe to shred and recycle.
The key to any system like this is being able to trust that you have what you need when you need it. You can't ensure digital or physical documents 100%, but with both, you can feel pretty safe (store a copy of the digitals at a relative's house or some such, somewhere significantly off-site, or on an FTP somewhere or somesuch, or course). But having the files isn't the only part of a reliable system: you also have to be able to find them.
so get yourself an A-size scanner and just scan each page in two parts?
Or if there aren't too many grayscales that you'd trash,
just run it all through a photocopier to shrink to 8.5x11 and scan that?
Bill Stewart
New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
So to clarify... You want to trade the hassle of:
1) lift a lid
2) stick a paper in a well-defined corner
3) press a button
for the hassle of:
1) align a camera on a tripod, including angle as well as position
2) align a paper with no guide
3) adjust the lighting so that you get an even tone
4) make sure you didn't accidentally move the camera, the tripod, or bump the desk
5) step on a foot pedal that you jury-rigged to make take a picture
OR
5) Push a button on a camera that you can't afford to move even a hair.
6) Use image software to continue adjusting the photo so that the OCR will read it properly
7) Hope you did everything right the first time.
I think I'd pick door number 1.
"If you make people think they're thinking, they'll love you; But if you really make them think, they'll hate you." - DM
Once you've OCRd, is there any (preferably Free) software that can parse the text against a grammar and word list and hopefully fix some of these errors? Surely "if there's a digit in the middle of a word, it's probably really the letter with the similar shape," "if an unknown word is a character or two different from a known word, it's probably the known word," etc. aren't difficult heuristics, right?
"[Regarding the 'cloud,'] ownership was what made America different than Russia." -- Woz
Well, I'm trying to max quality with modest equipment, but the basics are always the same. You still need some sort of support like a camera stand, lighting, and something like glass to hold down the documents. Lighting and reflections will always be a problem. I've done this for real quickie jobs using camera on a tripod, and the results sucked. A flatbed scanner is still a much quicker, cheaper, and better way to do the job.
BTW, I have privately circulated a few of my PDFs amongst some online punk communities, and they went nuts over them. The old school punks love them for the nostalgia, but to the new punks who weren't even born in the 1970s it might as well be Elizabethan English, they don't get it at all. Ha! Some of these magazines are still around, and even have major online websites, but none of this old material is available through the official sites. It's a shame, since they presumably have high quality reproductions in their archives, I just have 30 year old mouldering newsprint. They could probably never re-release this material, it all depends on context, half the fun is the advertisements next to the articles, and they could probably never get all the rights and sort out all the royalties to reproduce all the trademarks in the ads. But I could probably get away with circulating my scans openly, I don't think a British court could touch me here in the US. And some of these magazines don't exist anymore and no company has any financial interests in the content, so there's nobody left to file a lawsuit.