Slashdot Mirror


Digital Cameras vs Scanners for OCR?

ttennebkram asks: "With 6 and 8 Megapixel cameras on the market, some now with Wifi built in, it might be more convenient to shoot pictures of your bills and papers with a camera than fussing with the scanner. By the numbers, it would seem feasible. 300dpi for an 8.5"x11" sheet of paper works out to about 8 megapixels; 300 dpi is usually what OCR vendors suggest. I imagine for high volume good results you'd want to maybe mount the camera on a tripod arm over your desk. Heck, I was thinking of a glass desk and maybe one camera below and one above, and maybe a foot pedal to trigger the cameras (and I suppose a flash and high F-stop would help as well). If I could quickly 'snap' all the junk paper I have and electronically file it, maybe OCR the images at night in batch while I'm asleep, and then maybe get rid of all that paper once and for all. Using a traditional cheap scanner just takes too long. So has anybody tried this? I realize that camera optics are different than scanner optics, so maybe it's not just a question of raw pixel counts. Any thoughts?"

13 of 95 comments (clear)

  1. Aspect Ratio and Even Lighting by mythosaz · · Score: 5, Insightful

    ...the aspect ratio and even lighting are your enemies. It's almost impossible to shoot a bill or a check stub dead on, at close rage, without fish-eye'ing, and without getting in your own shadow. Sure, you might have a little white linnen box that you use to take your eBay photos, but, seriously, this is a job for a scanner.

    1. Re:Aspect Ratio and Even Lighting by TheWanderingHermit · · Score: 4, Insightful

      Considering that, and I'm speaking not just as a former student, but as a former teacher, there is a delicate balance in all professors between ego and laziness, most of what is taught in college is in the text books. As for handouts, I found it pretty easy to file them as well. As for notes -- you mean someone who is scribbling notes in a hurry actually takes them in good enough handwriting that OCR would be able to read them without a lot of prompting? I should have mentioned that a lot of similar material like that is included in my 4 drawers. You have to think to file them in folders, and the same thought is needed to figure out which directory to put them in, but a lot more is needed to photograph papers so they are legible. If it's that important, a sheet-feed scanner would be more practical, but there's the difference between theory and practice: it's not as easy to batch convert as it sounds.

      I've also found that there is a lot more of value to learn from practical experience than from pedants.

      Unless one is a geek gone wild.

  2. Why bother with OCR? by Badfysh · · Score: 2, Insightful

    If it's just for keeping a record of bills and other junk, why even bother with OCR? As long as you can read the results, just snap away.

    --

    I was conned by an old man in a cloak. It turns out those *were* the droids I was looking for.

  3. scanners are FOR documents by Dun+Malg · · Score: 4, Insightful
    Digital Cameras vs Scanners for OCR?
    What, are you kidding? You can use a joystick in place of a mouse, but why? Cameras are for capturing a 2D image of a 3D scene. Like you noted, the optics are designed specifically for it. Scanners are for capturing a digital version of a 2D paper image. Musing over whether today's new, heavier wrenches might be stout enough to drive nails is silly, as what you really need is a hammer.

    Get a scanner
    --
    If a job's not worth doing, it's not worth doing right.
    1. Re:scanners are FOR documents by hords · · Score: 2, Insightful

      Besides that, scanners are usually cheaper than digital cameras anyway and can do much more than 300 DPI if you need it for another task. The scanner gets the lighting even and doesn't have to be focused. Maybe the reasoning is that it's faster to take a picture than to scan a document.

    2. Re:scanners are FOR documents by flewp · · Score: 2, Insightful

      Unless he has a proper lighting setup, it may actually take longer to clean up the photos of the documents than to simply scan them. Also, if the bills/reciepts/etc are of different sizes, he would have to zoom and fit it in frame, and crop the images among other considerations.

      --
      WWJD.... for a Klondike bar?
  4. Re:Sheetfeeder by Dadoo · · Score: 3, Insightful

    What you want is a scanner with a sheet feeder and a GOOD one at that.

    Absolutely.

    I tried this, myself, a few years ago. I guarantee that, using a camera, you'll get through, maybe, 100 pages. I got a decent scanner (HP something or other) with a sheet feeder. It does about 12ppm and that turned out to be too slow. I got tired of it in a day or two.

    I tried a bunch of different solutions, but I finally had to take it all to work. We had a Fujitsu M4097D and an enormous Ricoh Copier/Scanner/Fax machine. Both did 60ppm, both sides (120 images a minute). I actually made some headway with that setup, but I still didn't finish.

    As far as OCR is concerned, don't bother. Even today, it's nowhere near accurate enough. In my experience, the best software out there get an average of one error per page on a really good scan. Trust me: it will take a lot more of your time than you think to fix that. Assuming you're doing mostly black and white text, G4 compression will compress a 300dpi, 8.5x11 image down to about 100k. At that rate, you can store close to 7000 pages on one CD.

    --
    Sit, Ubuntu, sit. Good dog.
  5. Must have reliable files -and- reliable system by Anonymous Coward · · Score: 1, Insightful

    It really sounds like you haven't actually tried it. I've got a Powershot 450, I think they just upped the versions so they're past that now, I imagine. Anyways, it's 5~mp, and I used it all the time in the library to take pictures of books so I could accurately quote them...much much easier than copying long quotes word for word, and then just look at the picture and re-type the text.

    My suggestion is to not take things so far digital with your process. Paper doesn't take up so much space that you can't hold onto digital and physical copies:

    1) Gather your big stack of "documents I probably don't need but if I do need, I will really really need"
    2) Take digital images of them all, either with a scanner or a digital camera. Choose a method, try it a few times, check the output to make sure it works and is legible, and then go with it.
    3) File the digital image on your computer, and file the paper in a system with the same setup and title. Even if you put all the papers in a an office-style cardboard box and stick them all in the back of the closet, the attic, whatever, do it in the same system as best you can.

    The point here is that you use the computer files if you ever need the information, and if they fail, you've still got the paper in the back of the closet in your basement. If you keep the papers organized, you can store them in such a way that after a few years, you'll know that an entire box is safe to shred and recycle.

    The key to any system like this is being able to trust that you have what you need when you need it. You can't ensure digital or physical documents 100%, but with both, you can feel pretty safe (store a copy of the digitals at a relative's house or some such, somewhere significantly off-site, or on an FTP somewhere or somesuch, or course). But having the files isn't the only part of a reliable system: you also have to be able to find them.

  6. $100 8.5x11 scanner, and scan half-pages? by billstewart · · Score: 2, Insightful
    It sounds like you've got to handle each page by hand anyway -

    so get yourself an A-size scanner and just scan each page in two parts?


    Or if there aren't too many grayscales that you'd trash,
    just run it all through a photocopier to shrink to 8.5x11 and scan that?

    --

    Bill Stewart
    New Fast-Compression-only CPR http://preview.tinyurl.com/dy575ks
  7. To Clarify... by Aladrin · · Score: 4, Insightful

    So to clarify... You want to trade the hassle of:

    1) lift a lid
    2) stick a paper in a well-defined corner
    3) press a button

      for the hassle of:

    1) align a camera on a tripod, including angle as well as position
    2) align a paper with no guide
    3) adjust the lighting so that you get an even tone
    4) make sure you didn't accidentally move the camera, the tripod, or bump the desk
    5) step on a foot pedal that you jury-rigged to make take a picture
    OR
    5) Push a button on a camera that you can't afford to move even a hair.
    6) Use image software to continue adjusting the photo so that the OCR will read it properly
    7) Hope you did everything right the first time.

    I think I'd pick door number 1.

    --
    "If you make people think they're thinking, they'll love you; But if you really make them think, they'll hate you." - DM
    1. Re:To Clarify... by vanyel · · Score: 2, Insightful

      I was thinking about this recently, and what I want is:

      1. stick the paper in the slot, it feeds, scans and files in "New Docs"
      2. drag thumbnail to register entry in gnucash, it optionally (sometime in the distant future) ocrs it and tries to find the total and the vendor, as well as matching the last 4 to one of your cards to verify it's going into the right account, then gives you a chance to correct its mistakes. The scanned image is included in the financial db attached to the register entry.

      Unfortunately, few of the sheet feed scanners seem to get very good marks in reviews...

  8. Re:Bulk indexing by mrchaotica · · Score: 2, Insightful

    Oh it is. You will need to add "variants" to your searches. E.g. if you are looking for Microsoft you would search for "M[i1]cr[o0]s[o0]ft". Some search engines can do this for you, others can say "max of two errors".

    Once you've OCRd, is there any (preferably Free) software that can parse the text against a grammar and word list and hopefully fix some of these errors? Surely "if there's a digit in the middle of a word, it's probably really the letter with the similar shape," "if an unknown word is a character or two different from a known word, it's probably the known word," etc. aren't difficult heuristics, right?

    --

    "[Regarding the 'cloud,'] ownership was what made America different than Russia." -- Woz

  9. Re:you are making it too hard. by sakusha · · Score: 2, Insightful

    Well, I'm trying to max quality with modest equipment, but the basics are always the same. You still need some sort of support like a camera stand, lighting, and something like glass to hold down the documents. Lighting and reflections will always be a problem. I've done this for real quickie jobs using camera on a tripod, and the results sucked. A flatbed scanner is still a much quicker, cheaper, and better way to do the job.

    BTW, I have privately circulated a few of my PDFs amongst some online punk communities, and they went nuts over them. The old school punks love them for the nostalgia, but to the new punks who weren't even born in the 1970s it might as well be Elizabethan English, they don't get it at all. Ha! Some of these magazines are still around, and even have major online websites, but none of this old material is available through the official sites. It's a shame, since they presumably have high quality reproductions in their archives, I just have 30 year old mouldering newsprint. They could probably never re-release this material, it all depends on context, half the fun is the advertisements next to the articles, and they could probably never get all the rights and sort out all the royalties to reproduce all the trademarks in the ads. But I could probably get away with circulating my scans openly, I don't think a British court could touch me here in the US. And some of these magazines don't exist anymore and no company has any financial interests in the content, so there's nobody left to file a lawsuit.