Slashdot Mirror


Digital Cameras vs Scanners for OCR?

ttennebkram asks: "With 6 and 8 Megapixel cameras on the market, some now with Wifi built in, it might be more convenient to shoot pictures of your bills and papers with a camera than fussing with the scanner. By the numbers, it would seem feasible. 300dpi for an 8.5"x11" sheet of paper works out to about 8 megapixels; 300 dpi is usually what OCR vendors suggest. I imagine for high volume good results you'd want to maybe mount the camera on a tripod arm over your desk. Heck, I was thinking of a glass desk and maybe one camera below and one above, and maybe a foot pedal to trigger the cameras (and I suppose a flash and high F-stop would help as well). If I could quickly 'snap' all the junk paper I have and electronically file it, maybe OCR the images at night in batch while I'm asleep, and then maybe get rid of all that paper once and for all. Using a traditional cheap scanner just takes too long. So has anybody tried this? I realize that camera optics are different than scanner optics, so maybe it's not just a question of raw pixel counts. Any thoughts?"

9 of 95 comments (clear)

  1. Sheetfeeder by cerberusss · · Score: 2, Informative

    What you want is a scanner with a sheet feeder and a GOOD one at that. They're not that expensive anymore, since there are lots of cheap machines which have a feeder anyway due to them having a fax function. This alone will go faster than manually swapping the papers and shooting with a camera.

    --
    8 of 13 people found this answer helpful. Did you?
  2. Re:Aspect Ratio and Even Lighting by TheWanderingHermit · · Score: 5, Informative

    That's about it. I used to transfer photos to video professionally. We had a nice rig with lighting and a video camera mounted on a stand. We had to do a lot of adjusting of focus because of different types of phots and other issues. More often that not it was not just put down and click, then move on to the next one. If you're dealing with letters, and you're not scanning, you'll have problems with some fonts and other oddities that make sure many shots won't turn out as perfect as you'd think.

    I have my own business. I keep all my bills, receipts of deductable expenses, home records, and so on. I keep personal records 7 years except in special cases. I just take the bill, when I get it (and most bills are e-mail now!) and put it in the envolope for the biller for that year. At the end of the year I spend less than 30 minutes writing up labels for the next year and when I get time, I burn the stuff that is past 7 years old. For "all those blls" I've never needed more than 4 filing drawers, which can be stacked as one cabinet that doesn't take up much space, or I use the two cabinets (2 drawers each) as legs for one of my desks.

    I thought about keeping things electronically, but then I realize I'd have to take time to scan them and file them and that would take a lot more time, over all, than just dropping them in folders. If you want, you can spend all that time scanning. I prefer not to, but then again, I have a life and would rather be cycling or rock climbing than scanning bills.

    This, to me, sounds like a geek gone wild, over thinking the solution and trying to come up with a hi-tech answer to a low-tech problem that really doesn't need an answer if one uses a little common sense and simple organization.

  3. Not as easy as you think.. by sakusha · · Score: 5, Informative

    I have some experience doing what you're trying to do. I've even done this type of work in professional labs with serious pro equipment (it was my job). It's a huge pain in the butt.

    I'm currently digitizing my collection of old tabloid punk magazines from the 1970s. I had to use a digital camera because flatbed scanners that do 11x17 or larger are extremely expensive, they're like $3000 or more. So I did some experiments with my consumer-grade 5 megapixel digital camera. The results were adequate, barely (and I have an art degree in Photography, this stuff is easy for me, YMMV). I've currently suspended my project until I can afford a higher rez digital camera, mostly because 5Mp is barely enough to capture the little 6 point type that is used in large sections of the magazines. But let me tell you more generally what I've learned.

    First off, you'll need a copy stand. This is a fairly standard photo accessory, but a good copy stand is fairly expensive. You need something that is easily adjustable, so you can raise and lower the camera to get the document to fill the frame, without using too much zooming. The copy stand keeps the camera parallel to the target at all distances. It is important to have quick adjustability in height, rather than zooming. You'd be much better off using a "prime lens" rather than a zoom, as zooms tend to have barrel and keystone distortion.

    Secondly, you need lights. If you only want to copy written documents (or B&W magazines like me) you can use cheap spotlights. If you want to do color, you need much better lighting, something with a fixed color temperature, or a flash system. Spotlights are really hot, and when I work in my small office, it gets intolerably hot when I spend about an hour photographing. For better, more repeatable results, you'd be better off getting a flash system. BUT...

    Here is the sticking point. You need something to keep the documents flat. That means placing them under a sheet of glass. So you are going to get reflections from the lights, and flash is high intensity lighting which makes it even more difficult to control reflections. The usual method is to put polarizing filters over the lights and the lens, to cancel out the reflections. This is a rather complex method, and a LOW END professional copystand with polarized lighting will set you back about $2500.

    OK, so what I did is I adapted my old disused photo enlarger. It was a huge monster for 4x5 negatives, I took off the enlarger head, and used a Bogen photo clamp with a ball-head joint attached to the motorized arm that goes up and down. It does a fairly good job as an improvised copy stand, but it is pretty cramped, the baseboard is only designed to make max 20x24 prints. Also it is a HUGE pain in the ass getting the camera leveled with the baseboard, I use a bubble level. Then I attached a cheap set of tungsten photofloods to the wings of the enlarger, so the light hits the baseboard at a 45 degree angle, to reduce glare. Note that it is best to point each light at the far side of the document, so the light paths cross each other. This gives the light a little distance to fan out and eliminate hot spots. I don't put my documents under glass, they're newspaper pages, so I flatten them for several weeks (!!!) under weights, then if there's a little curl, I use weights (like heavy metal rulers) at the edges, or hold the edges down with post-it notes. That eliminates the need for a glass plate to hold them down, and I don't have to deal with reflections. However, it takes a LOT of time and effort to get the documents positioned and flattened correctly, it is not a quick process.
    I use a Canon camera, so I use the Canon Camera Remote to my laptop to preview and take the shot. Even with the lights and some fill flash, I can end up with exposures of 1 or 2 seconds, so I can use a narrow f-stop. This shouldn't be necessary for a flat object, which requires no depth of field, but I find that the lens is sharper stopped down. It takes quite a bit of fiddling to get the optimal

  4. Re:Cameras aren't all that easy to use by otherniceman · · Score: 3, Informative

    Google used dedicated book scanners called Planetary or Orbital Scanners, see http://www.dlsg.net/bookeye.htm and http://en.wikipedia.org/wiki/Planetary_scanner. They are a lot better that a digital camera on a tripod.

  5. The solution by C4st13v4n14 · · Score: 1, Informative

    I have a four year-old Canon Powershot G2 that has been indispensible in the digitising of my documents. Given adequate lighting, all you need is to line up the document in the view-finder and take the photo. Autofocus is usually adequate, but if you just can't seem to get a clear shot (certain things will prove problematic), manual focus is your next best feature to utilise. If you're doing James Bond-type work and are in a hurry, then you'll often end up with blurry images that won't be useful in OCR. Given that I already own a digital camera, I will never invest money in a scanner. If anything, I'll buy a better camera when I can find one with all the features I want. Hope this helps in some strange way :)

  6. Re:Bulk indexing by mlk · · Score: 2, Informative
    I work in the Media Monitoring industry. What we do is scan in newspapers (we have some 4000 publications), OCR them, throw 'em in a search engine and do some bloody complicated searches on that dataset before sending out hits.

    roughly OCR a document and store keywords or snippets of text in metadata or an index?

    Lots.
    You could be OK with GOCR and Apache Lucene if you do not require zoning (working out blocks of text and columns).

    OCR is not good enough

    Oh it is. You will need to add "variants" to your searches. E.g. if you are looking for Microsoft you would search for "M[i1]cr[o0]s[o0]ft". Some search engines can do this for you, others can say "max of two errors".

    What formats allow an easy mix of image and text data (without formatting)?

    XML (hehe). PDF can. Most systems would have the image as file somewhere on your file store, and the text in a database.
    --
    Wow, I should not post when knackered.
  7. No scanning required by coinreturn · · Score: 2, Informative

    I find that most bill providers have an option to receive your bills electronically, keeping them either in their "safe" (ie, website) or to receive them in e-mail. This is true for credit cards, banks, major utilities; the main exception being the city-run water and trash company.

  8. Re:Desktop duplex scanners by Anonymous Coward · · Score: 1, Informative

    I recently bought a used HP Network Scanjet 5 for $50 on ebay and upgraded it, following instuctions at http://www.madole.net/scanjet/. In addition to installing BSD, I upgraded to a bigger hard drive, so now it's both my scanner and my document repository. The scanner does a great job of 300dpi black and white scans, and I use NFS to mount the scanner's drive and organize the scanned documents, so I can easily access my files from any computer. It doesn't have an automatic duplex feed, so for two-sided originals you have to pick up the stack and turn it over when prompted, but you only have to do that once per scan job. I'm really impressed with how easy it was to upgrade the scanner and how well it works, now.

  9. Real camera solution by nuggz · · Score: 2, Informative

    1 Don't use a tripod, use a document photo stand.
    Think of an overhead projector with the camera where the mirror is for vertical adjustment.
    2 Have a guide for the paper, not that hard.
    3 Lighting is an important one, but as long as it's even the type of light doesn't really matter if you set your white balance correctly.
    4 If it is a rigid setup doesn't really matter
    5 Use the camera control software on the computer, you don't need to really use a camera.
    6 Save the file and run the OCR software.

    I use a similar setup to take photos of test parts at work, works nicely.