Simple Document Imaging for Unix?
andylievertz asks: "I have developed a logical system of directories for storing my digital documents (i.e. *.doc, *.mp3, *.gif, etc.), and can usually find any obscure document with relative speed. These 'must-keep' hardcopies include everything from bills and shipping invoices to brochures and chinese-food menus. I've tried applying my electronic filing techniques to an actual, real-world filing cabinet, complete with folders and labels, but such a system: requires a great deal of effort to maintain relative to the electronic system, especially considering the frequent influx of new hardcopy material; and doesn't address the greater issue of reducing the sheer paper bulk, organized or not. What solutions have you, the Slashdot Reader, employed to solve this situation for yourself? Are there viable Unix-based Document Imaging packages, similar in function to the Microsoft Document Imaging utility packaged with Office? Do you use a Unix-based Document Imaging solution personally or professionally? If so, what package, and why does it work for you?"
"So, step one is to find ways to reduce the influx of hardcopy (i.e. electronic billing, etc.), but for me, the second step is to find and utilize a [Unix-based!] system that will allow me to scan and file hardcopies electronically so they may be indexed, searched, re-organized, shared, and retrieved as easily as their electronic counterparts. Naturally, any such system would need tolerances for multi-paged documents, and would need to store its output in a non-proprietary file format."
If you want to track money, having the paper is not nearly as useful as entering the data into a financial program. Try GnuCash or something of that ilk.
Delivery menus are different story. I keep them under a magnet on the fridge. If you get a nice rare earth magnet that can hold a half inch stack of menus, that problem is easy to solve (get at least the half inch cubes).
Any solution that requires every document to be scanned is not going to work for you if you can't even file the documents. what are the chances you are going to get around to that stack of stuff to be scanned?
Invest in a magnet, a big box, and a good paper shredder.
I've wondered about this myself.
Seems to me that the solution would involve a scanner, a database, and a mechanical system for retrieving the documents.
1) Scan the document.
2) Slide document into doc protector with ID tag (UPC codes might work, but really it could just be sequentioal
3) Create DB entry for ID, BLOB of scanned image, (or perhaps a foreign key to keep the images out of the quesry, but realistically most DBs optimize this for you) and most importatntly, meta data about document.
The more I think about it, the more I realize a number system of 1,2,3,4...would work fine. The automated retrieval, which would be nice, is not really vital. The match between the doc ID and the scanned version is enough, so long as the document always goes back into the same folder.
Insertion O(1)
Search O(log(n))
Deletion O(log(n))
Note that garbage collection (compation is not really an option, which means to reaclaim discarded IDS (Reuse folders would crank insertion back up to O(log(n))
The question is whether the scanning process would be worth the time.
Open Source Identity Management: FreeIPA.org
I have found that a digital camera does a very good job of quickly capturing usable images of paper documents. A 5 megapixel camera provides over 200 ppi for 8.5 x 11 hardcopy and grabs the image faster than does most flatbed scanners. Given the scarcity of drivers for Unix, the only trick is finding a memory card reader that is compatible with your system.
;)
A good digital camera may seem like overkill for scanning in bills, but then the camera also doubles as a camera too.
Two wrongs don't make a right, but three lefts do.
QuiteInsane.
Its insanely good. I use it to scan in all my important documents. It useful multipage modes for... well, multipage documents.
Try it. It's actually been considerably revamped since I installed it, I will have to try a more recent version,
Oh, it comes in a nice debian package via apt-get.
Yours Sincerely, Michael.
and htDIG to solve all my document storage problems.
The Digital Sender is a wonderful toy. Stick a stack of paper in the bin. Enter an email address. Press the big-green button. And a PDF shows up in my mailbox in a few minutes. Even does double sided. Very simple device and it does most of what I need.
It doesn't do OCR. The Digital Sender outputs a bit-mapped PDF that looks very good. I usually use the full version of Adobe Acrobat to do optical character recognition and store the results in the background. That way I still see the good scan on the screen and when I print. But I can copy and search the text as I would normally.
I use htDig (http://www.htdig.org/) to index my archive. I store content in file folders that make sense (2002 taxes, pitch perception papers, etc). But I still find htdig useful. It indexes both HTML (my lab notebook) and PDF files. All is good.
PDF is a well-documented file format. I wish there was a good free-OCR package, but sometimes you have to pay for good performance. htDig and PDF work great on Windows and Linux.
In three years I have accumulated just over 1Gbyte of content. That represents all my lab notes (in HTML format) and all the papers I've read (in PDF). It's wonderful having my entire paper life with me on my laptop. (I also back it up to three different machines.)