Ask Slashdot: Open Source For Bill and Document Management?

← Back to Stories (view on slashdot.org)

Ask Slashdot: Open Source For Bill and Document Management?

Posted by timothy on Sunday April 7, 2013 @07:40AM from the seasonally-appropriate dept.

Rinisari writes "Since striking out on my own nearly a decade ago, I've been collecting bills and important documents in a briefcase and small filing box. Since buying a house more than a year ago, the amount of paper that I receive and need to keep has increased to deluge amounts and is overflowing what space I want to dedicate. I would like to scan everything, and only retain the papers for things that don't require the original copies. I'd archive the scans in my heavily backed up NAS. What free and/or open source software is out there that can handle this task of document management? Being able to scan to PDF and associate a date and series of labels to a document would be great, as well as some other metadata such as bill amount. My target OS is OS X, but Linux and Windows would be OK."

8 of 187 comments (clear)

Min score:

Reason:

Sort:

I just thought of something by roman_mir · 2013-04-07 07:45 · Score: 5, Funny

Send them to a dedicated gmail account. You'll be able to find all of your documents (you can label them, whatever) and they provide online office of some sort and if you forget what you have there you can always just go to Google search and push "I feel lucky" button.

--
You can't handle the truth.
1. Re:I just thought of something by Anonymous Coward · 2013-04-07 07:56 · Score: 4, Insightful
  
  Providing quick and easy access to the government (and who knows who else) to all of your important documents.
This again? by turkeyfeathers · 2013-04-07 07:53 · Score: 5, Funny

Similar questions to yours appear here regularly. The consensus is that it's best just to throw the bills and documents out and spend more time watching porn.
My Workflow by Orphaze · 2013-04-07 07:57 · Score: 5, Interesting

1) Receive document.
2) Scan with Fujitsu Scansnap S1500 in about 10 seconds. $380 on sale, but so far worth it over cheap all-in-one scanners it's not even funny. Seriously, don't even bother going paperless unless you get a real document scanner.
3) Save PDF to simple software RAID-1 mirror of two 2TB drives. (Takes about 5 seconds to setup from disk management in Windows.) This should protect against sudden drive failure taking everything.
4) Backup nightly to external drive swapped off-site every other month. This should protect from accidental deletions, fires, etc. Bonus points if backup drive is ioSafe fire proof variety.
5) Throw away original. Only exception is official documents like titles, marriage certificate, etc.. Yes, I even throw away W2s and the like. My taxes are 100 percent digital nowadays.
6) Check and test restore from those backups on a semi-regular basis, and you're done!
You don't need a CMS by Anonymous Coward · 2013-04-07 07:58 · Score: 5, Interesting

So, I've been doing this pretty consistently for the past few years and sent this advice to some relatives asking basically the same question. (That's also why it's a little dumbed down.)
I haven't found a case where any sort of CMS makes more sense than the file system. This is after doing this for about 10 years, and I've got records going back to '01.
I'm using a Fujifilm Scansnap and a Fellowes Powershred, and running Mac OS X. OS X has decent indexing, a good file system manager (really can't beat column view) and the Preview app will let you reassemble PDFs, which is occasionally very handy.
1. The enemy is copies. I strongly recommend "scan and shred", or you'll wind up scanning the same thing over and over.
1.1. Don't bother with any scanner that doesn't do double-sided scans.
1.2. Use a shredder. You can take things out of a trash can.
1.3. The scanner should come with OCR software. Choose "Searchable PDFs".
2. Do scanning in small batches.
2.1. Create a folder "Scanned", and "Unfiled".
2.2. The scanned files go immediately into scans, and the paper immediately goes into the shredder.
2.3. After you've got a batch of stuff scanned, you move it into Unfiled and correct the names, or split the documents up as you need to.
3. If it takes any work to scan it just shove it in a filing cabinet, or, better yet, just shred it.
3.1. If you're having to use a flatbed, it's too complicated to scan and you should file or shred it.
3.2. You can often get manuals and pamphlets and stuff online by googling part of the text or the product name.
4. Don't scan anything you can get electronically.
4.1. Most companies would much rather let you download bills and statements and such.
4.2. Most of them will also delete those statements after a few months, so get in the habit of immediately downloading the statement.
5. It's *very* helpful to put a date on everything. I generally do YYMMDD, trying to guess from dates I find in the document.
5.1.If it's a document covering a period of time like a bill for the month of November, I use the ending date.
5.2. For tax documents I'll put TT-YYMMDD, where TT is the tax year, since the actual transactions occur that year, but filing and IRS stuff happens the year after.
6. I've found that even with full text search, you still need folders.
6.1. They just don't need to be extremely complicated; usually two levels seems to be fine. I'll put prior years into separate folders, too.
6.2. Your system will evolve as you work; just get it in there, and then be mindful of what you are commonly looking for.
6.3. Keep books and reference manuals in a folder that doesn't get indexed. (Spotlight has an option for this.) They tend to create a lot of spurious hits.
7. Keep your inbox clean, if an email wants you to download a statement, get it right away and put it in Unfiled.
7.1. Likewise, keep your desktop clean, scan and shred stuff as soon as it comes in.
7.2. Have a periodic to-do item to tidy your files, don't spend more than half an hour (tops!) at any given time.
1. Re:You don't need a CMS by overlordofmu · 2013-04-07 09:17 · Score: 5, Insightful
  
  Disclaimer: I know this will seem pedantic but I am trying to get people to think about problems in the long term (solutions that work for thousands of years, not hundreds).
  
  If we use the format YYYY-MM-DD for dates (for instance 2013-04-07), they sort both alphabetically and numerically, they are easy for human eyes/minds to parse at a glance (my apologies to the vision impaired) and there won't be a reason to change to format for approximately 7,895 years (but who is counting, really).
  
  Please see ISO 8601: http://en.wikipedia.org/wiki/ISO_8601
  
  Obligiatory XKCD: http://xkcd.com/1179/
2. Re:You don't need a CMS by Anonymous Coward · 2013-04-07 10:01 · Score: 4, Insightful
  
  4.1. Most companies would much rather let you download bills and statements and such.
  And this is exactly why I HATE all of the "e-bill" solutions that every company has dreamed up at the moment.
  They turn the problem from "the company remembers to SEND you a bill/invoice/paper" to "you have to go get the bill/invoice/paper FROM the company".
  With paper bills/invoices/etc. sent through the US mail, they "remember" to do something, and I get an automatic reminder when the envelope appears in my mailbox.
  With the e-bill solution, the most I get is an email reminding me to go log in and download the bill/invoice/paper. Now, notice what is wrong here. They just sent me a communication (hint, its the reminder email) that could have functioned identically to the USMail envelope of carrying the bill/invoice/paper along with it right to my inbox, so when I receive the email, I ALSO receive the bill/invoice/paper itself (i.e., attach the bill/invlice/paper as a .pdf to the email).
  Now, most companies will balk at that because "email is not secure" or "email is not private". Well, why don't you let me F****** upload a gpg public key to your system, and then your system could encrypt my bill/invoice/paper using my gpg public key, then attach it to the "reminder" email, and now we have an electronic system that functions identically to the old paper bill in the old paper envelope sent through the postal office.
  They remember it is time to send me my bill, they create the .pdf (electronic equivalent to printing the bill on paper), they encrypt the pdf (electronic equivalnet to sealing the bill in a mailing envelope, and they email me the item (electronic equivalent of giving the sealed envelope to the postal service).
  But does any company implement this system? No, not one.
  And so they will continue to mail me paper, and can continue to hound me to switch to "e-bills" all they like. But until their e-bills are done properly (as above) they won't get any buy in here.
Re:I was in the same boat by tomtomtom · 2013-04-07 12:11 · Score: 4, Informative

I ended up with gscan2pdf and a rigid directory and filename structure. It works, but yeah, tags would be nice.
gscan2pdf is OK, but if you want to do this seriously then you're probably going to want a reasonably fast sheet-fed scanner (I got a Fujitsu ScanSnap S1500, which is supported by SANE and can scan at 18-20 pages/36-40 sides per minute) with a button so that you can go through a whole stack of paper quickly with minimal keyboard/mouse interaction to slow you down. This led me to setting up scanbuttond (which just gained official support for the ScanSnap but there was a patch floating around somewhere for a while before that) with a custom script.
Make sure you OCR your documents to make them searchable then run an indexer (I like recoll but KDE and GNOME both have their own desktop search solutions as well). I've found the best OCR engine on Linux seems to be tesseract, but there are a couple of others you can try. The process took me a while to get right and is a bit painful - the script which scanbuttond runs runs scanadf to scan to a string of image files per side and puts them in a processing directory. I then have another batch-processing script I run once I'm done with a pile of papers while I go and get a cup of tea which runs unpaper then tesseract on them, then hocr2pdf to convert each page individually into a searchable PDF file then finally pdftk to concatenate all the pages together into a scanned document. I split the two parts of the process out because the OCR bit can take some time and this way I can get maximum throughput on the scanner itself without needing to wait for the rest to catch up. If I could be bothered then I could make the scanning script run my de-batching script once only and have it pick up new files as they are dropped in the directory but it's not that much of an effort really.
I then sort my PDFs into a hierarchical directory structure once they've been OCRd (and at this point they get indexed as well for searching).
If you're on Windows/Mac then the software that comes with the ScanSnap will pretty much do all this for you; although it's better to scan with OCR disabled then use Acrobat to batch-OCR the PDFs later for the same reason. Add a decent desktop search solution like an old version of Copernic (or possible Windows Search) and all is good.