Slashdot Mirror


Ask Slashdot: Open Source For Bill and Document Management?

Rinisari writes "Since striking out on my own nearly a decade ago, I've been collecting bills and important documents in a briefcase and small filing box. Since buying a house more than a year ago, the amount of paper that I receive and need to keep has increased to deluge amounts and is overflowing what space I want to dedicate. I would like to scan everything, and only retain the papers for things that don't require the original copies. I'd archive the scans in my heavily backed up NAS. What free and/or open source software is out there that can handle this task of document management? Being able to scan to PDF and associate a date and series of labels to a document would be great, as well as some other metadata such as bill amount. My target OS is OS X, but Linux and Windows would be OK."

36 of 187 comments (clear)

  1. I just thought of something by roman_mir · · Score: 5, Funny

    Send them to a dedicated gmail account. You'll be able to find all of your documents (you can label them, whatever) and they provide online office of some sort and if you forget what you have there you can always just go to Google search and push "I feel lucky" button.

    1. Re:I just thought of something by Anonymous Coward · · Score: 4, Insightful

      Providing quick and easy access to the government (and who knows who else) to all of your important documents.

    2. Re:I just thought of something by Rinisari · · Score: 2

      I'm concerned with privacy of backing up to Gmail, even if its labeling is completely what I'm looking for. I suppose I could encrypt everything I send and base its subject on something I can read and label, but that's a lot of rigmarole for something that I really would rather keep locally or on my own backed-up network.

    3. Re:I just thought of something by fustakrakich · · Score: 3, Insightful

      Google is pretty fickle with its applications. We'll never know how long gmail will remain online, until they decide to shut it down.

      Oh, like the other replies said, 'privacy'... You will have none if it is online in any form.

      --
      “He’s not deformed, he’s just drunk!”
    4. Re:I just thought of something by roman_mir · · Score: 3, Interesting

      Absolutely, no question about it. Some documents are not that important, but the important ones shouldn't go there.

  2. I was in the same boat by mkro · · Score: 2

    I ended up with gscan2pdf and a rigid directory and filename structure. It works, but yeah, tags would be nice.

    --
    I shall go and tell the indestructible man that someone plans to murder him.
    1. Re:I was in the same boat by AvitarX · · Score: 2

      Hasn't kde finally gotten their shot together for functioning tags?

      --
      Wow, sent an e-mail as suggested when clicking on "use classic" banner, and got a fast response that addressed my msg
    2. Re:I was in the same boat by tomtomtom · · Score: 4, Informative

      I ended up with gscan2pdf and a rigid directory and filename structure. It works, but yeah, tags would be nice.

      gscan2pdf is OK, but if you want to do this seriously then you're probably going to want a reasonably fast sheet-fed scanner (I got a Fujitsu ScanSnap S1500, which is supported by SANE and can scan at 18-20 pages/36-40 sides per minute) with a button so that you can go through a whole stack of paper quickly with minimal keyboard/mouse interaction to slow you down. This led me to setting up scanbuttond (which just gained official support for the ScanSnap but there was a patch floating around somewhere for a while before that) with a custom script.

      Make sure you OCR your documents to make them searchable then run an indexer (I like recoll but KDE and GNOME both have their own desktop search solutions as well). I've found the best OCR engine on Linux seems to be tesseract, but there are a couple of others you can try. The process took me a while to get right and is a bit painful - the script which scanbuttond runs runs scanadf to scan to a string of image files per side and puts them in a processing directory. I then have another batch-processing script I run once I'm done with a pile of papers while I go and get a cup of tea which runs unpaper then tesseract on them, then hocr2pdf to convert each page individually into a searchable PDF file then finally pdftk to concatenate all the pages together into a scanned document. I split the two parts of the process out because the OCR bit can take some time and this way I can get maximum throughput on the scanner itself without needing to wait for the rest to catch up. If I could be bothered then I could make the scanning script run my de-batching script once only and have it pick up new files as they are dropped in the directory but it's not that much of an effort really.

      I then sort my PDFs into a hierarchical directory structure once they've been OCRd (and at this point they get indexed as well for searching).

      If you're on Windows/Mac then the software that comes with the ScanSnap will pretty much do all this for you; although it's better to scan with OCR disabled then use Acrobat to batch-OCR the PDFs later for the same reason. Add a decent desktop search solution like an old version of Copernic (or possible Windows Search) and all is good.

  3. OpenKM by Anonymous Coward · · Score: 3, Informative

    OpenKM (http://www.openkm.com/en/) is what I use to manage my documents, its tagging and document preview features are what I appreciate most. It runs as a web-service, FYI.

  4. muddle headed post by Anonymous Coward · · Score: 2, Interesting

    by definition, "important" = keep original (I mean seriously, are u that short of basement space ??)
    Electronics are ephemeral; You can, today, read stuff on papyrus, as long as you know the language..do you really want to trust stuff that is important to ephemera electronics ?
    (i mean, how many times has /. gone over this - is this the editors idea of a yearly question ?)

    tagging is an inherently stupid idea; it may be the best that you can do with current technology, buta google like full text search is much much better (tell me - if you want to pull out a piece of information you know is on your hard drive in a pdf, do you look for the pdf, or just google it ?)

    it is possible,after 5 or ten years, you might know what tags you want....
    tagging is hard work, that you have to do manually consistently; better to have 3 or 4 folders organized by client/project then tag

    1. Re:muddle headed post by Rinisari · · Score: 2

      You are correct. I meant to keep only the things I need originals of: birth certificate, car titles, etc.

      As for physical space, I have better things than documents to store in my available basement space: wine, beer, computers, etc.

    2. Re:muddle headed post by techno-vampire · · Score: 2

      I don't even have the originals of my birth certificate, discharge papers or DD 214, and haven't in decades. However, my father registered my birth certificate at the Hall of Records, and I did the same with my discharge papers and DD 214 after I got out of the Navy so I don't have to worry. In fact, in Los Angeles, where they're registered, any veteran can get two copies of his service papers for free, any time they're needed, so why keep the originals? And, once when I was down there to request copies, I ran across my father's, although I've never had a reason to request them. Still, it's nice to know how long they hang on to things like that.

      --
      Good, inexpensive web hosting
    3. Re:muddle headed post by ShanghaiBill · · Score: 2

      Electronics are ephemeral; You can, today, read stuff on papyrus, as long as you know the language..do you really want to trust stuff that is important to ephemera electronics ?

      This is just completely backwards. Electronic documents are the least likely to get lost or destroyed. I have no receipts or papers from 25 years ago. But I have all my email from those days. With e-docs, you can make multiple copies, store copies off-site, etc. Every email I have ever sent, every non-spam email I have ever received, all the source code I have ever written, over 10,000 family photos, copies of my marriage license, deeds, insurance forms, etc. etc. will ALL fit on a single XD card smaller than my fingernail, and the XD card will fit in a keychain fob that I carry in my pocket. Other copies of all these docs are on my laptop, on my desktop, at my parent's house, on a server outside the USA, on an SD card in a ziploc bag taped to the bottom of my will, etc.

      (i mean, how many times has /. gone over this - is this the editors idea of a yearly question ?)

      Apparently not enough. Every time it comes up, the general consensus is the opposite of what you recollect.

    4. Re:muddle headed post by Genda · · Score: 2

      M-Disk now finally allows you to make good archival high density storage (DVD.) Combine this with a good document management tool (like Docmoto on Mac) and you can pretty much be assured of managing all you paper to electronic needs elegantly. Additionally, a lot of HEAVY DUTY Document Management Applications (eg. Docfinity) have sophisticated Business Process Management tools included to control process flow for those documents. One cool feature for these tools is the ability to parse metatags from files names. or Import files (often in CSV formation.) You could store your documents in an intelligently organized directory tree, and keep a central spreadsheet with file location and name and the metadata you want to maintain for those documents. At some time in the future you could export your spreadsheet and use it both as the information needed to import those documents and add the necessary tags to those documents.

      There are elegant solutions available, haven't seen any great open source ones yet, this whole process is still surprisingly new. Part of the problem is that its still labor intensive, expensive and the problem space remains poorly defined.

    5. Re: muddle headed post by Genda · · Score: 2

      Take two stone tablets and call me in the morning...

    6. Re:muddle headed post by Miamicanes · · Score: 2

      One small detail to add... M-Disk is the best there is *if* you need or care about DVD-ROM compatibility, but for roughly the same price per disc, you can get non-LTH BD-R discs with roughly 5-6x the capacity. M-Disk is basically a non-LTH BD-R disc with the track geometry of a DVD-ROM. Either way, non-LTH BD-R and M-Disk are the way to go if you want long-term passive archivability (ie, the ability to write a disc, throw it in a box, forget about it for 25 years, and still be able to read it. While there aren't any guarantees that DVD or Blu-Ray will be mainstream 25 years from now, I'd feel pretty safe betting that someone will sell drives capable of reading them without drama, even if doing anything useful from that point requires a bit more work.

      Non-LTH BD-R discs rock. They're by far the best long-term media we've ever had (well, with the possible exception of Magneto-Optical discs from ~10 years ago, which is basically what non-LTH BD-R discs *are*). LTH discs, though, are pure shit. They exist solely to enable factories to crank out BD-density media using the same unreliable organic dyes that we've been suffering with for ~15 years. They've gotten better, of course, since the first CD-Rs came out 15 years ago, but they aren't anywhere NEAR the same league as magneto-optical technology when it comes to archival stability (MO works by using the laser to melt & liquefy a substrate, then using a magnet to quickly orient reflective particles floating in the melted substrate before it re-solidifies for all eternity. Organic dyes start out light, then darken when burned by the laser... or sunlight... or possibly even slow chemical oxidation over time).

  5. This again? by turkeyfeathers · · Score: 5, Funny

    Similar questions to yours appear here regularly. The consensus is that it's best just to throw the bills and documents out and spend more time watching porn.

  6. Try Alfresco by Anonymous Coward · · Score: 2, Interesting

    You can try Alfresco DMS.
    It requires a webserver so it might be too-much for a single user.

  7. iDocument by Idimmu+Xul · · Score: 2

    http://www.icyblaze.com/idocument/

    iDocument for the mac is like iTunes but for documents. It lets you import documents (pretty much any type) and tag them and store them in virtual or real folders, it sounds like it's exactly what you're after.

    --
    The problem with slashdot is that most of its users were bullied and stuffed into lockers as kids!
  8. My Workflow by Orphaze · · Score: 5, Interesting

    1) Receive document.
    2) Scan with Fujitsu Scansnap S1500 in about 10 seconds. $380 on sale, but so far worth it over cheap all-in-one scanners it's not even funny. Seriously, don't even bother going paperless unless you get a real document scanner.
    3) Save PDF to simple software RAID-1 mirror of two 2TB drives. (Takes about 5 seconds to setup from disk management in Windows.) This should protect against sudden drive failure taking everything.
    4) Backup nightly to external drive swapped off-site every other month. This should protect from accidental deletions, fires, etc. Bonus points if backup drive is ioSafe fire proof variety.
    5) Throw away original. Only exception is official documents like titles, marriage certificate, etc.. Yes, I even throw away W2s and the like. My taxes are 100 percent digital nowadays.
    6) Check and test restore from those backups on a semi-regular basis, and you're done!

    1. Re:My Workflow by spire3661 · · Score: 2

      I liked it up until you have windows managing a RAID. Get a RAID NAS running Linux. It seems odd to RAID up a couple of drives just to let windows mess them up. I suggest a Synology ds212. If you are really serious build a ZFS rig with snapshots.

      --
      Good-bye
  9. You don't need a CMS by Anonymous Coward · · Score: 5, Interesting

    So, I've been doing this pretty consistently for the past few years and sent this advice to some relatives asking basically the same question. (That's also why it's a little dumbed down.)

    I haven't found a case where any sort of CMS makes more sense than the file system. This is after doing this for about 10 years, and I've got records going back to '01.

    I'm using a Fujifilm Scansnap and a Fellowes Powershred, and running Mac OS X. OS X has decent indexing, a good file system manager (really can't beat column view) and the Preview app will let you reassemble PDFs, which is occasionally very handy.

    1. The enemy is copies. I strongly recommend "scan and shred", or you'll wind up scanning the same thing over and over.

    1.1. Don't bother with any scanner that doesn't do double-sided scans.

    1.2. Use a shredder. You can take things out of a trash can.

    1.3. The scanner should come with OCR software. Choose "Searchable PDFs".

    2. Do scanning in small batches.

    2.1. Create a folder "Scanned", and "Unfiled".

    2.2. The scanned files go immediately into scans, and the paper immediately goes into the shredder.

    2.3. After you've got a batch of stuff scanned, you move it into Unfiled and correct the names, or split the documents up as you need to.

    3. If it takes any work to scan it just shove it in a filing cabinet, or, better yet, just shred it.

    3.1. If you're having to use a flatbed, it's too complicated to scan and you should file or shred it.

    3.2. You can often get manuals and pamphlets and stuff online by googling part of the text or the product name.

    4. Don't scan anything you can get electronically.

    4.1. Most companies would much rather let you download bills and statements and such.

    4.2. Most of them will also delete those statements after a few months, so get in the habit of immediately downloading the statement.

    5. It's *very* helpful to put a date on everything. I generally do YYMMDD, trying to guess from dates I find in the document.

    5.1.If it's a document covering a period of time like a bill for the month of November, I use the ending date.

    5.2. For tax documents I'll put TT-YYMMDD, where TT is the tax year, since the actual transactions occur that year, but filing and IRS stuff happens the year after.

    6. I've found that even with full text search, you still need folders.

    6.1. They just don't need to be extremely complicated; usually two levels seems to be fine. I'll put prior years into separate folders, too.

    6.2. Your system will evolve as you work; just get it in there, and then be mindful of what you are commonly looking for.

    6.3. Keep books and reference manuals in a folder that doesn't get indexed. (Spotlight has an option for this.) They tend to create a lot of spurious hits.

    7. Keep your inbox clean, if an email wants you to download a statement, get it right away and put it in Unfiled.

    7.1. Likewise, keep your desktop clean, scan and shred stuff as soon as it comes in.

    7.2. Have a periodic to-do item to tidy your files, don't spend more than half an hour (tops!) at any given time.

    1. Re:You don't need a CMS by sribe · · Score: 2

      2.3. After you've got a batch of stuff scanned, you move it into Unfiled and correct the names, or split the documents up as you need to.

      god, no! Give it a sensible name and put it where it belongs to begin with; don't deal with the same document multiple times.

    2. Re:You don't need a CMS by overlordofmu · · Score: 5, Insightful

      Disclaimer: I know this will seem pedantic but I am trying to get people to think about problems in the long term (solutions that work for thousands of years, not hundreds).

      If we use the format YYYY-MM-DD for dates (for instance 2013-04-07), they sort both alphabetically and numerically, they are easy for human eyes/minds to parse at a glance (my apologies to the vision impaired) and there won't be a reason to change to format for approximately 7,895 years (but who is counting, really).

      Please see ISO 8601: http://en.wikipedia.org/wiki/ISO_8601

      Obligiatory XKCD: http://xkcd.com/1179/

    3. Re:You don't need a CMS by Anonymous Coward · · Score: 4, Insightful

      4.1. Most companies would much rather let you download bills and statements and such.

      And this is exactly why I HATE all of the "e-bill" solutions that every company has dreamed up at the moment.

      They turn the problem from "the company remembers to SEND you a bill/invoice/paper" to "you have to go get the bill/invoice/paper FROM the company".

      With paper bills/invoices/etc. sent through the US mail, they "remember" to do something, and I get an automatic reminder when the envelope appears in my mailbox.

      With the e-bill solution, the most I get is an email reminding me to go log in and download the bill/invoice/paper. Now, notice what is wrong here. They just sent me a communication (hint, its the reminder email) that could have functioned identically to the USMail envelope of carrying the bill/invoice/paper along with it right to my inbox, so when I receive the email, I ALSO receive the bill/invoice/paper itself (i.e., attach the bill/invlice/paper as a .pdf to the email).

      Now, most companies will balk at that because "email is not secure" or "email is not private". Well, why don't you let me F****** upload a gpg public key to your system, and then your system could encrypt my bill/invoice/paper using my gpg public key, then attach it to the "reminder" email, and now we have an electronic system that functions identically to the old paper bill in the old paper envelope sent through the postal office.

      They remember it is time to send me my bill, they create the .pdf (electronic equivalent to printing the bill on paper), they encrypt the pdf (electronic equivalnet to sealing the bill in a mailing envelope, and they email me the item (electronic equivalent of giving the sealed envelope to the postal service).

      But does any company implement this system? No, not one.

      And so they will continue to mail me paper, and can continue to hound me to switch to "e-bills" all they like. But until their e-bills are done properly (as above) they won't get any buy in here.

    4. Re:You don't need a CMS by melikamp · · Score: 2

      Yes, yes, yes. The submitter sounds like he wants to digitize a bunch of files, so I would recommend a good file system. Any stable filesystem will do, like ext4 for instance.

      Avoid metadata within a file for as long as possible. It will bury you. If date and bill amount is all you need, then just stick them into the file name.

      YYYY-MM-DD.amount.unit.short-description.pdf

      2013-04-07.-3975.us-cents.how-much-this-advice-will-cost-you.pdf

      Now you can pile your files into, say, ~/my-files/ in any way whatever. You can create a category tree, for example, to allow you to find files in a file manager in 3 clicks. For more complex tasks you can just use bash, find, and the rest of the userland. It does not get simpler or more portable than that. In particular, it is trivial to convert this structure into a CVS, which you can suck into a spreadsheet or a database of your choice.

  10. Scan, OCR, and use your file system (and symlinks) by magic+maverick+ · · Score: 2

    My suggestion would be to just scan and OCR your files, and then store them in your file system.
    Hierarchy might be something like: ~/scans/year/project/sorted

    Within each sorted subdir, you'd have three folders. Date, organizationThatGeneratedTheDoc and TypeOfDoc.
    So in the folder ~/scans/year/project/sorted/org
    The file names would be something like: organizationThatGeneratedTheDoc-yyyy-mm-dd-TypeOfDoc.pdf
    In the folder ~/scans/year/project/sorted/TypeOfDoc
    The file names would be like: TypeOfDoc-yyyy-mm-dd-organizationThatGeneratedTheDoc.pdf
    Etc.

    You'd use links (symlinks or hard links) to make sure that each document is accessible in more than one place. (You can also use links to put documents in more than one project folder.)

    Types of documents would be things like invoices, receipts, legal threats, court orders etc. In the event that a document has more than one type, or more than one organization, you simply have more links. So invoice-2013-04-07-webdevteamawesome.pdf and legalthreat-2013-04-07-webdevteamawesome.pdf are the same document, because the first page is an invoice, and the second a threat to take you to court if you don't pay. (This then exists six times, three times for each type, but with the magic of hard links only takes up the space of 1.001 documents.)

    With the OCRed text being saved with the PDF scan, you can also run text searches with in your files to find specific information (such as bill amount, seriously, how often would you use that information?)

    This allows you maximum flexibility, and prevents you from being locked into a particular piece of software (as you can do everything manually). Moreover, once you've got it setup, it's easy to run with each new document.
    Steps would be:
    1) Scan and OCR doc, saving the PDF into the staging area folder.
    2) Run your script, which asks for the date, project, org name, doc type.
    3) The script then saves the document in the appropriate folders, generating links as required.
    4) Profit!

    --
    HELP MY ACCOUNT HAS BEEN HACKED BY AN ILLIBERAL ART STUDENT SET TO DESTROY THE INTERWEBZ!
  11. Mayan EDMS by Rob+the+Roadie · · Score: 2

    I've played with this a few times, never used it in anger though.

    http://www.mayan-edms.com/

    I might take up your challenge on going paperless too and give Mayan a go.

  12. Tossing hat into the ring for DJVU format. by Areyoukiddingme · · Score: 3, Interesting

    PDF is big and bulky. DJVU format makes for tiny document scans. And there are open source libraries for creating it, available even in Debian. Wavelet compression did finally make it into the wild. It's just nobody has ever heard of it, for some reason.

    Doesn't help for organization, but it should be a reasonable option for storage.

    It even embeds the OCR text in the document along with the image version, so it doesn't proliferate multiple copies of the same data.

    1. Re:Tossing hat into the ring for DJVU format. by Inda · · Score: 2

      PDF only wraps around the PNG, JPEG, BMP, generic_image_format. The extra bloat is only a couple of kb.

      If the bloat is more, the PDF has been generated incorrectly.

      --
      This post contains benzene, nitrosamines, formaldehyde and hydrogen cyanide.
  13. Owncloud? by bazorg · · Score: 2

    Maybe that Owncloud thing will work well to handle the storage and access. Anyone knows if its search function is any good?

  14. Alfresco by Balr0g · · Score: 3, Informative

    I use the community edition of Alfresco for that task. You can tag all documents, add custom fields and have full text search and versioning out of the box. Documents can be accessed via web interface, smb, ftp and even imap.

  15. OPAC by buss_error · · Score: 2

    Any open source library management software that does ebooks should help you out. Here's a list:

    http://sourceforge.net/directory/home-education/library/opac/os:windows/freshness:recently-updated/

    --
    Necessity is the plea for every infringement of human freedom. It is the argument of tyrants; it is the creed of slaves.
  16. Skip scanning, download PDFs directly. by yeah+I+can+fix+that · · Score: 2

    (Longtime listener; first-time caller. If I'm doing it wrong, please be kind.)

    I've been going through the same issue and have painstakingly scanned/filed a metric crapton of old documents, putting them in a hierarchical directory structure where I can find them if I need them.

    But this sucks for a number of obvious reasons. The ones that bother me the most are:

    1) A scanned document is larger (*and* less useful than a downloaded PDF).

    2) It's a manual process! I'd rather spend ten hours automating something than five hours over the next 5 years trying to remember the filename convention for storing the scanned document.

    Anyway, cutting to the chase, I'm now using Ruby/Watir scripts to automate the business of downloading my most common phone/utility bills from the websites and stashing them directly. I used to use Perl and WWW::Mechanize but all the websites are now so contaminated with unecessary javascript that only something which manipulated a browser directly allows automation without pulling my hair out. Ruby/Waitr works pretty well. Recommend. Automated download; priceless. Without automated download, I'd rather return to scanning paper documents mailed to me, otherwise you quickly find how unreliable your service provider is for retaining your statements.

    If anybody wants some pre-alpha scripts for grabbing their pg&e, comcast, cigna, at&t, schwab, nvenergy statements, let me know.

  17. OCR - Re: I was in the same boat by WebCowboy · · Score: 2

    GScanToPDF can do OCR and embed the results as annotations within the PDF. Perhaps that would help with search ability. It works well enough with a lot of my documents though it is far from perfect it is good enough for those purposes especially for bills as they are not handwritten. Best results are on scans set to line art/b&w rather than grey scale or colour.

  18. Re:Receive electronic statements? by Miamicanes · · Score: 2

    The problem with most businesses is that they want to have their cake & eat it too... they want to get you to opt into paperless statements, but they don't want to allow you to fetch your statements via automated means. They just want to spam you monthly (or more), then make you go to their site, log in, and generally set things up to make it as hard to automate those logins as possible. If companies like CapitalOne and Chase would let you just give them your public key, encrypt your statements with it, and email them directly to you (or allow you to fetch them in some standard manner via a web service), I'd happily let them off the hook and go all-electronic. But I'll be damned if I'm going to settle for statements I have to go out of my way to obtain. At least printed statements can be tossed into a box and ignored for years unless I care enough to look at them, as opposed to ephemeral online statements that go bye-bye after 12 months.