Ask Slashdot: Open Source For Bill and Document Management?

I just thought of something by roman_mir · 2013-04-07 07:45 · Score: 5, Funny

Send them to a dedicated gmail account. You'll be able to find all of your documents (you can label them, whatever) and they provide online office of some sort and if you forget what you have there you can always just go to Google search and push "I feel lucky" button.

--
You can't handle the truth.

Re:I just thought of something by Anonymous Coward · 2013-04-07 07:56 · Score: 4, Insightful

Providing quick and easy access to the government (and who knows who else) to all of your important documents.
Re:I just thought of something by Rinisari · 2013-04-07 07:57 · Score: 2

I'm concerned with privacy of backing up to Gmail, even if its labeling is completely what I'm looking for. I suppose I could encrypt everything I send and base its subject on something I can read and label, but that's a lot of rigmarole for something that I really would rather keep locally or on my own backed-up network.

--
Colin Dean Go a year without DRM
Re:I just thought of something by fustakrakich · 2013-04-07 08:19 · Score: 3, Insightful

Google is pretty fickle with its applications. We'll never know how long gmail will remain online, until they decide to shut it down.
Oh, like the other replies said, 'privacy'... You will have none if it is online in any form.

--
“He’s not deformed, he’s just drunk!”
Re:I just thought of something by roman_mir · 2013-04-07 08:44 · Score: 3, Interesting

Absolutely, no question about it. Some documents are not that important, but the important ones shouldn't go there.

--
You can't handle the truth.
Re:I just thought of something by Genda · 2013-04-07 12:14 · Score: 1

Not necessarily so... Google (or any cloud storage resource) is an awesome place to store encrypted and compressed documents. You just want to make certain that you back everything off every once in a while so if Google (or other resource) decides to pull the plug, you won't find yourself trying to slurp 5 GB of data down in a week through a limited resource being crushed by a hundred million other users doing the same.
Re:I just thought of something by DeBaas · 2013-04-07 18:26 · Score: 1

to overcome this we worked for a while on Tagnlock
It is supposed to do the following:
- allow you to easy label /tag documents
- gui way of defining tags, if they are free form or not, compulsorary etc.
- encrypt and then store via method of choice (email included)
I am afraid that it currently is abondonware...
You can also:
Find some way to tag and use duplicity to make encrypted incremental backups to a cloud service. That's what I do now. I simply use duplicity to duplicate (and encrypt) to the same drive and use the dropbox deamon to sync the encrypted copy to my dropbox

--
---

I was in the same boat by mkro · 2013-04-07 07:47 · Score: 2

I ended up with gscan2pdf and a rigid directory and filename structure. It works, but yeah, tags would be nice.

--
I shall go and tell the indestructible man that someone plans to murder him.

Re:I was in the same boat by AvitarX · 2013-04-07 07:50 · Score: 2

Hasn't kde finally gotten their shot together for functioning tags?

--
Wow, sent an e-mail as suggested when clicking on "use classic" banner, and got a fast response that addressed my msg
Re:I was in the same boat by fustakrakich · 2013-04-07 08:23 · Score: 1

Hasn't kde finally gotten their shot together...?
Their aim is true...

--
“He’s not deformed, he’s just drunk!”
Re:I was in the same boat by tomtomtom · 2013-04-07 12:11 · Score: 4, Informative

I ended up with gscan2pdf and a rigid directory and filename structure. It works, but yeah, tags would be nice.
gscan2pdf is OK, but if you want to do this seriously then you're probably going to want a reasonably fast sheet-fed scanner (I got a Fujitsu ScanSnap S1500, which is supported by SANE and can scan at 18-20 pages/36-40 sides per minute) with a button so that you can go through a whole stack of paper quickly with minimal keyboard/mouse interaction to slow you down. This led me to setting up scanbuttond (which just gained official support for the ScanSnap but there was a patch floating around somewhere for a while before that) with a custom script.
Make sure you OCR your documents to make them searchable then run an indexer (I like recoll but KDE and GNOME both have their own desktop search solutions as well). I've found the best OCR engine on Linux seems to be tesseract, but there are a couple of others you can try. The process took me a while to get right and is a bit painful - the script which scanbuttond runs runs scanadf to scan to a string of image files per side and puts them in a processing directory. I then have another batch-processing script I run once I'm done with a pile of papers while I go and get a cup of tea which runs unpaper then tesseract on them, then hocr2pdf to convert each page individually into a searchable PDF file then finally pdftk to concatenate all the pages together into a scanned document. I split the two parts of the process out because the OCR bit can take some time and this way I can get maximum throughput on the scanner itself without needing to wait for the rest to catch up. If I could be bothered then I could make the scanning script run my de-batching script once only and have it pick up new files as they are dropped in the directory but it's not that much of an effort really.
I then sort my PDFs into a hierarchical directory structure once they've been OCRd (and at this point they get indexed as well for searching).
If you're on Windows/Mac then the software that comes with the ScanSnap will pretty much do all this for you; although it's better to scan with OCR disabled then use Acrobat to batch-OCR the PDFs later for the same reason. Add a decent desktop search solution like an old version of Copernic (or possible Windows Search) and all is good.
Re:I was in the same boat by bill_mcgonigle · 2013-04-07 14:20 · Score: 1

Hey, man, upload that script somewhere and somebody might add the watching logic to it!
How do you keep your ScanSnap from not feeding multiple pages simultaneously? Mine does it frequently, making it mostly collect dust.

--
My God, it's Full of Source!
OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
Re:I was in the same boat by ozmanjusri · 2013-04-07 14:32 · Score: 1

yeah, tags would be nice
SVN and a good client (Rabbit VCS, Tortoise SVN etc) then?

--
"I've got more toys than Teruhisa Kitahara."
Re:I was in the same boat by socceroos · 2013-04-07 15:14 · Score: 1

They have. If you wanted to do something like this and make it semantic and awesome you could create a new Nepomuk Ontology for your document and have some associated specific metadata - then you can semantically link it with people, bank accounts, etc, give it tags and more. Search and relationships are enumerated with nepomuk-indexer. On top of that you can either use a file manager like Dolphin and just use it for search, or build a little interface over the top to perform custom queries on the data.
Re:I was in the same boat by dannys42 · 2013-04-07 19:37 · Score: 1

I have a ScanSnap as well, but just use their Mac software. What type of paper are your documents and how many pages do you do at once? I've found for really thin paper or for many pages it helps to simply fan them out a bit. But if you're doing many pages (like over ~20 or so) you might need to feed them in batches... Unfortunately it's a bit of baby-sitting, waiting for it to reach near the end of one batch, then putting the next batch in. But I manage to avoid the multi-page problem most of the time this way.
Re:I was in the same boat by AvitarX · 2013-04-07 23:43 · Score: 1

As you appear to know these things, I've been curious, but not enough to really check.
does kde use extended attributes to store the meta data, so that plain old backups don't lose any information, or does it use some other method that requires a separate backup of the database?

--
Wow, sent an e-mail as suggested when clicking on "use classic" banner, and got a fast response that addressed my msg
Re:I was in the same boat by socceroos · 2013-04-08 00:59 · Score: 1

It uses plain Nepomuk for everthing with one small exception: PIM data from the Kontact suite which uses Akonadi (KMail, KOrganiser, KAddressbook, KPilot, etc). How can I say that it's only a small exception when PIM is a pretty big umbrella of important data? Well, Akonadi passes all it's metadata and information along to Nepomuk for indexing and actually uses Nepomuk for search in PIM. There may be a few corner case exceptions to single items of metadata in PIM that are not reflected in the relevant Nepomuk ontologies, so are therefore not pushed to Nepomuk for indexing. These however are few, far between and mostly non-important.

For a better idea of their roles, and the reason to have both Nepomuk and Akonadi then have a look at this link. It's very useful info: http://cmollekopf.wordpress.com/2013/02/13/kontact-nepomuk-integration-why-data-from-akonadi-is-indexed-in-nepomuk/

Heh, so in answer to your question (I get sidetracked easily), it is a simple matter of backing up Nepomuk to keep all your semantic information. You'll find that in the settings section for search in KDE there is a button there to automatically or manually perform Nepomuk backups.
Re:I was in the same boat by socceroos · 2013-04-08 01:01 · Score: 1

As a note, I think the backup button for Nepomuk is only in the newer versions of KDE (4.8+).
Re:I was in the same boat by AvitarX · 2013-04-08 04:29 · Score: 1

I still think it'd make sense if all user generated meta data was stored in extended attributes. Then I could back up a file with a program like back in time, and restore it later, or to a different system, and when it got indexed, all my tags etc would be associated with it still.
I imagine it'd make display of tags slightly quicker too, to have them orginized in an item -> data system (the words from your link).

--
Wow, sent an e-mail as suggested when clicking on "use classic" banner, and got a fast response that addressed my msg
Re:I was in the same boat by socceroos · 2013-04-08 09:24 · Score: 1

Having the metadata in the file would have the benefit of the data following the file, but you couldn't realistically store all ontology based metadata in the file itself. It would lead to a fair amount of file bloat and also would slow down search hugely as all metadata is distributed across your storage and filesystem.
Re:I was in the same boat by AvitarX · 2013-04-08 16:47 · Score: 1

I don't think the meta data should be searched off the filesystem, I simply think it should be saved with the file.
full text search isn't searched with the file, but It's still there. I can't think of any logical reason that a file's tags shouldn't be included as extended attributes. I should be able to backup a folder using rsynch, and trust the integrity of my backup.
When I copy the file to a new system, the program that indexes should be able to read the extended attributes, and add those to the index too. How much is truly added to these files relative to an inode? I can't imagine it adding much bloat at all, the largest thing indexed I assume is the full text, and that obviously wouldn't need to be indexed, as it obviously already moves with the file.
things like tags, labels, and icons should move with the file, and hopefully various DEs can arrange to agree on certain things. But even if they didn't, tags would be readable outside of KDE (even if they weren't fast searchable).
The way I see it, the searching/finding database should be redundant of things already available, just faster, and with extended attributes I don't see why that can't be the case.

--
Wow, sent an e-mail as suggested when clicking on "use classic" banner, and got a fast response that addressed my msg
Re:I was in the same boat by socceroos · 2013-04-09 11:18 · Score: 1

Tags would still be readable outside KDE. Nepomuk is an independent system developed and funded officially by the EU. It cost over 17 million euros to make.

Gnome's Zeitgeist uses Nepomuk Ontologies, as does Ubuntu's Unity.

http://en.wikipedia.org/wiki/NEPOMUK_(framework)

OpenKM by Anonymous Coward · 2013-04-07 07:50 · Score: 3, Informative

OpenKM (http://www.openkm.com/en/) is what I use to manage my documents, its tagging and document preview features are what I appreciate most. It runs as a web-service, FYI.

Re:OpenKM by alhirzel · 2013-04-08 01:50 · Score: 1

OpenKM looks like a great piece of software. It even integrates with OCR.

muddle headed post by Anonymous Coward · 2013-04-07 07:52 · Score: 2, Interesting

by definition, "important" = keep original (I mean seriously, are u that short of basement space ??)
Electronics are ephemeral; You can, today, read stuff on papyrus, as long as you know the language..do you really want to trust stuff that is important to ephemera electronics ?
(i mean, how many times has /. gone over this - is this the editors idea of a yearly question ?)

tagging is an inherently stupid idea; it may be the best that you can do with current technology, buta google like full text search is much much better (tell me - if you want to pull out a piece of information you know is on your hard drive in a pdf, do you look for the pdf, or just google it ?)

it is possible,after 5 or ten years, you might know what tags you want....
tagging is hard work, that you have to do manually consistently; better to have 3 or 4 folders organized by client/project then tag

Re:muddle headed post by techno-vampire · 2013-04-07 08:11 · Score: 1

Basements aren't as common as you think they are. I've always lived in Southern California, and I've never lived in a house with a basement. At most, there's been a crawl space under the house, but that's not exactly a good place to store things. And, I suspect there's a typo in TFS that the editor's didn't catch: it says, "I would like to scan everything, and only retain the papers for things that don't require the original copies." and I think it should read, "...that do require..." because as written, it makes no sense at all.

--
Good, inexpensive web hosting
Re:muddle headed post by Rinisari · 2013-04-07 09:07 · Score: 2

You are correct. I meant to keep only the things I need originals of: birth certificate, car titles, etc.
As for physical space, I have better things than documents to store in my available basement space: wine, beer, computers, etc.

--
Colin Dean Go a year without DRM
Re:muddle headed post by techno-vampire · 2013-04-07 09:18 · Score: 2

I don't even have the originals of my birth certificate, discharge papers or DD 214, and haven't in decades. However, my father registered my birth certificate at the Hall of Records, and I did the same with my discharge papers and DD 214 after I got out of the Navy so I don't have to worry. In fact, in Los Angeles, where they're registered, any veteran can get two copies of his service papers for free, any time they're needed, so why keep the originals? And, once when I was down there to request copies, I ran across my father's, although I've never had a reason to request them. Still, it's nice to know how long they hang on to things like that.

--
Good, inexpensive web hosting
Re:muddle headed post by ShanghaiBill · 2013-04-07 09:42 · Score: 2

Electronics are ephemeral; You can, today, read stuff on papyrus, as long as you know the language..do you really want to trust stuff that is important to ephemera electronics ?
This is just completely backwards. Electronic documents are the least likely to get lost or destroyed. I have no receipts or papers from 25 years ago. But I have all my email from those days. With e-docs, you can make multiple copies, store copies off-site, etc. Every email I have ever sent, every non-spam email I have ever received, all the source code I have ever written, over 10,000 family photos, copies of my marriage license, deeds, insurance forms, etc. etc. will ALL fit on a single XD card smaller than my fingernail, and the XD card will fit in a keychain fob that I carry in my pocket. Other copies of all these docs are on my laptop, on my desktop, at my parent's house, on a server outside the USA, on an SD card in a ziploc bag taped to the bottom of my will, etc.

(i mean, how many times has /. gone over this - is this the editors idea of a yearly question ?)
Apparently not enough. Every time it comes up, the general consensus is the opposite of what you recollect.
Re: muddle headed post by DigiShaman · 2013-04-07 10:14 · Score: 1

I think the parent poster was referring to a theoretical future digital dark age. Societal collapse, EMP, etc. Honestly though, if any of the aforementioned were to occur, you have bigger problems to worry about. So ya, I wouldn't sweat it.

--
Life is not for the lazy.
Re:muddle headed post by Genda · 2013-04-07 12:26 · Score: 2

M-Disk now finally allows you to make good archival high density storage (DVD.) Combine this with a good document management tool (like Docmoto on Mac) and you can pretty much be assured of managing all you paper to electronic needs elegantly. Additionally, a lot of HEAVY DUTY Document Management Applications (eg. Docfinity) have sophisticated Business Process Management tools included to control process flow for those documents. One cool feature for these tools is the ability to parse metatags from files names. or Import files (often in CSV formation.) You could store your documents in an intelligently organized directory tree, and keep a central spreadsheet with file location and name and the metadata you want to maintain for those documents. At some time in the future you could export your spreadsheet and use it both as the information needed to import those documents and add the necessary tags to those documents.
There are elegant solutions available, haven't seen any great open source ones yet, this whole process is still surprisingly new. Part of the problem is that its still labor intensive, expensive and the problem space remains poorly defined.
Re:muddle headed post by darkfeline · 2013-04-07 12:27 · Score: 1

Tags work just fine. Organizing files in folders is just one-dimensional tagging. Lots of people in lots of different areas use tags (Firefox bookmarks, Gmail labels, pictures/music). Tagging can also be automated (You can think of google search just as a really complex, automatic, learning tagging system). Of course, by all means keep the originals for important documents, but I'll not stand someone bad-mouthing tagging as a concept.
Re: muddle headed post by Genda · 2013-04-07 12:28 · Score: 2

Take two stone tablets and call me in the morning...
Re:muddle headed post by Miamicanes · 2013-04-07 20:14 · Score: 2

One small detail to add... M-Disk is the best there is *if* you need or care about DVD-ROM compatibility, but for roughly the same price per disc, you can get non-LTH BD-R discs with roughly 5-6x the capacity. M-Disk is basically a non-LTH BD-R disc with the track geometry of a DVD-ROM. Either way, non-LTH BD-R and M-Disk are the way to go if you want long-term passive archivability (ie, the ability to write a disc, throw it in a box, forget about it for 25 years, and still be able to read it. While there aren't any guarantees that DVD or Blu-Ray will be mainstream 25 years from now, I'd feel pretty safe betting that someone will sell drives capable of reading them without drama, even if doing anything useful from that point requires a bit more work.
Non-LTH BD-R discs rock. They're by far the best long-term media we've ever had (well, with the possible exception of Magneto-Optical discs from ~10 years ago, which is basically what non-LTH BD-R discs *are*). LTH discs, though, are pure shit. They exist solely to enable factories to crank out BD-density media using the same unreliable organic dyes that we've been suffering with for ~15 years. They've gotten better, of course, since the first CD-Rs came out 15 years ago, but they aren't anywhere NEAR the same league as magneto-optical technology when it comes to archival stability (MO works by using the laser to melt & liquefy a substrate, then using a magnet to quickly orient reflective particles floating in the melted substrate before it re-solidifies for all eternity. Organic dyes start out light, then darken when burned by the laser... or sunlight... or possibly even slow chemical oxidation over time).
Re:muddle headed post by david_thornley · 2013-04-08 06:43 · Score: 1

Most of what was written on papyrus was lost. Anybody interested in BCE history is going to have a list of documents they wish had been preserved. Many ancient documents were preserved by being copied and re-copied, which works just as well for electronic documents.
Paper documents (coming up to the present) take volume, have weight, are hard to copy in bulk, and can be destroyed by many different things (fire, flood, rodents, as examples). Electronic documents are trivial to copy in large quantities and easily stored in many locations since they are light and take little space. They're also far easier to search, and you can produce paper copies whenever you want.
Paper is somewhat more likely to last through the fall of civilization, but I don't really have many personal documents that will be useful after that.

--
"When you have eliminated the unacceptable, whatever is left, however improbable, must be the truthiness" - Holmes

simpler = better by Anonymous Coward · 2013-04-07 07:52 · Score: 1

I do this on Windows using the cheapest HP all in one with ADF with its bundled scan to PDF with OCR. I use an encrypted TC volume for storage. 512MB is plenty for several years worth at 300dpi b/w. The less typing you have to do the better. Just use one folder for each major category. House, Taxes, utilities, etc. Don't make yourself work too hard entering each item or you will never get around to scanning.

This again? by turkeyfeathers · 2013-04-07 07:53 · Score: 5, Funny

Similar questions to yours appear here regularly. The consensus is that it's best just to throw the bills and documents out and spend more time watching porn.

Re:This again? by AmiMoJo · 2013-04-07 23:36 · Score: 1

In all seriousness most of them can just be thrown in a box. They are automatically sorted by date and chances are 99% of them will never be looked at again. That one time you need one you can just sift back through the pile, and you still saved a huge amount of time and money compared to having a proper electronic document system.

--
const int one = 65536; (Silvermoon, Texture.cs)
SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC

Try Alfresco by Anonymous Coward · 2013-04-07 07:56 · Score: 2, Interesting

You can try Alfresco DMS.
It requires a webserver so it might be too-much for a single user.

iDocument by Idimmu+Xul · 2013-04-07 07:56 · Score: 2

http://www.icyblaze.com/idocument/

iDocument for the mac is like iTunes but for documents. It lets you import documents (pretty much any type) and tag them and store them in virtual or real folders, it sounds like it's exactly what you're after.

--
The problem with slashdot is that most of its users were bullied and stuffed into lockers as kids!

Re:iDocument by Rinisari · 2013-04-07 09:10 · Score: 1

Thanks for this. It's pretty damned close to what I want, the sole exception being that it's not open source and not cross platform. I might go in on it anyway if I can't find something better.

--
Colin Dean Go a year without DRM
Re:iDocument by ananamouse · 2013-04-08 08:28 · Score: 1

I have converted over to Neat. I have both their sheet feed scanner and also it imports from the Fujitsu. I hate that it is closed source trap but it works very well. I put the scanned documents in a plastic box, when it is full I put the end date on the box, put in some moth balls, tape it shut, and bury it in the back pasture (I suffer from the hoarding gene and this lets me cope.)

My Workflow by Orphaze · 2013-04-07 07:57 · Score: 5, Interesting

1) Receive document.
2) Scan with Fujitsu Scansnap S1500 in about 10 seconds. $380 on sale, but so far worth it over cheap all-in-one scanners it's not even funny. Seriously, don't even bother going paperless unless you get a real document scanner.
3) Save PDF to simple software RAID-1 mirror of two 2TB drives. (Takes about 5 seconds to setup from disk management in Windows.) This should protect against sudden drive failure taking everything.
4) Backup nightly to external drive swapped off-site every other month. This should protect from accidental deletions, fires, etc. Bonus points if backup drive is ioSafe fire proof variety.
5) Throw away original. Only exception is official documents like titles, marriage certificate, etc.. Yes, I even throw away W2s and the like. My taxes are 100 percent digital nowadays.
6) Check and test restore from those backups on a semi-regular basis, and you're done!

Re:My Workflow by spire3661 · 2013-04-07 08:30 · Score: 2

I liked it up until you have windows managing a RAID. Get a RAID NAS running Linux. It seems odd to RAID up a couple of drives just to let windows mess them up. I suggest a Synology ds212. If you are really serious build a ZFS rig with snapshots.

--
Good-bye
Re:My Workflow by rastos1 · 2013-04-07 08:35 · Score: 1

I've been thinking about the same problem as the TFA recently. There is more to the task than simply scan and store.
What about assigning tags to the documents? Fast previews? Searching based on time/time range/tags/fulltext? Grouping related documents? Annotations? OCR?
I'm considering to write my own solution, but if there is something useful out there, I'd like to have a look.
Re:My Workflow by rastos1 · 2013-04-07 08:38 · Score: 1

One more thing I forgot: electronic signatures.
Re:My Workflow by sribe · 2013-04-07 08:44 · Score: 1

One more thing I forgot: electronic signatures.
What about them? For scanning and archiving, they're irrelevant.
Re:My Workflow by Rinisari · 2013-04-07 09:03 · Score: 1

OP here.
These features you list are examples of what I desire in a package that manages documents. I'm not as concerned with OCR, but that'd be a nice feature to have for the lengthier letters and such.

--
Colin Dean Go a year without DRM
Re:My Workflow by Rinisari · 2013-04-07 09:05 · Score: 1

That's actually a good feature I'd not considered. As a document is added to the system, sign it using PGP and store the signature. That way, I have reasonable certainty that the document has not been modified since initial ingestion, or at least a warning that it may have been compromised if the signature doesn't check out.

--
Colin Dean Go a year without DRM
Re:My Workflow by rastos1 · 2013-04-07 09:06 · Score: 1

First, I want to be able after years to verify that the scan was not modified. Second: There are countries that do recognize electronically signed documents as legal documents (if signed with a certificate issued by state-run CA). I did not actually check with a lawyer if this fulfills the requirements, but ... why not to have the option?
Re:My Workflow by sribe · 2013-04-07 09:24 · Score: 1

First, I want to be able after years to verify that the scan was not modified. Second: There are countries that do recognize electronically signed documents as legal documents (if signed with a certificate issued by state-run CA). I did not actually check with a lawyer if this fulfills the requirements, but ... why not to have the option?
For your own verification, OK. But no, no state-run authority is going to give any weight whatsoever to an image from your own archive that you signed yourself.
Re:My Workflow by rastos1 · 2013-04-07 09:37 · Score: 1

I can get a certificate from sate-run CA. That means I can authenticate myself to a state-run service. Why not have a state-run service that produces a signature for a document (or encrypted document or document digest) that I upload? That would be a great service. It would even take some workload off the notaries (which would certainly make them very "happy") ... I should patent that.
Re:My Workflow by sribe · 2013-04-07 10:47 · Score: 1

Why not have a state-run service that produces a signature for a document (or encrypted document or document digest) that I upload?
Because it says absolutely nothing about the authenticity of the document which you provide.
You're talking about scanned documents--documents from other sources which you allegedly scan, allegedly without modifying them, before signing them. The only authentication anybody else would be interested in would be authentication by the document producer, not by you, because you could perform any amount of modification/forgery before signing the document.
This is all very different from documents which you produce yourself, where authentication by you does have value.
Re:My Workflow by Rinisari · 2013-04-07 12:20 · Score: 1

You clearly don't do CACert assurance :-p
If your house were to burn down this evening, your bank accounts emptied, and someone hacked the IRS, state, and local government records to show that you have not paid your taxes, how would you prove otherwise?

--
Colin Dean Go a year without DRM
Re:My Workflow by fast+turtle · 2013-04-07 13:18 · Score: 1

So you lost power, causing a corrupted FS. Why in hell didn't you have even a cheap battery backup? All of the computers here have even a simple 300w unit that gives 5 minutes of run time (long enough to safely shutdown)

--
Mod me up/Mod me down: I wont frown as I've no crown
Re:My Workflow by Bert64 · 2013-04-07 19:19 · Score: 1

If you store the signature in the same place, then anyone in a position to modify the document can simply generate a new signature too.

--
http://spamdecoy.net - free throwaway anonymous email - avoid spam!
Re:My Workflow by Miamicanes · 2013-04-07 20:50 · Score: 1

Has anybody ever reverse-engineered the Scansnap protocol, or at least come up with some way for people like us to either launch their .exe (or maybe use one of their .dlls directly) to tell the scanner, "Scan everything in the feeder as a 300dpi double-sided grayscale .jpg with filenames that follow ${some-template}, then let me know when you're done and how many pages you actually scanned?"
I hate... Hate... HATE the software that came with my ScanSnap FI-5110EOX2, because it makes it a total pain to do anything that resembles bulk scanning. I have crates of old stuff I'd love to scan and toss, but they make it so fsck'ing cumbersome, with so much realtime babysitting required, I've never gotten around to ever making a dent in the mountain. There's no fast & easy way to verify if a page mis-scanned, or two pages stuck together, or whatever, without actually stopping to load the pdf it just created and look at every page. I've wished on countless times I could just count the pages, stick them in the feeder, tell it how many pages it should find in the feeder, and tell it to go... and notify me immediately if the number it finds doesn't match the number I said were supposed to be there.
I could swear I remember a program from ~10 years ago (I think its name was something like MagicDesk or something) that could actually do intelligent human-guided document recognition. Basically, you'd scan a document. If it had no idea what it was, you could tell it, "This is a Capital One bill... look at (select rectangular area for logo) to recognize it as such, and look (select another rectangular area with account #) here to figure out which account it is, and look (select a third rectangular area where the date is). From what I remember, the program's fatal flaw was that it saved data in some proprietary format that couldn't be exported into anything useful, and (from what I remember) it didn't work on newer versions of Windows. The weird thing is, I don't think any program that ever came out since that time ever did anything comparable... and ${deity} knows, I've looked for one. When you have literally two dozen bankers' boxes of old stuff going back 10 years, scanning to one pdf document at a time just isn't going to cut it, nor is any workflow that requires major menu-navigation, dialog-swatting, and staring at a "busy, please wait" screen for a minute or more per document.
There *HAS* to be some better bulk-scanning workflow that lets you split up the scanning and identification/naming into two different parts... say, one part that focuses on scanning everything, with safeguards to avoid missed pages that basically just scans everything to a directory full of color or grayscale jpeg images, then a second step that makes it easy to assemble that directory full of sequentially-named images into coherent documents that you can then walk away from and let it chew on the files after giving it hints and identifying them.
Re:My Workflow by tlhIngan · 2013-04-08 03:36 · Score: 1

If you store the signature in the same place, then anyone in a position to modify the document can simply generate a new signature too.
Not without the key they can't.
You're confusing a signature with a hash. The latter can be easily regenerated. However, a signature cannot as it requires signing the hash. Change the hash, and the signature cannot be verified anymore.
It's why people are interested in breaking the hash algorithms - break SHA2-224/256/384/512 and anything signed with those algorithms is vulnerable.
You can sign it with your private key, and thus let anyone with the public key verify its authenticity (this is how stuff like DRM usually works - including signed bootloaders and such). Or you can sign it with the public key and only you know if it's authentic or not.
In this case, you want to sign the document with your private key because anyone can re-sign with a public key. The only way to alter the document would be to obtain the private key. If they signed the altered document with your public key, you can't use the public key to verify it.

You don't need a CMS by Anonymous Coward · 2013-04-07 07:58 · Score: 5, Interesting

So, I've been doing this pretty consistently for the past few years and sent this advice to some relatives asking basically the same question. (That's also why it's a little dumbed down.)

I haven't found a case where any sort of CMS makes more sense than the file system. This is after doing this for about 10 years, and I've got records going back to '01.

I'm using a Fujifilm Scansnap and a Fellowes Powershred, and running Mac OS X. OS X has decent indexing, a good file system manager (really can't beat column view) and the Preview app will let you reassemble PDFs, which is occasionally very handy.

1. The enemy is copies. I strongly recommend "scan and shred", or you'll wind up scanning the same thing over and over.

1.1. Don't bother with any scanner that doesn't do double-sided scans.

1.2. Use a shredder. You can take things out of a trash can.

1.3. The scanner should come with OCR software. Choose "Searchable PDFs".

2. Do scanning in small batches.

2.1. Create a folder "Scanned", and "Unfiled".

2.2. The scanned files go immediately into scans, and the paper immediately goes into the shredder.

2.3. After you've got a batch of stuff scanned, you move it into Unfiled and correct the names, or split the documents up as you need to.

3. If it takes any work to scan it just shove it in a filing cabinet, or, better yet, just shred it.

3.1. If you're having to use a flatbed, it's too complicated to scan and you should file or shred it.

3.2. You can often get manuals and pamphlets and stuff online by googling part of the text or the product name.

4. Don't scan anything you can get electronically.

4.1. Most companies would much rather let you download bills and statements and such.

4.2. Most of them will also delete those statements after a few months, so get in the habit of immediately downloading the statement.

5. It's *very* helpful to put a date on everything. I generally do YYMMDD, trying to guess from dates I find in the document.

5.1.If it's a document covering a period of time like a bill for the month of November, I use the ending date.

5.2. For tax documents I'll put TT-YYMMDD, where TT is the tax year, since the actual transactions occur that year, but filing and IRS stuff happens the year after.

6. I've found that even with full text search, you still need folders.

6.1. They just don't need to be extremely complicated; usually two levels seems to be fine. I'll put prior years into separate folders, too.

6.2. Your system will evolve as you work; just get it in there, and then be mindful of what you are commonly looking for.

6.3. Keep books and reference manuals in a folder that doesn't get indexed. (Spotlight has an option for this.) They tend to create a lot of spurious hits.

7. Keep your inbox clean, if an email wants you to download a statement, get it right away and put it in Unfiled.

7.1. Likewise, keep your desktop clean, scan and shred stuff as soon as it comes in.

7.2. Have a periodic to-do item to tidy your files, don't spend more than half an hour (tops!) at any given time.

Re:You don't need a CMS by sribe · 2013-04-07 08:46 · Score: 2

2.3. After you've got a batch of stuff scanned, you move it into Unfiled and correct the names, or split the documents up as you need to.
god, no! Give it a sensible name and put it where it belongs to begin with; don't deal with the same document multiple times.
Re:You don't need a CMS by Rinisari · 2013-04-07 09:13 · Score: 1

Thanks for this. This is definitely a workflow I need to model.

--
Colin Dean Go a year without DRM
Re:You don't need a CMS by overlordofmu · 2013-04-07 09:17 · Score: 5, Insightful

Disclaimer: I know this will seem pedantic but I am trying to get people to think about problems in the long term (solutions that work for thousands of years, not hundreds).

If we use the format YYYY-MM-DD for dates (for instance 2013-04-07), they sort both alphabetically and numerically, they are easy for human eyes/minds to parse at a glance (my apologies to the vision impaired) and there won't be a reason to change to format for approximately 7,895 years (but who is counting, really).

Please see ISO 8601: http://en.wikipedia.org/wiki/ISO_8601

Obligiatory XKCD: http://xkcd.com/1179/
Re:You don't need a CMS by Anonymous Coward · 2013-04-07 10:01 · Score: 4, Insightful

4.1. Most companies would much rather let you download bills and statements and such.
And this is exactly why I HATE all of the "e-bill" solutions that every company has dreamed up at the moment.
They turn the problem from "the company remembers to SEND you a bill/invoice/paper" to "you have to go get the bill/invoice/paper FROM the company".
With paper bills/invoices/etc. sent through the US mail, they "remember" to do something, and I get an automatic reminder when the envelope appears in my mailbox.
With the e-bill solution, the most I get is an email reminding me to go log in and download the bill/invoice/paper. Now, notice what is wrong here. They just sent me a communication (hint, its the reminder email) that could have functioned identically to the USMail envelope of carrying the bill/invoice/paper along with it right to my inbox, so when I receive the email, I ALSO receive the bill/invoice/paper itself (i.e., attach the bill/invlice/paper as a .pdf to the email).
Now, most companies will balk at that because "email is not secure" or "email is not private". Well, why don't you let me F****** upload a gpg public key to your system, and then your system could encrypt my bill/invoice/paper using my gpg public key, then attach it to the "reminder" email, and now we have an electronic system that functions identically to the old paper bill in the old paper envelope sent through the postal office.
They remember it is time to send me my bill, they create the .pdf (electronic equivalent to printing the bill on paper), they encrypt the pdf (electronic equivalnet to sealing the bill in a mailing envelope, and they email me the item (electronic equivalent of giving the sealed envelope to the postal service).
But does any company implement this system? No, not one.
And so they will continue to mail me paper, and can continue to hound me to switch to "e-bills" all they like. But until their e-bills are done properly (as above) they won't get any buy in here.
Re:You don't need a CMS by reboot246 · 2013-04-07 11:44 · Score: 1

Amen, brother! I prefer to get my bills in the mail. Real, honest to goodness paper.

On the outside of the envelope I write the date I received the bill, how much it is, and when it is due. It goes into my mail sorter in the first bin. When I pay it online I write the confirmation number from the bank on the envelope and then put the envelope in the second bin. I keep the envelope until the next bill from that company comes in. That way I can see if the previous bill was actually paid on time or if there were any mistakes on their part or mine. If everything is okay, I shred the old bill, envelope and all. Then the cycle starts over again.

Important items like house payments or car payments are kept until they're paid off. I want a physical record I can use to prove I've made all the payments and that they were made on time. No, I don't trust banks or mortgage companies any further than I can throw them. Do you?
Re:You don't need a CMS by melikamp · 2013-04-07 13:10 · Score: 2

Yes, yes, yes. The submitter sounds like he wants to digitize a bunch of files, so I would recommend a good file system. Any stable filesystem will do, like ext4 for instance.
Avoid metadata within a file for as long as possible. It will bury you. If date and bill amount is all you need, then just stick them into the file name.
YYYY-MM-DD.amount.unit.short-description.pdf
2013-04-07.-3975.us-cents.how-much-this-advice-will-cost-you.pdf
Now you can pile your files into, say, ~/my-files/ in any way whatever. You can create a category tree, for example, to allow you to find files in a file manager in 3 clicks. For more complex tasks you can just use bash, find, and the rest of the userland. It does not get simpler or more portable than that. In particular, it is trivial to convert this structure into a CVS, which you can suck into a spreadsheet or a database of your choice.
Re:You don't need a CMS by jellyfoo · 2013-04-07 14:37 · Score: 1

Very good idea. I've been using the YYYYMMDD format and although it's close to the ISO, I have to admit that the hyphens definitely improve readability while retaining the ability to sort properly, so I think I'll change things now.
Re:You don't need a CMS by melikamp · 2013-04-07 14:58 · Score: 1

sed s/CVS/CSV/
Re:You don't need a CMS by LoRdTAW · 2013-04-08 02:45 · Score: 1

I don't understand the mindset either. To me Email is much more secure than snail mail. With email someone needs my password to gain access to my mail. With snail mail, all someone needs to do is open my mail box and walk away with my mail. You can make it a bit more secure by using a mail slot or lock box but a majority of people have the classic unsecure mailbox that is out in the open.
Re:You don't need a CMS by synaptik · 2013-04-12 14:35 · Score: 1

Maybe what is needed is some software that can remember to download for you? And alert you on those occasions where it fails to do so? I'm not saying that is superior to your suggestion... just, more likely to actually happen.

--
HSJ$$*&#^!#+++ATH0
NO CARRIER
Re:You don't need a CMS by jif · 2013-04-14 01:26 · Score: 1

Maybe what is needed is some software that can remember to download for you? And alert you on those occasions where it fails to do so? I'm not saying that is superior to your suggestion... just, more likely to actually happen.
Here's a service that claims to do just that: https://filethis.com/fetch/

Scan, OCR, and use your file system (and symlinks) by magic+maverick+ · 2013-04-07 07:59 · Score: 2

My suggestion would be to just scan and OCR your files, and then store them in your file system.
Hierarchy might be something like: ~/scans/year/project/sorted

Within each sorted subdir, you'd have three folders. Date, organizationThatGeneratedTheDoc and TypeOfDoc.
So in the folder ~/scans/year/project/sorted/org
The file names would be something like: organizationThatGeneratedTheDoc-yyyy-mm-dd-TypeOfDoc.pdf
In the folder ~/scans/year/project/sorted/TypeOfDoc
The file names would be like: TypeOfDoc-yyyy-mm-dd-organizationThatGeneratedTheDoc.pdf
Etc.

You'd use links (symlinks or hard links) to make sure that each document is accessible in more than one place. (You can also use links to put documents in more than one project folder.)

Types of documents would be things like invoices, receipts, legal threats, court orders etc. In the event that a document has more than one type, or more than one organization, you simply have more links. So invoice-2013-04-07-webdevteamawesome.pdf and legalthreat-2013-04-07-webdevteamawesome.pdf are the same document, because the first page is an invoice, and the second a threat to take you to court if you don't pay. (This then exists six times, three times for each type, but with the magic of hard links only takes up the space of 1.001 documents.)

With the OCRed text being saved with the PDF scan, you can also run text searches with in your files to find specific information (such as bill amount, seriously, how often would you use that information?)

This allows you maximum flexibility, and prevents you from being locked into a particular piece of software (as you can do everything manually). Moreover, once you've got it setup, it's easy to run with each new document.
Steps would be:
1) Scan and OCR doc, saving the PDF into the staging area folder.
2) Run your script, which asks for the date, project, org name, doc type.
3) The script then saves the document in the appropriate folders, generating links as required.
4) Profit!

--
HELP MY ACCOUNT HAS BEEN HACKED BY AN ILLIBERAL ART STUDENT SET TO DESTROY THE INTERWEBZ!

SnapScan. by Anonymous Coward · 2013-04-07 08:09 · Score: 1

http://www.amazon.com/ScanSnap-S510M-Instant-Sheet-Fed-Scanner/dp/B000WJCX18/ref=sr_1_31?s=pc&ie=UTF8&qid=1365365308&sr=1-31&keywords=archive+scanner

The above come highly recommended as an all-in-one solution.

Re:SnapScan. by thomasw_lrd · 2013-04-07 09:24 · Score: 1

I've used owl document repository, but it needs a webserver to run. (http://www.doxbox.ca/) It's pretty nice, and it can do full text search (sometimes).

--
21st Century Renaissance Man

Re:Scan, OCR, and use your file system (and symlin by magic+maverick+ · 2013-04-07 08:10 · Score: 1

I should note, you need to be careful to make sure you use the same spelling and wording for each org and doc type. You don't want to end up with Murphies, Murphy's, Murphy's Inc., Murphpy's Beer Company Inc. etc., each with invoice, inv., invoise and envoice.
It would be better if your script forced you to pick a doc type, and showed a list of already existing companies.
This applies no matter what solution you end up running with.

Also, for documents that cover a period, you have multiple options. The first is to give 00 as the day and month (e.g. 2012-12-00), and the second 01 as the start (e.g. 2012-12-01). Another is to have two dates (2012-12-01-to-2013-01-01) in place of the yyyy-mm-dd suggested in my first post. Also, don't even think of having the dates in any other order than year, month, day.

Some places have a working year (e.g. a tax year) that crosses two calendar years. In that case, you should be careful about where you put documents. Because if you put them in the first year, and then go "OK, it's been 7 years, and I no longer need any docs from 2005", you'll be burnt. A solution is to hardlink them into both years.

Do post back when you have a solution!

--
HELP MY ACCOUNT HAS BEEN HACKED BY AN ILLIBERAL ART STUDENT SET TO DESTROY THE INTERWEBZ!

Mayan EDMS by Rob+the+Roadie · 2013-04-07 08:16 · Score: 2

I've played with this a few times, never used it in anger though.

http://www.mayan-edms.com/

I might take up your challenge on going paperless too and give Mayan a go.

Tossing hat into the ring for DJVU format. by Areyoukiddingme · 2013-04-07 08:19 · Score: 3, Interesting

PDF is big and bulky. DJVU format makes for tiny document scans. And there are open source libraries for creating it, available even in Debian. Wavelet compression did finally make it into the wild. It's just nobody has ever heard of it, for some reason.

Doesn't help for organization, but it should be a reasonable option for storage.

It even embeds the OCR text in the document along with the image version, so it doesn't proliferate multiple copies of the same data.

Re:Tossing hat into the ring for DJVU format. by magic+maverick+ · 2013-04-07 08:28 · Score: 1

You do realize that PDF can store the OCRed text along side (or above? it's another layer) the original scanned text.
Also, because the reference software of DJVU is GPLed, it's never going to see widespread commercial use (as all the big software companies only want to take and take).
A standard subset of PDF (e.g. PDF/A) is a much better option, and if you're worried about the amount of file space taken up, you can always use GZIP or ZIP.

--
HELP MY ACCOUNT HAS BEEN HACKED BY AN ILLIBERAL ART STUDENT SET TO DESTROY THE INTERWEBZ!
Re:Tossing hat into the ring for DJVU format. by spire3661 · 2013-04-07 08:35 · Score: 1

And as counterpoint; I bought a 32 GB, class 10 MicroSD card with adapter for $22.19 USD yesterday.

--
Good-bye
Re:Tossing hat into the ring for DJVU format. by Inda · 2013-04-07 21:21 · Score: 2

PDF only wraps around the PNG, JPEG, BMP, generic_image_format. The extra bloat is only a couple of kb.

If the bloat is more, the PDF has been generated incorrectly.

--
This post contains benzene, nitrosamines, formaldehyde and hydrogen cyanide.
Re:Tossing hat into the ring for DJVU format. by Miamicanes · 2013-04-08 01:38 · Score: 1

Flash media, by Sandisk's own reckoning in one of their white papers, has a realistic lifespan of about 10 years (roughly 5 years until the first unrecoverable read error for MLC media, roughly 15 years until the first unrecoverable read error for SLC media, with both unlikely to be directly-readable by any consumer operating system due to accumulated errors after ~20 years and require professional data recovery (SLC takes longer to get to its first hard error, but once the avalanche begins, it accelerates until SLC becomes as unreadable as MLC). Remember, flash media is like a leaky bucket that starts to drip the day you format it. Flash is absolutely NOT even REMOTELY close to being a passive long-term archival medium. The only thing we have RIGHT NOW that can be remotely considered suitable for long-term passive archival storage in the consumer realm are non-LTH BD-R discs.
Re:Tossing hat into the ring for DJVU format. by fritsd · 2013-04-08 03:58 · Score: 1

That's what the Internet Archive uses, isn't it?

--
To be, or not to be: isn't that quite logical, Slashdot Beta?

Re:Scan, OCR, and use your file system (and symlin by whoever57 · 2013-04-07 08:24 · Score: 1

Your suggestion is over-complicated IMHO. I use Xsane and scan as multi-page documents. Xsane allows me to add pages to the scan set and reproduce a new PDF file. There are some downsides to my method: I need to have an approximate idea of the date of the document that I am looking for.

I generally file by //.pdf, although I may vary the hierachy if appropriate, for example: TAXES//.pdf

Perhaps more important, though, is to extract the data into some form of record keeping (even if it is only a spreadheet) at the time that it is saved. Then, unless I am being audited, I really don't need the scans.

--
The real "Libtards" are the Libertarians!

Hosting on a NAS by ericdano · 2013-04-07 08:27 · Score: 1

I have been trying to do this for a while. I have a ScanSnap S1500M and have been hosting all the PDFs on my Synology NAS. However, programs like iDocument don't support network drives and text searching PDFs. They rely on Spotlight's database, and spotlight doesn't work on a NAS (though it supposedly does work on a Apple Server).

I'd LOVE some sort of text searchable solution that is better. I do use iDocument, but that has a LOT of limitations, like it will not handle ePUBs. I'm hoping at some point Synology will create an App for it's line of units similar to something like Evernote. They already have two great Apps that allow you to stream Audio and Video from your Synology unit to an iOS or Android phone and computer. And they also have a Dropbox like App. The last piece they really need is some sort of document management thing that works with their stuff. That would be a perfect solution for someone who has a lot of documents or a small business which doesn't want to have it's data in the hands of Google or other companies.

--
It's either on the beat or off the beat, it's that easy.
I moderate therefore I rule!
--

Owncloud? by bazorg · 2013-04-07 08:37 · Score: 2

Maybe that Owncloud thing will work well to handle the storage and access. Anyone knows if its search function is any good?

Alfresco by Balr0g · 2013-04-07 08:39 · Score: 3, Informative

I use the community edition of Alfresco for that task. You can tag all documents, add custom fields and have full text search and versioning out of the box. Documents can be accessed via web interface, smb, ftp and even imap.

I wrote a script to do exactly what you are saying by Ogi_UnixNut · 2013-04-07 08:44 · Score: 1

My situation is the same, except that I move often, and have to keep legal documents for a few years (typically 5). I also have paper copies of invoinces and Bills (loads). I didn't want to have to lug boxes and boxes of paper, so I developed a script to do the following:

1) Scan the document page by page, and save as tiff (300dpi)
2) Run open source OCR on it, and save the resulting text to the tiff "comment" field on the metadata
3) Save it in my file server.
4) Index it with a desktop search program (here is a list: http://en.wikipedia.org/wiki/Desktop_search). This has the nice facility of scanning the metadata and allowing you to search it. This way I can search documents by text, ignoring the fact OCR is not 100% correct (it is usually correct enough for me to find the document I want), while having the pure text in photocopy quality as a TIFF (this is very important for legal documents, as OCR'd versions are not acceptable replacments).

I have been wondering whether it would be worth open sourcing the script (for the moment it is a bit hacky, but it has been serving me well for years now). If the TIFFs take up to much space for you liking, subsitute with PNG/JPEG/etc...

So far it has served me well, I've been collecting hundreds of documents this way. The only manual step is the script requesting a filename (not a big deal for me, as I have to manually put each page into the scanner anyway).

If you are interested let me know, and I can post the script.

Re:Scan, OCR, and use your file system (and symlin by whoever57 · 2013-04-07 08:46 · Score: 1

Arrgh... /. ate my filenames, even though I posted it as Plain Old Text:

I generally file by //.pdf, although I may vary the hierachy if appropriate, for example: TAXES//.pdf

Should be: I generally file by <TOPIC>/<YEAR>/<MONTH#>.pdf or perhaps <TOPIC>/<YEAR>/<MONTH#>/scans.pdf. I use other variations to the hierarchy if appropriate, for example: TAXES/<YEAR>/<Type_of_Form>.pdf. So all W2s for a particular tax year. would be in the same PDF file.

All scanned invoices for a particular year/month would be in the same PDF file and in the same directory as any downloaded invoices.

It's not important that I use the same hierarchy everywhere, I use the hierachy that will make it easiest to find the document in the future and that varies according to what I am filing.

--
The real "Libtards" are the Libertarians!

Mac OS X Finder by supercrisp · 2013-04-07 08:47 · Score: 1

Thinking about this question, I checked the folder in which I keep research and notes for my primary area of study. It's 2GB and just under 2,000 separate files. Many of these are OCRed PDFs, some mp3, some .doc, .rtf. Mac OS X's indexing lets me do adequately quick find-by-content searches, and a relatively simple organizational schema for subfolders let me consult categories of data swiftly. I also use a reference manager program that probably has close to a 100 keyword tags, and Finder lets me get to stuff as quickly, so I'm assuming creating some sort of metadata beyond filename, date, and filetype is really unnecessary. I'd say just relax and throw the stuff in a folder in Finder, and back that up somewhere while also using something like SpiderOak. My work requires frequent and specific searches over this fairly large data set, so if this system works for me, it would probably work for you, unless you plan on getting OCD with your OCR and scanning every Wally World receipt. Anyway, my advice is to keep it simple. Life is too short to diddle around with stuff like this.

Re:Neat? by evilviper · 2013-04-07 08:50 · Score: 1

Sadly there is still nothing like the Neat scanner system for Linux. Something that, preferably, OCRs and indexes your documents for easy searching and retrieval. At the least something that indexes, even if you have to manually populate the fields. Nothing at all after years of hoping.

There are NUMEROUS document/content management systems for Linux (and have been for years), any of which will do VASTLY more than the dumbed-down "Neat" system.

--
Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant

A couple questions by The+Optimizer · 2013-04-07 08:52 · Score: 1

Just a couple questions come to mind:

First: What is the purpose of keeping the information? If it's just to have a record for your own sake of what and when and how much, do you even need to scan the statement or receipt or keep the original? or can having all the info imported into a money manager be enough?

I've been using Quicken for over a decade (still using Quicken 2000 actually as later versions are bloaty) to keep all my financial history in detail. For answering questions like "When did I buy that Belkin KVM switch so I can see if the warranty period has expired" searching the register is good enough as I add enough info the memos. In this example (real one from just a week ago), finding the information easily was enough, and it's to my advantage to have all the individual statements and detail items combined into larger account histories rather than parse an archive tree full of pdf/ocr files (FWIW: even this old version of quicken lets me attach scans of receipts to entries)

Second Question: In what cases is the Original Paper required as opposed to a scan? If you need to show an original statement, receipt or other document to prove some thing or get something approved, do you know when an electronic copy or reproduction is as acceptable as the original? I don't think this is an area with consistent clear cut answers yet because of its newness.

Let's take an admittedly unlikely example. You have a house but have moved to take a job out of state, and you're trying to sell the house. Some scumbag squatter moves in and tries submitting false documents to claim ownership. All the documents relating to purchase and any mortgages have been scanned and shredded. Will the courts, police, banks, city and county offices etc. give you any trouble because they are not signed originals? What if the scumbag claims you fabricated the documents (like he did) and his are the originals? What if some entities accept a scan and others don't?

I've implemented a hybrid system where different documents get scanned / destroyed at different times. I have a single card-file cabinet (Filing cabinet with half-height drawers). Paper copies of everything from the current year and previous year are kept in a drawer. At the end of each year, I take all the documents from year-1, shred most of them (assuming any need for them has past), and put the ones I deem most critical in a small box to archive.

Re:A couple questions by Rinisari · 2013-04-07 09:28 · Score: 1

The initial purpose of keeping the information is completion. I sheepishly admit to digital hoarding, and this may be feeding that desire. To me, it's easier to scan a document and tag it, rather than importing its information.
I need to keep things like receipts for large purchases for insurance, expense, and warranty purposes, bills and account statements, tax documents, and even things like the rare paper letter I get (e.g. my former tax preparer died last year. If I were to be audited, I'd need some evidence that she's dead. I have a letter from her next of kin and coworkers saying that she died.)
I need original paper for SOME receipts, things with raised seals such as birth certificates or car titles, and other unique items that the originality of the paper would increase its authenticity in a court of law.
What you do seems very similar to what I want to do, perhaps with the exception that I'm a metadata nut and want to be able to search things a little easier, should the need ever arise.

--
Colin Dean Go a year without DRM
Re:A couple questions by david_thornley · 2013-04-08 08:36 · Score: 1

There's a certain amount to be said for hoarding. There are quite a few documents that I'm unlikely to ever want again, but I may have need of. My current practice with the paper ones is to throw them into the appropriate drawer, since they're nicely compact in such a pile, and it'll likely be faster to do an exhaustive search a couple of times than to maintain a filing system.

--
"When you have eliminated the unacceptable, whatever is left, however improbable, must be the truthiness" - Holmes

gscan2pdf by markdavis · 2013-04-07 09:02 · Score: 1

I use gscan2pdf http://gscan2pdf.sourceforge.net/ with my multifunction "printer" and then save the bills and documents in properly named and organized directories as pdf files. Simple as pie. (Why is pie simple?)

Dropbox or google drive by alen · 2013-04-07 09:07 · Score: 1

I use my iPhone to scan, convert to,PDF and upload to my Dropbox. The app cost me $6

Dropbox will always be there and is backed up

Depends on your personality. by Richy_T · 2013-04-07 09:10 · Score: 1

I used to periodically sort things into hanging folders then dispose of anything nonessential after 3-4 years. A few years back I decided to switch to scanning. So I started collecting a pile of stuff to scan. In the intervening years, that pile has grown and grown and now the scanning would be such a big chore that I don't even like to contemplate it anymore.

The simple fact is that most documents are not something you will ever need again so deserve the minimal effort you can put towards temporary mid-term storage and worth 0 effort for archiving. Others may disagree but I suspect I already keep too much for too long. To be honest, there's not really been much that would have been an issue if I didn't shred immediately after reading.

Zotero by Fpdx · 2013-04-07 09:14 · Score: 1

Zotero (an extension of firefox, also stand alone I believe) works well for me to archive lots of PDFs. It has tags and directories, meta information, search, notes etc.. Once you got your pdfs Zotero is a good organizer.

Re:I wrote a script to do exactly what you are say by Rinisari · 2013-04-07 09:18 · Score: 1

Please do post the script. Throw it up on pastebin, or, better yet, https://gist.github.com./

--
Colin Dean Go a year without DRM

Re:Smartphone by Rinisari · 2013-04-07 09:22 · Score: 1

Using Camscanner or its ilk is something that a few friends have suggested, but I find the quality of the scans to be less than I really want for long-term archival. This may suffice for many documents that I'm likely never to look at again, such as bills, but things like letters or tax documents I think may require a little higher quality. Also, if a document is more than one page, camera scanning quickly gets unwieldy. I scanned a 30 page document on the go using Camscanner and it was a painful experience.

--
Colin Dean Go a year without DRM

OPAC by buss_error · 2013-04-07 10:35 · Score: 2

Any open source library management software that does ebooks should help you out. Here's a list:

http://sourceforge.net/directory/home-education/library/opac/os:windows/freshness:recently-updated/

--
Necessity is the plea for every infringement of human freedom. It is the argument of tyrants; it is the creed of slaves.

Skip scanning, download PDFs directly. by yeah+I+can+fix+that · 2013-04-07 11:02 · Score: 2

(Longtime listener; first-time caller. If I'm doing it wrong, please be kind.)

I've been going through the same issue and have painstakingly scanned/filed a metric crapton of old documents, putting them in a hierarchical directory structure where I can find them if I need them.

But this sucks for a number of obvious reasons. The ones that bother me the most are:

1) A scanned document is larger (*and* less useful than a downloaded PDF).

2) It's a manual process! I'd rather spend ten hours automating something than five hours over the next 5 years trying to remember the filename convention for storing the scanned document.

Anyway, cutting to the chase, I'm now using Ruby/Watir scripts to automate the business of downloading my most common phone/utility bills from the websites and stashing them directly. I used to use Perl and WWW::Mechanize but all the websites are now so contaminated with unecessary javascript that only something which manipulated a browser directly allows automation without pulling my hair out. Ruby/Waitr works pretty well. Recommend. Automated download; priceless. Without automated download, I'd rather return to scanning paper documents mailed to me, otherwise you quickly find how unreliable your service provider is for retaining your statements.

If anybody wants some pre-alpha scripts for grabbing their pg&e, comcast, cigna, at&t, schwab, nvenergy statements, let me know.

Super simple Linux based document scanner. by beachdog · 2013-04-07 11:27 · Score: 1

Super simple scanning system using Linux.
Make directory called scans, make another called taxes
Have a text file of scanning hints with an easy to remember name.
in a terminal, print the scanning hints file and use the Linux mouse copy feature to construct a scan instruction
The scanimage application requires sudo or you can find a tweak using google search to alter the scanner's USB files and make it run from an unprivileged user.
cd scans
cat filewitheasycommandstocopy.txt

Typical contents of my hint file:
sudo scanimage -l 0mm -x 90mm -y 66mm --resolution 400 | pnmtojpeg >cprcard.jpg
# make files non-overwritable
# chmod -w ~/scans/*.jpg

Verify each scan with eog viewer.
Organize scans like this:
Make long filenames with agencynames, recipientnames, and documentnames all in lower case.
use the mouse to copy an old file name for re-use.
this groups similar documents together.
use ls -lr to show most recently scanned items.
use ls -lr *keyword*.jpg to show selected classes of scanned items.
use locate in the distant future to find those oddball items like certificates or letters of recommendation.

locate certificate | grep rabies

Re:I wrote a script to do exactly what you are say by beachdog · 2013-04-07 11:51 · Score: 1

Here are some pieces of a scan to ocr script I am developing.
First I am scanning a multicolumn document and to preserve the sense of the document text, I scan even pages twice and odd pages twice.
Second, the scanned images must be rotated. Pieces of the "convert" command appear in the perl fragments here.
Third, I am using the open source tesseract OCR program. Some of my documents have grayed areas that contain text. So I am running tesseract twice on the source files and picking the output file with the most text characters.
Forth, the basic program is just a big loop with a menu where I input file names or page numbers.

Here goes:
# my $scanprog = "/usr/bin/scanimage --resolution 400 >";# print "$scanprog \n";
# Scanner settings for pages top of book at left of scanner StylusScan 2500
my $scanoddleft = "/usr/bin/scanimage -l 30mm -x 190mm -y 235mm --resolution 400 >";#for odd pages
my $scanoddright = "/usr/bin/scanimage -l 0mm -x 190mm -y 235mm --resolution 400 >";#for odd pages
my $scanevenleft = "/usr/bin/scanimage -l 30mm -x 190mm -y 235mm --resolution 400 >";#for even pages
my $scanevenright = "/usr/bin/scanimage -l 0mm -x 190mm -y 235mm --resolution 400 >";#for even pages
# OCR commands and parameters
#tesseract test1.tif test1 -l eng;
#scanimage -l 26mm -x 166mm -t 10mm -y 125mm --brightness 3 --resolution 400 | pnmtotiff>test1.tif;eog test1.tif;convert -rotate 90 test1.tif test1.tif; eog test1.tif; tesseract test1.tif test1 -l eng
my $tesseract = " tesseract ";
my $language = " -l eng ";
my $brightness2 = " --brightness 2 ";
my $brightness3 = " --brightness 3 ";
my $convert90 = " convert -rotate 90 ";
my $eog = " eog " ;
my $charcount = " wc -c " ;
my $scanpage = 1; # Range is 1 to 183

OCR - Re: I was in the same boat by WebCowboy · 2013-04-07 16:01 · Score: 2

GScanToPDF can do OCR and embed the results as annotations within the PDF. Perhaps that would help with search ability. It works well enough with a lot of my documents though it is far from perfect it is good enough for those purposes especially for bills as they are not handwritten. Best results are on scans set to line art/b&w rather than grey scale or colour.

My workflow by dcavens · 2013-04-07 17:46 · Score: 1

I've been doing this for a while now. Like others here, I have a Fujitsu Scansnap 1500- it's one of the best investments I've made for cleaning up my office/workflow.

When something comes in, I immediately scan it to the filesystem. My structure is:

2013/Banking/BankName/2013-01-31-14h32.pdf (or something like that- it's the default Scansnap filename.)

I then place the original in a filebox- keeping one filebox for each year. No sorting, organizing, just keeping originals.

At the end of each year, the filebox goes to the crawlspace, and I start a new one. After 7 years, intention is get the box securely shredded (costs about $10/box around here.)

I back the filesystem up nightly to two separate local NASs, and upload the whole filesystem (as a series of encrypted files) to Amazon Glacier (this is a recent addition to my workflow- has stopped me worrying about a fire etc. wiping out both NASs).

All of my documents go in there- it's really easy to find stuff (depending on how good your folder organization is- you can add depth for those kinds of documents that need it, while other ones that aren't likely to be needed can be put in a less descriptive folder hierarchy.)

HP officejet pro 8600 by medoc · 2013-04-07 18:59 · Score: 1

For what it may be worth, I have an all-in-one HP printer/scanner (model in subject). It's reasonably cheap, it is a good printer, and it has a double-sided scanner with auto-feeder which works really well.

I've scanned thousands of sheets with it recently (for archiving before shredding), and I would never have even tried it without the automatic scanning.

Disclaimer: not an HP employee, have no HP stock...

Re:I wrote a script to do exactly what you are say by kermidge · 2013-04-07 19:33 · Score: 1

"I have been wondering whether it would be worth open sourcing the script...."

Please do. Unless you deem it worthwhile to spiffy it up and try to make some moola, I think it'd be great to share your script. It could be useful to some, could be instructive to those wannting to learn, to see how someone else has done something; any possible embarrassment you might feel about it being 'a bit hacky' you might could toss off to 'having character'. Heck, after your description I'd like to see it, even tho I haven't done any real coding in years.

Seems to me putting the OCR text in the comment field is a fine and good thing. An obvious thing to some, perhaps, but an elegant usage to me.

@Rinsari, below - the link you gave throws a cert warning in Opera, could just be my settings.

filesystem or free document repository by unixhero · 2013-04-07 20:03 · Score: 1

This is so easy, I've been doing it since primary school. Mark your files with date YYYY-MM-DD-name. Put them in folders 1993, 1999, 2005, 2010, 2013 (.e.g.) Profit! In modern times I have used a Lexmark X560 all-in-one office machine to scan everything to a designated network drive. Works like a charm every time. I apply the backup policy of equal drives. So I buy a 3TB drive, and I buy another for off-site backups. Once everytwo month or so, I freshen the backup and verify it. If you need to do it a couple of notches more professional, apply the same backup policy, but use a document retention system, and store it in globally and industry standard PDF. There are many good document repositories that are free/as in beer.

Re:Receive electronic statements? by Miamicanes · 2013-04-07 20:20 · Score: 2

The problem with most businesses is that they want to have their cake & eat it too... they want to get you to opt into paperless statements, but they don't want to allow you to fetch your statements via automated means. They just want to spam you monthly (or more), then make you go to their site, log in, and generally set things up to make it as hard to automate those logins as possible. If companies like CapitalOne and Chase would let you just give them your public key, encrypt your statements with it, and email them directly to you (or allow you to fetch them in some standard manner via a web service), I'd happily let them off the hook and go all-electronic. But I'll be damned if I'm going to settle for statements I have to go out of my way to obtain. At least printed statements can be tossed into a box and ignored for years unless I care enough to look at them, as opposed to ephemeral online statements that go bye-bye after 12 months.

Y10E4 by Compaqt · 2013-04-07 20:24 · Score: 1

>there won't be a reason to change to format for approximately 7,895 years (but who is counting, really).

I'm kicking myself for not having caught this earlier!

Thanks (no sarc) for alerting us to the Y10E4 problem.

Question: Is Linux Y10E4 ready?

--
I'm not a lawyer, but I play one on the Internet. Blog

Re:I wrote a script to do exactly what you are say by Ogi_UnixNut · 2013-04-07 21:05 · Score: 1

And done :)

https://github.com/ZivaVatra/SDAT

Figured I would take the opportunity to try out GIT (have not bothered so far).

Also, seems that I have recently made it actually save to tagged PNG's rather than TIFFs. Forgot about that :)

Hope it turns out to be useful to you. Let me know if you want commit privs for any fixes you do. Happy Hacking!

Re:I wrote a script to do exactly what you are say by Ogi_UnixNut · 2013-04-07 21:13 · Score: 1

As requested, I have put it on github now:

https://github.com/ZivaVatra/SDAT

Hope it is useful to you, or at least interesting :)

I don't think it is worth selling the script, it isn't that fancy, not to mention that then I would be on the hook for supporting it.
Since the recession hit I've had to work 2 jobs, and I really don't have much time to devote to personal nerdy pursuits. Barely have time to sleep as it is :(

All I can do is publish and hope it helps others. That little script has made my life a lot easier and less cluttered. With any luck others with more time will improve on it and we all benefit :)

And it isn't your settings. I also get an invalid certificate on Firefox and Chromium. Something is up with the link, so I just went ahead and used the actual github.com site to host it.

Version control for documents by mekberg · 2013-04-07 21:29 · Score: 1

For maintaining a high-integrity archive of documents, try Boar. It can even version control huge documents, like movies and photos. http://www.boarvcs.org/

A real document management system? by rolfc · 2013-04-07 21:30 · Score: 1

Try out Alfresco, It is a nice document management system if you are familiar with IT-system.

Really that much of an issue by tehcyder · 2013-04-07 23:01 · Score: 1

I've been a houseowner and the rest for a while, and the amount of stuff you actually need to keep is not that great. I mean, who keeps old bank statements, credit card bills, invoices and receipts for more than a couple of months, unless they're business- or tax-related?

I know it's probably different here in the UK where most people don't even need to do a tax return, but basically the really important stuff like house deeds (and wills) are in the hands of a solicitor anyway, and I simply don't need to keep copies of 2 year old bank statements or 3 year old electricity bills on the off chance I might need to refer to them.

If you have a business, fair enough, you legally need to keep financial stuff for 6 years, but then off-site archiving is just an insignificant business cost.

--
To have a right to do a thing is not at all the same as to be right in doing it

Re:Scan, OCR, and use your file system (and symlin by AmiMoJo · 2013-04-07 23:49 · Score: 1

You would think that in this day and age someone could invent an OCR system that has a basic understanding of the documents being scanned. It would automatically name files "Bank of Elbonia Current Account Statement 03/2013" and allow you to do things like search all statements for transactions over Â£250.

In fact the bank could just include a big QR code on the back with all the data in it, but I suppose that is asking too much.

--
const int one = 65536; (Silvermoon, Texture.cs)
SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC

Re:I wrote a script to do exactly what you are say by kermidge · 2013-04-07 23:55 · Score: 1

Thank you. I'm enjoying looking at the script, am wondering where I can get a scanner I can afford, and had a look at your site as well. And thanks for the readme also. Now get some sleep. (grin) I've had two jobs at times, when I was younger, and it's decidedly not only no fun, but done too long a recipe for ungood juju. Good luck to you.

doo for OS X by Anonymous Coward · 2013-04-08 00:37 · Score: 1

There is a very good OS X application called doo which will do exactly what you want and it's completely free. Check it out at http://doo.net/ and you have it on Mac AppStore.

Come work for Laserfiche! by neminem · 2013-04-08 02:46 · Score: 1

This is not actually a real solution, I'm just amused every time I see document management show up in a slashdot thread - being it's not a very *exciting* field. I work for a company that provides document management solutions for much larger organizations, with costs (I think) starting in the tens of thousands of dollars, and going up to way more than that. Of course, since I work here, I have my own personal test repositories for testing things, and when I was buying a house and started getting crazy amounts of paper documents I had to sign, scan and send back, I was like, why not scan them with QuickFields and keep them in a Laserfiche repository? So I did. :D

That solution doesn't work for most people, though. (Also it's totally not open source. Source is only open to those who are development at this company, which I am, so I suppose in a twisted way...)

Question about tesseract by fritsd · 2013-04-08 04:42 · Score: 1

Can anyone tell if you can train tesseract to be a bit better at recognizing a specific font?

I'm using the Debian version but if you have a 300 dpi scan the OCR is often gobbledygook.

(Yes, "use the source Luke" is also a valid answer in this case...)

--
To be, or not to be: isn't that quite logical, Slashdot Beta?

doo for all your docs by Ben-p-williams · 2013-04-08 04:50 · Score: 1

Hi Rinisari,
I work at doo (http://doo.net) and immediately thought of our app when I saw your problem. All of the filing systems listed so far really make sense, and I personally learned a few things from some of the workflows suggested. But using doo would greatly simplify your entire document management.

When you set up doo you select the documents and folders that you wish to connect: not just those imported from connected scanners, but Dropbox, GDrive, the local HD, email, etc. Once documents are connected, the app indexes them and runs OCR automatically. Then it allows you to back everything up and sync it to the cloud for backup and access on other devices.

And it’s all in one spot. Handling the “task of document management” is precisely what we do.

Some details ...
If your document(s) is already digital – whether it’s a scan on your HD, a Google Drive document, an email attachment, in Dropbox, etc – you can connect the source or just the individual folder/document to doo.

If your document is still paper, you can scan it directly into doo using the app’s interface if the scanner has a TWAIN driver (http://is.gd/15O26Q). If it doesn’t have a TWAIN, we have a guide for quickly setting up top brands, like Fujitsu ScanSnap, Canon and Doxie (http://docs.doo.net/scanguide.pdf). I also just wrote a blog post about it if you want to take a look: https://blog.doo.net/2013/04/04/how-to-scan-with-doo.html.

The coolest thing about doo is that it does the tedious, difficult part of document management for you: the aforementioned automatic indexing and OCR occur right when you connect the document or folder in question, which means document search-and-retrieval is a cinch. After that, should you wish to add a degree of personalization, you can alter existing tags or add individual labels in the app’s intuitive UI.

doo is currently available for OS X, Windows 8 and will be coming very soon for Android and iOS (https://doo.net/en/download.html). Hope that helps! Send us a ticket at support@doo.net if you have any questions.

KISS by GrantRobertson · 2013-04-08 15:36 · Score: 1

Keep It Simple Software (engineer) or whatever...

I went with the simplest possible solution. One that also allows me to recover even if a "database" becomes corrupted or obsolete, because all the "real" data is contained in the documents themselves.

I just scan to PDF and add tags in the Keywords field of the PDF metadata. For the keywords, I use unique words that aren't going to show up in an actual document. (Just tacking on a prefix or surrounding each keyword in brackets is good enough.) I also organize the files in a decent (but not too detailed) directory structure. (You can use any high-tech storage system you like. I just use a regular hard drive.) Then I installed the PDF iFilter so the Windows Indexing service could index the files, including that metadata (There are many. Google is your friend.) So, now, if I want to find all the tax files, say, that are related to my farm, for instance (totally made up example), I would just navigate to the directory that holds all my tax documents, then do a basic Windows search for [farm] and there are all my documents. No database to manage or learn how to use. Just the files and their metadata.

There are utilities that allow you to easily select a group of .PDF files and tag them all with the same keywords. I'm sure you can find one for any OS. And the beauty is: Once the file is tagged with the keyword, it doesn't matter if you just throw away the program you used to set that keyword, because the keyword is just a normal part of that .PDF file.

Because the keywords are standard PDF metadata, any OS should be able to read and index on them. If not, then you could find some program that would, I am sure. Again, the beauty of this system is: if you loose access to that indexing system, or move your files to a different platform, all you gotta do is reindex the metadata that is right there in the files. As long as you have your files, you have your keywords.

Yes, please post us that script for getting PDFs by KWTm · 2013-04-09 00:01 · Score: 1

If anybody wants some pre-alpha scripts for grabbing their pg&e, comcast, cigna, at&t, schwab, nvenergy statements, let me know.

In a similar vein to http://ask.slashdot.org/comments.pl?sid=3623835&cid=43389299, I would say: yes, please post to pastebin or github or something (maybe even your own Slashdot journal); if you GPL it, someone might even do some fine-tuning for you.

--
404555974007725459910684486621289147856453481154 in hex is "You sank my Battleship?"
[GPG key in journal]

Re:Neat? by evilviper · 2013-04-09 04:53 · Score: 1

I really didn't want to get into specifics, and waste a bunch of time on minutia in this thread, discussing the pros and cons of each, and details like one not having some edge feature Neat or some other does.

http://lmgtfy.com/?q=linux+document+management+system

There are oh so many out there, and lots and lots of others have endlessly discussed the benefits of each. There's even new ones every day, because, as the OP said, it's just a matter of a tiny bit of programming to fit all the existing pieces together.

--
Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant

Re:I wrote a script to do exactly what you are say by Ogi_UnixNut · 2013-04-10 05:02 · Score: 1

Glad you like it! As for scanners, I bought my first one secondhand on amazon for $45. It served me throughout Uni for the next 5 years, until I got a new one (it came as a 2-for-1 deal with a new printer). If that is beyond your budget I have seen them be thrown/given away (especially old parallel port ones), and a lot of them work well with Linux.

If Linux support is a must, have a look at: http://www.sane-project.org/sane-supported-devices.html

Yeah.. My site, like the rest of my non work life, is out of date and broken (the counter no longer increments, and it doesn't load properly, breaking the site). I am in the process of rewriting the backend, but time is short.

Thanks a lot, hopefully times will get better soon, and then I can devote some proper time to personal projects again :)

Good luck with your attempts to find a scanner as well!

Re:I wrote a script to do exactly what you are say by kermidge · 2013-04-10 16:50 · Score: 1

Thanks for the tips. I'll be glad when I'm able to stand for more than a few minutes without passing out - then I can take a bus to the re-sale shops (our city fathers in a burst of concern for bargain hunters - many of them very low income and living in and around down town - moved all those shops to out-lying areas, to better serve the citizens' needs), else they'd all be within crutching distance.

I've never tried the extent of external connectivity, but if an XP vm can talk to my printer, it should maybe talk to a scanner as well, so either way it oughta be OK.

Well, even as is, I enjoyed your site, found some interesting things to read. Yeah, we just do what we can, as spirit moves and wallet enables. And the meat bag cooperates, of course.

Google's stuff would work by JBJblaze · 2013-04-12 13:27 · Score: 1

If you use one of Google's products, you should be good. NEVER go Apple though. If you want something open-source, avoid anything and everything Apple as much as you can. Your thoughtful friend, JBJblaze

Slashdot Mirror

Ask Slashdot: Open Source For Bill and Document Management?

126 of 187 comments (clear)