Ask Slashdot: Open Source For Bill and Document Management?
Rinisari writes "Since striking out on my own nearly a decade ago, I've been collecting bills and important documents in a briefcase and small filing box. Since buying a house more than a year ago, the amount of paper that I receive and need to keep has increased to deluge amounts and is overflowing what space I want to dedicate. I would like to scan everything, and only retain the papers for things that don't require the original copies. I'd archive the scans in my heavily backed up NAS. What free and/or open source software is out there that can handle this task of document management? Being able to scan to PDF and associate a date and series of labels to a document would be great, as well as some other metadata such as bill amount. My target OS is OS X, but Linux and Windows would be OK."
Send them to a dedicated gmail account. You'll be able to find all of your documents (you can label them, whatever) and they provide online office of some sort and if you forget what you have there you can always just go to Google search and push "I feel lucky" button.
You can't handle the truth.
I ended up with gscan2pdf and a rigid directory and filename structure. It works, but yeah, tags would be nice.
I shall go and tell the indestructible man that someone plans to murder him.
OpenKM (http://www.openkm.com/en/) is what I use to manage my documents, its tagging and document preview features are what I appreciate most. It runs as a web-service, FYI.
by definition, "important" = keep original (I mean seriously, are u that short of basement space ??) /. gone over this - is this the editors idea of a yearly question ?)
Electronics are ephemeral; You can, today, read stuff on papyrus, as long as you know the language..do you really want to trust stuff that is important to ephemera electronics ?
(i mean, how many times has
tagging is an inherently stupid idea; it may be the best that you can do with current technology, buta google like full text search is much much better (tell me - if you want to pull out a piece of information you know is on your hard drive in a pdf, do you look for the pdf, or just google it ?)
it is possible,after 5 or ten years, you might know what tags you want....
tagging is hard work, that you have to do manually consistently; better to have 3 or 4 folders organized by client/project then tag
I do this on Windows using the cheapest HP all in one with ADF with its bundled scan to PDF with OCR. I use an encrypted TC volume for storage. 512MB is plenty for several years worth at 300dpi b/w. The less typing you have to do the better. Just use one folder for each major category. House, Taxes, utilities, etc. Don't make yourself work too hard entering each item or you will never get around to scanning.
Similar questions to yours appear here regularly. The consensus is that it's best just to throw the bills and documents out and spend more time watching porn.
You can try Alfresco DMS.
It requires a webserver so it might be too-much for a single user.
http://www.icyblaze.com/idocument/
iDocument for the mac is like iTunes but for documents. It lets you import documents (pretty much any type) and tag them and store them in virtual or real folders, it sounds like it's exactly what you're after.
The problem with slashdot is that most of its users were bullied and stuffed into lockers as kids!
1) Receive document.
2) Scan with Fujitsu Scansnap S1500 in about 10 seconds. $380 on sale, but so far worth it over cheap all-in-one scanners it's not even funny. Seriously, don't even bother going paperless unless you get a real document scanner.
3) Save PDF to simple software RAID-1 mirror of two 2TB drives. (Takes about 5 seconds to setup from disk management in Windows.) This should protect against sudden drive failure taking everything.
4) Backup nightly to external drive swapped off-site every other month. This should protect from accidental deletions, fires, etc. Bonus points if backup drive is ioSafe fire proof variety.
5) Throw away original. Only exception is official documents like titles, marriage certificate, etc.. Yes, I even throw away W2s and the like. My taxes are 100 percent digital nowadays.
6) Check and test restore from those backups on a semi-regular basis, and you're done!
So, I've been doing this pretty consistently for the past few years and sent this advice to some relatives asking basically the same question. (That's also why it's a little dumbed down.)
I haven't found a case where any sort of CMS makes more sense than the file system. This is after doing this for about 10 years, and I've got records going back to '01.
I'm using a Fujifilm Scansnap and a Fellowes Powershred, and running Mac OS X. OS X has decent indexing, a good file system manager (really can't beat column view) and the Preview app will let you reassemble PDFs, which is occasionally very handy.
1. The enemy is copies. I strongly recommend "scan and shred", or you'll wind up scanning the same thing over and over.
1.1. Don't bother with any scanner that doesn't do double-sided scans.
1.2. Use a shredder. You can take things out of a trash can.
1.3. The scanner should come with OCR software. Choose "Searchable PDFs".
2. Do scanning in small batches.
2.1. Create a folder "Scanned", and "Unfiled".
2.2. The scanned files go immediately into scans, and the paper immediately goes into the shredder.
2.3. After you've got a batch of stuff scanned, you move it into Unfiled and correct the names, or split the documents up as you need to.
3. If it takes any work to scan it just shove it in a filing cabinet, or, better yet, just shred it.
3.1. If you're having to use a flatbed, it's too complicated to scan and you should file or shred it.
3.2. You can often get manuals and pamphlets and stuff online by googling part of the text or the product name.
4. Don't scan anything you can get electronically.
4.1. Most companies would much rather let you download bills and statements and such.
4.2. Most of them will also delete those statements after a few months, so get in the habit of immediately downloading the statement.
5. It's *very* helpful to put a date on everything. I generally do YYMMDD, trying to guess from dates I find in the document.
5.1.If it's a document covering a period of time like a bill for the month of November, I use the ending date.
5.2. For tax documents I'll put TT-YYMMDD, where TT is the tax year, since the actual transactions occur that year, but filing and IRS stuff happens the year after.
6. I've found that even with full text search, you still need folders.
6.1. They just don't need to be extremely complicated; usually two levels seems to be fine. I'll put prior years into separate folders, too.
6.2. Your system will evolve as you work; just get it in there, and then be mindful of what you are commonly looking for.
6.3. Keep books and reference manuals in a folder that doesn't get indexed. (Spotlight has an option for this.) They tend to create a lot of spurious hits.
7. Keep your inbox clean, if an email wants you to download a statement, get it right away and put it in Unfiled.
7.1. Likewise, keep your desktop clean, scan and shred stuff as soon as it comes in.
7.2. Have a periodic to-do item to tidy your files, don't spend more than half an hour (tops!) at any given time.
My suggestion would be to just scan and OCR your files, and then store them in your file system.
Hierarchy might be something like: ~/scans/year/project/sorted
Within each sorted subdir, you'd have three folders. Date, organizationThatGeneratedTheDoc and TypeOfDoc.
So in the folder ~/scans/year/project/sorted/org
The file names would be something like: organizationThatGeneratedTheDoc-yyyy-mm-dd-TypeOfDoc.pdf
In the folder ~/scans/year/project/sorted/TypeOfDoc
The file names would be like: TypeOfDoc-yyyy-mm-dd-organizationThatGeneratedTheDoc.pdf
Etc.
You'd use links (symlinks or hard links) to make sure that each document is accessible in more than one place. (You can also use links to put documents in more than one project folder.)
Types of documents would be things like invoices, receipts, legal threats, court orders etc. In the event that a document has more than one type, or more than one organization, you simply have more links. So invoice-2013-04-07-webdevteamawesome.pdf and legalthreat-2013-04-07-webdevteamawesome.pdf are the same document, because the first page is an invoice, and the second a threat to take you to court if you don't pay. (This then exists six times, three times for each type, but with the magic of hard links only takes up the space of 1.001 documents.)
With the OCRed text being saved with the PDF scan, you can also run text searches with in your files to find specific information (such as bill amount, seriously, how often would you use that information?)
This allows you maximum flexibility, and prevents you from being locked into a particular piece of software (as you can do everything manually). Moreover, once you've got it setup, it's easy to run with each new document.
Steps would be:
1) Scan and OCR doc, saving the PDF into the staging area folder.
2) Run your script, which asks for the date, project, org name, doc type.
3) The script then saves the document in the appropriate folders, generating links as required.
4) Profit!
HELP MY ACCOUNT HAS BEEN HACKED BY AN ILLIBERAL ART STUDENT SET TO DESTROY THE INTERWEBZ!
http://www.amazon.com/ScanSnap-S510M-Instant-Sheet-Fed-Scanner/dp/B000WJCX18/ref=sr_1_31?s=pc&ie=UTF8&qid=1365365308&sr=1-31&keywords=archive+scanner
The above come highly recommended as an all-in-one solution.
I should note, you need to be careful to make sure you use the same spelling and wording for each org and doc type. You don't want to end up with Murphies, Murphy's, Murphy's Inc., Murphpy's Beer Company Inc. etc., each with invoice, inv., invoise and envoice.
It would be better if your script forced you to pick a doc type, and showed a list of already existing companies.
This applies no matter what solution you end up running with.
Also, for documents that cover a period, you have multiple options. The first is to give 00 as the day and month (e.g. 2012-12-00), and the second 01 as the start (e.g. 2012-12-01). Another is to have two dates (2012-12-01-to-2013-01-01) in place of the yyyy-mm-dd suggested in my first post. Also, don't even think of having the dates in any other order than year, month, day.
Some places have a working year (e.g. a tax year) that crosses two calendar years. In that case, you should be careful about where you put documents. Because if you put them in the first year, and then go "OK, it's been 7 years, and I no longer need any docs from 2005", you'll be burnt. A solution is to hardlink them into both years.
Do post back when you have a solution!
HELP MY ACCOUNT HAS BEEN HACKED BY AN ILLIBERAL ART STUDENT SET TO DESTROY THE INTERWEBZ!
I've played with this a few times, never used it in anger though.
http://www.mayan-edms.com/
I might take up your challenge on going paperless too and give Mayan a go.
PDF is big and bulky. DJVU format makes for tiny document scans. And there are open source libraries for creating it, available even in Debian. Wavelet compression did finally make it into the wild. It's just nobody has ever heard of it, for some reason.
Doesn't help for organization, but it should be a reasonable option for storage.
It even embeds the OCR text in the document along with the image version, so it doesn't proliferate multiple copies of the same data.
Your suggestion is over-complicated IMHO. I use Xsane and scan as multi-page documents. Xsane allows me to add pages to the scan set and reproduce a new PDF file. There are some downsides to my method: I need to have an approximate idea of the date of the document that I am looking for.
I generally file by //.pdf, although I may vary the hierachy if appropriate, for example: TAXES//.pdf
Perhaps more important, though, is to extract the data into some form of record keeping (even if it is only a spreadheet) at the time that it is saved. Then, unless I am being audited, I really don't need the scans.
The real "Libtards" are the Libertarians!
I have been trying to do this for a while. I have a ScanSnap S1500M and have been hosting all the PDFs on my Synology NAS. However, programs like iDocument don't support network drives and text searching PDFs. They rely on Spotlight's database, and spotlight doesn't work on a NAS (though it supposedly does work on a Apple Server).
I'd LOVE some sort of text searchable solution that is better. I do use iDocument, but that has a LOT of limitations, like it will not handle ePUBs. I'm hoping at some point Synology will create an App for it's line of units similar to something like Evernote. They already have two great Apps that allow you to stream Audio and Video from your Synology unit to an iOS or Android phone and computer. And they also have a Dropbox like App. The last piece they really need is some sort of document management thing that works with their stuff. That would be a perfect solution for someone who has a lot of documents or a small business which doesn't want to have it's data in the hands of Google or other companies.
It's either on the beat or off the beat, it's that easy.
I moderate therefore I rule!
--
Maybe that Owncloud thing will work well to handle the storage and access. Anyone knows if its search function is any good?
I use the community edition of Alfresco for that task. You can tag all documents, add custom fields and have full text search and versioning out of the box. Documents can be accessed via web interface, smb, ftp and even imap.
My situation is the same, except that I move often, and have to keep legal documents for a few years (typically 5). I also have paper copies of invoinces and Bills (loads). I didn't want to have to lug boxes and boxes of paper, so I developed a script to do the following:
1) Scan the document page by page, and save as tiff (300dpi)
2) Run open source OCR on it, and save the resulting text to the tiff "comment" field on the metadata
3) Save it in my file server.
4) Index it with a desktop search program (here is a list: http://en.wikipedia.org/wiki/Desktop_search). This has the nice facility of scanning the metadata and allowing you to search it. This way I can search documents by text, ignoring the fact OCR is not 100% correct (it is usually correct enough for me to find the document I want), while having the pure text in photocopy quality as a TIFF (this is very important for legal documents, as OCR'd versions are not acceptable replacments).
I have been wondering whether it would be worth open sourcing the script (for the moment it is a bit hacky, but it has been serving me well for years now). If the TIFFs take up to much space for you liking, subsitute with PNG/JPEG/etc...
So far it has served me well, I've been collecting hundreds of documents this way. The only manual step is the script requesting a filename (not a big deal for me, as I have to manually put each page into the scanner anyway).
If you are interested let me know, and I can post the script.
Should be: I generally file by <TOPIC>/<YEAR>/<MONTH#>.pdf or perhaps <TOPIC>/<YEAR>/<MONTH#>/scans.pdf. I use other variations to the hierarchy if appropriate, for example: TAXES/<YEAR>/<Type_of_Form>.pdf. So all W2s for a particular tax year. would be in the same PDF file.
All scanned invoices for a particular year/month would be in the same PDF file and in the same directory as any downloaded invoices.
It's not important that I use the same hierarchy everywhere, I use the hierachy that will make it easiest to find the document in the future and that varies according to what I am filing.
The real "Libtards" are the Libertarians!
Thinking about this question, I checked the folder in which I keep research and notes for my primary area of study. It's 2GB and just under 2,000 separate files. Many of these are OCRed PDFs, some mp3, some .doc, .rtf. Mac OS X's indexing lets me do adequately quick find-by-content searches, and a relatively simple organizational schema for subfolders let me consult categories of data swiftly. I also use a reference manager program that probably has close to a 100 keyword tags, and Finder lets me get to stuff as quickly, so I'm assuming creating some sort of metadata beyond filename, date, and filetype is really unnecessary. I'd say just relax and throw the stuff in a folder in Finder, and back that up somewhere while also using something like SpiderOak. My work requires frequent and specific searches over this fairly large data set, so if this system works for me, it would probably work for you, unless you plan on getting OCD with your OCR and scanning every Wally World receipt. Anyway, my advice is to keep it simple. Life is too short to diddle around with stuff like this.
There are NUMEROUS document/content management systems for Linux (and have been for years), any of which will do VASTLY more than the dumbed-down "Neat" system.
Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
Just a couple questions come to mind:
First: What is the purpose of keeping the information? If it's just to have a record for your own sake of what and when and how much, do you even need to scan the statement or receipt or keep the original? or can having all the info imported into a money manager be enough?
I've been using Quicken for over a decade (still using Quicken 2000 actually as later versions are bloaty) to keep all my financial history in detail. For answering questions like "When did I buy that Belkin KVM switch so I can see if the warranty period has expired" searching the register is good enough as I add enough info the memos. In this example (real one from just a week ago), finding the information easily was enough, and it's to my advantage to have all the individual statements and detail items combined into larger account histories rather than parse an archive tree full of pdf/ocr files (FWIW: even this old version of quicken lets me attach scans of receipts to entries)
Second Question: In what cases is the Original Paper required as opposed to a scan? If you need to show an original statement, receipt or other document to prove some thing or get something approved, do you know when an electronic copy or reproduction is as acceptable as the original? I don't think this is an area with consistent clear cut answers yet because of its newness.
Let's take an admittedly unlikely example. You have a house but have moved to take a job out of state, and you're trying to sell the house. Some scumbag squatter moves in and tries submitting false documents to claim ownership. All the documents relating to purchase and any mortgages have been scanned and shredded. Will the courts, police, banks, city and county offices etc. give you any trouble because they are not signed originals? What if the scumbag claims you fabricated the documents (like he did) and his are the originals? What if some entities accept a scan and others don't?
I've implemented a hybrid system where different documents get scanned / destroyed at different times. I have a single card-file cabinet (Filing cabinet with half-height drawers). Paper copies of everything from the current year and previous year are kept in a drawer. At the end of each year, I take all the documents from year-1, shred most of them (assuming any need for them has past), and put the ones I deem most critical in a small box to archive.
I use gscan2pdf http://gscan2pdf.sourceforge.net/ with my multifunction "printer" and then save the bills and documents in properly named and organized directories as pdf files. Simple as pie. (Why is pie simple?)
I use my iPhone to scan, convert to,PDF and upload to my Dropbox. The app cost me $6
Dropbox will always be there and is backed up
I used to periodically sort things into hanging folders then dispose of anything nonessential after 3-4 years. A few years back I decided to switch to scanning. So I started collecting a pile of stuff to scan. In the intervening years, that pile has grown and grown and now the scanning would be such a big chore that I don't even like to contemplate it anymore.
The simple fact is that most documents are not something you will ever need again so deserve the minimal effort you can put towards temporary mid-term storage and worth 0 effort for archiving. Others may disagree but I suspect I already keep too much for too long. To be honest, there's not really been much that would have been an issue if I didn't shred immediately after reading.
Zotero (an extension of firefox, also stand alone I believe) works well for me to archive lots of PDFs. It has tags and directories, meta information, search, notes etc.. Once you got your pdfs Zotero is a good organizer.
Please do post the script. Throw it up on pastebin, or, better yet, https://gist.github.com./
Colin Dean Go a year without DRM
Using Camscanner or its ilk is something that a few friends have suggested, but I find the quality of the scans to be less than I really want for long-term archival. This may suffice for many documents that I'm likely never to look at again, such as bills, but things like letters or tax documents I think may require a little higher quality. Also, if a document is more than one page, camera scanning quickly gets unwieldy. I scanned a 30 page document on the go using Camscanner and it was a painful experience.
Colin Dean Go a year without DRM
Any open source library management software that does ebooks should help you out. Here's a list:
http://sourceforge.net/directory/home-education/library/opac/os:windows/freshness:recently-updated/
Necessity is the plea for every infringement of human freedom. It is the argument of tyrants; it is the creed of slaves.
(Longtime listener; first-time caller. If I'm doing it wrong, please be kind.)
I've been going through the same issue and have painstakingly scanned/filed a metric crapton of old documents, putting them in a hierarchical directory structure where I can find them if I need them.
But this sucks for a number of obvious reasons. The ones that bother me the most are:
1) A scanned document is larger (*and* less useful than a downloaded PDF).
2) It's a manual process! I'd rather spend ten hours automating something than five hours over the next 5 years trying to remember the filename convention for storing the scanned document.
Anyway, cutting to the chase, I'm now using Ruby/Watir scripts to automate the business of downloading my most common phone/utility bills from the websites and stashing them directly. I used to use Perl and WWW::Mechanize but all the websites are now so contaminated with unecessary javascript that only something which manipulated a browser directly allows automation without pulling my hair out. Ruby/Waitr works pretty well. Recommend. Automated download; priceless. Without automated download, I'd rather return to scanning paper documents mailed to me, otherwise you quickly find how unreliable your service provider is for retaining your statements.
If anybody wants some pre-alpha scripts for grabbing their pg&e, comcast, cigna, at&t, schwab, nvenergy statements, let me know.
Super simple scanning system using Linux.
Make directory called scans, make another called taxes
Have a text file of scanning hints with an easy to remember name.
in a terminal, print the scanning hints file and use the Linux mouse copy feature to construct a scan instruction
The scanimage application requires sudo or you can find a tweak using google search to alter the scanner's USB files and make it run from an unprivileged user.
cd scans
cat filewitheasycommandstocopy.txt
Typical contents of my hint file:
sudo scanimage -l 0mm -x 90mm -y 66mm --resolution 400 | pnmtojpeg >cprcard.jpg
# make files non-overwritable
# chmod -w ~/scans/*.jpg
Verify each scan with eog viewer.
Organize scans like this:
Make long filenames with agencynames, recipientnames, and documentnames all in lower case.
use the mouse to copy an old file name for re-use.
this groups similar documents together.
use ls -lr to show most recently scanned items.
use ls -lr *keyword*.jpg to show selected classes of scanned items.
use locate in the distant future to find those oddball items like certificates or letters of recommendation.
locate certificate | grep rabies
Here are some pieces of a scan to ocr script I am developing.
First I am scanning a multicolumn document and to preserve the sense of the document text, I scan even pages twice and odd pages twice.
Second, the scanned images must be rotated. Pieces of the "convert" command appear in the perl fragments here.
Third, I am using the open source tesseract OCR program. Some of my documents have grayed areas that contain text. So I am running tesseract twice on the source files and picking the output file with the most text characters.
Forth, the basic program is just a big loop with a menu where I input file names or page numbers.
Here goes:
# my $scanprog = "/usr/bin/scanimage --resolution 400 >";# print "$scanprog \n";
# Scanner settings for pages top of book at left of scanner StylusScan 2500
my $scanoddleft = "/usr/bin/scanimage -l 30mm -x 190mm -y 235mm --resolution 400 >";#for odd pages
my $scanoddright = "/usr/bin/scanimage -l 0mm -x 190mm -y 235mm --resolution 400 >";#for odd pages
my $scanevenleft = "/usr/bin/scanimage -l 30mm -x 190mm -y 235mm --resolution 400 >";#for even pages
my $scanevenright = "/usr/bin/scanimage -l 0mm -x 190mm -y 235mm --resolution 400 >";#for even pages
# OCR commands and parameters
#tesseract test1.tif test1 -l eng;
#scanimage -l 26mm -x 166mm -t 10mm -y 125mm --brightness 3 --resolution 400 | pnmtotiff>test1.tif;eog test1.tif;convert -rotate 90 test1.tif test1.tif; eog test1.tif; tesseract test1.tif test1 -l eng
my $tesseract = " tesseract ";
my $language = " -l eng ";
my $brightness2 = " --brightness 2 ";
my $brightness3 = " --brightness 3 ";
my $convert90 = " convert -rotate 90 ";
my $eog = " eog " ;
my $charcount = " wc -c " ;
my $scanpage = 1; # Range is 1 to 183
GScanToPDF can do OCR and embed the results as annotations within the PDF. Perhaps that would help with search ability. It works well enough with a lot of my documents though it is far from perfect it is good enough for those purposes especially for bills as they are not handwritten. Best results are on scans set to line art/b&w rather than grey scale or colour.
I've been doing this for a while now. Like others here, I have a Fujitsu Scansnap 1500- it's one of the best investments I've made for cleaning up my office/workflow.
When something comes in, I immediately scan it to the filesystem. My structure is:
2013/Banking/BankName/2013-01-31-14h32.pdf (or something like that- it's the default Scansnap filename.)
I then place the original in a filebox- keeping one filebox for each year. No sorting, organizing, just keeping originals.
At the end of each year, the filebox goes to the crawlspace, and I start a new one. After 7 years, intention is get the box securely shredded (costs about $10/box around here.)
I back the filesystem up nightly to two separate local NASs, and upload the whole filesystem (as a series of encrypted files) to Amazon Glacier (this is a recent addition to my workflow- has stopped me worrying about a fire etc. wiping out both NASs).
All of my documents go in there- it's really easy to find stuff (depending on how good your folder organization is- you can add depth for those kinds of documents that need it, while other ones that aren't likely to be needed can be put in a less descriptive folder hierarchy.)
For what it may be worth, I have an all-in-one HP printer/scanner (model in subject). It's reasonably cheap, it is a good printer, and it has a double-sided scanner with auto-feeder which works really well.
I've scanned thousands of sheets with it recently (for archiving before shredding), and I would never have even tried it without the automatic scanning.
Disclaimer: not an HP employee, have no HP stock...
"I have been wondering whether it would be worth open sourcing the script...."
Please do. Unless you deem it worthwhile to spiffy it up and try to make some moola, I think it'd be great to share your script. It could be useful to some, could be instructive to those wannting to learn, to see how someone else has done something; any possible embarrassment you might feel about it being 'a bit hacky' you might could toss off to 'having character'. Heck, after your description I'd like to see it, even tho I haven't done any real coding in years.
Seems to me putting the OCR text in the comment field is a fine and good thing. An obvious thing to some, perhaps, but an elegant usage to me.
@Rinsari, below - the link you gave throws a cert warning in Opera, could just be my settings.
This is so easy, I've been doing it since primary school. Mark your files with date YYYY-MM-DD-name. Put them in folders 1993, 1999, 2005, 2010, 2013 (.e.g.) Profit! In modern times I have used a Lexmark X560 all-in-one office machine to scan everything to a designated network drive. Works like a charm every time. I apply the backup policy of equal drives. So I buy a 3TB drive, and I buy another for off-site backups. Once everytwo month or so, I freshen the backup and verify it. If you need to do it a couple of notches more professional, apply the same backup policy, but use a document retention system, and store it in globally and industry standard PDF. There are many good document repositories that are free/as in beer.
The problem with most businesses is that they want to have their cake & eat it too... they want to get you to opt into paperless statements, but they don't want to allow you to fetch your statements via automated means. They just want to spam you monthly (or more), then make you go to their site, log in, and generally set things up to make it as hard to automate those logins as possible. If companies like CapitalOne and Chase would let you just give them your public key, encrypt your statements with it, and email them directly to you (or allow you to fetch them in some standard manner via a web service), I'd happily let them off the hook and go all-electronic. But I'll be damned if I'm going to settle for statements I have to go out of my way to obtain. At least printed statements can be tossed into a box and ignored for years unless I care enough to look at them, as opposed to ephemeral online statements that go bye-bye after 12 months.
>there won't be a reason to change to format for approximately 7,895 years (but who is counting, really).
I'm kicking myself for not having caught this earlier!
Thanks (no sarc) for alerting us to the Y10E4 problem.
Question: Is Linux Y10E4 ready?
I'm not a lawyer, but I play one on the Internet. Blog
And done :)
https://github.com/ZivaVatra/SDAT
Figured I would take the opportunity to try out GIT (have not bothered so far).
Also, seems that I have recently made it actually save to tagged PNG's rather than TIFFs. Forgot about that :)
Hope it turns out to be useful to you. Let me know if you want commit privs for any fixes you do. Happy Hacking!
As requested, I have put it on github now:
https://github.com/ZivaVatra/SDAT
Hope it is useful to you, or at least interesting :)
I don't think it is worth selling the script, it isn't that fancy, not to mention that then I would be on the hook for supporting it. :(
Since the recession hit I've had to work 2 jobs, and I really don't have much time to devote to personal nerdy pursuits. Barely have time to sleep as it is
All I can do is publish and hope it helps others. That little script has made my life a lot easier and less cluttered. With any luck others with more time will improve on it and we all benefit :)
And it isn't your settings. I also get an invalid certificate on Firefox and Chromium. Something is up with the link, so I just went ahead and used the actual github.com site to host it.
For maintaining a high-integrity archive of documents, try Boar. It can even version control huge documents, like movies and photos. http://www.boarvcs.org/
Try out Alfresco, It is a nice document management system if you are familiar with IT-system.
I know it's probably different here in the UK where most people don't even need to do a tax return, but basically the really important stuff like house deeds (and wills) are in the hands of a solicitor anyway, and I simply don't need to keep copies of 2 year old bank statements or 3 year old electricity bills on the off chance I might need to refer to them.
If you have a business, fair enough, you legally need to keep financial stuff for 6 years, but then off-site archiving is just an insignificant business cost.
To have a right to do a thing is not at all the same as to be right in doing it
You would think that in this day and age someone could invent an OCR system that has a basic understanding of the documents being scanned. It would automatically name files "Bank of Elbonia Current Account Statement 03/2013" and allow you to do things like search all statements for transactions over £250.
In fact the bank could just include a big QR code on the back with all the data in it, but I suppose that is asking too much.
const int one = 65536; (Silvermoon, Texture.cs)
SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
Thank you. I'm enjoying looking at the script, am wondering where I can get a scanner I can afford, and had a look at your site as well. And thanks for the readme also. Now get some sleep. (grin) I've had two jobs at times, when I was younger, and it's decidedly not only no fun, but done too long a recipe for ungood juju. Good luck to you.
There is a very good OS X application called doo which will do exactly what you want and it's completely free. Check it out at http://doo.net/ and you have it on Mac AppStore.
This is not actually a real solution, I'm just amused every time I see document management show up in a slashdot thread - being it's not a very *exciting* field. I work for a company that provides document management solutions for much larger organizations, with costs (I think) starting in the tens of thousands of dollars, and going up to way more than that. Of course, since I work here, I have my own personal test repositories for testing things, and when I was buying a house and started getting crazy amounts of paper documents I had to sign, scan and send back, I was like, why not scan them with QuickFields and keep them in a Laserfiche repository? So I did. :D
That solution doesn't work for most people, though. (Also it's totally not open source. Source is only open to those who are development at this company, which I am, so I suppose in a twisted way...)
Can anyone tell if you can train tesseract to be a bit better at recognizing a specific font?
I'm using the Debian version but if you have a 300 dpi scan the OCR is often gobbledygook.
(Yes, "use the source Luke" is also a valid answer in this case...)
To be, or not to be: isn't that quite logical, Slashdot Beta?
Hi Rinisari,
I work at doo (http://doo.net) and immediately thought of our app when I saw your problem. All of the filing systems listed so far really make sense, and I personally learned a few things from some of the workflows suggested. But using doo would greatly simplify your entire document management.
When you set up doo you select the documents and folders that you wish to connect: not just those imported from connected scanners, but Dropbox, GDrive, the local HD, email, etc. Once documents are connected, the app indexes them and runs OCR automatically. Then it allows you to back everything up and sync it to the cloud for backup and access on other devices.
And it’s all in one spot. Handling the “task of document management” is precisely what we do.
Some details ...
If your document(s) is already digital – whether it’s a scan on your HD, a Google Drive document, an email attachment, in Dropbox, etc – you can connect the source or just the individual folder/document to doo.
If your document is still paper, you can scan it directly into doo using the app’s interface if the scanner has a TWAIN driver (http://is.gd/15O26Q). If it doesn’t have a TWAIN, we have a guide for quickly setting up top brands, like Fujitsu ScanSnap, Canon and Doxie (http://docs.doo.net/scanguide.pdf). I also just wrote a blog post about it if you want to take a look: https://blog.doo.net/2013/04/04/how-to-scan-with-doo.html.
The coolest thing about doo is that it does the tedious, difficult part of document management for you: the aforementioned automatic indexing and OCR occur right when you connect the document or folder in question, which means document search-and-retrieval is a cinch. After that, should you wish to add a degree of personalization, you can alter existing tags or add individual labels in the app’s intuitive UI.
doo is currently available for OS X, Windows 8 and will be coming very soon for Android and iOS (https://doo.net/en/download.html). Hope that helps! Send us a ticket at support@doo.net if you have any questions.
Keep It Simple Software (engineer) or whatever...
I went with the simplest possible solution. One that also allows me to recover even if a "database" becomes corrupted or obsolete, because all the "real" data is contained in the documents themselves.
I just scan to PDF and add tags in the Keywords field of the PDF metadata. For the keywords, I use unique words that aren't going to show up in an actual document. (Just tacking on a prefix or surrounding each keyword in brackets is good enough.) I also organize the files in a decent (but not too detailed) directory structure. (You can use any high-tech storage system you like. I just use a regular hard drive.) Then I installed the PDF iFilter so the Windows Indexing service could index the files, including that metadata (There are many. Google is your friend.) So, now, if I want to find all the tax files, say, that are related to my farm, for instance (totally made up example), I would just navigate to the directory that holds all my tax documents, then do a basic Windows search for [farm] and there are all my documents. No database to manage or learn how to use. Just the files and their metadata.
There are utilities that allow you to easily select a group of .PDF files and tag them all with the same keywords. I'm sure you can find one for any OS. And the beauty is: Once the file is tagged with the keyword, it doesn't matter if you just throw away the program you used to set that keyword, because the keyword is just a normal part of that .PDF file.
Because the keywords are standard PDF metadata, any OS should be able to read and index on them. If not, then you could find some program that would, I am sure. Again, the beauty of this system is: if you loose access to that indexing system, or move your files to a different platform, all you gotta do is reindex the metadata that is right there in the files. As long as you have your files, you have your keywords.
In a similar vein to http://ask.slashdot.org/comments.pl?sid=3623835&cid=43389299, I would say: yes, please post to pastebin or github or something (maybe even your own Slashdot journal); if you GPL it, someone might even do some fine-tuning for you.
404555974007725459910684486621289147856453481154 in hex is "You sank my Battleship?"
[GPG key in journal]
I really didn't want to get into specifics, and waste a bunch of time on minutia in this thread, discussing the pros and cons of each, and details like one not having some edge feature Neat or some other does.
http://lmgtfy.com/?q=linux+document+management+system
There are oh so many out there, and lots and lots of others have endlessly discussed the benefits of each. There's even new ones every day, because, as the OP said, it's just a matter of a tiny bit of programming to fit all the existing pieces together.
Slashdot gets worse every day... Pipedot: News for nerds, without the corporate slant
Glad you like it! As for scanners, I bought my first one secondhand on amazon for $45. It served me throughout Uni for the next 5 years, until I got a new one (it came as a 2-for-1 deal with a new printer). If that is beyond your budget I have seen them be thrown/given away (especially old parallel port ones), and a lot of them work well with Linux.
If Linux support is a must, have a look at: http://www.sane-project.org/sane-supported-devices.html
Yeah.. My site, like the rest of my non work life, is out of date and broken (the counter no longer increments, and it doesn't load properly, breaking the site). I am in the process of rewriting the backend, but time is short.
Thanks a lot, hopefully times will get better soon, and then I can devote some proper time to personal projects again :)
Good luck with your attempts to find a scanner as well!
Thanks for the tips. I'll be glad when I'm able to stand for more than a few minutes without passing out - then I can take a bus to the re-sale shops (our city fathers in a burst of concern for bargain hunters - many of them very low income and living in and around down town - moved all those shops to out-lying areas, to better serve the citizens' needs), else they'd all be within crutching distance.
I've never tried the extent of external connectivity, but if an XP vm can talk to my printer, it should maybe talk to a scanner as well, so either way it oughta be OK.
Well, even as is, I enjoyed your site, found some interesting things to read. Yeah, we just do what we can, as spirit moves and wallet enables. And the meat bag cooperates, of course.
If you use one of Google's products, you should be good. NEVER go Apple though. If you want something open-source, avoid anything and everything Apple as much as you can. Your thoughtful friend, JBJblaze