Ask Slashdot: Open Source For Bill and Document Management?
Rinisari writes "Since striking out on my own nearly a decade ago, I've been collecting bills and important documents in a briefcase and small filing box. Since buying a house more than a year ago, the amount of paper that I receive and need to keep has increased to deluge amounts and is overflowing what space I want to dedicate. I would like to scan everything, and only retain the papers for things that don't require the original copies. I'd archive the scans in my heavily backed up NAS. What free and/or open source software is out there that can handle this task of document management? Being able to scan to PDF and associate a date and series of labels to a document would be great, as well as some other metadata such as bill amount. My target OS is OS X, but Linux and Windows would be OK."
Send them to a dedicated gmail account. You'll be able to find all of your documents (you can label them, whatever) and they provide online office of some sort and if you forget what you have there you can always just go to Google search and push "I feel lucky" button.
You can't handle the truth.
I ended up with gscan2pdf and a rigid directory and filename structure. It works, but yeah, tags would be nice.
I shall go and tell the indestructible man that someone plans to murder him.
Subject says it all
May I suggest reading David Spark's Paperless book?
It has a whole chapter on how to tag documents on OS X, which sounds what you're looking for.
OpenKM (http://www.openkm.com/en/) is what I use to manage my documents, its tagging and document preview features are what I appreciate most. It runs as a web-service, FYI.
I built my own document storage system years ago using sane, postgresql, and I think it was perl wor the web interface though there are much better ways to do the web part.
by definition, "important" = keep original (I mean seriously, are u that short of basement space ??) /. gone over this - is this the editors idea of a yearly question ?)
Electronics are ephemeral; You can, today, read stuff on papyrus, as long as you know the language..do you really want to trust stuff that is important to ephemera electronics ?
(i mean, how many times has
tagging is an inherently stupid idea; it may be the best that you can do with current technology, buta google like full text search is much much better (tell me - if you want to pull out a piece of information you know is on your hard drive in a pdf, do you look for the pdf, or just google it ?)
it is possible,after 5 or ten years, you might know what tags you want....
tagging is hard work, that you have to do manually consistently; better to have 3 or 4 folders organized by client/project then tag
I do this on Windows using the cheapest HP all in one with ADF with its bundled scan to PDF with OCR. I use an encrypted TC volume for storage. 512MB is plenty for several years worth at 300dpi b/w. The less typing you have to do the better. Just use one folder for each major category. House, Taxes, utilities, etc. Don't make yourself work too hard entering each item or you will never get around to scanning.
In the business this is often part of ERP software. ERP stands for enterprise resource planning.
An open example would be ERPAL
http://drupal.org/project/erpal
Similar questions to yours appear here regularly. The consensus is that it's best just to throw the bills and documents out and spend more time watching porn.
You can try Alfresco DMS.
It requires a webserver so it might be too-much for a single user.
http://www.icyblaze.com/idocument/
iDocument for the mac is like iTunes but for documents. It lets you import documents (pretty much any type) and tag them and store them in virtual or real folders, it sounds like it's exactly what you're after.
The problem with slashdot is that most of its users were bullied and stuffed into lockers as kids!
1) Receive document.
2) Scan with Fujitsu Scansnap S1500 in about 10 seconds. $380 on sale, but so far worth it over cheap all-in-one scanners it's not even funny. Seriously, don't even bother going paperless unless you get a real document scanner.
3) Save PDF to simple software RAID-1 mirror of two 2TB drives. (Takes about 5 seconds to setup from disk management in Windows.) This should protect against sudden drive failure taking everything.
4) Backup nightly to external drive swapped off-site every other month. This should protect from accidental deletions, fires, etc. Bonus points if backup drive is ioSafe fire proof variety.
5) Throw away original. Only exception is official documents like titles, marriage certificate, etc.. Yes, I even throw away W2s and the like. My taxes are 100 percent digital nowadays.
6) Check and test restore from those backups on a semi-regular basis, and you're done!
So, I've been doing this pretty consistently for the past few years and sent this advice to some relatives asking basically the same question. (That's also why it's a little dumbed down.)
I haven't found a case where any sort of CMS makes more sense than the file system. This is after doing this for about 10 years, and I've got records going back to '01.
I'm using a Fujifilm Scansnap and a Fellowes Powershred, and running Mac OS X. OS X has decent indexing, a good file system manager (really can't beat column view) and the Preview app will let you reassemble PDFs, which is occasionally very handy.
1. The enemy is copies. I strongly recommend "scan and shred", or you'll wind up scanning the same thing over and over.
1.1. Don't bother with any scanner that doesn't do double-sided scans.
1.2. Use a shredder. You can take things out of a trash can.
1.3. The scanner should come with OCR software. Choose "Searchable PDFs".
2. Do scanning in small batches.
2.1. Create a folder "Scanned", and "Unfiled".
2.2. The scanned files go immediately into scans, and the paper immediately goes into the shredder.
2.3. After you've got a batch of stuff scanned, you move it into Unfiled and correct the names, or split the documents up as you need to.
3. If it takes any work to scan it just shove it in a filing cabinet, or, better yet, just shred it.
3.1. If you're having to use a flatbed, it's too complicated to scan and you should file or shred it.
3.2. You can often get manuals and pamphlets and stuff online by googling part of the text or the product name.
4. Don't scan anything you can get electronically.
4.1. Most companies would much rather let you download bills and statements and such.
4.2. Most of them will also delete those statements after a few months, so get in the habit of immediately downloading the statement.
5. It's *very* helpful to put a date on everything. I generally do YYMMDD, trying to guess from dates I find in the document.
5.1.If it's a document covering a period of time like a bill for the month of November, I use the ending date.
5.2. For tax documents I'll put TT-YYMMDD, where TT is the tax year, since the actual transactions occur that year, but filing and IRS stuff happens the year after.
6. I've found that even with full text search, you still need folders.
6.1. They just don't need to be extremely complicated; usually two levels seems to be fine. I'll put prior years into separate folders, too.
6.2. Your system will evolve as you work; just get it in there, and then be mindful of what you are commonly looking for.
6.3. Keep books and reference manuals in a folder that doesn't get indexed. (Spotlight has an option for this.) They tend to create a lot of spurious hits.
7. Keep your inbox clean, if an email wants you to download a statement, get it right away and put it in Unfiled.
7.1. Likewise, keep your desktop clean, scan and shred stuff as soon as it comes in.
7.2. Have a periodic to-do item to tidy your files, don't spend more than half an hour (tops!) at any given time.
My suggestion would be to just scan and OCR your files, and then store them in your file system.
Hierarchy might be something like: ~/scans/year/project/sorted
Within each sorted subdir, you'd have three folders. Date, organizationThatGeneratedTheDoc and TypeOfDoc.
So in the folder ~/scans/year/project/sorted/org
The file names would be something like: organizationThatGeneratedTheDoc-yyyy-mm-dd-TypeOfDoc.pdf
In the folder ~/scans/year/project/sorted/TypeOfDoc
The file names would be like: TypeOfDoc-yyyy-mm-dd-organizationThatGeneratedTheDoc.pdf
Etc.
You'd use links (symlinks or hard links) to make sure that each document is accessible in more than one place. (You can also use links to put documents in more than one project folder.)
Types of documents would be things like invoices, receipts, legal threats, court orders etc. In the event that a document has more than one type, or more than one organization, you simply have more links. So invoice-2013-04-07-webdevteamawesome.pdf and legalthreat-2013-04-07-webdevteamawesome.pdf are the same document, because the first page is an invoice, and the second a threat to take you to court if you don't pay. (This then exists six times, three times for each type, but with the magic of hard links only takes up the space of 1.001 documents.)
With the OCRed text being saved with the PDF scan, you can also run text searches with in your files to find specific information (such as bill amount, seriously, how often would you use that information?)
This allows you maximum flexibility, and prevents you from being locked into a particular piece of software (as you can do everything manually). Moreover, once you've got it setup, it's easy to run with each new document.
Steps would be:
1) Scan and OCR doc, saving the PDF into the staging area folder.
2) Run your script, which asks for the date, project, org name, doc type.
3) The script then saves the document in the appropriate folders, generating links as required.
4) Profit!
HELP MY ACCOUNT HAS BEEN HACKED BY AN ILLIBERAL ART STUDENT SET TO DESTROY THE INTERWEBZ!
Alfresco is open source, works for Windows, Linux and Mac, and allows you to define your own metadata for your documents.
http://wiki.alfresco.com/wiki/Download_and_Install_Alfresco
http://www.amazon.com/ScanSnap-S510M-Instant-Sheet-Fed-Scanner/dp/B000WJCX18/ref=sr_1_31?s=pc&ie=UTF8&qid=1365365308&sr=1-31&keywords=archive+scanner
The above come highly recommended as an all-in-one solution.
I should note, you need to be careful to make sure you use the same spelling and wording for each org and doc type. You don't want to end up with Murphies, Murphy's, Murphy's Inc., Murphpy's Beer Company Inc. etc., each with invoice, inv., invoise and envoice.
It would be better if your script forced you to pick a doc type, and showed a list of already existing companies.
This applies no matter what solution you end up running with.
Also, for documents that cover a period, you have multiple options. The first is to give 00 as the day and month (e.g. 2012-12-00), and the second 01 as the start (e.g. 2012-12-01). Another is to have two dates (2012-12-01-to-2013-01-01) in place of the yyyy-mm-dd suggested in my first post. Also, don't even think of having the dates in any other order than year, month, day.
Some places have a working year (e.g. a tax year) that crosses two calendar years. In that case, you should be careful about where you put documents. Because if you put them in the first year, and then go "OK, it's been 7 years, and I no longer need any docs from 2005", you'll be burnt. A solution is to hardlink them into both years.
Do post back when you have a solution!
HELP MY ACCOUNT HAS BEEN HACKED BY AN ILLIBERAL ART STUDENT SET TO DESTROY THE INTERWEBZ!
1) Receive document.
Most places I do business with offer electronic documents: billing, statements, etc ....
No need for a scanner.
Now the old stuff, well, of course you need a scanner.
I've played with this a few times, never used it in anger though.
http://www.mayan-edms.com/
I might take up your challenge on going paperless too and give Mayan a go.
PDF is big and bulky. DJVU format makes for tiny document scans. And there are open source libraries for creating it, available even in Debian. Wavelet compression did finally make it into the wild. It's just nobody has ever heard of it, for some reason.
Doesn't help for organization, but it should be a reasonable option for storage.
It even embeds the OCR text in the document along with the image version, so it doesn't proliferate multiple copies of the same data.
Your suggestion is over-complicated IMHO. I use Xsane and scan as multi-page documents. Xsane allows me to add pages to the scan set and reproduce a new PDF file. There are some downsides to my method: I need to have an approximate idea of the date of the document that I am looking for.
I generally file by //.pdf, although I may vary the hierachy if appropriate, for example: TAXES//.pdf
Perhaps more important, though, is to extract the data into some form of record keeping (even if it is only a spreadheet) at the time that it is saved. Then, unless I am being audited, I really don't need the scans.
The real "Libtards" are the Libertarians!
So, you were watching TV and you saw the commercial for the Neat Scanner. You thought to yourself' "that's a great idea. I wish I had that, but I run Linux. I wish there was something like this for Linux. Certainly someone has come up with such software for Linux."
I know how you feel. That's exactly what happened to me over 5 years ago. Yet today, we still have this lame list of "excuses" for solutions. PDFs in rigid directory structures, blah, blah, blah.
Sadly there is still nothing like the Neat scanner system for Linux. Something that, preferably, OCRs and indexes your documents for easy searching and retrieval. At the least something that indexes, even if you have to manually populate the fields. Nothing at all after years of hoping.
Cue the replies stating it would be trivial to make your own using MySQL/Mariadb and a PHP frontend. It always amuses me that it's so trivial, yet no one has done it yet, except on Windows.
I have been trying to do this for a while. I have a ScanSnap S1500M and have been hosting all the PDFs on my Synology NAS. However, programs like iDocument don't support network drives and text searching PDFs. They rely on Spotlight's database, and spotlight doesn't work on a NAS (though it supposedly does work on a Apple Server).
I'd LOVE some sort of text searchable solution that is better. I do use iDocument, but that has a LOT of limitations, like it will not handle ePUBs. I'm hoping at some point Synology will create an App for it's line of units similar to something like Evernote. They already have two great Apps that allow you to stream Audio and Video from your Synology unit to an iOS or Android phone and computer. And they also have a Dropbox like App. The last piece they really need is some sort of document management thing that works with their stuff. That would be a perfect solution for someone who has a lot of documents or a small business which doesn't want to have it's data in the hands of Google or other companies.
It's either on the beat or off the beat, it's that easy.
I moderate therefore I rule!
--
boxcryptor + your choice of google drive, dropbox, etc. Keep a notes.txt file and put each set of scans in a dated folder. Keep a local copy and use copernic etc to make it easily searchable. done and done.
Maybe that Owncloud thing will work well to handle the storage and access. Anyone knows if its search function is any good?
I use the community edition of Alfresco for that task. You can tag all documents, add custom fields and have full text search and versioning out of the box. Documents can be accessed via web interface, smb, ftp and even imap.
I have been setting up the Alfresco CMS at work for our companies document management and its pretty robust. The install process on Linux is a bit of a bitch, because you have to get it just right. I am not sure about the other OSes but I am assuming it doesn't require too much on the Windows and OSX side of things. http://www.alfresco.com/node/2296?utm_expid=11184972-4
My situation is the same, except that I move often, and have to keep legal documents for a few years (typically 5). I also have paper copies of invoinces and Bills (loads). I didn't want to have to lug boxes and boxes of paper, so I developed a script to do the following:
1) Scan the document page by page, and save as tiff (300dpi)
2) Run open source OCR on it, and save the resulting text to the tiff "comment" field on the metadata
3) Save it in my file server.
4) Index it with a desktop search program (here is a list: http://en.wikipedia.org/wiki/Desktop_search). This has the nice facility of scanning the metadata and allowing you to search it. This way I can search documents by text, ignoring the fact OCR is not 100% correct (it is usually correct enough for me to find the document I want), while having the pure text in photocopy quality as a TIFF (this is very important for legal documents, as OCR'd versions are not acceptable replacments).
I have been wondering whether it would be worth open sourcing the script (for the moment it is a bit hacky, but it has been serving me well for years now). If the TIFFs take up to much space for you liking, subsitute with PNG/JPEG/etc...
So far it has served me well, I've been collecting hundreds of documents this way. The only manual step is the script requesting a filename (not a big deal for me, as I have to manually put each page into the scanner anyway).
If you are interested let me know, and I can post the script.
Should be: I generally file by <TOPIC>/<YEAR>/<MONTH#>.pdf or perhaps <TOPIC>/<YEAR>/<MONTH#>/scans.pdf. I use other variations to the hierarchy if appropriate, for example: TAXES/<YEAR>/<Type_of_Form>.pdf. So all W2s for a particular tax year. would be in the same PDF file.
All scanned invoices for a particular year/month would be in the same PDF file and in the same directory as any downloaded invoices.
It's not important that I use the same hierarchy everywhere, I use the hierachy that will make it easiest to find the document in the future and that varies according to what I am filing.
The real "Libtards" are the Libertarians!
Thinking about this question, I checked the folder in which I keep research and notes for my primary area of study. It's 2GB and just under 2,000 separate files. Many of these are OCRed PDFs, some mp3, some .doc, .rtf. Mac OS X's indexing lets me do adequately quick find-by-content searches, and a relatively simple organizational schema for subfolders let me consult categories of data swiftly. I also use a reference manager program that probably has close to a 100 keyword tags, and Finder lets me get to stuff as quickly, so I'm assuming creating some sort of metadata beyond filename, date, and filetype is really unnecessary. I'd say just relax and throw the stuff in a folder in Finder, and back that up somewhere while also using something like SpiderOak. My work requires frequent and specific searches over this fairly large data set, so if this system works for me, it would probably work for you, unless you plan on getting OCD with your OCR and scanning every Wally World receipt. Anyway, my advice is to keep it simple. Life is too short to diddle around with stuff like this.
Just a couple questions come to mind:
First: What is the purpose of keeping the information? If it's just to have a record for your own sake of what and when and how much, do you even need to scan the statement or receipt or keep the original? or can having all the info imported into a money manager be enough?
I've been using Quicken for over a decade (still using Quicken 2000 actually as later versions are bloaty) to keep all my financial history in detail. For answering questions like "When did I buy that Belkin KVM switch so I can see if the warranty period has expired" searching the register is good enough as I add enough info the memos. In this example (real one from just a week ago), finding the information easily was enough, and it's to my advantage to have all the individual statements and detail items combined into larger account histories rather than parse an archive tree full of pdf/ocr files (FWIW: even this old version of quicken lets me attach scans of receipts to entries)
Second Question: In what cases is the Original Paper required as opposed to a scan? If you need to show an original statement, receipt or other document to prove some thing or get something approved, do you know when an electronic copy or reproduction is as acceptable as the original? I don't think this is an area with consistent clear cut answers yet because of its newness.
Let's take an admittedly unlikely example. You have a house but have moved to take a job out of state, and you're trying to sell the house. Some scumbag squatter moves in and tries submitting false documents to claim ownership. All the documents relating to purchase and any mortgages have been scanned and shredded. Will the courts, police, banks, city and county offices etc. give you any trouble because they are not signed originals? What if the scumbag claims you fabricated the documents (like he did) and his are the originals? What if some entities accept a scan and others don't?
I've implemented a hybrid system where different documents get scanned / destroyed at different times. I have a single card-file cabinet (Filing cabinet with half-height drawers). Paper copies of everything from the current year and previous year are kept in a drawer. At the end of each year, I take all the documents from year-1, shred most of them (assuming any need for them has past), and put the ones I deem most critical in a small box to archive.
Forget scanners. Use smartphone (iPhone in your case, I suspect) to scan, ocr and convert to pdf.
Use some cardbox with sides cut open as smartphone support.
Then transfer files to your NAS server and organize it way that makes sense to you.
Any phone camera 5Mpixels+ is ok for this kind of job.
I use gscan2pdf http://gscan2pdf.sourceforge.net/ with my multifunction "printer" and then save the bills and documents in properly named and organized directories as pdf files. Simple as pie. (Why is pie simple?)
If your target OS is OS X, you may want to drop the "open source" part of your requirement list and take a good long look at DevonThink Office Pro. When used in conjunction with the Fujitsu Scansnap (though one isn't required per-se) it's a seriously beautiful thing. You can set up a blindingly fast workflow in just minutes.
The hierarchy it creates is easily exportable as nested directories of bagged-and-tagged PDFs, should you ever need to jump ship to another system.
I use my iPhone to scan, convert to,PDF and upload to my Dropbox. The app cost me $6
Dropbox will always be there and is backed up
I used to periodically sort things into hanging folders then dispose of anything nonessential after 3-4 years. A few years back I decided to switch to scanning. So I started collecting a pile of stuff to scan. In the intervening years, that pile has grown and grown and now the scanning would be such a big chore that I don't even like to contemplate it anymore.
The simple fact is that most documents are not something you will ever need again so deserve the minimal effort you can put towards temporary mid-term storage and worth 0 effort for archiving. Others may disagree but I suspect I already keep too much for too long. To be honest, there's not really been much that would have been an issue if I didn't shred immediately after reading.
I have never used them but reading their description, they seems to fit your needs: https://github.com/jflesch/paperwork and http://code.google.com/p/malodos/
Zotero (an extension of firefox, also stand alone I believe) works well for me to archive lots of PDFs. It has tags and directories, meta information, search, notes etc.. Once you got your pdfs Zotero is a good organizer.
Please do post the script. Throw it up on pastebin, or, better yet, https://gist.github.com./
Colin Dean Go a year without DRM
"only retain the papers for things that don't require the original copies"
That's a disaster waiting to happen; I think it would be a lot safer to retain the papers that do require originals instead.
Until..you need a niche product like you are describing. Then spend the $200 bucks for a dedicated solution that actually saves time and requires no modification.
Buy NeatReceipts and be done with it.
Onsite and offsite backups, labeling, searching, categorizing and OCR.
The amount of time I save with a dedicated solution is worth the up front cost.
just an anonymous cowards 2 cents
I never got a chance to play with this, but the features sound nice - and they claim to work with Apple. http://silkwoodsoftware.com/
I use the community edition of Alfresco for that task.
You can tag all documents, add custom fields and have full text search and versioning out of the box. Documents can be accessed via web interface, smb, ftp and even imap.
Alfresco is great and all, but it seems a bit heavy for a home use scenario and it doesn't handle or automate the scanning tagging aspect of things either. He's looking for a lot more than a DMS.
I use Growly Notes which is a note-taking application that's freeware on Mac OS X. It isn't Open Source - the developer is an ex-Microsoft Word developer that has written a number of free Mac OS X applications. It is essentially a Microsoft One-Note knockoff that runs on Mac OS X.
Documents (photos, PDFs, video clips, text, audio clips, etc) are organized by Notebook at the highest level and you can have many notebooks with various topics. These go across the top of the page as tabs. There's a left sidebar with a section and then notes. Notes go under sections. So in my bills notebook, I have sections with Verizon, Electric, Gas, Visa, etc. bills. In the notes, I put in a year for the bills and I just drag photos of bills or PDFs into the note. This system works quite well for me. It's local and I back it up to an off-site drive but nothing on the network.
The downsides of doing this is that the project is dependent on the one developer and on Mac OS X which some may not like. It is relatively easy to export notebooks, sections or notes to PDF format so I could get everything out at some point in the future in a convenient format but it would take a little work to do that. My alternative is to use Microsoft One-Note which is something that I'd have to pay for (I use Growly Notes for my work projects too which would mean business licenses for One-Note on multiple machines would be needed along with the prices for updates). That's my backup plan of Growly Notes goes south in the the future.
I haven't been able to find anything in the Open Source world comparable to Growly Notes and One Note. I've seen many projects started but nothing that I would consider Commercial Quality. Evernote does this sort of thing but it stores your data in the cloud. They had a security break recently which confirms that I don't want to put business work product or confidential information on the cloud.
Any open source library management software that does ebooks should help you out. Here's a list:
http://sourceforge.net/directory/home-education/library/opac/os:windows/freshness:recently-updated/
Necessity is the plea for every infringement of human freedom. It is the argument of tyrants; it is the creed of slaves.
(Longtime listener; first-time caller. If I'm doing it wrong, please be kind.)
I've been going through the same issue and have painstakingly scanned/filed a metric crapton of old documents, putting them in a hierarchical directory structure where I can find them if I need them.
But this sucks for a number of obvious reasons. The ones that bother me the most are:
1) A scanned document is larger (*and* less useful than a downloaded PDF).
2) It's a manual process! I'd rather spend ten hours automating something than five hours over the next 5 years trying to remember the filename convention for storing the scanned document.
Anyway, cutting to the chase, I'm now using Ruby/Watir scripts to automate the business of downloading my most common phone/utility bills from the websites and stashing them directly. I used to use Perl and WWW::Mechanize but all the websites are now so contaminated with unecessary javascript that only something which manipulated a browser directly allows automation without pulling my hair out. Ruby/Waitr works pretty well. Recommend. Automated download; priceless. Without automated download, I'd rather return to scanning paper documents mailed to me, otherwise you quickly find how unreliable your service provider is for retaining your statements.
If anybody wants some pre-alpha scripts for grabbing their pg&e, comcast, cigna, at&t, schwab, nvenergy statements, let me know.
Super simple scanning system using Linux.
Make directory called scans, make another called taxes
Have a text file of scanning hints with an easy to remember name.
in a terminal, print the scanning hints file and use the Linux mouse copy feature to construct a scan instruction
The scanimage application requires sudo or you can find a tweak using google search to alter the scanner's USB files and make it run from an unprivileged user.
cd scans
cat filewitheasycommandstocopy.txt
Typical contents of my hint file:
sudo scanimage -l 0mm -x 90mm -y 66mm --resolution 400 | pnmtojpeg >cprcard.jpg
# make files non-overwritable
# chmod -w ~/scans/*.jpg
Verify each scan with eog viewer.
Organize scans like this:
Make long filenames with agencynames, recipientnames, and documentnames all in lower case.
use the mouse to copy an old file name for re-use.
this groups similar documents together.
use ls -lr to show most recently scanned items.
use ls -lr *keyword*.jpg to show selected classes of scanned items.
use locate in the distant future to find those oddball items like certificates or letters of recommendation.
locate certificate | grep rabies
The Pro version because it does OCR, and a few other things, including archive and index your email. Plus, it interfaces VERY well with Fujitsu SCANsnap scanners.
I've been using it a few years, and it works a treat.
Here are some pieces of a scan to ocr script I am developing.
First I am scanning a multicolumn document and to preserve the sense of the document text, I scan even pages twice and odd pages twice.
Second, the scanned images must be rotated. Pieces of the "convert" command appear in the perl fragments here.
Third, I am using the open source tesseract OCR program. Some of my documents have grayed areas that contain text. So I am running tesseract twice on the source files and picking the output file with the most text characters.
Forth, the basic program is just a big loop with a menu where I input file names or page numbers.
Here goes:
# my $scanprog = "/usr/bin/scanimage --resolution 400 >";# print "$scanprog \n";
# Scanner settings for pages top of book at left of scanner StylusScan 2500
my $scanoddleft = "/usr/bin/scanimage -l 30mm -x 190mm -y 235mm --resolution 400 >";#for odd pages
my $scanoddright = "/usr/bin/scanimage -l 0mm -x 190mm -y 235mm --resolution 400 >";#for odd pages
my $scanevenleft = "/usr/bin/scanimage -l 30mm -x 190mm -y 235mm --resolution 400 >";#for even pages
my $scanevenright = "/usr/bin/scanimage -l 0mm -x 190mm -y 235mm --resolution 400 >";#for even pages
# OCR commands and parameters
#tesseract test1.tif test1 -l eng;
#scanimage -l 26mm -x 166mm -t 10mm -y 125mm --brightness 3 --resolution 400 | pnmtotiff>test1.tif;eog test1.tif;convert -rotate 90 test1.tif test1.tif; eog test1.tif; tesseract test1.tif test1 -l eng
my $tesseract = " tesseract ";
my $language = " -l eng ";
my $brightness2 = " --brightness 2 ";
my $brightness3 = " --brightness 3 ";
my $convert90 = " convert -rotate 90 ";
my $eog = " eog " ;
my $charcount = " wc -c " ;
my $scanpage = 1; # Range is 1 to 183
Isn't this just about the classic use case for Evernote? About the only criterion it doesn't hit is saving locally to a NAS (although I admit that might be an important one for this specific user).
Evernote's secure enough for most purposes; it does a particularly good job of being able to search text that's been scanned; it operates on just about any OS out there; with certain makes/models of scanners, you can scan direct to your Evernote account; if you're caught short, you can "scan" from your phone or tablet and send to Evernote; you can tag your stuff; you can reprint your stuff; setup has to be easier than any homebuilt system; you can access your stuff from just about anywhere; cost is free to minimal; lots of people are using it so it's presumably pretty robust and reliable (I've had no problems in this respect)...
No, I'm not an Evernote employee/shareholder/fanboi, just a very satisfied user... OK, maybe I'm a fanboi
I started developing XODA (xoda.org, sf.net/projects/xoda) some years ago for the same needs. Now it does the job for me even though some more features were implemented as users started requesting them. You may want to give it a try. :)
Disclaimer: as already mentioned, I am the developer.
xoda.org
GScanToPDF can do OCR and embed the results as annotations within the PDF. Perhaps that would help with search ability. It works well enough with a lot of my documents though it is far from perfect it is good enough for those purposes especially for bills as they are not handwritten. Best results are on scans set to line art/b&w rather than grey scale or colour.
I've been doing this for a while now. Like others here, I have a Fujitsu Scansnap 1500- it's one of the best investments I've made for cleaning up my office/workflow.
When something comes in, I immediately scan it to the filesystem. My structure is:
2013/Banking/BankName/2013-01-31-14h32.pdf (or something like that- it's the default Scansnap filename.)
I then place the original in a filebox- keeping one filebox for each year. No sorting, organizing, just keeping originals.
At the end of each year, the filebox goes to the crawlspace, and I start a new one. After 7 years, intention is get the box securely shredded (costs about $10/box around here.)
I back the filesystem up nightly to two separate local NASs, and upload the whole filesystem (as a series of encrypted files) to Amazon Glacier (this is a recent addition to my workflow- has stopped me worrying about a fire etc. wiping out both NASs).
All of my documents go in there- it's really easy to find stuff (depending on how good your folder organization is- you can add depth for those kinds of documents that need it, while other ones that aren't likely to be needed can be put in a less descriptive folder hierarchy.)
I just scanned everything to PDF and then just use nested folders.
bills/bank/
bills/credit/
bills/utilities/
expenses/abc inc/
expenses/xyz inc/
house/123 mayberry street/purchase documents/
house/123 mayberry street/rental expenses/
taxes/2011/
taxes/2012/
For incoming mail, I use Earth Class Mail. They scan the mail for me and turn it into a PDF. It's not particularly cheap (although it used to be), but then again I never have to touch the scanner again, and I have access to my snail mail from anywhere.
VirtualPostMail.com will scan all your mail for you. Don't bother scanning your current documents, but put them in a file cabinet. In 5 years, they will be old enough to throw out anyway. I mean, is it really important for you to spend your time scanning your electricity bills from 2 years ago?
For what it may be worth, I have an all-in-one HP printer/scanner (model in subject). It's reasonably cheap, it is a good printer, and it has a double-sided scanner with auto-feeder which works really well.
I've scanned thousands of sheets with it recently (for archiving before shredding), and I would never have even tried it without the automatic scanning.
Disclaimer: not an HP employee, have no HP stock...
"I have been wondering whether it would be worth open sourcing the script...."
Please do. Unless you deem it worthwhile to spiffy it up and try to make some moola, I think it'd be great to share your script. It could be useful to some, could be instructive to those wannting to learn, to see how someone else has done something; any possible embarrassment you might feel about it being 'a bit hacky' you might could toss off to 'having character'. Heck, after your description I'd like to see it, even tho I haven't done any real coding in years.
Seems to me putting the OCR text in the comment field is a fine and good thing. An obvious thing to some, perhaps, but an elegant usage to me.
@Rinsari, below - the link you gave throws a cert warning in Opera, could just be my settings.
http://blog.evernote.com/blog/2012/08/14/the-30-day-paperless-challenge-with-ambassador-jamie-todd-rubin/
I'm a big advocate of Evenote. Cross platformed, OCR that means you can search you scans. There are even a few scanners on the market that scan directly into evernote. Then you just add tags, labels, etc. Couldn't be simpler!
This is so easy, I've been doing it since primary school. Mark your files with date YYYY-MM-DD-name. Put them in folders 1993, 1999, 2005, 2010, 2013 (.e.g.) Profit! In modern times I have used a Lexmark X560 all-in-one office machine to scan everything to a designated network drive. Works like a charm every time. I apply the backup policy of equal drives. So I buy a 3TB drive, and I buy another for off-site backups. Once everytwo month or so, I freshen the backup and verify it. If you need to do it a couple of notches more professional, apply the same backup policy, but use a document retention system, and store it in globally and industry standard PDF. There are many good document repositories that are free/as in beer.
>there won't be a reason to change to format for approximately 7,895 years (but who is counting, really).
I'm kicking myself for not having caught this earlier!
Thanks (no sarc) for alerting us to the Y10E4 problem.
Question: Is Linux Y10E4 ready?
I'm not a lawyer, but I play one on the Internet. Blog
Something like this site:
http://www.managebills.net/bills/web/user/athome
DISCLAIMER: I created this project, but not not yet launched it.
You can track your bills/documents online, either individually or as a group, with reports and reminders, etc.
And you can always download them anytime.
user: test
password: test
And done :)
https://github.com/ZivaVatra/SDAT
Figured I would take the opportunity to try out GIT (have not bothered so far).
Also, seems that I have recently made it actually save to tagged PNG's rather than TIFFs. Forgot about that :)
Hope it turns out to be useful to you. Let me know if you want commit privs for any fixes you do. Happy Hacking!
As requested, I have put it on github now:
https://github.com/ZivaVatra/SDAT
Hope it is useful to you, or at least interesting :)
I don't think it is worth selling the script, it isn't that fancy, not to mention that then I would be on the hook for supporting it. :(
Since the recession hit I've had to work 2 jobs, and I really don't have much time to devote to personal nerdy pursuits. Barely have time to sleep as it is
All I can do is publish and hope it helps others. That little script has made my life a lot easier and less cluttered. With any luck others with more time will improve on it and we all benefit :)
And it isn't your settings. I also get an invalid certificate on Firefox and Chromium. Something is up with the link, so I just went ahead and used the actual github.com site to host it.
For maintaining a high-integrity archive of documents, try Boar. It can even version control huge documents, like movies and photos. http://www.boarvcs.org/
I've used software called Paperless from Mariner Software (https://www.marinersoftware.com/products/paperless/). I only use it for personal record storage, but it might fit the bill for a small business too. It supports document acquisition from several sources, e.g. scanner, drag-and-drop from the file system, print to PDF, etc. I haven't used it, but it supposedly does some OCR. It supports categories and sub-categories as well as tags. At a fundamental level, I think it is just a database to capture the metadata you create and a front end to individual files stored in the file system.
Try out Alfresco, It is a nice document management system if you are familiar with IT-system.
I know it's probably different here in the UK where most people don't even need to do a tax return, but basically the really important stuff like house deeds (and wills) are in the hands of a solicitor anyway, and I simply don't need to keep copies of 2 year old bank statements or 3 year old electricity bills on the off chance I might need to refer to them.
If you have a business, fair enough, you legally need to keep financial stuff for 6 years, but then off-site archiving is just an insignificant business cost.
To have a right to do a thing is not at all the same as to be right in doing it
You would think that in this day and age someone could invent an OCR system that has a basic understanding of the documents being scanned. It would automatically name files "Bank of Elbonia Current Account Statement 03/2013" and allow you to do things like search all statements for transactions over £250.
In fact the bank could just include a big QR code on the back with all the data in it, but I suppose that is asking too much.
const int one = 65536; (Silvermoon, Texture.cs)
SJW, n: "Someone I don't like, and by the way I'm a fuckwit" - AC
Thank you. I'm enjoying looking at the script, am wondering where I can get a scanner I can afford, and had a look at your site as well. And thanks for the readme also. Now get some sleep. (grin) I've had two jobs at times, when I was younger, and it's decidedly not only no fun, but done too long a recipe for ungood juju. Good luck to you.
There is a very good OS X application called doo which will do exactly what you want and it's completely free. Check it out at http://doo.net/ and you have it on Mac AppStore.
There is ideology, and then there's get it done.
Open source is built on the idea of standing on the shoulders of other people good code, build it up from good to better.
Free is just a nice to have feature in the end. If you place Free above everything else you might as well not bother.
After failing many times to "Cut the PaperMill cord"
I stumbled upon Filecenter by Lucion, it taught me two things real fast:
1. TWAIN may be an old stick in the mud protocol and costs more to find a scanner that supports it, but this is what Opensource really should be, Open Protocol.. it lets you use software from other sources to do things in anice predictable, reliable way.
2. PDF is a good archival format. With "Archival" PDF and its support for many other features like full text indexing within the document its way above and ahead of any other archival format.
Which leads me to FileCenter for Windows - I wish it were for Mac OSX or Linux, but nope. The pradigm is there, it could be, but it relies on a good PDF engine from another source.. aka "Libraries" - and best of all it knows how to treat PDF like atack of pages. You can search across documents and build indexes on "File Cabinets" which are merely Symlinks within the program for different existing File System Folders. And it doesn't "lock in" documents into a repository. You can go behind the scenes and still move your documents around.. then reindex if you want from within the program.
It's simplicity and usefulness is unparalleled to me.
Yeah it has the old fashions Windows 2010 Ribbon Menu.. but really that's become more intuitive to me over the years since its pervasive.
Its got a "stop fighting the way of the world" and make the best of it theme that rusn throughout the software and really works.
One thing is there are two versions, Standard and Pro. Standard only nets you most of the scanning and navigating features. Pro nets Standard plus the treat your PDF documents like stacks of paper features.. which are more important than you might think.
They have like a 30 day trial.. I'm a customer not a sales person.. so I merely suggest you try it. Or watch the videos.. its a paperless approach that works.
Opensource tends to "borrow" ideas from commercial software (putting it nicely) "Reinventing the wheel so to speak" only problem is "no support, few updates and comes with feature 'attitude-itis'.." which means "He who codes shall rule with Absolute Power!!" - sure you can go anywhere else.. but really.. a market is a market.. if the product doesn't suit.. do go elsewhere.
SnapScan is considered a cheap scanner precisely because it does not support TWAIN.
All of the Fujitsu fi series support TWAIN which has the device driver.
SnapScan will never have a TWAIN driver because it is designed as a direct to consumer minimal feature product.
Your better off buying a used fi-series from eBay or an ADF only version from NewEgg or B&H Video.
Expect it to cost anywhere from a few hundered on eBay to $900 new.. you can find good deals.
SnapScan is just too cheap, fair hardware but the driver development is what costs money and that's what it doesn't have one, they save you money by not developing the driver you need.
I mentioned FileCenter elsewhere in this thread.. by Lucion.. it can work with a pre-scanned by SnapScan "pile of documents".. you might look at that for your workflow. FileCenter is a really nice tool. But Ultimately you want a good scanner with a good driver to go along with good software. I'll admit SnapScan had me going for a while.. until it got down to brass tacks.. and its lack of real driver support was a deal breaker.. I never went down the SnapScan path.. and feel I dodged a bullet. Look on YouTube its compared to lots of other scanners.. good hardware.. terrible driver.. and that limits the software you can use with it.
The image "recognition" feature you recall is "Seperators" which are special "Cover sheets" you insert into a stack of things to be scanned.
They are printed from the software ahead of time and used to automatically "route" the scanned documents that follows to a particular destination or marks them for special treatment. After you print them they are resuable. FileCenter does have that. It also has bookmarking OCR and a slew of other features. http://www.lucion.com/filecenter-features.html
One of the best things I like though is the ability to "browse" a PDF or set of PDFs like a stack of papers, tear them apart and re-order them or make them into separate piles of documents. Its all the benefits of a desktop paradigm without the paper. The concept reminds me of the "Bumptop" desktop OS from a long time ago. FileCenter also lets you select which PDF renderer you use to look at things inside the program, so it has one built in, can use Adobe reader if you have that installed, or the operating system default. And does the same with OCR engines, use the one builtin, or use one that's third party installed (if it lets you, some lock out external programs from using their OCR engine) or use the one that comes with the operating system.
This is not actually a real solution, I'm just amused every time I see document management show up in a slashdot thread - being it's not a very *exciting* field. I work for a company that provides document management solutions for much larger organizations, with costs (I think) starting in the tens of thousands of dollars, and going up to way more than that. Of course, since I work here, I have my own personal test repositories for testing things, and when I was buying a house and started getting crazy amounts of paper documents I had to sign, scan and send back, I was like, why not scan them with QuickFields and keep them in a Laserfiche repository? So I did. :D
That solution doesn't work for most people, though. (Also it's totally not open source. Source is only open to those who are development at this company, which I am, so I suppose in a twisted way...)
Use Plone CMS. Scan your docs, upload w/ WebDAV. Tag via categories. Now that I've thought of this, I think I may go that route myself.. ;)
OpenERP allows attachments to accounting records and is simpler to figure out than Adempiere.
OpenERP seems to have a more vibrant development community.
(and Alfresco is another option - althog it is also heavy weight.)
Can anyone tell if you can train tesseract to be a bit better at recognizing a specific font?
I'm using the Debian version but if you have a 300 dpi scan the OCR is often gobbledygook.
(Yes, "use the source Luke" is also a valid answer in this case...)
To be, or not to be: isn't that quite logical, Slashdot Beta?
Hi Rinisari,
I work at doo (http://doo.net) and immediately thought of our app when I saw your problem. All of the filing systems listed so far really make sense, and I personally learned a few things from some of the workflows suggested. But using doo would greatly simplify your entire document management.
When you set up doo you select the documents and folders that you wish to connect: not just those imported from connected scanners, but Dropbox, GDrive, the local HD, email, etc. Once documents are connected, the app indexes them and runs OCR automatically. Then it allows you to back everything up and sync it to the cloud for backup and access on other devices.
And it’s all in one spot. Handling the “task of document management” is precisely what we do.
Some details ...
If your document(s) is already digital – whether it’s a scan on your HD, a Google Drive document, an email attachment, in Dropbox, etc – you can connect the source or just the individual folder/document to doo.
If your document is still paper, you can scan it directly into doo using the app’s interface if the scanner has a TWAIN driver (http://is.gd/15O26Q). If it doesn’t have a TWAIN, we have a guide for quickly setting up top brands, like Fujitsu ScanSnap, Canon and Doxie (http://docs.doo.net/scanguide.pdf). I also just wrote a blog post about it if you want to take a look: https://blog.doo.net/2013/04/04/how-to-scan-with-doo.html.
The coolest thing about doo is that it does the tedious, difficult part of document management for you: the aforementioned automatic indexing and OCR occur right when you connect the document or folder in question, which means document search-and-retrieval is a cinch. After that, should you wish to add a degree of personalization, you can alter existing tags or add individual labels in the app’s intuitive UI.
doo is currently available for OS X, Windows 8 and will be coming very soon for Android and iOS (https://doo.net/en/download.html). Hope that helps! Send us a ticket at support@doo.net if you have any questions.
I've been doing document management for my personal and financial stuff for over 30 years. Its quite simple and I found out my dad
has done it the same way though neither of us shared much in this area over the years. If you have taken accounting or seen businesses
processes then you are already there.
1. File for the current year, in folders by expense, bank/income statements any other pertinent data.
2. End of year when doing taxes summarize, toss, and clean. Keep most documents grouped in a yearly box/organizer for tax records.
I keep a summary also in odt and xls spreadsheet format similar storage organized by year in electronic format with backup.
3. After five years or tax requirement expired keep only summary of year and burn/shred physical documents.
4. Keep a separate folder for the most important stuff deeds, titles, others. Use to use a safety deposit box. Just make sure that stuff
can be quickly obtained in case of emergency and is safe.
Only one difference in the last few years. Do step 1. by entering into a database via my own MyJSQLView program. Makes Step 2. simple
at the end of year for summary and can anytime search sort data. The database is two tables, expenses and income. Scanning everything
seems like a pain in the butt since most of it is going to get tossed. What you really want is summary data like what have I been spending
on insurance the last ten years?
Keep It Simple Software (engineer) or whatever...
I went with the simplest possible solution. One that also allows me to recover even if a "database" becomes corrupted or obsolete, because all the "real" data is contained in the documents themselves.
I just scan to PDF and add tags in the Keywords field of the PDF metadata. For the keywords, I use unique words that aren't going to show up in an actual document. (Just tacking on a prefix or surrounding each keyword in brackets is good enough.) I also organize the files in a decent (but not too detailed) directory structure. (You can use any high-tech storage system you like. I just use a regular hard drive.) Then I installed the PDF iFilter so the Windows Indexing service could index the files, including that metadata (There are many. Google is your friend.) So, now, if I want to find all the tax files, say, that are related to my farm, for instance (totally made up example), I would just navigate to the directory that holds all my tax documents, then do a basic Windows search for [farm] and there are all my documents. No database to manage or learn how to use. Just the files and their metadata.
There are utilities that allow you to easily select a group of .PDF files and tag them all with the same keywords. I'm sure you can find one for any OS. And the beauty is: Once the file is tagged with the keyword, it doesn't matter if you just throw away the program you used to set that keyword, because the keyword is just a normal part of that .PDF file.
Because the keywords are standard PDF metadata, any OS should be able to read and index on them. If not, then you could find some program that would, I am sure. Again, the beauty of this system is: if you loose access to that indexing system, or move your files to a different platform, all you gotta do is reindex the metadata that is right there in the files. As long as you have your files, you have your keywords.
In a similar vein to http://ask.slashdot.org/comments.pl?sid=3623835&cid=43389299, I would say: yes, please post to pastebin or github or something (maybe even your own Slashdot journal); if you GPL it, someone might even do some fine-tuning for you.
404555974007725459910684486621289147856453481154 in hex is "You sank my Battleship?"
[GPG key in journal]
Here's an online service that does the same thing:
https://filethis.com/fetch/
The first 6 accounts/month are free. You can download to a local folder, Dropbox, or even Evernote
Dropbox will always be there
Oh, my sweet summer child...
Glad you like it! As for scanners, I bought my first one secondhand on amazon for $45. It served me throughout Uni for the next 5 years, until I got a new one (it came as a 2-for-1 deal with a new printer). If that is beyond your budget I have seen them be thrown/given away (especially old parallel port ones), and a lot of them work well with Linux.
If Linux support is a must, have a look at: http://www.sane-project.org/sane-supported-devices.html
Yeah.. My site, like the rest of my non work life, is out of date and broken (the counter no longer increments, and it doesn't load properly, breaking the site). I am in the process of rewriting the backend, but time is short.
Thanks a lot, hopefully times will get better soon, and then I can devote some proper time to personal projects again :)
Good luck with your attempts to find a scanner as well!
OpenDocMan is what I use at work. If you can set up LAMP or MAMP or WAMP, the rest is pretty easy. It runs as a web service so OS shouldn't be a problem.
http://www.opendocman.com/
Thanks for the tips. I'll be glad when I'm able to stand for more than a few minutes without passing out - then I can take a bus to the re-sale shops (our city fathers in a burst of concern for bargain hunters - many of them very low income and living in and around down town - moved all those shops to out-lying areas, to better serve the citizens' needs), else they'd all be within crutching distance.
I've never tried the extent of external connectivity, but if an XP vm can talk to my printer, it should maybe talk to a scanner as well, so either way it oughta be OK.
Well, even as is, I enjoyed your site, found some interesting things to read. Yeah, we just do what we can, as spirit moves and wallet enables. And the meat bag cooperates, of course.
If you use one of Google's products, you should be good. NEVER go Apple though. If you want something open-source, avoid anything and everything Apple as much as you can. Your thoughtful friend, JBJblaze