How To Manage Hundreds of Thousands of Documents?
ajmcello78 writes "We're a mid-sized aerospace company with over a hundred thousand documents stored out on our Samba servers that also need to be accessed from our satellite offices. We have a VPN set up for the remote sites and use the Samba net use command to map the remote shares. It's becoming quite a mess, sometimes quite slow, and there is really no naming or numbering convention in place for the files and directories. We end up with mixed casing, all uppercase, all lowercase, dashes and ampersands in the file names, and there are literally hundreds of directories to sort through before you can find the document you are looking for. Does anybody know of a good system or method to manage all these documents, and also make them available to our satellite offices?"
I happen to have written one:
http://sourceforge.net/projects/docdb-v/
could be what you are looking for. Of course, it'll take effort to catalog the documents.
I worked at a place that used FileNet, which is now an IBM product, to do this sort of thing. We had millions of scanned documents in the system. I wasn't personally very impressed with it, in that whenever anything "bad" happened, you had to call IBM because finding support online was impossible, and at that they support wasn't very good. It was also a very picky system, those seemed to handle the load well. If you go with it, I strongly encourage doing it for UNIX/Oracle because it screamed "poorly ported" when we used it for Windows/MSSSQL. It has an API for integration, but it is also, poorly documented and would take some time to integrate into your existing business systems.
This is more of a rant at this point, but it is a stop-gap solution that allows people to continue to use outdated business processes storing important data in image formats or in documents scattered about with minimal indexing/search capabilities, rather than analyzable "data" that can lead to "information." I always take the position that if the goal is something on paper, or the goal is to store something that "was" on paper, it is time to rethink the business process to see if we can automate it, or store/present the data electronically in the first place. The old school fights against it, but no one has ever been able to say it wasn't more efficent in the end and enabled IT to say "yes we can" when the next great idea came along versus "here is a stack of papers, figure out $trend."
Forgive my spelling from time to time. I'm often posting during short breaks.
Why should you give sharepoint a chance? Even it it works well, it is proprietary and you are locked in.
No less proprietary than other similar systems. Getting files in/out of Sharepoint is a fairly trivial process, and the API is open enough to craft your own migration plan if you ever decide to move away from it, given that everything else is equally (or even more) proprietary than Sharepoint.
MS Office might be proprietary, but is so widespread that it's a 'standard' in its own right -- Sharepoint integrates excellently with Office, and keeps your users happy.
I'm typically not one to advocate the use of Microsoft products. However, Sharepoint worked just fine when I was using it, and is definitely a huge step up from any of the competing products at the same price-level.
-- If you try to fail and succeed, which have you done? - Uli's moose
Very true. I'd take a look at DSpace or Open Library for examples of software designed to handle gigantic numbers of documents and maintain sensible indexes for them.
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
Now I actually LOL'd on that one!
Getting our userbase to actually give a flying fart about a naming protocol and then getting them to follow it!?
I won't be holding my breath for either of those two things to happen...
You obviously don't know how to motivate people. Tell your boss you can get everything renamed for $100/week. Then post a leader board showing who has renamed the most documents each week, and give each week's winner a gift certificate to a local restaurant. Don't let anyone win more than once a month, to prevent too much disruption of normal job duties, and set up some sort of meta-moderation to prevent gaming the system. (You could probably use slashcode out-of the-box, just make each document a story and suggest better names in the comments.)
Nothing for 6-digit uids?