Slashdot Mirror


How To Manage Hundreds of Thousands of Documents?

ajmcello78 writes "We're a mid-sized aerospace company with over a hundred thousand documents stored out on our Samba servers that also need to be accessed from our satellite offices. We have a VPN set up for the remote sites and use the Samba net use command to map the remote shares. It's becoming quite a mess, sometimes quite slow, and there is really no naming or numbering convention in place for the files and directories. We end up with mixed casing, all uppercase, all lowercase, dashes and ampersands in the file names, and there are literally hundreds of directories to sort through before you can find the document you are looking for. Does anybody know of a good system or method to manage all these documents, and also make them available to our satellite offices?"

8 of 438 comments (clear)

  1. Use a cataloging system by vondo · · Score: 4, Interesting

    I happen to have written one:

    http://sourceforge.net/projects/docdb-v/

    could be what you are looking for. Of course, it'll take effort to catalog the documents.

  2. FileNet by Ohio+Calvinist · · Score: 4, Interesting

    I worked at a place that used FileNet, which is now an IBM product, to do this sort of thing. We had millions of scanned documents in the system. I wasn't personally very impressed with it, in that whenever anything "bad" happened, you had to call IBM because finding support online was impossible, and at that they support wasn't very good. It was also a very picky system, those seemed to handle the load well. If you go with it, I strongly encourage doing it for UNIX/Oracle because it screamed "poorly ported" when we used it for Windows/MSSSQL. It has an API for integration, but it is also, poorly documented and would take some time to integrate into your existing business systems.

    This is more of a rant at this point, but it is a stop-gap solution that allows people to continue to use outdated business processes storing important data in image formats or in documents scattered about with minimal indexing/search capabilities, rather than analyzable "data" that can lead to "information." I always take the position that if the goal is something on paper, or the goal is to store something that "was" on paper, it is time to rethink the business process to see if we can automate it, or store/present the data electronically in the first place. The old school fights against it, but no one has ever been able to say it wasn't more efficent in the end and enabled IT to say "yes we can" when the next great idea came along versus "here is a stack of papers, figure out $trend."

    --
    Forgive my spelling from time to time. I'm often posting during short breaks.
  3. Re:Alfresco or SharePoint by flydpnkrtn · · Score: 3, Interesting

    ...and I found an article backing up Alfresco pretty well:

    "You can now stand up an Alfresco Labs server next to a SharePoint Server, and Office will not be able to tell the difference between the two," said John Newton, CTO of Alfresco. "But we are offering considerably more scale than SharePoint can deliver," he said.

  4. Re:SharePoint? by moosesocks · · Score: 4, Interesting

    Why should you give sharepoint a chance? Even it it works well, it is proprietary and you are locked in.

    No less proprietary than other similar systems. Getting files in/out of Sharepoint is a fairly trivial process, and the API is open enough to craft your own migration plan if you ever decide to move away from it, given that everything else is equally (or even more) proprietary than Sharepoint.

    MS Office might be proprietary, but is so widespread that it's a 'standard' in its own right -- Sharepoint integrates excellently with Office, and keeps your users happy.

    I'm typically not one to advocate the use of Microsoft products. However, Sharepoint worked just fine when I was using it, and is definitely a huge step up from any of the competing products at the same price-level.

    --
    -- If you try to fail and succeed, which have you done? - Uli's moose
  5. WIKI by unum15 · · Score: 3, Interesting

    Maybe not the best solution for this particular job, but man am I glad we started using Dokuwiki for all our scattered documents.

  6. Re:SharePoint? by pete-classic · · Score: 3, Interesting

    How does Sharepoint address his problem? It uses the exact same folder/file paradigm that is failing in his existing solution.

    -Peter

  7. Re:it's all about the index by jd · · Score: 4, Interesting

    Very true. I'd take a look at DSpace or Open Library for examples of software designed to handle gigantic numbers of documents and maintain sensible indexes for them.

    --
    It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
  8. Re:Google to the rescue? by vrmlguy · · Score: 4, Interesting

    Now I actually LOL'd on that one!

    Getting our userbase to actually give a flying fart about a naming protocol and then getting them to follow it!?

    I won't be holding my breath for either of those two things to happen...

    You obviously don't know how to motivate people. Tell your boss you can get everything renamed for $100/week. Then post a leader board showing who has renamed the most documents each week, and give each week's winner a gift certificate to a local restaurant. Don't let anyone win more than once a month, to prevent too much disruption of normal job duties, and set up some sort of meta-moderation to prevent gaming the system. (You could probably use slashcode out-of the-box, just make each document a story and suggest better names in the comments.)

    --
    Nothing for 6-digit uids?