Slashdot Mirror


How To Manage Hundreds of Thousands of Documents?

ajmcello78 writes "We're a mid-sized aerospace company with over a hundred thousand documents stored out on our Samba servers that also need to be accessed from our satellite offices. We have a VPN set up for the remote sites and use the Samba net use command to map the remote shares. It's becoming quite a mess, sometimes quite slow, and there is really no naming or numbering convention in place for the files and directories. We end up with mixed casing, all uppercase, all lowercase, dashes and ampersands in the file names, and there are literally hundreds of directories to sort through before you can find the document you are looking for. Does anybody know of a good system or method to manage all these documents, and also make them available to our satellite offices?"

11 of 438 comments (clear)

  1. Use a cataloging system by vondo · · Score: 4, Interesting

    I happen to have written one:

    http://sourceforge.net/projects/docdb-v/

    could be what you are looking for. Of course, it'll take effort to catalog the documents.

  2. Documentum by trondwn · · Score: 2, Interesting

    use EMC document solution, where you have all documents i central database with metadata that can describe content. And can be accessed thru cached server from different sites.

  3. FileNet by Ohio+Calvinist · · Score: 4, Interesting

    I worked at a place that used FileNet, which is now an IBM product, to do this sort of thing. We had millions of scanned documents in the system. I wasn't personally very impressed with it, in that whenever anything "bad" happened, you had to call IBM because finding support online was impossible, and at that they support wasn't very good. It was also a very picky system, those seemed to handle the load well. If you go with it, I strongly encourage doing it for UNIX/Oracle because it screamed "poorly ported" when we used it for Windows/MSSSQL. It has an API for integration, but it is also, poorly documented and would take some time to integrate into your existing business systems.

    This is more of a rant at this point, but it is a stop-gap solution that allows people to continue to use outdated business processes storing important data in image formats or in documents scattered about with minimal indexing/search capabilities, rather than analyzable "data" that can lead to "information." I always take the position that if the goal is something on paper, or the goal is to store something that "was" on paper, it is time to rethink the business process to see if we can automate it, or store/present the data electronically in the first place. The old school fights against it, but no one has ever been able to say it wasn't more efficent in the end and enabled IT to say "yes we can" when the next great idea came along versus "here is a stack of papers, figure out $trend."

    --
    Forgive my spelling from time to time. I'm often posting during short breaks.
  4. Re:Alfresco or SharePoint by flydpnkrtn · · Score: 3, Interesting

    ...and I found an article backing up Alfresco pretty well:

    "You can now stand up an Alfresco Labs server next to a SharePoint Server, and Office will not be able to tell the difference between the two," said John Newton, CTO of Alfresco. "But we are offering considerably more scale than SharePoint can deliver," he said.

  5. Alfresco of course! by thule · · Score: 2, Interesting

    It can scale extremely well. It is the backend to Adobe's acrobat.com website! So you know it can handle millions of documents if you need it to. Sharepoint requires MS SQL Server for searching documents. With Alfresco, that feature is built in.

    Sharepoint is teaming software and not really designed for large document repositories. Alfresco has a teaming interface (Alfresco Share) and a more generic document repository interface.

    Alfresco can expose the repository via FTP, SMB, WebDAV, and a web client interface.

  6. Re:SharePoint? by moosesocks · · Score: 4, Interesting

    Why should you give sharepoint a chance? Even it it works well, it is proprietary and you are locked in.

    No less proprietary than other similar systems. Getting files in/out of Sharepoint is a fairly trivial process, and the API is open enough to craft your own migration plan if you ever decide to move away from it, given that everything else is equally (or even more) proprietary than Sharepoint.

    MS Office might be proprietary, but is so widespread that it's a 'standard' in its own right -- Sharepoint integrates excellently with Office, and keeps your users happy.

    I'm typically not one to advocate the use of Microsoft products. However, Sharepoint worked just fine when I was using it, and is definitely a huge step up from any of the competing products at the same price-level.

    --
    -- If you try to fail and succeed, which have you done? - Uli's moose
  7. WIKI by unum15 · · Score: 3, Interesting

    Maybe not the best solution for this particular job, but man am I glad we started using Dokuwiki for all our scattered documents.

  8. Re:SharePoint? by pete-classic · · Score: 3, Interesting

    How does Sharepoint address his problem? It uses the exact same folder/file paradigm that is failing in his existing solution.

    -Peter

  9. Re:it's all about the index by jd · · Score: 4, Interesting

    Very true. I'd take a look at DSpace or Open Library for examples of software designed to handle gigantic numbers of documents and maintain sensible indexes for them.

    --
    It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
  10. Re:Google to the rescue? by vrmlguy · · Score: 4, Interesting

    Now I actually LOL'd on that one!

    Getting our userbase to actually give a flying fart about a naming protocol and then getting them to follow it!?

    I won't be holding my breath for either of those two things to happen...

    You obviously don't know how to motivate people. Tell your boss you can get everything renamed for $100/week. Then post a leader board showing who has renamed the most documents each week, and give each week's winner a gift certificate to a local restaurant. Don't let anyone win more than once a month, to prevent too much disruption of normal job duties, and set up some sort of meta-moderation to prevent gaming the system. (You could probably use slashcode out-of the-box, just make each document a story and suggest better names in the comments.)

    --
    Nothing for 6-digit uids?
  11. Re:Google to the rescue? by SlashWombat · · Score: 2, Interesting

    I agree - although you might want to eventually implement a systematic method of naming/storing your documents.

    While this seems like a good idea on the surface, it never seems to work very well. Even verbose file names seem to fail miserably, as the first 100 or so letters are always the same (IE:"Project Tiger Sausage rocket module assembly - Ion injector harware part 1 ...>

    Then there is the problem of getting all the employees to fully understand directory structures. Just look at your workmates screens to see how many people save everything on their desktop. (Yes, really a windows problem ... but so what.)

    I used to get a local WAN search engine, and let it index the entire site. Much more useful as it would find documents most people thought had disappeared years ago.

    Another approach would be to have a database that assigned file names for various projects and/or functions and mandate that this be the only way files are named for storage on the WAN. This, however, does not get around the thousands of files already stored in weird places using weird names! (Which is why an already indexed search engine works so well ... not only does it extract the file names, but also search on random (but significant) phrases are picked up within the scanned documents. (I used to use "MAMMA", it worked a treat!) http://www.mamma.com/