Slashdot Mirror


How To Manage Hundreds of Thousands of Documents?

ajmcello78 writes "We're a mid-sized aerospace company with over a hundred thousand documents stored out on our Samba servers that also need to be accessed from our satellite offices. We have a VPN set up for the remote sites and use the Samba net use command to map the remote shares. It's becoming quite a mess, sometimes quite slow, and there is really no naming or numbering convention in place for the files and directories. We end up with mixed casing, all uppercase, all lowercase, dashes and ampersands in the file names, and there are literally hundreds of directories to sort through before you can find the document you are looking for. Does anybody know of a good system or method to manage all these documents, and also make them available to our satellite offices?"

41 of 438 comments (clear)

  1. Google to the rescue? by Shatrat · · Score: 4, Insightful

    Isn't this the sort of thing that a google search appliance would be helpful for? Then you don't need to know the exact filename, just some specific information that can identify the file. This certainly solved my problem with having thousands of emails.

    --
    09 F9 11 02 9D 74 E3 5B D8 41 56 C5 63 56 88 C0
    1. Re:Google to the rescue? by liquidsin · · Score: 5, Insightful

      use your users, if you can. i'm just talking out my ass here, but i'd think it a not-too-difficult matter to add some sort of user input form along the lines of "hey, now that you've found the document you need, does the name fit the new naming scheme? if not, why not rename it so it fits!". this is assuming you can trust your userbase not to be asshats and to be able to follow the naming protocol.

      --
      do not read this line twice.
    2. Re:Google to the rescue? by CozmicCharlie · · Score: 4, Insightful

      Now I actually LOL'd on that one! Getting our userbase to actually give a flying fart about a naming protocol and then getting them to follow it!? I won't be holding my breath for either of those two things to happen...

    3. Re:Google to the rescue? by shri · · Score: 3, Informative

      May I also suggest Yahoo/IBM's OmniFind as a free as beer alternative?

    4. Re:Google to the rescue? by vrmlguy · · Score: 4, Interesting

      Now I actually LOL'd on that one!

      Getting our userbase to actually give a flying fart about a naming protocol and then getting them to follow it!?

      I won't be holding my breath for either of those two things to happen...

      You obviously don't know how to motivate people. Tell your boss you can get everything renamed for $100/week. Then post a leader board showing who has renamed the most documents each week, and give each week's winner a gift certificate to a local restaurant. Don't let anyone win more than once a month, to prevent too much disruption of normal job duties, and set up some sort of meta-moderation to prevent gaming the system. (You could probably use slashcode out-of the-box, just make each document a story and suggest better names in the comments.)

      --
      Nothing for 6-digit uids?
  2. Google Appliance by TornCityVenz · · Score: 4, Informative
    --
    I Need someone to rebuild a Digitech Digital Delay pedal for me....for me...for me...for me.
  3. How not to do it by Daimanta · · Score: 3, Funny

    Store it on a single FAT32 partition and hope for the best. Only meant for people with guts or really really nice bosses.

    --
    Knowledge is power. Knowledge shared is power lost.
    1. Re:How not to do it by CarpetShark · · Score: 4, Funny

      Pfft. This is a serious job. 320k floppies are what you want.

      Or... you know... you could try managing those documents with a document management system.

    2. Re:How not to do it by selven · · Score: 4, Funny

      Two of those should be enough for everyone!

  4. Answered your own question by Sir_Lewk · · Score: 5, Insightful

    and there is really no naming or numbering convention in place for the files and directories.

    I think you already know the answer.

    --
    "linux is just DOS with a UNIX like syntax" -- Galactic Dominator (944134)
    1. Re:Answered your own question by peektwice · · Score: 4, Informative

      Absolutely correct. However, I would take it a step further and say that you need a document management system that manages security, meta-data, retention, disposition, etc. Examples are Documentum, IBM FileNet P8, Alfresco, etc. Here's a place to start readin: http://www.cmswire.com/.

      --
      Other than this text, there is no discernible information contained in this sig.
    2. Re:Answered your own question by CorporateSuit · · Score: 5, Funny

      No kidding, men are practically born with this instinct.

      The most basic is dividing the images up according to hair color or the number of girls appearing in each photo. Then you usually divide them up between hardcore and softcore, type of performance, fetish, etc. For your favorites, you can keep a folder in the home directory, of course. I know this guy works for an aerospace company, but keeping track of 500,000+ files isn't rocket science! We've all been able to do that since the advent of the 200GB harddrive.

      --
      I am the richest astronaut ever to win the superbowl.
  5. Re:Hummingbird Document management by HikingStick · · Score: 3, Insightful

    If they're going to consider Hummingbird, they need to be ready to cough up the dollars to get an *EXPERIENCED* Hummingbird administrator. If not, the product will be set up, but basic search functionality will be hosed because of some of the same issues in the original problem description (arising from differences in how the document's properties sheets are populated). If done well, it can be fantastic. If not, it users will hate it and do everything possible to avoid it (including installing their own NAS devices).

    --
    I use irony whenever I can, but my shirts are still wrinkled...
  6. Alfresco or SharePoint by flydpnkrtn · · Score: 3, Insightful

    Or some other corporate content management system

    1. Re:Alfresco or SharePoint by flydpnkrtn · · Score: 3, Interesting

      ...and I found an article backing up Alfresco pretty well:

      "You can now stand up an Alfresco Labs server next to a SharePoint Server, and Office will not be able to tell the difference between the two," said John Newton, CTO of Alfresco. "But we are offering considerably more scale than SharePoint can deliver," he said.

    2. Re:Alfresco or SharePoint by Kadin2048 · · Score: 3, Informative

      I have a personal bias, but I think IBM's FileNet would solve this quite neatly. I've done implementations of it that are pretty much exactly what the OP describes.

      Customer has a share that's gotten totally out of control, just stuffed full of files. They want to make them available across multiple offices, generally without getting into complex VPN crap, and also want to simplify management, add more security / compartmentalization, or integrate it with corporate SSI. All doable. Runs on your choice of platforms, too. (Linux, Unix/AIX, Windows all OK as servers.)

      There are even tools that basically take a share drive and walk the directory structure, importing documents at extremely high volume and using the folder structure to categorize and tag the documents within FileNet. It's quite slick and can either be used as a one-shot migration from a traditional fileserver to FileNet, or as an ongoing thing (take all files in a particular directory or set of directories and commit them).

      Once you have the documents into FileNet you can access them over a web interface or via various desktop clients, and there is a nice API for integrating it with custom in-house applications if that's a requirement. Also, IBM makes some add-ons for Word and Excel (and maybe PowerPoint) that allow you to work directly with items stored in a FileNet repository. Plus, if down the road you want to get into "workflow" (basically building your document management system around your business process), that can be easily bolted on.

      Email is in profile if you want specific case studies or whitepapers, or if you want me to put you in touch with people who do these sorts of things regularly.

      --
      "Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."
  7. Use a cataloging system by vondo · · Score: 4, Interesting

    I happen to have written one:

    http://sourceforge.net/projects/docdb-v/

    could be what you are looking for. Of course, it'll take effort to catalog the documents.

  8. SharePoint? by tekiegreg · · Score: 4, Informative

    I know I'm gonna get hit for blurting out the Microsoft Solution but...give SharePoint a shot...

    --
    ...in bed
    1. Re:SharePoint? by goffster · · Score: 4, Insightful

      Why should you give sharepoint a chance? Even it it works well, it is proprietary and you are locked in.

    2. Re:SharePoint? by moosesocks · · Score: 4, Interesting

      Why should you give sharepoint a chance? Even it it works well, it is proprietary and you are locked in.

      No less proprietary than other similar systems. Getting files in/out of Sharepoint is a fairly trivial process, and the API is open enough to craft your own migration plan if you ever decide to move away from it, given that everything else is equally (or even more) proprietary than Sharepoint.

      MS Office might be proprietary, but is so widespread that it's a 'standard' in its own right -- Sharepoint integrates excellently with Office, and keeps your users happy.

      I'm typically not one to advocate the use of Microsoft products. However, Sharepoint worked just fine when I was using it, and is definitely a huge step up from any of the competing products at the same price-level.

      --
      -- If you try to fail and succeed, which have you done? - Uli's moose
    3. Re:SharePoint? by pete-classic · · Score: 3, Interesting

      How does Sharepoint address his problem? It uses the exact same folder/file paradigm that is failing in his existing solution.

      -Peter

    4. Re:SharePoint? by Anonymous Coward · · Score: 3, Insightful

      What you say:

      Why should you give sharepoint a chance? Even it it works well, it is proprietary and you are locked in.

      What you mean:

      Regardless of how perfect a solution might be for you, if it doesn't conform to MY personal ideological viewpoint, it shouldn't be given a chance.

      God I hate people like you.

      --AC

    5. Re:SharePoint? by preystalker · · Score: 4, Informative

      I would recommend using Alfresco. Correct configured and deployed, you could access files via Windows Explorer, WebDav, web interface, etc. and data is stored in a SQL database. Alfresco uses open standards and should be considered instead of SharePoint.

  9. Document management software by Wrexs0ul · · Score: 4, Insightful

    Most print companies like Xerox have their own proprietary Document management tools you can buy, and a bunch of CRM and ERP solutions (like OpenERP - it's free AND Open Source) provide some good simple document searching and indexing tools.

    Really it comes down to how complex you want searching to be? Are there specific keys in the document you could index by? Do you require the full-text search capabilities of a Google search appliance?

    A really good solution I've come across for some clients in Edmonton is Called MetalTrace by Trace Applications. Don't let the name fool you about the specificity, software like this can Scan, Index, and even read barcodes on all sorts of documents then let people search for it via the web. Their "killer-app" has multiple user-defined document types with multiple search fields, combined with some back-filing (digital and scanning) really saved the day.

    Do your research though on "Document managment" and see what product best fits your needs. It's a really well established field so reinventing the wheel is a little masochistic... not that there's anything wrong with that. ;)

    -Matt

    --
    --- Need web hosting?
    1. Re:Document management software by MyDixieWrecked · · Score: 3, Insightful

      Most print companies like Xerox have their own proprietary Document management [wikipedia.org] tools you can buy

      Document management software is great, but when you have enormous numbers of documents (100s of thousands like in the summary), it becomes necessary to have a content management system in place. Something that's intelligent enough to break the documents up into pieces and allow searches, but something more robust than full-text search.

      We've been using this software called MarkLogic Server (http://marklogic.com). It's an XML database and has a content processing framework for document ingestion. So, basically, assuming that documents are structured similarly, they can be converted into XML so they can be queried with custom weights being applied to content in different portions of the document. The software has built-in Word support so it'll automatically convert .doc files with proper formatting as well as the ability to add custom handlers for other formats including plaintext.

      We're currently managing a couple million documents and generating dynamic documents on the fly for some processes. Since on-the-fly documents may take time to generate, we have a system in place that saves the result in the database which can also be queried at a later date. It's all really cool.

      Of course, there's a bit of a learning curve to writing your own software for it since it uses XQuery, but it's not much harder to learn than SQL, and so far, it seems to be far more powerful.

      Disclaimer: I'm not a shill nor am I being paid in any way by MarkLogic... I'm just seriously blown away by what their technology has enabled us to do.

      --



      ...spike
      Ewwwwww, coconut...
  10. Knowledge Tree by crackervoodoo · · Score: 3, Informative

    http://www.knowledgetree.com/ If you're looking for a no-cost (read as no license fee) option then Knowledge Tree Community Edition is a decent Document Management tool. We've been using it for a couple of years.

  11. Re:Hummingbird Document management by kiwimate · · Score: 3, Informative

    Yes, but it's not that hard to find someone. But Hummingbird (now owned by Open Text) or any other Document Management System. You've got a bunch of documents. You need to manage them. Ergo, a document management system.

    Parent makes an excellent point, however: the single most critical component of a successful implementation is to get a skilled* consultant who can work with you to properly define the taxonomy. Everything else flows from there.

    * If you go with Hummingbird DM, "skilled" means "not one of their over priced professional services people". They're dreadful.

  12. Switch to Apple... by Tibor+the+Hun · · Score: 3, Informative

    I only partly jest, I know such a thing is damn near impossible to actually do, but in our Mac shop, such things are trivial. With one click of the mouse we enable spotlight searching on our Leopard AFP server and bam... all the clients have almost instantaneous search access to their docs.

    --
    If you don't know what AltaVista is (was), get off my lawn.
  13. WebDav by SplashMyBandit · · Score: 4, Informative
    There are a few options:
    • For relatively unstructured data without versioning you could serve them over HTTP with WebDAV (Apache) and use your existing HTTP security mechanisms. You wouldn't believe how relieved I've often been when I can get my (secured) resources from home-base while located at a clients site.
    • My outfit uses KnowledgeTree for versioned stuff (http://www.knowledgetree.com/)
    • Or you could embrace your dark-side and use Microsoft SharePoint (plus, with all the Microsoft bugs you'd have a job for life until your employeer goes bust). If you are a friend to your company you won't do this, plus your outfit has engineers and the good ones can spot trash solutions.

    If you users are naming their files with strange characters in them (assuming it's not due to Samba) then they will just have to live with it, you won't have time to sort out all the wierd names that (mostly MS-Word) users give to their filenames. The primary objective should be to give your users access to the files. Making the directory listing pretty ought to be a secondary concern.

  14. FileNet by Ohio+Calvinist · · Score: 4, Interesting

    I worked at a place that used FileNet, which is now an IBM product, to do this sort of thing. We had millions of scanned documents in the system. I wasn't personally very impressed with it, in that whenever anything "bad" happened, you had to call IBM because finding support online was impossible, and at that they support wasn't very good. It was also a very picky system, those seemed to handle the load well. If you go with it, I strongly encourage doing it for UNIX/Oracle because it screamed "poorly ported" when we used it for Windows/MSSSQL. It has an API for integration, but it is also, poorly documented and would take some time to integrate into your existing business systems.

    This is more of a rant at this point, but it is a stop-gap solution that allows people to continue to use outdated business processes storing important data in image formats or in documents scattered about with minimal indexing/search capabilities, rather than analyzable "data" that can lead to "information." I always take the position that if the goal is something on paper, or the goal is to store something that "was" on paper, it is time to rethink the business process to see if we can automate it, or store/present the data electronically in the first place. The old school fights against it, but no one has ever been able to say it wasn't more efficent in the end and enabled IT to say "yes we can" when the next great idea came along versus "here is a stack of papers, figure out $trend."

    --
    Forgive my spelling from time to time. I'm often posting during short breaks.
  15. Technical issues aside by Vroom_Vroom · · Score: 3, Insightful

    Hire a document manager / clerk person who will create order. Your engineers won't.

    --
    Boing boing boing....
  16. SharePoint by PIPBoy3000 · · Score: 3, Informative

    NASA is a big user of SharePoint, strangely enough. My coworkers run into their folks at conferences from time to time.

    I personally am ambivalent about SharePoint. Its roots are in document management, so it seems to do that relatively well. The publishing features are fairly nice as well. I don't think it's the best system for making web sites, but it may some day get there. Currently it feels like a 2.0 product (the magic rule is to never buy anything from Microsoft before 3.0).

    There are gotchas. SharePoint is tightly coupled with your clients. If everyone accessing the documents are using the latest version of Office, you'll be okay. If not, you'll run into problems. You may also need to throw a lot of hardware into SharePoint, as storing files inside of SQL has some built-in inefficiencies.

    Still, some of our users seem to love SharePoint, so it might be a good option for you.

  17. WIKI by unum15 · · Score: 3, Interesting

    Maybe not the best solution for this particular job, but man am I glad we started using Dokuwiki for all our scattered documents.

  18. There is a right way. by mrmeval · · Score: 5, Informative

    http://en.wikipedia.org/wiki/Document_management_system

    For that level of documentation you need to have a staff and get it properly indexed. You need a high level librarian. This would be someone with a masters degree at minimum in library science and at least a bachelors in information technology. They will not come cheap and they are a long term investment. The software is available, it is not trivial. Hiring a large number of people to recategorize and tag all the documents for the length of time that takes is also an expense but worth it. Once it's all in place maintaining it gets much easier.

    I've seen a system developed for Raytheon. They took all the old compartmentalized data Hughes had and put every scrap of paper through a scanner. It was exceptionally well done. This would display electronic files and would have the location of hard copy. Classified documents were in some cases indexed but were hard copy only afaik. There were some documents that were hard copy only, those were usually ones with an NDA or other restriction on making electronic copies. It had every thing mentioned wrt versioning and such. Documents spanned decades with hundreds of revisions and you could pull up and view any revision. Depending on how recent and what type of document you could view a change log. Older scanned ones did not have that unless they'd been important enough to reenter as modern documents which meant OCR or manually transcribed. Some schematics were reentered into the system in a modern format. The effort was worth it. Having that data is the only way some devices or parts could be made or repaired.

    http://en.wikipedia.org/wiki/Document_management_system

    --
    I'd go on a Vegan diet but the delivery time from Vega is too long. --brownkitty
  19. Re:Google wave by EQ · · Score: 4, Funny

    "[O]rganizations managing hundreds of thousands of documents since the Roman Era,"

    You mean The Vatican? I doubt that "small aerospace company" could afford to staff up on monks and monasteries.

    --
    Buffalo buffalo Buffalo buffalo buffalo buffalo Buffalo buffalo! http://goo.gl/J9bkO
  20. Mac OS X Server - Spotlight Server by Gary+W.+Longsine · · Score: 4, Insightful

    Since your organization probably has Windows clients, you can only long for something as nice as Mac OS X Spotlight Server.

    Google Search Appliance is definitely what you want.

    If you have a mid sized company you definitely don't have the surplus of highly talented systems administrator talent laying about to run one of the document management systems that others here are likely to suggest. Be very careful going down the document management server path. It's far, far more work than you think it will be, than the vendor will tell you it is. Not simply more work for you, but for your IT staff and your users, too.

    The Google Search Appliance, by contrast, is "fire and forget". Plug it in. Turn it on. Patch it when Google suggests you do so. That's about it.

    --
    If you mod me down, I shall become more powerful than you could possibly imagine.
  21. Re:Google wave by theNetImp · · Score: 5, Funny

    monks work for free, they just need food and enlightenment, and if you get lucky they fast and then only need the enlightenment aspect.

  22. Re:Hummingbird Document management by dimeglio · · Score: 3, Insightful

    Skilled consultants are great but without training employees you'll keep on paying big $ for consultants whenever there's a change to make. Let the consultant show how and let the employees do the work. BTW: We have 3000+ users (all happy) on their system and no consultant.

    --
    Views expressed do not necessarily reflect those of the author.
  23. Garbage In Garbage Out by sexconker · · Score: 4, Informative

    It's becoming quite a mess, sometimes quite slow, and there is really no naming or numbering convention in place for the files and directories. We end up with mixed casing, all uppercase, all lowercase, dashes and ampersands in the file names, and there are literally hundreds of directories to sort through before you can find the document you are looking for.

    Slow. Upgrade your network and VPN. You know that VPN layer is just killing your performance.

    No naming or numbering convention. Get one.

    Mixed casing. Learn How to Properly Case Folders (and documents).

    Dashes and ampersands. Are they a problem? Aesthetically unpleasant? I personally restrict punctuation in a filesystem to dashes, periods, and parenthesis (unless the punctuation is a replicable part of the name of the file/folder).

    Examples:
    01 - The First Track (vocal)
    02 - $lashhvertisements Attack!
    03 - Where Have All the A.C.'s Gone

    Develop your own method that works and be obsessed about it to the point where you would reburn a disc if one of the filenames was "01-Name" instead of "01 - Name".

    Hundreds of directories.
    Each file should have it's own folder.
    "That's insane!" you say. Start out with this mentality. If there is no reason at all to separate two files (they are part of the same thing) then place them in one folder, and make sure the folder is named all-encompasingly. Repeat for all files. If you get into a AB, BC, but not ABC situation, the solution is to have A and B and C, with A and C linking to B with your choice of shortcut/link/symlink/etc.
    Do this until all files are in folders. Then repeat with folders.

    There is NO substitute for organization and getting people on the same page. Develop some conventions. Task people to fix as they go. Check up to make sure people accessing documents are fixing as they go, and doing so according to convention. Once people are used to the convention, and once things are relatively organized, they won't ever need to search again. They'll instantly know where 99% of things are, and will be able to dig around and find anything else within seconds.

    The main problem you face is getting organized after already being unorganized. It isn't easy, but at least you're not dealing with millions of paper documents.

  24. Re:it's all about the index by jd · · Score: 4, Interesting

    Very true. I'd take a look at DSpace or Open Library for examples of software designed to handle gigantic numbers of documents and maintain sensible indexes for them.

    --
    It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
  25. Re:Google wave by Anarchduke · · Score: 5, Insightful

    There is a whole profession dedicated to this, and there is a major in college specifically designed to assist in organizing documents into meaningful collections.

    I suggest your company look at hiring a library sciences major, since this is what they do.

    --
    who prays for Satan? Who in 18 centuries has had the humanity to pray for the 1 sinner that needed it most? ~Mark Twain