How To Manage Hundreds of Thousands of Documents?
ajmcello78 writes "We're a mid-sized aerospace company with over a hundred thousand documents stored out on our Samba servers that also need to be accessed from our satellite offices. We have a VPN set up for the remote sites and use the Samba net use command to map the remote shares. It's becoming quite a mess, sometimes quite slow, and there is really no naming or numbering convention in place for the files and directories. We end up with mixed casing, all uppercase, all lowercase, dashes and ampersands in the file names, and there are literally hundreds of directories to sort through before you can find the document you are looking for. Does anybody know of a good system or method to manage all these documents, and also make them available to our satellite offices?"
I think it's in beta though.
Isn't this the sort of thing that a google search appliance would be helpful for? Then you don't need to know the exact filename, just some specific information that can identify the file. This certainly solved my problem with having thousands of emails.
09 F9 11 02 9D 74 E3 5B D8 41 56 C5 63 56 88 C0
http://en.wikipedia.org/wiki/Hummingbird_Ltd
and
http://connectivity.hummingbird.com/home/connectivity.html?cks=y
Google them? http://www.google.com/enterprise/search/gsa.html
I Need someone to rebuild a Digitech Digital Delay pedal for me....for me...for me...for me.
Store it on a single FAT32 partition and hope for the best. Only meant for people with guts or really really nice bosses.
Knowledge is power. Knowledge shared is power lost.
and there is really no naming or numbering convention in place for the files and directories.
I think you already know the answer.
"linux is just DOS with a UNIX like syntax" -- Galactic Dominator (944134)
I don't think this is one of those times, tough.
The lack of a naming convention for the filenames and directories is neither here nor there. What matters is how well it's indexed.
Now I use naming conventions for my files (photos ,mp3s etc). Am i contradicting myself? No, it's because I don't have enough of them that I need a separate index.
Confucius say, "Find worm in apple - bad. Find half a worm - worse."
OpenDocMan has helped a lot with our Graphics and Engineering department issues, similar to yours, ..where. The implementation took a bit of
ldap access to storage helped sort out who could put what
time to get the original files files into right locations, but it's easyer to manage now.
Darwin Enforcement Agent
http://www.google.com/enterprise/gsa/
Or some other corporate content management system
Here's to the crazy ones
Then if you can be bothered, you can start going through older files and updating the naming conventions or entering them into the Document management system of you choice...
Laters Sol "Have you found the secrets of the universe? Asked Zebade "I'm sure I left them here somewhere"
I happen to have written one:
http://sourceforge.net/projects/docdb-v/
could be what you are looking for. Of course, it'll take effort to catalog the documents.
I know I'm gonna get hit for blurting out the Microsoft Solution but...give SharePoint a shot...
...in bed
Google Search Appliance
Cygnet ECM might work for you.
use EMC document solution, where you have all documents i central database with metadata that can describe content. And can be accessed thru cached server from different sites.
If you need to use just plain documents, store then in on big directory, update the meta information.
Let people move links onto there system and organize the links how the like, but don't let them move the documents.
Think iTunes for documents. I loath that example since I have set this sort of thing long before iTunes came around.
If you on collaborative use of your documents get something like this:
Jive.com
The Kruger Dunning explains most post on
Sounds like you need a real document management system.
Depending on your requirements, you could go with something open source like Alfresco or one of the big boys like EMC Documentum or IBM/Filenet P8. Either way, you will end-up with an indexed repository of documents that makes it easy to to find old documents, add new ones, etc (assuming you and/or your integrator do the project correctly). It will also provide a web front-end so you don't have as much killer WAN traffic as you do now.
With a good document management system in-place, you are also on your way to having a workflow and other benefits as well. e.g. When Bob submits a document with XYZ as an index value, automatically tell Joe that it is in and ask Joe to approve it. When Joe approves it, tag it "Approved", and let Jim know.
Depending on your requirements for document retention, archiving, e-discovery, etc. the document management system can help you fulfill all of those automatically.
Hire human beings to sift through it and label each file with a numbering/labeling system devised by your engineers. The human mind is a relatively inexpensive and already well designed piece of machinery. A few dozen of them given enough time can work through those hundreds of thousands of document and get them sorted correctly. The problem you have, is that you have unsorted, improperly labeled material. It is cheaper to hire sufficiently (or even insufficiently) evolved groups of people than to invent a machine capable of doing so. And, with the economy the way it is, you'll be doing everyone a favor by giving them years of employment. When the Manhattan project needed to create a large excess of fissile material for the war with Japan, and with all the men away at war, they hired dozens of women to sit at machines; turning knobs, checking meter levels, verifying output. The scientists themselves did not even need to be there, they designed a process and the women were trained in it and followed it.
Most print companies like Xerox have their own proprietary Document management tools you can buy, and a bunch of CRM and ERP solutions (like OpenERP - it's free AND Open Source) provide some good simple document searching and indexing tools.
Really it comes down to how complex you want searching to be? Are there specific keys in the document you could index by? Do you require the full-text search capabilities of a Google search appliance?
A really good solution I've come across for some clients in Edmonton is Called MetalTrace by Trace Applications. Don't let the name fool you about the specificity, software like this can Scan, Index, and even read barcodes on all sorts of documents then let people search for it via the web. Their "killer-app" has multiple user-defined document types with multiple search fields, combined with some back-filing (digital and scanning) really saved the day.
Do your research though on "Document managment" and see what product best fits your needs. It's a really well established field so reinventing the wheel is a little masochistic... not that there's anything wrong with that. ;)
-Matt
--- Need web hosting?
http://www.knowledgetree.com/ If you're looking for a no-cost (read as no license fee) option then Knowledge Tree Community Edition is a decent Document Management tool. We've been using it for a couple of years.
JamWiki.org, for instance, has search capabilities built in. Has security built-in and easily mnageable. You can upload the documents and even migrate them to wiki format later. Keeping the documents in near-text open format will help you re-migrate them into the future sometime later.
A man without religion is like a fish without a bicycle. -- Ron "Doc" Ferrell
I used an old version a while ago and it was pretty good then. Does versioning and other things.
http://www.knowledgetree.com/
While this may be an odd suggestion, here's two things:
1) Get yourself a damn good document or content management system. Get it set up on the baddest machines you can afford.Overshoot the capability you need, so that you have room to grow.
2) Get a librarian to look at the kinds of documents you create, and develop a system to catalog documents while maintaining reasonable standards for file names. As the super simplest system, maybe document names that indicate (at a minimum) what project or what overhead department they belong to, a broad category of subject matter, and if it's versioned, a version number.
I tried to bludgeon a small company I worked for (around 40 engineers, one overworked Q&A person, and one system administrator) into moving towards a storage system for word documents that was not "Create a new folder for each version of the document set, place them all in the right folder, and if you don't Ray will eat your head." We wound up using (of all things) Perforce SCM to house fifty thousand word documents, and were starting on putting actual code revisions for automated test sets into the system when our avionics testing focus became a serious liability, and overhead workers were drastically cut. (Why have one Q&A guy and one system admin guy? We can get an intern to do BOTH!)
Any of many document managment systems. They allow the extraction of meta data, which is in turn used to 'find' the document you are looking for. Nearly all contain some security settings and a viewer for many types of files. One thing to note. This magic doesn't happen by itself, if you get stuck doing this, be prepared for a. No one really knows how they want to do this, they all want to wonder if one of the many docs has their answer and have the correct doc located and opened for them. b. you are about to become a stranger to all those who know you outside of work.
If you don't like the idea of sending your information to google to have it indexed, you can look into some server side applications (with associated client apps) that do the indexing and searching for you. I'm not familiar with Windows ones (although I'm sure there are some) but there are quite a few for Linux and primarily Spotlight for the Mac. The option have the actual indexing done server side would save on your bandwidth tremendously. You may also want to consider using a different filesystem, one that has indexing capabilities built in.
Science will save us. The question is, will it destroy us first?
Sure, with any number of ECM solutions. At the simplest end many of them simply enforce naming conventions; at the more robust end, they support many different file types for viewing, indexing, etc. and can also provide rich metadata on a document-by-document basis. Some of them have been named in the comments, including but certainly not limited to SharePoint 2007, Cygnet, Documentum, Open Text, FileNet, etc. Any system worth looking at has a web-based interface, at least for searching, and many of them offer for more meaningful interaction as well. Alfresco, Hyland, and SpringCM all have web-based ECM solutions and more comprehensive web-based offerings are available all the time. Oh - and if you're aerospace there are a number of regulatory requirements for information management you'll need to comply with, which does complicate the situation but spending the ducats for software and/or consulting help is probably cheaper than whatever your litigation and regulatory audit support processes cost today. Hope this helps, Jesse Wilkins ECM and other stuff consultant jwilkins13 at gmail dot com
I think step one is to pick a storage/naming convention and stick with it. Also depending on your needs a document management system could help. The other thing I would do is look and figure out where the bottleneck is for your speed issue, is it the vpn connection, the network not being able to keep up, or the computer running samba. Once you know more of where the slowdown is work on that spot.
I only partly jest, I know such a thing is damn near impossible to actually do, but in our Mac shop, such things are trivial. With one click of the mouse we enable spotlight searching on our Leopard AFP server and bam... all the clients have almost instantaneous search access to their docs.
If you don't know what AltaVista is (was), get off my lawn.
I'm gonna say nothing beats a proper folder structure and naming convention. I'd also recommend using svn. Also spend some time to develop some macros to assist in the creation/saving/retrieval of said documents from the repository. Maybe create some standard templates too... just my 2cents!
If you users are naming their files with strange characters in them (assuming it's not due to Samba) then they will just have to live with it, you won't have time to sort out all the wierd names that (mostly MS-Word) users give to their filenames. The primary objective should be to give your users access to the files. Making the directory listing pretty ought to be a secondary concern.
..something like Filenet or SAP. Sound like you have big corporation needs, get a big corporation solution.
Laws are rules for the court, but merely a bottom bar to hit for life. Think beyond laws in your actions always.
Mindoka (http://www.mindoka.com) has a document management product that is designed to solve the problem that you have.
Put Steelhead mobile on all the clients. Document transfer over the VPN will GREATLY improve. Since it's mostly text/pictures, there will be so much duplicate data that doesn't need to be transferred over the wire multiple times, the round trip time will decrease so much they'll forget they're on a VPN.
I worked at a place that used FileNet, which is now an IBM product, to do this sort of thing. We had millions of scanned documents in the system. I wasn't personally very impressed with it, in that whenever anything "bad" happened, you had to call IBM because finding support online was impossible, and at that they support wasn't very good. It was also a very picky system, those seemed to handle the load well. If you go with it, I strongly encourage doing it for UNIX/Oracle because it screamed "poorly ported" when we used it for Windows/MSSSQL. It has an API for integration, but it is also, poorly documented and would take some time to integrate into your existing business systems.
This is more of a rant at this point, but it is a stop-gap solution that allows people to continue to use outdated business processes storing important data in image formats or in documents scattered about with minimal indexing/search capabilities, rather than analyzable "data" that can lead to "information." I always take the position that if the goal is something on paper, or the goal is to store something that "was" on paper, it is time to rethink the business process to see if we can automate it, or store/present the data electronically in the first place. The old school fights against it, but no one has ever been able to say it wasn't more efficent in the end and enabled IT to say "yes we can" when the next great idea came along versus "here is a stack of papers, figure out $trend."
Forgive my spelling from time to time. I'm often posting during short breaks.
Hire a document manager / clerk person who will create order. Your engineers won't.
Boing boing boing....
I think the right option for you would have to be ordering the documents in a database and serving them up through a website. I think that would be helpfull for your satelite offices since mapping shares through samba over VPN is sometimes unstable and always nontrivial. Besides the system doesn't seem to be working for you. You really don't have to be that proficiant with functional webpages to make something like this, especially if you use ruby on rails. A ruby on rails guy would probably use only a couple of hours to make such an application. Then you could have functionality like searching and sort by author, department, type and so on.
I forgot to mention Alfresco as well, although I've never personally tried it.
http://www.alfresco.com/index-b2.html
Can't really suggest a good document management program but I can tell you one to avoid. We use Livelink at my place of work and its indexing and search capabilities are horrible (some would say non-existent). For example every document added to Livelink gets a document number assigned to it. One would expect to be able to retrieve that document by using the same document number but if you enter it into the search bar Livelink returns no results found. Huh? Not to mention some odd UI behaviours like when you add a folder to the favourites box the original folder disappears from the standard file listing (meaning there is no single canonical listing of files and directories, you need to always look in 2 places).
What kind of documents are they? If they're mostly text and you want versioning, the only drawback to subversion is getting people to learn the tools, but that might be too much.
If they're archival/static documents, an institutional repository could work. Something like DSpace isn't that hard to deploy and will provide basic archival and search features.
The middle ground between those two solutions is probably what you want, though. Everyone I work with uses SharePoint for that, and I hate recommending proprietary lock-in.
Laserfiche (or LF) is just what this is for. It is DOD, DOJ certified and crap, and is used by all branches of the military and several other areas of the government as their document management system. With several different software offerings, just about any situation can be taken care of. It's features include the ability to search based on document name, template information, or OCR'd text (which the software also takes care of). With add-on features such as Quick Fields, it may be able to automatically sort, add template information, OCR, name and then store the documents. It really is a nice way to go. Satellite offices can access and be either full or read-only users. It has the ability and modules to connect to just about any other type of data/information system (GIS, financial software, etc) and is very scalable.
I was a tech for 5 years with a LF VAR. I'm not there anymore. We were constantly cleaning up messes left by other document management systems. Take your time with this thing and really plan your naming convention, folder hierarchy and user setup. It's easier to get it right(or as close to it as possible) then going back and having to fix it later. A good LF VAR should help you with this. Definitely check references of competing companies. Some VAR's are A LOT better than others.
Digital Asset Management
http://www.lmgtfy.com/?q=digital+asset+management
we have extensive documentation and tracking needs. we use two sets of software for records and also keep a hard copy for long term storage. For tracking parts on/off and hours in service, TSO TSI etc... we use TRAX Evo2 We scan all written paperwork into a database which is interfaced with via Alchemy. This allows us to view the current status of all of our aircraft and their parts and track the paperwork for each action taken. Alchemy has a browser interface and we use IE to access it. this allows for a person to access the documentation from any of our stations and or offices internally on the network. Both Alchemy and TRAX are acceptable to our local FSDO. The hardware setup for this is not something I can shed light on as I do not get to play with computers that are ground bound. hope that helps, maric
As may have been pointed out, organizing the files is really the best way. Develop a strict schema for naming conventions as well as a hierarchical directory structure for maintaining and organizing. Something like:
/projectname/projectpart/data (contains the final draft of any document) /projectname/projectpart/working (contains files that people are modifying so that they can be merged/checked in to the data dir) /projectname/projecttpart/misc (contains misc. notes or files that need to be filed with the project)
The "projectpart" dirs are really just logical groupings of data/files for the project. Say you are designing a plane, well, break it up into relevant systems, like electronics, power plant, structure, etc., and each of those are the "projectpart" directories. The "projectname" is simply the overall project itself, be it the name of the plane, maybe the name of the contract, etc.
We were all warned a long time ago that MS products sucked, remember the Magic 8 Ball said, "Outlook not so good"
The OP did not mention exactly how many remote branches or computers need to access the documents at once, however, windows Terminal Server licenses aren't too expensive and the remote desktop experience is silky smooth. Also the documents would all reside on a central server raid array or NAS device and never need to travel over the internet to remote sites. This would also free up massive amounts of bandwidth over the VPN, considering TS just needs an internet connection and uses SSL encryption. (although I don't know what you would even need a VPN for after making this conversion)
Comment removed based on user account deletion
Who else read this and thought... working in a satellite office for an aerospace company would involve a lot of cool travel perks?
-- Terry
Odd that the next story has a great idea for document management right in the summary...
Hadoop!
Support FSF: Stop thinking with your wallet, and think with your imagination. (cc/non-commercial)
...seems like a natural solution for your connectivity issues, or perhaps whatever the open source variety of Sharepoint is. You really do need to tackle the naming convention question though. You can have all the file indexing you want, but sometimes a nice, logical, clean file name will get you what you're after much faster than any kind of searching.
It's going to be horrible, painful, thankless work that will put you on the shit list of just about every department manager and administrative assistant ("You want me to rename how many files?"), but it has to be done.
What worries me more than anything else is that you claim to be a mid-sized aerospace company. If you are having problems finding documents, what happened to your traceability processes necessary for your QMS and how do you guarantee that employees use up-to-date documents? How did you handle the process in the past??? And, what does your QMS stipulate for records and traceability?
IBM OmniFind should do the trick, It indexes your files and then you can search the index very quickly. It also does caching of documents and other nifty stuff. It is based on Apache Lucene and there is a free (as in beer) version, IBM OmniFind Yahoo Edition. The free version will work with up to 500 000 documents. I used it for searching a number of networked drives with circa 50 000 files on them which it did very well.
NASA is a big user of SharePoint, strangely enough. My coworkers run into their folks at conferences from time to time.
I personally am ambivalent about SharePoint. Its roots are in document management, so it seems to do that relatively well. The publishing features are fairly nice as well. I don't think it's the best system for making web sites, but it may some day get there. Currently it feels like a 2.0 product (the magic rule is to never buy anything from Microsoft before 3.0).
There are gotchas. SharePoint is tightly coupled with your clients. If everyone accessing the documents are using the latest version of Office, you'll be okay. If not, you'll run into problems. You may also need to throw a lot of hardware into SharePoint, as storing files inside of SQL has some built-in inefficiencies.
Still, some of our users seem to love SharePoint, so it might be a good option for you.
When I worked for the state Attorney General's office as I.T. Director a request came into I.T. that immediately gave me an upset stomach. The request was for all documents on the server that contained the word "lead" as in the chemical element Pb. The issue was that the word lead and the element share the same spelling.
I kicked in and wrote an app that generated a web list on the fly and had clickable links so the documents could be examined and then marked as part of discovery.
I also brought in three Xerox 490's. Those were the hardware part of the document management system. I don't know if they ever got the servers for it but at least they had the gear. In the meantime I suggested using meta-data in filenames.
Hire a real librarian, it's what they do.
On the plus side, you also get to hire a librarian. nudge, nudge, wink, wink, say no more.
I'm guessing that wasn't on their radar screen...
We run a EDMS system for our local council here - doesn't matter about the filename, it is how it is all indexed. Too many people here are thinking that you need to re-name EVERY document. I don't have any experience with Hummingbird, but what about HP's TRIM software? Yes $$$$, but it also has a WEB GUI interface. Just a thought.
It can scale extremely well. It is the backend to Adobe's acrobat.com website! So you know it can handle millions of documents if you need it to. Sharepoint requires MS SQL Server for searching documents. With Alfresco, that feature is built in.
Sharepoint is teaming software and not really designed for large document repositories. Alfresco has a teaming interface (Alfresco Share) and a more generic document repository interface.
Alfresco can expose the repository via FTP, SMB, WebDAV, and a web client interface.
Your solution:
http://xkcd.com/208/
A black cat crossing your path signifies that the animal is going somewhere. -- Groucho Marx
Maybe not the best solution for this particular job, but man am I glad we started using Dokuwiki for all our scattered documents.
http://en.wikipedia.org/wiki/Document_management_system
For that level of documentation you need to have a staff and get it properly indexed. You need a high level librarian. This would be someone with a masters degree at minimum in library science and at least a bachelors in information technology. They will not come cheap and they are a long term investment. The software is available, it is not trivial. Hiring a large number of people to recategorize and tag all the documents for the length of time that takes is also an expense but worth it. Once it's all in place maintaining it gets much easier.
I've seen a system developed for Raytheon. They took all the old compartmentalized data Hughes had and put every scrap of paper through a scanner. It was exceptionally well done. This would display electronic files and would have the location of hard copy. Classified documents were in some cases indexed but were hard copy only afaik. There were some documents that were hard copy only, those were usually ones with an NDA or other restriction on making electronic copies. It had every thing mentioned wrt versioning and such. Documents spanned decades with hundreds of revisions and you could pull up and view any revision. Depending on how recent and what type of document you could view a change log. Older scanned ones did not have that unless they'd been important enough to reenter as modern documents which meant OCR or manually transcribed. Some schematics were reentered into the system in a modern format. The effort was worth it. Having that data is the only way some devices or parts could be made or repaired.
http://en.wikipedia.org/wiki/Document_management_system
I'd go on a Vegan diet but the delivery time from Vega is too long. --brownkitty
It's called an index or a bibliography. There exists a profession known as 'librarian' specifically trained in the creation of such and in the management of large numbers of documents.
We went through this for both document management and web front end for access. We looked through, Sharepoint, Alfresco, Oracle UCM, Reddot and a few others. We dropped most due to cost, functionality, and ease of use for non-developers to do page work. Sharepoint was dropped due to cost in an internet setting (CALs), no non-developer front end for page layout (they couldn't use HTML) and it stores everything in the database. From prior experience this made backup/restore difficult as it keeps the IP ofthe web site in the database when you backup. If you restore to a different machine it gets confused. It was between Oracle and Alfresco. You cannot go wrong with either. Both are extensible, either have what you need built in or can be added easily. Both are good for non-developers to use. Support is very good with either. We went with Oracle. While it did cost more it matched our existing infrastructure.
This is built for the exact situation you described:
http://www.opentext.com/2/global/sol-products/sol-pro-docmgmt-collaboration.htm
You can either import the files into the system, or leave them in place, index them and use the search engines to locate the needles in your haystacks...
About Open Text:
http://en.wikipedia.org/wiki/Open_Text
Hummingbird is a subsidiary of Open Text, the solution mentioned above...
Full Disclosure:
I am an Open Text employee.
Google?
http://www.google.com.au/enterprise/mini/index.html
Seriously, if you can't be bothered collecting/maintaining the metadata that more structured solutions require, then just let Google index the lot. It'll work just as well (or not) as it does on the Internet. Although its not free it seems reasonably priced. It could be a quick answer to your problem.
I know I'm gonna get hit for blurting out the Microsoft Solution but...give SharePoint a shot...
Just avoid the wiki functionality like the plague. It completely sucks.
Since your organization probably has Windows clients, you can only long for something as nice as Mac OS X Spotlight Server.
Google Search Appliance is definitely what you want.
If you have a mid sized company you definitely don't have the surplus of highly talented systems administrator talent laying about to run one of the document management systems that others here are likely to suggest. Be very careful going down the document management server path. It's far, far more work than you think it will be, than the vendor will tell you it is. Not simply more work for you, but for your IT staff and your users, too.
The Google Search Appliance, by contrast, is "fire and forget". Plug it in. Turn it on. Patch it when Google suggests you do so. That's about it.
If you mod me down, I shall become more powerful than you could possibly imagine.
We use a Bentley product called ProjectWise. It is a document management system with file attribution among other things. It is primary useful for Bentley's line of products, but we have used it as an archival system as well as a working documents that are non-Bentley specific. No... I do not work for Bentley, but my job heavily uses their products.
A good example is Cisco WAAS, a cool video showing how it works is here: http://www.cisco.com/cdc_content_elements/flash/ans/index.html
See here for data sheets and specs: http://www.cisco.com/en/US/products/ps5680/Products_Sub_Category_Home.html
Cisco's solution is inexpensive and you can use your existing router investment to do all the heavy lifting.
Pat
Unsurprisingly, the answer to managing many documents is to use a document management system. There are several commercial and free products available, both linked here and on the Wikipedia page for Document Management Systems.
I've worked next to the team who administered Bentley ProjectWise in a previous engineering job, which is expensive but definitely suited to your task. There may be other good options out there.
DMS -- http://en.wikipedia.org/wiki/Document_management_system
-- botsex is {grep;touch;strip;unzip;head;mount}
We're using a Win3.1 app called LaserFiche on XP with > 250,000 documents and it's lightning fast, works with TIFF files and PDF and probably more. Includes file and folder permissions.
moox. for a new generation.
Step 1: Print out all 100 thousand docs and draw different little smiley faces on each of them. Step 2: scan all your docs back in as jpegs. Step 3: import all those jpegs into iPhoto and use "Faces" to magically organize them - just like on the television commercial.
Check out Thunderstone. It's what they do, and they do it very well.
Documentum, docushare, livelink, sharepoint. I've heard of documentum installs with 100m+ docs. It's quite good, but expensive.
Take a look at NetDocuments. It's a SaaS (Software as a Service) document management system. It handles millions of documents, can be accessed from anywhere, and is relatively inexpensive compared to maintaining your own servers.
It's becoming quite a mess, sometimes quite slow, and there is really no naming or numbering convention in place for the files and directories. We end up with mixed casing, all uppercase, all lowercase, dashes and ampersands in the file names, and there are literally hundreds of directories to sort through before you can find the document you are looking for.
Slow. Upgrade your network and VPN. You know that VPN layer is just killing your performance.
No naming or numbering convention. Get one.
Mixed casing. Learn How to Properly Case Folders (and documents).
Dashes and ampersands. Are they a problem? Aesthetically unpleasant? I personally restrict punctuation in a filesystem to dashes, periods, and parenthesis (unless the punctuation is a replicable part of the name of the file/folder).
Examples:
01 - The First Track (vocal)
02 - $lashhvertisements Attack!
03 - Where Have All the A.C.'s Gone
Develop your own method that works and be obsessed about it to the point where you would reburn a disc if one of the filenames was "01-Name" instead of "01 - Name".
Hundreds of directories.
Each file should have it's own folder.
"That's insane!" you say. Start out with this mentality. If there is no reason at all to separate two files (they are part of the same thing) then place them in one folder, and make sure the folder is named all-encompasingly. Repeat for all files. If you get into a AB, BC, but not ABC situation, the solution is to have A and B and C, with A and C linking to B with your choice of shortcut/link/symlink/etc.
Do this until all files are in folders. Then repeat with folders.
There is NO substitute for organization and getting people on the same page. Develop some conventions. Task people to fix as they go. Check up to make sure people accessing documents are fixing as they go, and doing so according to convention. Once people are used to the convention, and once things are relatively organized, they won't ever need to search again. They'll instantly know where 99% of things are, and will be able to dig around and find anything else within seconds.
The main problem you face is getting organized after already being unorganized. It isn't easy, but at least you're not dealing with millions of paper documents.
I use the 'job' system, which I learned from working at Digital Domain (the Visual Effects Company) and then passed it on to the Aerospace company where I now work.
Effects companies deal with enormous amounts of data, and many different versions of a shot as well as all the elements that make up that shot, along with other data such as project settings files from software used in the making of that shot. They had a very specific file naming system to keep that all organized, and it was referred to as the job system, because first and foremost everything was logically separated by project.
How that has translated for me into the Aerospace field is at the root of the main drive share, there are two primary folders, job and departments. Departments contains generic documents for each department such as forms, standards, etc.
The 'job' folder contains several categories of jobs or projects, such as vehicles, engines, pumps, etc.
Inside those are folders with the project name. Inside each project folder is a series of folders for different data types, such as solidworks, reports, proposals, documentation images, etc.
File naming:
File naming should be consistent, and I always start my own files with the date with year first, because I do not trust meta-data one single iota. I have had dates wiped out when a backup system kept a backup, but did not preserve the file creation / modify date on copy.
After that it is the thing, then the version.
So 09-06-10_widget_v01.sldprt
version two should be exactly the same, with the number iterated up. There should never be a document named something_FINAL because you always end up with FINAL_FINAL_FINAL etc. :)
Now, as you probably know, the difficulty is enforcing a uniform standard when people are busy doing actual work. Things get sloppy, things get messy. You have to keep up after people, and policing stuff like this is not fun. At Digital Domain is was an urgent necessity for everyone to use the standard and there was automated software that relied upon it. At the aerospace company, I gave up years ago trying to enforce a perfect policy. Now, people generally follow the example I set to a point where you can easily find things. When I first got to this company, when it was really small, all files were (seriously) piled nearly in a single folder. This was when the company was very small, but it was already a disaster and it was impossible to find anything. People were used to working on their own computer and did not have a concept of a shared file server, at least not in a modern sense.
Now you can just swatch down the left pane in windows explorer and get what you want very quickly.
This system is designed to use the left pane (lots of folders for organization) and people who were used to the Windows 3.1 way of double clicking through folders without the left pane had to change their (awful) habits. That was the biggest concession among the old school users.
The trick is also not to over-do the nested folders. Just enough to keep it nice and tidy.
Every once in a long while you run into a file that really wants to belong to several folders, and that's what shortcuts are for. Even if the shortcut gets broken you can look at the shortcut file to see what it originally pointed to, and you can probably find it that way.
At home I use the same methodology to archive 30,000 photographs. I can find anything in an instant by expanding folder icons. When that fails, plain old windows search is able to turn up what I am looking for, in those rare instances.
I have always been against anything that 'collects' your files into meta data, such as iTunes, or various photo editing programs. It's a big mess because one day that software won't be around and your files will be a mess.
Even my MP3s are organized by genre/album/1.song.MP3. I just drag album folders or songs into Winamp and I am off and running as my own DJ. I don't use a media organize
Oh no, not another CMS.
I've never seen a CMS that was anywhere near up to date.
The only way to index more than a few dozen documents is to use Enterprise search.
For the really cheap, you can install Google Desktop on the PC that holds the Enormous Shared Drive, and then let people log in via Remote Desktop or VMC and look stuff up. (Is there a Google Desktop API?)
You eventually could have a lot of people making personal indexes of the Enormous Shared Drive with Google Desktop, which is going to cause problems that will motivate you to obtain a real enterprise search package.
Give a man a fish and you have fed him for today. Teach a man to fish, and he'll say "WHERE'S MY FISH, YOU IDIOT?"
Obviously throw them on the desktop. Once it fills, throw them into a New Folder. Once your desktop fills with Folders, throw those in My Documents. Repeat until your computer crashes.
Ginga no Rekshiya Mata Each page.
Basic unix tools can do the trick. find (atime,ctime,etc) mixed with egrep, or just egrep with -R... all sorts of solutions, right at your command line.
20th century Marxism is not progress...
I'm pretty sure there are databases that can store and serve up documents based on criteria. Couldn't you set up a centralized web server with an SQL backend that hosts those files for you? You would be able to then keep track of who is using which document and when, and regulate who can do what with different documents as well. As a bonus you should be able to ditch SMB while you're at it and move to a more robust OS for your critical files. Centralizing those documents would also make it dramatically easier to back them up at regular intervals.
Damn_registrars has no butt-hole. Damn_registrars has no use for a butt-hole.
This is not going to help you with your 'finding the right document' problem, but it is essential for your remote offices to be able to open (and save) those documents in a reasonable time. It will also have the added benefit of dramatically reducing your WAN traffic (think 50% reduction). When I initially trialled these, Riverbed was miles ahead of Cisco. That was 2 years ago, but they are still the only one with a remote client and a few other tricks. Well worth the investigation & money.
Yes, I know it's been mentioned before. Yes, I know it's Microsoft. But SharePoint is an excellent document management system. It supports clustering natively, load balancing, search, information rights management, web editing for most Office formats, InfoPath web-integration. Users can also save natively to SP via WEBDAV through Office apps directly, or through Explorer. There's a whole crapload more that you may want to check out at the SP site.
.Net libraries available for you to natively access SP and manipulate the whole system via scripts. Importing and exporting files is a cinch using these APIs. There's also exposed web services via SOAP that let you do the same thing. And, in the end, there's the actual SQL backend that is very straight-forward so if you don't want to use the SOAP or SP .Net libraries, you can manipulate the database directly.
To get yourself organized and imported, there are
So no, you are not locked in. And, the licensing cost is the most reasonable out of all the document management software out there.
Real men use an old TI-99/4A machine with a casette recorder, and files sent via RS-232 connections.
Lawfirms are experts at managing millions of documents using document management software. If you want state-of-the-art document management. Then the software that lawfirms use is what you're looking for.
What's up with this box everyone has to think inside of or outside of? Why does there have to be a box?
I'm on the IT Applications side of things, not operations so my experience with this has been more as a user than as an admin (though I've helped that group on a few things)...
...but we implemented Documentum and have found it to be slow, difficult to deal with and I've heard no end of horror stories about how hard it was to implement.
In all honesty we had a properly set up sharepoint (tsk!) solution at another company and it pretty much ran itself and did the job we needed it to do. YMMV.
Kneel before Sig!
Simple.. use CVS. Documentation is centralized and de-centralized. You have versioning, log, comment, and overall this... it's free
we use onbase, I like it. AND, one day when a tech was onsite for training, the entire home office was having a day off at Cedar Point. mmmm, rollercoasters.
I personally dealt with an issue like this at the Australian arm of large international mining equipment manufacturer. I wrote the software solutions mentioned and went on to do my engineering honors project in the area. My first recommendation is, stay away from document management systems, they are bulky, inefficient and tend to lock you into "their way" of doing things. As soon as you want something different, you will find yourself stuck. This is a simple problem don't make it too hard for yourself.
My solution was multi-layered:
1) Place exactly 1 person in charge.
2) Enforce a naming convention. - Our CAD Drafters and Engineers (of which I did both) were notoriously bad at naming their documents correctly. Most of this was ignorance. Document your naming convention and make it well known.
3) Write or come up with a standardized way of generating document numbers. In my current job as a software engineer I would recommend a simple, incremental numbered approach. Every document, every revision, simply gets a new number. Our engineers did not like this. So we went for a middle ground. Something like XXX-YYY-ZZ.eee Where XXX is the equipment type, YYY is the sub type, ZZ is the revision no, eee is the extension/file type.
4) Standardize the way you store your documents. For instance, make a folder structure . C:\xxx\yyy\XXX-YYY-ZZ.eee
5) Register ALL documents in a database with location, comments, purpose, revision, author name etc etc.
6) Take the Draftsperson or the Engineer out of the archiving process. I wrote a utility that checks the a single "to be archived" folder, fixes obvious mistakes such as using "_" or "." instead of "-" and so on, checks the database to make sure that the document has been registered and then drops the into file system. Make the archive read only access for everyone except the person in charge (and any utilities of course).
7) Clean up your existing archive. This can be a semi-automated process. I wrote a utility to do this partially, but it just takes a lot of painstaking effort. With 70,000 documents this was a slow and painful process but it can be done.
8) STICK TO IT. Any exception will erode the system over time making it useless.
There are a ton of Document Management systems out there, our company uses http://www.opentext.com/ look for DM You can use Microsoft Share point as a document management system, but it is not really what it was designed for. DM will integrate with all the Microsoft applications. It will give you document numbers, version numbers, etc... you can profile your emails as well if you want. We have had some performance problems for the remote locations, but it is still usable. I did a search for open source document management systems on Google and there are a ton out there if you don't feel like paying for something.
Curious about Storage and Virtualization? Check out
Manipulating the sql backend is a pretty bad idea. Its not quite -THAT- straight forward, since a lot of the elements end up crunched in one table in xml, so you have to be careful with that. Things are pretty duplicated and its not supported, plus it changes drastically between version, making migrations difficult.
WebDav however is indeed the way to go (for documents), especially since Vista lets you map a webdav folder as a drive (letter), and Linux has tools to mount them like any other volume, too. Good stuff.
or Confluence Hosted: http://www.atlassian.com/software/confluence/hosted/
I've worked with Oracle UCM (formerly Stellent) for a few years now and would thoroughly recommend it. It's scalable into (at least) the 10s of billions of documents. A single repository for Doc Management, Records, Web Content Management, workflow, imaging. It comes with security, library services, metadata, and search OOTB. Using the WCM, you can make your documents available on an intranet, extranet or internet site, according to specified security policies.
;-)
BTW... offices on satellites... that's so cool!
My other account has mod points!
We're an old engineering company, and our products last decades, so we need to keep lots of records.
Recently, we started scanning old documents (a warehouse full of them) to make room for expansion.
It is a very tedious process, because we can't risk shredding the old files unless we know for sure that the scans are correct. Amyway, for storage, we decided to go for an in house web-based system (some one developed it for us) that is quite basic, and does two important things for us:
1- it references the file in it's location, rather than store the file in a database and copy it to the webserver
2- gives us the ability to change meta data (the document indexes) as we find errors in them
By referencing a file in it's "physical" location gives us two layers of access control: 1- through the database permissions, and the other one through file system permissions. this is important for restricted files...
Obviously, searching is the important part. and indexing is absolutely critical and the most time consuming process.
Someone suggested to us Google appliance, but non of the scanned documents can be searched. they are all images.
The actual application is pretty basic concept (nice interface features, but the concept is simple)
1- A database to hold the info
2- a table per document type containing teh meta data and the filename and filepath
3- a web interface to search and re-search to narrow down the list.
I'm sure it's been said by now, but you really should be looking at a content management system. There are several vendors out there that sell various types of document control systems; Pilgrim, Master Control, I'm sure Oracle has something that does that. There are also open source frameworks that you can develop in-house like Drupal. All of those are online document management systems. Users upload documents to them. File naming conventions can be enforced as well as directory structure etc. Many of them allow for document collaboration and approval. It's a complex problem, and a valuable solution will take some serious thought and time. I've heard some people use google documents, but for a company of your size I wouldn't recommend it. In any case, folders on network drives are NOT the answer.
Google appliance.
Oliver's law of assumed responsibility: If you're seen fixing it, you will be blamed for breaking it.
I work at a midwestern public university in the USA, and we've been using this program for several years and a few versions. Backend can work on AIX, Linux, or Windows, and the frontend at least Windows (don't know if Macs or *nix are supported, we don't have many of those on users' desks). We probably have several gigs of imaged documents in this system, and it seems to work pretty well.
You'll have to import all the documents into the system, of course. The company recommends certain tractor-feed scanners for this; lighter-duty ones are USB, heavier are SCSI. I think it also has a software printer emulator to let you dump e.g. Word documents into the system; how you organize things is up to you.
Hail Eris, full of mischief...
E pluribus sanguinem
Whatever the solution, you have to get staff to declare what it is on the front end. It's not all about the technology. I see some of the benefits of Sharepoint, but depending on your audience (tech-savvy or not) it may become a training issue. Prepare for change management.
What I like about Sharepoint is the Office integration, the improvements over the last few years, document history (versions), and mostly, the ability to require metadata. If you have a taxonomy of topics, it will make it much easier to create a search appliance that can find what people are looking for. You may be forced to look at auto-classification if you can't get staff to do it, or hire knowledge managers (librarians) to properly catalogue. Trouble for us is getting to agreed-upon taxonomies and hierarchies across divisions (I'm in the knowledge management trenches here).
A good way to start might be Sharepoint repositories, require a topic field, seed it with however many topics you can come up with, and leave an OTHER field so you can collect what you have not organized. If you analyze what comes into the OTHER topic, you may keep adding new topics.
Find the logical buckets to start search before they think about searching too. Does your staff only care about 1 project at a time, break it up into project searches. Basically offer them one level of selection before they get to search - it may make things easier (if you are structured that way). They may look for something from a particular function - Marketing search vs. Operations search.
Also, sharepoint can leverage active directory info, so you may be able to get some metadata automation (Docs from sales staff vs. R&D, etc.)
Hope these points help. Contact me if you need more.
Jim - your name is Jim...
I implemented swish-e, http://swish-e.org/ for a client with html and .pdf indexing (nightly) in 11 hours from a standing start (never used swish-e before).
Verbum caro factum est
Hi there, I am one of the developers of this nice web tool which in fact was designed to achieve the requirements you say, we are calling it anydata, but dunno if we'll need to change it's name as it's a registered trademark, at least you see our goal ;)
:D
http://devel.anydata.tv/
Try it out with firefox if you don't want to see something ugly right now. It's a beta, but in less than 1 month you will see it complete. It looks like a filemanager, pretty well known user interface for browsing documents and information. This system ables you to store files, bookmarks, text notes, contacts and soon pgp'ed passwords for secure-sharing across system administrators.
In short, keeps the 'tree-browsing' typical schema of filesystems plus generating and showing previews of documents, tagging, automatic keyword gathering from documents and a search engine.
By the way, it's GPL
Anyone interested just send me an email to kenneth at gnun d-o-t net and I'll give you a testing user or whatever needed.
Cheers!
Kenneth
OK, so it is a bit hard to get your documents out once you put them in to this system, but man, does it tidy up a mess of documents.
-ted
How about Desksite (formerly iManage) or PC Docs?
You could set up a Document Management System like Alfresco or god-forbid, Sharepoint. Or you could run OS X Server and let Spotlight index everything.
I've got this car, and it doesn't run and it's got all these strange bits inside under this hood thingie. . . . Hire a librarian or someone with a degree in knowledge management who has experience in the corp world.
First, you're potentially dealing with more than one problem here you're trying to solve: slowness, and naming convention. I'm guessing they're somewhat related (large directory listings due to lack of organization), but there might be a deeper infrastructure issue that needs to be dealt with, too.
As for organizing files, You need a naming convention for your project files, first and foremost. Throwing a bunch of disparate files at a CMS is going to do nothing but complicate things more (from a sane-management perspective).
Data categorization is key. You need to figure out a way to organize it in a fashion which is both contextual to how people use it as well as how it relates to the other data (in, say, a project).
For instance, you will want (at a minimum) the equivalent of user-level and group-level data shares. This would, in all likelihood, get kind of tricky with shifting working groups. For this there are multiple ways to use ACLs (as opposed to just user/group/all permissions) within Samba (with or without shackling the machine to a Windows domain/authentication server). ext3 and XFS both have the ability to use ACLs (XFS natively), last I checked. Ultimately, this would probably be better than just using user/group, as it would be more extensible.
As for a Solution...
Something to look into specific to samba, is the "veto files" directive for smb.conf. It is per-share. I am uncertain whether it supports regex (it didn't in early 2005 when I last used it), if it did it could be very useful for enforcing a specific namespace (going forward).
I would recommend "enforcing" namespace. While this is likely a self-created problem (ie you or your predecessor did not set things up properly in the first place), you really need to push to your users the importance of this. You need to tell them "organize your files, it'll make things faster" if there's any bitching.
There was an article in LinuxMagazine a while ago about determining the age of data. Utilizing this in some sort of auto-sort script to move "old" data to a "pre$date" directory within the original messy directory might speed things up. Also, archiving (or at least moving it to an "old shit" directory) past, unused data is important. It eases the "human element" of data organization.
Projects should all have a reference number (because there is, in all certainty, hard paper associated with the projects, and sometimes you need to cross reference). Keeping this consistent is important. Use what works, keep it short/demarked so users don't avoid using them. I like each project folder to have the project number to relate to contract/etc. start (short) date (eg. 080112 for Jan 12th, '08) followed by a 2-3 digit number (depending on how many projects are started per day) followed by major revision. End result: something like "080112.01.a Jennings Construction" Or organize by client ID. Or something.
Requiring and/or encouraging project naming conventions through the managers (at the bequest of your manager/CIO/whomever, or just pleading) might also be worth a try. One department out of 5 doing it would be better than none.
IMO, once you've reached this step, you can consider putting it in a CMS to help perpetuate/encourage the organization. But remember that a CMS is not a panacea, and might even complicate things further (ie, instead of navigating to a file, -everyone- just searches the whole index, slowing things down further).
~/ssh slashdot.org ssh: connect to host slashdot.org port 22: too many beers
A document management system is a must for that many documents. Check out Alfresco. It's open source and as such isn't outrageously expensive like it's competitors. If setup seems too daunting for you, check out tsgrp.com. Technology Services Group is a consulting firm in Chicago with experience working with Alfresco and may be able to make this transition easier for you.
Or DMS. Commercial packages include Docs Open, and Soft Solutions.
Open Source DMS = http://mydms.sourceforge.net/
It sounds like they're heading for an epic fail. Aerospace == Process + CMS. They will never survive the NTSB audits and safety Nazi without both. They will need to prove the Change trail for every nut/bolt/software path/data item/paper clip and who authorised/designed/checked/tested it for the rest of their natural lives. So if they don't have Process + CMS, they are screwed beyound belief. To me it sounds like a medium sized software house, that's decided to switch to Aerospace because it's cool or high tech or the marketing guys sold some product.
http://www.nasw.org/users/nbauman/txtsrch.htm
http://www.nasw.org/users/nbauman/lawdb.htm
http://www.nasw.org/users/nbauman/discover.htm
They were imaging and indexing up to several million documents. During a civil suit, in discovery, companies on each side of the lawsuit have to disclose every relevant document to each other.
Lawyers probably use the most flexible and all-encompassing systems, since they have to deal with every industry, every profession, everything. They also spend more money on their systems than most people can afford. They told me it costs them about $1 a page to thoroughly index big databases.
Information scientists told me the best model of a document database was PubMed, which indexes virtually every significant published medical article. http://www.ncbi.nlm.nih.gov/sites/entrez?db=pubmed
The big limitation of Google is that you can't search too well by date. Another limitation of text searches is that you can't search for concepts -- just words. Sometimes words (particularly names) match concepts very well, but if they don't, you've got a problem.
Yeah, it would have been nice if you had set up coding and naming conventions at the beginning, so the original authors could have sorted them as you went along. It may be difficult or impossible to go back and re-code them after the fact. It could wind up costing $1 a document. OTOH, you could be lucky -- some industries have been using standardized filing schemes and standardized jargon since the days of slide rules and T-squares.
There should be standard filing schemes and procedures throughout your industry, so your solutions may be industry-specific. There should be consultants that deal with your industry who would be happy to talk to you (for the prospect of maybe getting your business). There should be trade magazines in your industry that have covered the same issue for companies of your size. (Hell, if the price is right I'll write a roundup for them.) Or you might have a trade or professional association with some friendly people who have done it before. Trade and professional associations usually have a computer or information technology section, and if you're a member of the association, you can call up the members of the section.
everyone is talking about document management software and search appliances. You're going about it all wrong...
Hire a document management staff.
Librarians. Hot librarians.
Back in the 90's I helped create a media department for large textbook publisher. One of the first projects was an asset library and tracking system. To this message brief. We first needed a naming convention. Look for a constant throughout your products, ours was ISBN numbers. That became the main identity of the product/project and their main digital folder. Every item or product was dropped in a sub folder such as images, design, text, etc. From here the main folders were always scanned by Portfolio and it was told/programmed that the main descriptions should come from the folder names. This allowed anyone with knowledge of the product ISBN to find details on the project. It also greatly minimized keyboarding of metadata onto the files needlessly. Portfolio then will allow check in and check out (versioning) to stay abreast of any edits or updates. The whole metadata catalog would also be exported and brought into Filemaker for secondary backup. Look to a constant for naming convention, keep it simple, look at ways to minimize keyboarding metadata, go over the counter (they are much easier to work with and you can experiment-they are also more than capable of handling 100K documents). Last. Good luck and if needed look for help.
Why not use subversion? Files will be accessible using a subversion client (including log + history), as webdav (only current version) and through a standard browser (read-only).
The company I work for uses a system called Document Locator. It is a Windows-shell integrated document management system. Basically, if you took Subversion and gave yourself extremely fine-grained control of repositories, folders and the like. It scales decently, too -- we have millions of documents spread across 25 major repositories, many of which include AutoCAD, Bentley Microstation, Smartplant 3D and other sizable files. The system is also fairly extensible, as we've built quite a few internal applications off of the DL system and there are plenty of third-party plug-ins available (a notable one being Brava, an application that allows adding QC and other markup to repository files). And if you don't want to be constrained to Windows, there is a web client available, which works decently. While it is not without its problems, the overall experience has been pretty good.
Full disclosure: My company is ColunbiaSoft's largest customer and, as such, we know a good deal of the development team.
My UID is a prime number. Yeah, I planned that.
My last company relied on a program called isys to index and search documents and email. You don't have to worry about what a document is named, just the type of content you're looking for. This solution can save a lot of time, especially if your users are good and phrasing queries. On the other hand, I did not have to maintain it, so I have no idea how much administration time was devoted to keeping it working.
Make love, not reality television.
One of your options is to use Salesforce Content, which is a very usable content & collaboration piece from salesforce.com. It's fully wired in to the rest of the force.com platform and CRM apps suite too, so if you're looking to build out more of your company's apps in the cloud, it's worth taking a look at it. http://www.salesforce.com/crm/marketing-automation/document-content-management/
After looking at backup systems and maintaining libraries of data
our company found that we needed something that fit our needs.
We designed a system that worked and knuckled down to programming it.
We now have a search-able database of documents and files with attributes
as well as context from content for over 20 years of data and documents.
We can pretty much find any file in less than 5 minutes.
We could still make it better but we sure couldn't have done anything like
it C.O.T.S., Google included.
If Google failed tomorrow, where would your documents be then?
establish a naming convention. come up with a few simple rules regarding:
file names
directory names
customer names
job/project names
department names
limit the number of total allowable characters in a file name, and publish and distribute your rules in an easy to follow cheatsheet. for example:
all files for client "Smith Inc." reside in a directory named "SI"
all files for Smith Inc for project "Widget X" reside in a subdirectory names "WX"
all files for Smith Inc for project Widget X have a unique number generated by you accounting system
all files generated by the sales department need to have "S" after the project number
enforce using file name extensions for all file types
so a powerpoint deck created by the sales department for a sales pitch to smith inc for Project X with an internal job number of 1234 would be named "SIWX1234_S.ppt".
a well structured naming convention with simple but rigid rules will allow users to navigate a file system to find files and identify wrongly filed assets.
invest in a digital asset management system that with a database backend.
there are many DAM systems available both commercially and opensource.
utilize one that has a web front end, so you can enforce consistancy in end user experience(as opposed to a fat client embed metadata into the files themselves in XML format thru the DAM if possible.
based on the naming convention you've established and the DAM system you've deployed, you should be able to track when a file was created, modified, and last accessed. establish rules regarding when a file moves from disk to tape, and from online tape(in jukebox) to offline tape(out of jukebox), to cold storage(offsite).
three can keep a secret, if two are dead - benjamin franklin
I know that I'll probably get verbally lynched for saying this here, but MOSS 2007 enterpise search is a REALLY nice way of dealing with this . Since MOSS can index your file shares, then all of your users can search for documents contextually using a simple web portal across multiple sites... I better leave before I'm hanging from the Slashdot tree.
I decommissioned a document management system at my client, a smallish law firm, because the system was too complicated, insecure, and expensive. Updating it to run w/ the latest version of MS-Office would have cost thousand$ just for the s/w. We replaced it with Google Search, and we defined a file hierarchy and naming convention for all documents created after the switchover. Client is very happy, their file access is more efficient, and they saved a bundle of money on administration, not to mention all the h/w and s/w they never bought.
Obviously documents are the lifeblood of any law firm. These guys only have about 100,000 or so, less than the aerospace company in question, but the lesson applies. It's extremely unlikely the IT admin of the aerospace company has the resources to manage, much less install, a proprietary document management system.
The ONLY reason to have a formal document management system with a database (like Microsoft SQL *ugh*) is to control access. But access control is something that really, really should be done through the directory. So unless you're NASA or another organization with many, many millions of documents and a legally mandated auditing requirement, there's no reason to make this more complicated than necessary. And even then....
Of course, if we're talking about images with no searchable text, that's another story.
-- "The only thing that is ever new in the world is the history you do not know." -- Harry Truman
I don't at all mean to be pat or facetious with such a short answer. But, seriously, you're asking the wrong crowd. Librarians have masters degrees in answering just the question you're asking and it goes far beyond just books. A couple of dozen hours of consulting contract with a good librarian can set you straight - whether you keep the samba store or you pony up for document management software. Because if you have a strategy for organizing your information and execute on it you will reap benefits that don't show up on any productivity spreadsheet. And a good librarian will tailor the system to how the people in your organization actually use the information. Get an internship program going with a library school to have someone remotely do the cleaning and maintenance every once in a while. Whole thing should be doable for a few grand.
You need to deal with this issue on multiple points
1. Consider PDF with OCR. That way you can search within files for specific words
2. FIle naming. Use a standard like date_headline.pdf
3. Hire a library sciences major, as an earlier poster suggested. They spend years studying how to organize and retrieve.
Outside of shoring up your connectivity to the remote site, you should use the structure of your company to your advantage.
It sounds like the wild west. You gave everyone full RW access to the fileserver.
Build a file structure the mirrors the organization of the company and apply permissions appropriately.
Map drives in the same fashion. An added advantage to this, you can split the files across separate Samba servers later with a minor map change.
The finance department has no reason digging around in your design documents.
The engineers don't have any reason to poke around in your sales collateral.
Does everyone in the company need to be tempted to open "DOD_GPS_NOYB_47-090611.xls"
Getting every employee to adhere to a single naming convention is like herding cats. Delegate responsibility to the directors and managers to keep their areas on the server organized to their own needs. Then you just need to deal with the occasional outlaw.
You may also want to deploy Samba servers to the local offices and back them up to a central server regularly. Use this for personal shares and anything that is primarily used ONLY in the local office.
In most cases, I doubt that "the single person" working on Project X at Remote Site A needs to work off of a centralized copy of their document. Do you really need to share this document across your entire organization? Let the employee keep their file on the local offices share. Let a employee or a manager share it with the entire department. Let the director share it with sales.
In the end, you may find one small part of the organization that REALLY needs a naming or numbering convention. You can address that when they approach you. For now, you need to stop everyone from treating the company share like their own desktop.
It's called a "database". You might want to look into it.
As a PC user, I have found one of the best products to manage hundreds of thousands of documents (*.doc, *.txt, *.wpd, *.xls, *.ppt, and email, images, etc.) is Isys by Odyssey. It requires very little work on the part of the endusers. Just searching. For the IT person, it requires very little to be up-and-running. You can set up automatic indexing to run anytime, without restricting usage and searching. This can be done across all hard drives. I found this little company (and their software) about 15 years ago when I was still using DOS. They have, of course, developed their software to match all the Windows versions that have come out, and have Web versions also. I manage a huge library of both physical and digital documents - all that must be located within seconds. Without this software, I would not be able to perform this job in the high-level capacity that I currently do. Yes, Google is a great contender, but it has its limitations. Google desktop, for example, does not index all different types of software that the hundreds of users may have/use/need. I have found the Isys by Odyssey to not only be extremely fast, high quality, but they have great customer service, and their prices are reasonable. You can always start slow - with a low number of licenses, and work your way up, depending on the company's finances and needs. We have 2 licenses, where I work. I currently am the main end-user to the product, and people request documents or information from me, which I can find and email to them in an instant. It's worth the time to check them out. Their home web page is: http://www.isys-search.com/
The content engines like IBM/FileNet are set up to manage millions of documents. Many also have the ability to add remote cache servers to improve local performance for repeat document access in satellite offices. Contact Dave at Softech-assoc.com if you need help.
Search engine
Move sig!
TRIM also has good Sharepoint integration if you're so inclined.
Some of the suggestions above says that you should just chuck everything haphazardly into a big pile and then use search engines to trawl the whole mess. I don't buy that. Instead, (like some others) I'd suggest a proper content management system such as the ones from http://www.alfresco.com/, http://www.interwoven.com/ or http://www.hummingbird.com/.
The reason for this suggestion is that I know that these systems are being used by organisations which handle, as OP said, hundreds of thousands of documents and which have satellite offices (e.g. large multinational lawfirms). They provide several benefits such as the possibility to structure projects, have both project related documents and e-mails saved and indexed in the project folders, allows for searching and proper document version chains (meaning that you can revert to older versions of documents if some klutz breaks a newer version).
Of course, this means quite an investment, a learning curve for everyone at your company and, most likely, the hiring of an individual with experience of the chosen system.
Firstly, you can absolutely forget about any system that requires users to name documents in a way that is descriptive, consistent, unique or anything else that a sane person would do.
Secondly, MacOS X Spotlight Server (as of version 10.5.7) doesn't work as one would expect/hope. Users' files stored on the server get indexed by the server but this index can only be read by users logged in to the server console (or via ssh), not clients that access the files my mounting them as shared volumes. If a client wishes to search the files, it must build its own index over the network. The workload on the server/network can cause severe performance issues until the clients have built their indexes, a process that will take hours and may take days to complete if you have a lot of files.
Namgge
Dark Green have just this week gone live with Gina2, a web solution for document archives.
Have a look at http://www.gina2.net/ - the text is currently in German, but the English translation will be up there in the next couple of weeks.
Dark Green are offering Gina2 as a hosted service for companies whose core business is not managing IT infrastructure.
I work for a company that stores terabytes of documents. There are two products that do this well EMC's documentum and Microsoft Sharepoint. Pick your poison depending on whom you want to abuse you.
I've written a recruitment app that has 200k resumes and other types of folder indexed in text.
The files live on the disk in /TYPE/YEAR/MONTH and are converted to text and inserted into MySQL database.
They can be searched on name, date record id, free text, type, etc, etc; or just browsed to on disk.
The front end is PHP on MySQL.
These were imported from a files on disk approach.
It can scale with master slave replication, etc. Just keeping it simple helps.
Go to Google main page and look for business solutions. They have a scheme where they'll charge you x dollars to index y hundred thousand documents, and they throw in the tinware (a custom pre-configured rack of search hardware, very scaleable) for you to plug into your LAN. All strictly inside your firewall. Set it up to crawl all your file shares and it won't matter whether you have a document management system or not. Most document management systems depend on keywords, taxonomies and special file name codes, all of which are decidedly old-hat. Index it and let 'em go search. The smallest version is kind of basic, but go up one level and they'll crawl pdf's, word docs, pretty much anything with text in it compressed or in source libraries or whatnot. They're pretty good. Not cheap, but then you're an aerospace firm...
Do not mock my vision of impractical footwear
Google is just a search engine. They need document management. :-) Correct me if it is not the thing called content management they need?
Import it into some CMS, sort it and make it available through the website secured by the password. We did something like this for http://www.olympus-ims.com/ (but these are public documents) and it really contains thousands of documents (in dozen languages) together with all the document revisions it is over the hundred of thousands of documents. Easy to search, easy to navigate, easy to manage.
Simply: CMS is what you need. Do research.
Well, I've got to get back to work. When I stop rowing, the slave ship just goes in circles.
Several years ago I worked for a NASA project called the National Technology Transfer Center. A big part of the job there is organizing and searching through tens of thousands of pages of research documents. They used a document oriented database at the time although they may have migrated to something else since then. You might want to contact them for advice.
A friend of mine was the person primarily responsible for scanning in the documents. IIRC, the process involved OCR of the scans for key word search and indexing and then storing a compressible graphic image of the page - this got them around the problem of text databases not storing technical drawings, etc.
Time's fun when you're having flies. - Kermit the Frog
Samba shared over a VPN? Man, you are asking for no end of painful trouble. There are many good ways of sharing docs, but putting MS docs in a filesystem shared over a VPN is not one of them. A simple way to improve things would be to drop all the filesystem sharing and create some sort of searchable index on a web server. If you want more sophistication and have money to burn (who hasn't these days?), go and talk to Oracle, they have some very good software for this very purpose.
I don't know why companies always do it this way - it is the worst possible way of organizing your documents. When you put them in a filesystem, people have to try to remember how to find the one they need; a directory is like a hiearchical database, badly implemented. Sharing it via a networked filesystem makes it even worse, because now you have a huge network overhead and the risk of undetectable corruption when the network stumbles. And the VPN means that your network traffic is something like 10 times as heavy because of the encryption.
The latest installment of Visual Source Safe is pretty good, they improved the performance over the network which used to kill on a domain spread across multiple cities (back during vb6 days), but now is really good repository tool. I also used another , but it lacked the history/detail section and could only keep a max number of files....seeing as you have hundreds of thousands
Smeadsoft might work for you. -- note: I don't work for them or any affiliate of theirs, and have no vested interest in them being used --
I'm in the process of setting up one of their systems for document management, it seems to be quite capable of that. Its not open source and it would involve some cash to set it up, but I think it worth looking into if those two things don't eliminate it from consideration. (they also handle management of physical files, which is where they came from)... Thus far set up involves setting up a lot of framework and tags for the actual documents, and scanning a lot of physical files to be stored. There is this system of using large scanners with something called VRS, and putting barcode identifier sheets with stacks of documents.
So for example you could have a large stack of papers, of which half belong to one category (or subcat or subsubetc), the others to a second. You put barcode sheet (a blank paper save for one barcode) for the first category, then all those papers, then a barcode sheet for the next category, and so on. You load them into the scanner (obviously a high capacity one) and it reads them all and puts the scanned documents into the proper location in the database automatically.
"Waste not one watt!" - CZ
Don't try to use the file name or directory structure. This is difficult to adapt or relocate as the namespace becomes distorted from its original content over time.
Try this instead:
Assign arbitrary file names.
Adopt a directory structure derivable from those names, if you must.
Build a database of several tables to link keywords, project names, authors, etc, to the arbitrary file names.
Award small prizes for verified corrections to the database.
See http://en.wikipedia.org/wiki/Content-addressable_storage for more information.
Teamcenter. It freaking rules. Also, as evil as StarTeam is, it will do the job for you as well.
I have been a user/admin of both Teamcenter and StarTeam.
We use EMC's Documentum suite here to manage our large volumes of documents. Expensive, but works great...and integrates with Fax software, MS Exchange, etc.
Similar to CamelCase. Limits the number of variations on the same name considerably (no: camelcase, Camelcase, Camel case, Cam El Case,...)
Reminds me of the command 'passwd' in *nix, I always have to 'apropos password' to find the correct spelling. Why is it not 'password' or 'psswrd'? Arbitrarily dropping 1 vowel and 1 consonant is silly.
What you need is a content management system. Such systems do more than store and find documents. They allow true document taxonomy management, records management for compliance and control, and many other features.
I personally specialize in IBM Content Manager. It's great for companies like yours where you have distributed offices. You can keep your metadata at one central location but have the documents themselves stored at your remote locations, all while maintaining centralized control.
Doug Hansknecht
Certified IT Architect
DougFromOhio@us.ibm.com
To help you with the challenge of sharing documents with your remote sites, there are universal web-based document viewers on the market that you can use to embed document viewing capability into your intranet or web site. The documents can be of different file formats too, they don't all need to be PDF. Some options use Adobe Flash, so a plug-in needs to be downloaded by the end user, but other options do not. Adeptol and Vuzit are two examples, but if you search for "online document viewer" in Google you'll find a number of options.
I'm surprised that there were quite a few programs not mentions on the DMS wikipedia page -- People might consider them to be more as repository software than DMS (or RMS), but some other ones to mention that would be useful to managing already existing documents:
And if you're looking for librarians with an IT background, in the libraries they're called "Systems Librarians". You might also check out the oss4lib and code4lib communities.
Build it, and they will come^Hplain.
It seems that the first responses to a request like this is to suggest new technologies and programs to solve the problem. It sounds, though, like 95% of the problem is that there are no procedures and organization in place already so that files have a purpose or place to go. A good file storage policy with the appropriate instructions sent to the users could just as easily make this work going forward. I've seen collections of millions of files that were perfectly fine as they were organized by user, purpose, source, destination, etc...and then subdivided as needed...and users knew what the organization was and how to maintain it (to their own benefit as it means they can find their own stuff). You can also institute a more structured system where organization is already there for them to use, but it's your call. ALWAYS ALWAYS ALWAYS figure out how you want things to be organized first! What are the functions of these files, why are they saved, who created them,, who accesses them? This will make the job of sorting the mess out easier.
Species.
Related question, has anyone used, or would recommend using IntraLinks to help manage a similar scenario?
If you don't want to go the google appliance route, Notes works great, is cheap to set up, and simple to administer.
One db.
One form with a couple of fields
One view
Render to the web
Write a simple agent that crawls your directory structure, snags the files and attach each one to a Notes doc. Stuff in the directory/file name if you care.
Let Notes build an index (and it can index damn near any file).
Poof - done.
Remove user's rights to leave crap in file directories and make 'em put new stuff into Notes and you have something that's maintainable without a ton of work.
If you then want to get fancy, you can make users enter some meta data before they can save new docs.
You can set up access control, etc, etc, etc.
Documentum costs about a quarter mil just to get it in the door and a boat load of cash to make it useful. (at least it did in the late '90s).
Notes server license a couple grand. If you need user authentication, it's around $150/client (ask your rep for prices because IBM is working tons of price schemes). If you don't need authentication, all you need is the server license.
Where I work we wrote our own Document Management System that integrates with the rest of our systems. The integration has proven quite beneficial. Off-the-shelf systems can integrate but it generally doesn't work very well. Anyway, we were looking at using SharePoint as our back-end to get the indexing support and improved versioning. What we discovered is that SharePoint just doesn't scale very well. When you get into the hundreds of thousands of documents it has problems. When you get into the tens of millions it has major problems.
Given that the submitter already needs to file 500,000 documents I question if SharePoint is feasible.
Consider Office Evolve by Documatics. They've a system that will; organise your directories in projects, provides fully indexed searching of all your documents, caters for document generation from templates, has a complete history of all your documents, integrates with Outlook and manages workflow. It's in use at GE. We love it.
Not a student. This is not a summer job. Even if someone at your office has nothing else to do, they will not be able to do a better job than a pro.