How To Manage Hundreds of Thousands of Documents?
ajmcello78 writes "We're a mid-sized aerospace company with over a hundred thousand documents stored out on our Samba servers that also need to be accessed from our satellite offices. We have a VPN set up for the remote sites and use the Samba net use command to map the remote shares. It's becoming quite a mess, sometimes quite slow, and there is really no naming or numbering convention in place for the files and directories. We end up with mixed casing, all uppercase, all lowercase, dashes and ampersands in the file names, and there are literally hundreds of directories to sort through before you can find the document you are looking for. Does anybody know of a good system or method to manage all these documents, and also make them available to our satellite offices?"
I think it's in beta though.
Isn't this the sort of thing that a google search appliance would be helpful for? Then you don't need to know the exact filename, just some specific information that can identify the file. This certainly solved my problem with having thousands of emails.
09 F9 11 02 9D 74 E3 5B D8 41 56 C5 63 56 88 C0
http://en.wikipedia.org/wiki/Hummingbird_Ltd
and
http://connectivity.hummingbird.com/home/connectivity.html?cks=y
Google them? http://www.google.com/enterprise/search/gsa.html
I Need someone to rebuild a Digitech Digital Delay pedal for me....for me...for me...for me.
Sometimes you just have to do the work and not look for the magic bullet.
Store it on a single FAT32 partition and hope for the best. Only meant for people with guts or really really nice bosses.
Knowledge is power. Knowledge shared is power lost.
and there is really no naming or numbering convention in place for the files and directories.
I think you already know the answer.
"linux is just DOS with a UNIX like syntax" -- Galactic Dominator (944134)
The lack of a naming convention for the filenames and directories is neither here nor there. What matters is how well it's indexed.
Now I use naming conventions for my files (photos ,mp3s etc). Am i contradicting myself? No, it's because I don't have enough of them that I need a separate index.
Confucius say, "Find worm in apple - bad. Find half a worm - worse."
OpenDocMan has helped a lot with our Graphics and Engineering department issues, similar to yours, ..where. The implementation took a bit of
ldap access to storage helped sort out who could put what
time to get the original files files into right locations, but it's easyer to manage now.
Darwin Enforcement Agent
You have massive project on your hands! You need a tiered storage solution and document management system that is back-end (Stored) on SAN storage. How big of a budget do you have to solve this problem? Double it.
Tiered storage requires the business to prioritize data by levels (1...n) 1 is highest, 2 is less than one, 3 is less than 1 and 2.
Generally 3 levels are employed sometime more.
Does the mgmt understand the complexity of the issue? Do they support the project? You have a lot of data gathering to do before you can even determine what you need.
Godspeed!
http://www.google.com/enterprise/gsa/
Or some other corporate content management system
Here's to the crazy ones
Then if you can be bothered, you can start going through older files and updating the naming conventions or entering them into the Document management system of you choice...
Laters Sol "Have you found the secrets of the universe? Asked Zebade "I'm sure I left them here somewhere"
I happen to have written one:
http://sourceforge.net/projects/docdb-v/
could be what you are looking for. Of course, it'll take effort to catalog the documents.
I know I'm gonna get hit for blurting out the Microsoft Solution but...give SharePoint a shot...
...in bed
invest in sans and get a sharepoint server. you dont need sans for sharepoint though.
Google Search Appliance
I'm not affliated with them, but I do use their product, and its a steal for the cost.
www.documentlocator.com
You get version control, auditing control, web access, and a bunch more stuff.
Cygnet ECM might work for you.
use EMC document solution, where you have all documents i central database with metadata that can describe content. And can be accessed thru cached server from different sites.
Look into Enterprise Content Management solutions, there are many. Many of them are very expensive but depending on your needs it may be worth it. Several examples are EMC Documentum, Alfresco, and even Sharepoint to an extent. Alfresco is open source so that may be a good place to start.
If you need to use just plain documents, store then in on big directory, update the meta information.
Let people move links onto there system and organize the links how the like, but don't let them move the documents.
Think iTunes for documents. I loath that example since I have set this sort of thing long before iTunes came around.
If you on collaborative use of your documents get something like this:
Jive.com
The Kruger Dunning explains most post on
Sounds like you need a real document management system.
Depending on your requirements, you could go with something open source like Alfresco or one of the big boys like EMC Documentum or IBM/Filenet P8. Either way, you will end-up with an indexed repository of documents that makes it easy to to find old documents, add new ones, etc (assuming you and/or your integrator do the project correctly). It will also provide a web front-end so you don't have as much killer WAN traffic as you do now.
With a good document management system in-place, you are also on your way to having a workflow and other benefits as well. e.g. When Bob submits a document with XYZ as an index value, automatically tell Joe that it is in and ask Joe to approve it. When Joe approves it, tag it "Approved", and let Jim know.
Depending on your requirements for document retention, archiving, e-discovery, etc. the document management system can help you fulfill all of those automatically.
Hire human beings to sift through it and label each file with a numbering/labeling system devised by your engineers. The human mind is a relatively inexpensive and already well designed piece of machinery. A few dozen of them given enough time can work through those hundreds of thousands of document and get them sorted correctly. The problem you have, is that you have unsorted, improperly labeled material. It is cheaper to hire sufficiently (or even insufficiently) evolved groups of people than to invent a machine capable of doing so. And, with the economy the way it is, you'll be doing everyone a favor by giving them years of employment. When the Manhattan project needed to create a large excess of fissile material for the war with Japan, and with all the men away at war, they hired dozens of women to sit at machines; turning knobs, checking meter levels, verifying output. The scientists themselves did not even need to be there, they designed a process and the women were trained in it and followed it.
Most print companies like Xerox have their own proprietary Document management tools you can buy, and a bunch of CRM and ERP solutions (like OpenERP - it's free AND Open Source) provide some good simple document searching and indexing tools.
Really it comes down to how complex you want searching to be? Are there specific keys in the document you could index by? Do you require the full-text search capabilities of a Google search appliance?
A really good solution I've come across for some clients in Edmonton is Called MetalTrace by Trace Applications. Don't let the name fool you about the specificity, software like this can Scan, Index, and even read barcodes on all sorts of documents then let people search for it via the web. Their "killer-app" has multiple user-defined document types with multiple search fields, combined with some back-filing (digital and scanning) really saved the day.
Do your research though on "Document managment" and see what product best fits your needs. It's a really well established field so reinventing the wheel is a little masochistic... not that there's anything wrong with that. ;)
-Matt
--- Need web hosting?
http://www.knowledgetree.com/ If you're looking for a no-cost (read as no license fee) option then Knowledge Tree Community Edition is a decent Document Management tool. We've been using it for a couple of years.
Don't know enough about your company, budget, policies, real requirements. But throwing Documentum at it is probably good. Either that or something simple like Sharepoint. Both provide rich web based access and documentum can support long term archiving and version control. I have no idea how google appliances would do jack for access. However, if you need search there are google or cheaper/better commercial solutions from companies that actually do it right.
JamWiki.org, for instance, has search capabilities built in. Has security built-in and easily mnageable. You can upload the documents and even migrate them to wiki format later. Keeping the documents in near-text open format will help you re-migrate them into the future sometime later.
A man without religion is like a fish without a bicycle. -- Ron "Doc" Ferrell
http://www.worldox.com
Document management is generally very good. Forces people to fill out required fields. I've seen it implemented in law offices.
http://cdsware.cern.ch/invenio/index.html
http://www.google.com/search?hl=en&safe=off&q=document+management+system
a web application 4 years ago at Konica-Minolta. It is called DocuBreeze. I am not sure whether you need all the functionality it provides, but you may want to take a look. Google Docubreeze and you will find it.
I am no way related to this company any more and I have nothing to gain from recommending this to you.
You forgot the link to your website: www.nasa.gov
I used an old version a while ago and it was pretty good then. Does versioning and other things.
http://www.knowledgetree.com/
While this may be an odd suggestion, here's two things:
1) Get yourself a damn good document or content management system. Get it set up on the baddest machines you can afford.Overshoot the capability you need, so that you have room to grow.
2) Get a librarian to look at the kinds of documents you create, and develop a system to catalog documents while maintaining reasonable standards for file names. As the super simplest system, maybe document names that indicate (at a minimum) what project or what overhead department they belong to, a broad category of subject matter, and if it's versioned, a version number.
I tried to bludgeon a small company I worked for (around 40 engineers, one overworked Q&A person, and one system administrator) into moving towards a storage system for word documents that was not "Create a new folder for each version of the document set, place them all in the right folder, and if you don't Ray will eat your head." We wound up using (of all things) Perforce SCM to house fifty thousand word documents, and were starting on putting actual code revisions for automated test sets into the system when our avionics testing focus became a serious liability, and overhead workers were drastically cut. (Why have one Q&A guy and one system admin guy? We can get an intern to do BOTH!)
Any of many document managment systems. They allow the extraction of meta data, which is in turn used to 'find' the document you are looking for. Nearly all contain some security settings and a viewer for many types of files. One thing to note. This magic doesn't happen by itself, if you get stuck doing this, be prepared for a. No one really knows how they want to do this, they all want to wonder if one of the many docs has their answer and have the correct doc located and opened for them. b. you are about to become a stranger to all those who know you outside of work.
If you don't like the idea of sending your information to google to have it indexed, you can look into some server side applications (with associated client apps) that do the indexing and searching for you. I'm not familiar with Windows ones (although I'm sure there are some) but there are quite a few for Linux and primarily Spotlight for the Mac. The option have the actual indexing done server side would save on your bandwidth tremendously. You may also want to consider using a different filesystem, one that has indexing capabilities built in.
Science will save us. The question is, will it destroy us first?
Sure, with any number of ECM solutions. At the simplest end many of them simply enforce naming conventions; at the more robust end, they support many different file types for viewing, indexing, etc. and can also provide rich metadata on a document-by-document basis. Some of them have been named in the comments, including but certainly not limited to SharePoint 2007, Cygnet, Documentum, Open Text, FileNet, etc. Any system worth looking at has a web-based interface, at least for searching, and many of them offer for more meaningful interaction as well. Alfresco, Hyland, and SpringCM all have web-based ECM solutions and more comprehensive web-based offerings are available all the time. Oh - and if you're aerospace there are a number of regulatory requirements for information management you'll need to comply with, which does complicate the situation but spending the ducats for software and/or consulting help is probably cheaper than whatever your litigation and regulatory audit support processes cost today. Hope this helps, Jesse Wilkins ECM and other stuff consultant jwilkins13 at gmail dot com
"We're a mid-sized aerospace company with... satellite offices. Wow... apparently the state-of-the-art in aerospace is a lot more advanced than I thought! What kind of rocket do you use for commuting to those satellite offices?
I've abandoned my search for truth; now I'm just looking for some useful delusions.
I work on a product whose focus is to address this very problem. Check us out at http://www.kalexo.com/
It's integrated file/document/project management. It's targeted at industries that are geographically spread far and wide but need collaborative, secure access to common files to work on stuff.
I think step one is to pick a storage/naming convention and stick with it. Also depending on your needs a document management system could help. The other thing I would do is look and figure out where the bottleneck is for your speed issue, is it the vpn connection, the network not being able to keep up, or the computer running samba. Once you know more of where the slowdown is work on that spot.
I only partly jest, I know such a thing is damn near impossible to actually do, but in our Mac shop, such things are trivial. With one click of the mouse we enable spotlight searching on our Leopard AFP server and bam... all the clients have almost instantaneous search access to their docs.
If you don't know what AltaVista is (was), get off my lawn.
I'm gonna say nothing beats a proper folder structure and naming convention. I'd also recommend using svn. Also spend some time to develop some macros to assist in the creation/saving/retrieval of said documents from the repository. Maybe create some standard templates too... just my 2cents!
That's such a silly solution to the problem.
Shift+Delete
works so much better!
OpenAFS will speed up local access, and also provide an automatic backup of important files at all the satellite offices. (could be a full backup if you mirror everything).
As for the lack of any naming convention or other organization - first, the fact that you somehow manage to continue operating with a hundred thousand documents indicates that you actually DO have some form of organization in place.
If it isn't structured - get on it.
If you users are naming their files with strange characters in them (assuming it's not due to Samba) then they will just have to live with it, you won't have time to sort out all the wierd names that (mostly MS-Word) users give to their filenames. The primary objective should be to give your users access to the files. Making the directory listing pretty ought to be a secondary concern.
..something like Filenet or SAP. Sound like you have big corporation needs, get a big corporation solution.
Laws are rules for the court, but merely a bottom bar to hit for life. Think beyond laws in your actions always.
If you need an easy way to find things, your looking at a good searching algorithm. In order to use a good searching algorithm I'd have to recommend the bubblesort first. That way you don't need to worry about the data for a good millenium or two!
Mindoka (http://www.mindoka.com) has a document management product that is designed to solve the problem that you have.
Put Steelhead mobile on all the clients. Document transfer over the VPN will GREATLY improve. Since it's mostly text/pictures, there will be so much duplicate data that doesn't need to be transferred over the wire multiple times, the round trip time will decrease so much they'll forget they're on a VPN.
I worked at a place that used FileNet, which is now an IBM product, to do this sort of thing. We had millions of scanned documents in the system. I wasn't personally very impressed with it, in that whenever anything "bad" happened, you had to call IBM because finding support online was impossible, and at that they support wasn't very good. It was also a very picky system, those seemed to handle the load well. If you go with it, I strongly encourage doing it for UNIX/Oracle because it screamed "poorly ported" when we used it for Windows/MSSSQL. It has an API for integration, but it is also, poorly documented and would take some time to integrate into your existing business systems.
This is more of a rant at this point, but it is a stop-gap solution that allows people to continue to use outdated business processes storing important data in image formats or in documents scattered about with minimal indexing/search capabilities, rather than analyzable "data" that can lead to "information." I always take the position that if the goal is something on paper, or the goal is to store something that "was" on paper, it is time to rethink the business process to see if we can automate it, or store/present the data electronically in the first place. The old school fights against it, but no one has ever been able to say it wasn't more efficent in the end and enabled IT to say "yes we can" when the next great idea came along versus "here is a stack of papers, figure out $trend."
Forgive my spelling from time to time. I'm often posting during short breaks.
Hire a document manager / clerk person who will create order. Your engineers won't.
Boing boing boing....
I think the right option for you would have to be ordering the documents in a database and serving them up through a website. I think that would be helpfull for your satelite offices since mapping shares through samba over VPN is sometimes unstable and always nontrivial. Besides the system doesn't seem to be working for you. You really don't have to be that proficiant with functional webpages to make something like this, especially if you use ruby on rails. A ruby on rails guy would probably use only a couple of hours to make such an application. Then you could have functionality like searching and sort by author, department, type and so on.
I forgot to mention Alfresco as well, although I've never personally tried it.
http://www.alfresco.com/index-b2.html
Can't really suggest a good document management program but I can tell you one to avoid. We use Livelink at my place of work and its indexing and search capabilities are horrible (some would say non-existent). For example every document added to Livelink gets a document number assigned to it. One would expect to be able to retrieve that document by using the same document number but if you enter it into the search bar Livelink returns no results found. Huh? Not to mention some odd UI behaviours like when you add a folder to the favourites box the original folder disappears from the standard file listing (meaning there is no single canonical listing of files and directories, you need to always look in 2 places).
What kind of documents are they? If they're mostly text and you want versioning, the only drawback to subversion is getting people to learn the tools, but that might be too much.
If they're archival/static documents, an institutional repository could work. Something like DSpace isn't that hard to deploy and will provide basic archival and search features.
The middle ground between those two solutions is probably what you want, though. Everyone I work with uses SharePoint for that, and I hate recommending proprietary lock-in.
Laserfiche (or LF) is just what this is for. It is DOD, DOJ certified and crap, and is used by all branches of the military and several other areas of the government as their document management system. With several different software offerings, just about any situation can be taken care of. It's features include the ability to search based on document name, template information, or OCR'd text (which the software also takes care of). With add-on features such as Quick Fields, it may be able to automatically sort, add template information, OCR, name and then store the documents. It really is a nice way to go. Satellite offices can access and be either full or read-only users. It has the ability and modules to connect to just about any other type of data/information system (GIS, financial software, etc) and is very scalable.
I was a tech for 5 years with a LF VAR. I'm not there anymore. We were constantly cleaning up messes left by other document management systems. Take your time with this thing and really plan your naming convention, folder hierarchy and user setup. It's easier to get it right(or as close to it as possible) then going back and having to fix it later. A good LF VAR should help you with this. Definitely check references of competing companies. Some VAR's are A LOT better than others.
Digital Asset Management
http://www.lmgtfy.com/?q=digital+asset+management
we have extensive documentation and tracking needs. we use two sets of software for records and also keep a hard copy for long term storage. For tracking parts on/off and hours in service, TSO TSI etc... we use TRAX Evo2 We scan all written paperwork into a database which is interfaced with via Alchemy. This allows us to view the current status of all of our aircraft and their parts and track the paperwork for each action taken. Alchemy has a browser interface and we use IE to access it. this allows for a person to access the documentation from any of our stations and or offices internally on the network. Both Alchemy and TRAX are acceptable to our local FSDO. The hardware setup for this is not something I can shed light on as I do not get to play with computers that are ground bound. hope that helps, maric
As may have been pointed out, organizing the files is really the best way. Develop a strict schema for naming conventions as well as a hierarchical directory structure for maintaining and organizing. Something like:
/projectname/projectpart/data (contains the final draft of any document) /projectname/projectpart/working (contains files that people are modifying so that they can be merged/checked in to the data dir) /projectname/projecttpart/misc (contains misc. notes or files that need to be filed with the project)
The "projectpart" dirs are really just logical groupings of data/files for the project. Say you are designing a plane, well, break it up into relevant systems, like electronics, power plant, structure, etc., and each of those are the "projectpart" directories. The "projectname" is simply the overall project itself, be it the name of the plane, maybe the name of the contract, etc.
We were all warned a long time ago that MS products sucked, remember the Magic 8 Ball said, "Outlook not so good"
The OP did not mention exactly how many remote branches or computers need to access the documents at once, however, windows Terminal Server licenses aren't too expensive and the remote desktop experience is silky smooth. Also the documents would all reside on a central server raid array or NAS device and never need to travel over the internet to remote sites. This would also free up massive amounts of bandwidth over the VPN, considering TS just needs an internet connection and uses SSL encryption. (although I don't know what you would even need a VPN for after making this conversion)
Comment removed based on user account deletion
Who else read this and thought... working in a satellite office for an aerospace company would involve a lot of cool travel perks?
-- Terry
Odd that the next story has a great idea for document management right in the summary...
Hadoop!
Support FSF: Stop thinking with your wallet, and think with your imagination. (cc/non-commercial)
...seems like a natural solution for your connectivity issues, or perhaps whatever the open source variety of Sharepoint is. You really do need to tackle the naming convention question though. You can have all the file indexing you want, but sometimes a nice, logical, clean file name will get you what you're after much faster than any kind of searching.
It's going to be horrible, painful, thankless work that will put you on the shit list of just about every department manager and administrative assistant ("You want me to rename how many files?"), but it has to be done.
www.Mindwrap.com
What worries me more than anything else is that you claim to be a mid-sized aerospace company. If you are having problems finding documents, what happened to your traceability processes necessary for your QMS and how do you guarantee that employees use up-to-date documents? How did you handle the process in the past??? And, what does your QMS stipulate for records and traceability?
In a previous job we dealt with the same problem but on a smaller scale: One main office with ~ 60 people with a branch office at quite some distance with ~ 6 people working there. In our case the problem wasn't documents but a combination of large profiles which had to be pumped through a VPN link over a rather narrow ADSL line at the branch office.
In that case we placed an offsite login server which contains all the information that was also present on the main server, with nightly delta synchronisation. Users still use the main server for work that requires write acces, but we were able to offer ~ 300 GB of data locally, instead of over the network.
We also placed a so-called WAFS device in both offices. This is basically a network optimizer which intercepts inefficient network traffic and wraps this data with compression in its own network protocol. Next to that it also caches network traffic which means that to some extent, often-referenced data / network traffic is also available locally. So far i've been positively surprised with the increased throughput we've shown (about a five-fold increase as compared to the old situation).
Lastly, we've been trying to push a version tracker system as a basis for documents, but hit a lot of walls with users whom preferred their 'known' samba enviroments over a versioning system. It does allow for you to re-design your data structure for documents and string together old/related documents in an interesting way.
Regardless, you'll have to rethink and restructure how you want to store documents, if only by using better directories and creating a 'method' which users will have to adhere to. And in the end you'll need some poor cheap students whom will have the pleasure of migrating all this data to your new system.
Just my 2 cents.
S V N
Do it right, or just don't freaking do it.
What SVN repository manages hundreds of thousands of documents between users that do not know how to deal with SVN?
IBM OmniFind should do the trick, It indexes your files and then you can search the index very quickly. It also does caching of documents and other nifty stuff. It is based on Apache Lucene and there is a free (as in beer) version, IBM OmniFind Yahoo Edition. The free version will work with up to 500 000 documents. I used it for searching a number of networked drives with circa 50 000 files on them which it did very well.
NASA is a big user of SharePoint, strangely enough. My coworkers run into their folks at conferences from time to time.
I personally am ambivalent about SharePoint. Its roots are in document management, so it seems to do that relatively well. The publishing features are fairly nice as well. I don't think it's the best system for making web sites, but it may some day get there. Currently it feels like a 2.0 product (the magic rule is to never buy anything from Microsoft before 3.0).
There are gotchas. SharePoint is tightly coupled with your clients. If everyone accessing the documents are using the latest version of Office, you'll be okay. If not, you'll run into problems. You may also need to throw a lot of hardware into SharePoint, as storing files inside of SQL has some built-in inefficiencies.
Still, some of our users seem to love SharePoint, so it might be a good option for you.
When I worked for the state Attorney General's office as I.T. Director a request came into I.T. that immediately gave me an upset stomach. The request was for all documents on the server that contained the word "lead" as in the chemical element Pb. The issue was that the word lead and the element share the same spelling.
I kicked in and wrote an app that generated a web list on the fly and had clickable links so the documents could be examined and then marked as part of discovery.
I also brought in three Xerox 490's. Those were the hardware part of the document management system. I don't know if they ever got the servers for it but at least they had the gear. In the meantime I suggested using meta-data in filenames.
Hire a real librarian, it's what they do.
On the plus side, you also get to hire a librarian. nudge, nudge, wink, wink, say no more.
I'm guessing that wasn't on their radar screen...
It can scale extremely well. It is the backend to Adobe's acrobat.com website! So you know it can handle millions of documents if you need it to. Sharepoint requires MS SQL Server for searching documents. With Alfresco, that feature is built in.
Sharepoint is teaming software and not really designed for large document repositories. Alfresco has a teaming interface (Alfresco Share) and a more generic document repository interface.
Alfresco can expose the repository via FTP, SMB, WebDAV, and a web client interface.
Your solution:
http://xkcd.com/208/
A black cat crossing your path signifies that the animal is going somewhere. -- Groucho Marx
Maybe not the best solution for this particular job, but man am I glad we started using Dokuwiki for all our scattered documents.
http://en.wikipedia.org/wiki/Document_management_system
For that level of documentation you need to have a staff and get it properly indexed. You need a high level librarian. This would be someone with a masters degree at minimum in library science and at least a bachelors in information technology. They will not come cheap and they are a long term investment. The software is available, it is not trivial. Hiring a large number of people to recategorize and tag all the documents for the length of time that takes is also an expense but worth it. Once it's all in place maintaining it gets much easier.
I've seen a system developed for Raytheon. They took all the old compartmentalized data Hughes had and put every scrap of paper through a scanner. It was exceptionally well done. This would display electronic files and would have the location of hard copy. Classified documents were in some cases indexed but were hard copy only afaik. There were some documents that were hard copy only, those were usually ones with an NDA or other restriction on making electronic copies. It had every thing mentioned wrt versioning and such. Documents spanned decades with hundreds of revisions and you could pull up and view any revision. Depending on how recent and what type of document you could view a change log. Older scanned ones did not have that unless they'd been important enough to reenter as modern documents which meant OCR or manually transcribed. Some schematics were reentered into the system in a modern format. The effort was worth it. Having that data is the only way some devices or parts could be made or repaired.
http://en.wikipedia.org/wiki/Document_management_system
I'd go on a Vegan diet but the delivery time from Vega is too long. --brownkitty
ls | grep
amiright?
If the geiger counter does not click, the coffee, she is not thick.
It's called an index or a bibliography. There exists a profession known as 'librarian' specifically trained in the creation of such and in the management of large numbers of documents.
Sure, you could use a big solution from IBM (filenet) or one of the other products they make that compete with each other that were mashed together from years of acquisitions. Not to mention the large costs of "test" databases and extensive configuration.
I have used OnBase from Hyland Software for years at my office. THey are a family run company in Ohio with google'esqe leanings. (see photos of the large plastic slides on Wikipedia) They have always been easy to use, robust, point and click configurable and they have the ability to screen grab from almost any legacy application. (COLD/DIP as well) Straight forward pricing...out of the box functionality. it just works.
I strongly recommend you take a look for yourself. (i wont post any links....i am not a OnBase stooge bot...just a fan)
In my last company, which was a leading semiconductor designer with a large document repository and several branch offices, we used Cognidox:
http://www.cognidox.com/
This worked well for us; it has good document workflow management, tagging, search capability, user rights management, etc.
We went through this for both document management and web front end for access. We looked through, Sharepoint, Alfresco, Oracle UCM, Reddot and a few others. We dropped most due to cost, functionality, and ease of use for non-developers to do page work. Sharepoint was dropped due to cost in an internet setting (CALs), no non-developer front end for page layout (they couldn't use HTML) and it stores everything in the database. From prior experience this made backup/restore difficult as it keeps the IP ofthe web site in the database when you backup. If you restore to a different machine it gets confused. It was between Oracle and Alfresco. You cannot go wrong with either. Both are extensible, either have what you need built in or can be added easily. Both are good for non-developers to use. Support is very good with either. We went with Oracle. While it did cost more it matched our existing infrastructure.
This is built for the exact situation you described:
http://www.opentext.com/2/global/sol-products/sol-pro-docmgmt-collaboration.htm
You can either import the files into the system, or leave them in place, index them and use the search engines to locate the needles in your haystacks...
About Open Text:
http://en.wikipedia.org/wiki/Open_Text
Hummingbird is a subsidiary of Open Text, the solution mentioned above...
Full Disclosure:
I am an Open Text employee.
Google?
http://www.google.com.au/enterprise/mini/index.html
Seriously, if you can't be bothered collecting/maintaining the metadata that more structured solutions require, then just let Google index the lot. It'll work just as well (or not) as it does on the Internet. Although its not free it seems reasonably priced. It could be a quick answer to your problem.
I know I'm gonna get hit for blurting out the Microsoft Solution but...give SharePoint a shot...
Just avoid the wiki functionality like the plague. It completely sucks.
Since your organization probably has Windows clients, you can only long for something as nice as Mac OS X Spotlight Server.
Google Search Appliance is definitely what you want.
If you have a mid sized company you definitely don't have the surplus of highly talented systems administrator talent laying about to run one of the document management systems that others here are likely to suggest. Be very careful going down the document management server path. It's far, far more work than you think it will be, than the vendor will tell you it is. Not simply more work for you, but for your IT staff and your users, too.
The Google Search Appliance, by contrast, is "fire and forget". Plug it in. Turn it on. Patch it when Google suggests you do so. That's about it.
If you mod me down, I shall become more powerful than you could possibly imagine.
We use a Bentley product called ProjectWise. It is a document management system with file attribution among other things. It is primary useful for Bentley's line of products, but we have used it as an archival system as well as a working documents that are non-Bentley specific. No... I do not work for Bentley, but my job heavily uses their products.
You could do worse than to look into KnowledgeTree
http://www.knowledgetree.com/
it's released under GPL2.
A good example is Cisco WAAS, a cool video showing how it works is here: http://www.cisco.com/cdc_content_elements/flash/ans/index.html
See here for data sheets and specs: http://www.cisco.com/en/US/products/ps5680/Products_Sub_Category_Home.html
Cisco's solution is inexpensive and you can use your existing router investment to do all the heavy lifting.
Pat
Unsurprisingly, the answer to managing many documents is to use a document management system. There are several commercial and free products available, both linked here and on the Wikipedia page for Document Management Systems.
I've worked next to the team who administered Bentley ProjectWise in a previous engineering job, which is expensive but definitely suited to your task. There may be other good options out there.
DMS -- http://en.wikipedia.org/wiki/Document_management_system
-- botsex is {grep;touch;strip;unzip;head;mount}
We're using a Win3.1 app called LaserFiche on XP with > 250,000 documents and it's lightning fast, works with TIFF files and PDF and probably more. Includes file and folder permissions.
moox. for a new generation.
Step 1: Print out all 100 thousand docs and draw different little smiley faces on each of them. Step 2: scan all your docs back in as jpegs. Step 3: import all those jpegs into iPhoto and use "Faces" to magically organize them - just like on the television commercial.
No matter what system you use, its still going to be slow. To overcome the slowness you will need something that makes the SMB/CIFS protocol less chatty. I would suggest:
Cisco WAAS www.cisco.com/go/waas
Riverbed www.riverbed.com
As two great WAN acceleration products that will help you speed up document retrieval, access, and writes across the satellite link.
Check out Thunderstone. It's what they do, and they do it very well.
Documentum, docushare, livelink, sharepoint. I've heard of documentum installs with 100m+ docs. It's quite good, but expensive.
Take a look at NetDocuments. It's a SaaS (Software as a Service) document management system. It handles millions of documents, can be accessed from anywhere, and is relatively inexpensive compared to maintaining your own servers.
Look into Alfresco: http://www.alfresco.com/
It's becoming quite a mess, sometimes quite slow, and there is really no naming or numbering convention in place for the files and directories. We end up with mixed casing, all uppercase, all lowercase, dashes and ampersands in the file names, and there are literally hundreds of directories to sort through before you can find the document you are looking for.
Slow. Upgrade your network and VPN. You know that VPN layer is just killing your performance.
No naming or numbering convention. Get one.
Mixed casing. Learn How to Properly Case Folders (and documents).
Dashes and ampersands. Are they a problem? Aesthetically unpleasant? I personally restrict punctuation in a filesystem to dashes, periods, and parenthesis (unless the punctuation is a replicable part of the name of the file/folder).
Examples:
01 - The First Track (vocal)
02 - $lashhvertisements Attack!
03 - Where Have All the A.C.'s Gone
Develop your own method that works and be obsessed about it to the point where you would reburn a disc if one of the filenames was "01-Name" instead of "01 - Name".
Hundreds of directories.
Each file should have it's own folder.
"That's insane!" you say. Start out with this mentality. If there is no reason at all to separate two files (they are part of the same thing) then place them in one folder, and make sure the folder is named all-encompasingly. Repeat for all files. If you get into a AB, BC, but not ABC situation, the solution is to have A and B and C, with A and C linking to B with your choice of shortcut/link/symlink/etc.
Do this until all files are in folders. Then repeat with folders.
There is NO substitute for organization and getting people on the same page. Develop some conventions. Task people to fix as they go. Check up to make sure people accessing documents are fixing as they go, and doing so according to convention. Once people are used to the convention, and once things are relatively organized, they won't ever need to search again. They'll instantly know where 99% of things are, and will be able to dig around and find anything else within seconds.
The main problem you face is getting organized after already being unorganized. It isn't easy, but at least you're not dealing with millions of paper documents.
Alfresco is likely your answer. We dumped the unbelievably expensive FileNet and jumped onboard with an Open Source solution. It can be done for free, but likely your company, like mine, would opt for paying a small license fee for support benefits. See http://www.alfresco.com/index-b1.html.
CVS, SVN, Git & friends with some sort of check-in verification scripts could provide what you look for. All of these
can interact with your LDAP directory as well.
A Tortoise client can provide Win Explorer integration and simplify user operations but a nice How-To with pictures
could probably help you sell it.
Literally LOL'd
OnBase -> http://www.onbase.com/english/index.aspx
Digital Asset Management applications solve this problem; one of which is NetXposure (netx.net).
The OP doesn't grok vector space. He should search for, "how to make a vector space search engine in 12 lines of Perl".
I use the 'job' system, which I learned from working at Digital Domain (the Visual Effects Company) and then passed it on to the Aerospace company where I now work.
Effects companies deal with enormous amounts of data, and many different versions of a shot as well as all the elements that make up that shot, along with other data such as project settings files from software used in the making of that shot. They had a very specific file naming system to keep that all organized, and it was referred to as the job system, because first and foremost everything was logically separated by project.
How that has translated for me into the Aerospace field is at the root of the main drive share, there are two primary folders, job and departments. Departments contains generic documents for each department such as forms, standards, etc.
The 'job' folder contains several categories of jobs or projects, such as vehicles, engines, pumps, etc.
Inside those are folders with the project name. Inside each project folder is a series of folders for different data types, such as solidworks, reports, proposals, documentation images, etc.
File naming:
File naming should be consistent, and I always start my own files with the date with year first, because I do not trust meta-data one single iota. I have had dates wiped out when a backup system kept a backup, but did not preserve the file creation / modify date on copy.
After that it is the thing, then the version.
So 09-06-10_widget_v01.sldprt
version two should be exactly the same, with the number iterated up. There should never be a document named something_FINAL because you always end up with FINAL_FINAL_FINAL etc. :)
Now, as you probably know, the difficulty is enforcing a uniform standard when people are busy doing actual work. Things get sloppy, things get messy. You have to keep up after people, and policing stuff like this is not fun. At Digital Domain is was an urgent necessity for everyone to use the standard and there was automated software that relied upon it. At the aerospace company, I gave up years ago trying to enforce a perfect policy. Now, people generally follow the example I set to a point where you can easily find things. When I first got to this company, when it was really small, all files were (seriously) piled nearly in a single folder. This was when the company was very small, but it was already a disaster and it was impossible to find anything. People were used to working on their own computer and did not have a concept of a shared file server, at least not in a modern sense.
Now you can just swatch down the left pane in windows explorer and get what you want very quickly.
This system is designed to use the left pane (lots of folders for organization) and people who were used to the Windows 3.1 way of double clicking through folders without the left pane had to change their (awful) habits. That was the biggest concession among the old school users.
The trick is also not to over-do the nested folders. Just enough to keep it nice and tidy.
Every once in a long while you run into a file that really wants to belong to several folders, and that's what shortcuts are for. Even if the shortcut gets broken you can look at the shortcut file to see what it originally pointed to, and you can probably find it that way.
At home I use the same methodology to archive 30,000 photographs. I can find anything in an instant by expanding folder icons. When that fails, plain old windows search is able to turn up what I am looking for, in those rare instances.
I have always been against anything that 'collects' your files into meta data, such as iTunes, or various photo editing programs. It's a big mess because one day that software won't be around and your files will be a mess.
Even my MP3s are organized by genre/album/1.song.MP3. I just drag album folders or songs into Winamp and I am off and running as my own DJ. I don't use a media organize
Oh no, not another CMS.
I've never seen a CMS that was anywhere near up to date.
The only way to index more than a few dozen documents is to use Enterprise search.
For the really cheap, you can install Google Desktop on the PC that holds the Enormous Shared Drive, and then let people log in via Remote Desktop or VMC and look stuff up. (Is there a Google Desktop API?)
You eventually could have a lot of people making personal indexes of the Enormous Shared Drive with Google Desktop, which is going to cause problems that will motivate you to obtain a real enterprise search package.
Give a man a fish and you have fed him for today. Teach a man to fish, and he'll say "WHERE'S MY FISH, YOU IDIOT?"
Oh man NextPage NXT sucks. Just stay away from it. Anything is better. It's consulting ware. You pay a ton of money for a mediocre product with mediocre support, and then a ton more money to pay their experts to set it all up and integrate it for you since it's so poorly documented.
Their IIS plug-in also allowed unauthenticated users to shut down the NXT web site with a simple GET request. We accidentally shut down their support web site one Friday afternoon after trying a command that was listed in their own documentation on their support site from a web browser with no special access.
Thank God I got a better job, so I don't ever have to work with that piece of crap ever again.
Obviously throw them on the desktop. Once it fills, throw them into a New Folder. Once your desktop fills with Folders, throw those in My Documents. Repeat until your computer crashes.
Ginga no Rekshiya Mata Each page.
Basic unix tools can do the trick. find (atime,ctime,etc) mixed with egrep, or just egrep with -R... all sorts of solutions, right at your command line.
20th century Marxism is not progress...
I'm pretty sure there are databases that can store and serve up documents based on criteria. Couldn't you set up a centralized web server with an SQL backend that hosts those files for you? You would be able to then keep track of who is using which document and when, and regulate who can do what with different documents as well. As a bonus you should be able to ditch SMB while you're at it and move to a more robust OS for your critical files. Centralizing those documents would also make it dramatically easier to back them up at regular intervals.
Damn_registrars has no butt-hole. Damn_registrars has no use for a butt-hole.
Have you checked out IntraLinks? (www.intralinks.com)
This is not going to help you with your 'finding the right document' problem, but it is essential for your remote offices to be able to open (and save) those documents in a reasonable time. It will also have the added benefit of dramatically reducing your WAN traffic (think 50% reduction). When I initially trialled these, Riverbed was miles ahead of Cisco. That was 2 years ago, but they are still the only one with a remote client and a few other tricks. Well worth the investigation & money.
Yes, I know it's been mentioned before. Yes, I know it's Microsoft. But SharePoint is an excellent document management system. It supports clustering natively, load balancing, search, information rights management, web editing for most Office formats, InfoPath web-integration. Users can also save natively to SP via WEBDAV through Office apps directly, or through Explorer. There's a whole crapload more that you may want to check out at the SP site.
.Net libraries available for you to natively access SP and manipulate the whole system via scripts. Importing and exporting files is a cinch using these APIs. There's also exposed web services via SOAP that let you do the same thing. And, in the end, there's the actual SQL backend that is very straight-forward so if you don't want to use the SOAP or SP .Net libraries, you can manipulate the database directly.
To get yourself organized and imported, there are
So no, you are not locked in. And, the licensing cost is the most reasonable out of all the document management software out there.
Microsoft Office SharePoint Server 2007 with search capabilities would be a wonderful place to store all of this stuff.
Before you look for a technical solution, hire a Document Controller.
Real men use an old TI-99/4A machine with a casette recorder, and files sent via RS-232 connections.
Lawfirms are experts at managing millions of documents using document management software. If you want state-of-the-art document management. Then the software that lawfirms use is what you're looking for.
What's up with this box everyone has to think inside of or outside of? Why does there have to be a box?
I'm on the IT Applications side of things, not operations so my experience with this has been more as a user than as an admin (though I've helped that group on a few things)...
...but we implemented Documentum and have found it to be slow, difficult to deal with and I've heard no end of horror stories about how hard it was to implement.
In all honesty we had a properly set up sharepoint (tsk!) solution at another company and it pretty much ran itself and did the job we needed it to do. YMMV.
Kneel before Sig!
ask about digital repository
Simple.. use CVS. Documentation is centralized and de-centralized. You have versioning, log, comment, and overall this... it's free
I suggest you put documents in filing cabinets, lots of filing cabinets. You Need a good indexing systems, but the documents will be pretty safe then.
Pretty easy really, people have been using this technology for hundreds of years, its pretty stable, you dont have to worry about magnetic fields wiping your drives, or dyes leaking out the edges of your dvd/cd's, or file corruption, or power blackouts, or haxors getting in, or people deleting random stuff by accident.
I personally dealt with an issue like this at the Australian arm of large international mining equipment manufacturer. I wrote the software solutions mentioned and went on to do my engineering honors project in the area. My first recommendation is, stay away from document management systems, they are bulky, inefficient and tend to lock you into "their way" of doing things. As soon as you want something different, you will find yourself stuck. This is a simple problem don't make it too hard for yourself.
My solution was multi-layered:
1) Place exactly 1 person in charge.
2) Enforce a naming convention. - Our CAD Drafters and Engineers (of which I did both) were notoriously bad at naming their documents correctly. Most of this was ignorance. Document your naming convention and make it well known.
3) Write or come up with a standardized way of generating document numbers. In my current job as a software engineer I would recommend a simple, incremental numbered approach. Every document, every revision, simply gets a new number. Our engineers did not like this. So we went for a middle ground. Something like XXX-YYY-ZZ.eee Where XXX is the equipment type, YYY is the sub type, ZZ is the revision no, eee is the extension/file type.
4) Standardize the way you store your documents. For instance, make a folder structure . C:\xxx\yyy\XXX-YYY-ZZ.eee
5) Register ALL documents in a database with location, comments, purpose, revision, author name etc etc.
6) Take the Draftsperson or the Engineer out of the archiving process. I wrote a utility that checks the a single "to be archived" folder, fixes obvious mistakes such as using "_" or "." instead of "-" and so on, checks the database to make sure that the document has been registered and then drops the into file system. Make the archive read only access for everyone except the person in charge (and any utilities of course).
7) Clean up your existing archive. This can be a semi-automated process. I wrote a utility to do this partially, but it just takes a lot of painstaking effort. With 70,000 documents this was a slow and painful process but it can be done.
8) STICK TO IT. Any exception will erode the system over time making it useless.
There are a ton of Document Management systems out there, our company uses http://www.opentext.com/ look for DM You can use Microsoft Share point as a document management system, but it is not really what it was designed for. DM will integrate with all the Microsoft applications. It will give you document numbers, version numbers, etc... you can profile your emails as well if you want. We have had some performance problems for the remote locations, but it is still usable. I did a search for open source document management systems on Google and there are a ton out there if you don't feel like paying for something.
Curious about Storage and Virtualization? Check out
Look at a document management system. Interwoven makes a great one. Some things to consider:
* Security
* Version Control
* Document History (Access, Changes, etc.)
* Search Capability (Profile Search, Full Text Search, Date Search)
or Confluence Hosted: http://www.atlassian.com/software/confluence/hosted/
Microsoft Sharepoint seems to handle lots of documents well. It includes document libraries, which are like folders, than you can store documents in. It also has a built in search function, which is described as being able to search through multiple levels of documents and retrieve results. It's also not too expensive. I think there are some specific web parts, or plugins, that even help facilitate document storage and handling.
The only downside that I can think of is that it requires knowledge of .Net as a framework, but that isn't so hard to learn - check it out, it might take you a long way!
I've worked with Oracle UCM (formerly Stellent) for a few years now and would thoroughly recommend it. It's scalable into (at least) the 10s of billions of documents. A single repository for Doc Management, Records, Web Content Management, workflow, imaging. It comes with security, library services, metadata, and search OOTB. Using the WCM, you can make your documents available on an intranet, extranet or internet site, according to specified security policies.
;-)
BTW... offices on satellites... that's so cool!
My other account has mod points!
We're an old engineering company, and our products last decades, so we need to keep lots of records.
Recently, we started scanning old documents (a warehouse full of them) to make room for expansion.
It is a very tedious process, because we can't risk shredding the old files unless we know for sure that the scans are correct. Amyway, for storage, we decided to go for an in house web-based system (some one developed it for us) that is quite basic, and does two important things for us:
1- it references the file in it's location, rather than store the file in a database and copy it to the webserver
2- gives us the ability to change meta data (the document indexes) as we find errors in them
By referencing a file in it's "physical" location gives us two layers of access control: 1- through the database permissions, and the other one through file system permissions. this is important for restricted files...
Obviously, searching is the important part. and indexing is absolutely critical and the most time consuming process.
Someone suggested to us Google appliance, but non of the scanned documents can be searched. they are all images.
The actual application is pretty basic concept (nice interface features, but the concept is simple)
1- A database to hold the info
2- a table per document type containing teh meta data and the filename and filepath
3- a web interface to search and re-search to narrow down the list.
I'm sure it's been said by now, but you really should be looking at a content management system. There are several vendors out there that sell various types of document control systems; Pilgrim, Master Control, I'm sure Oracle has something that does that. There are also open source frameworks that you can develop in-house like Drupal. All of those are online document management systems. Users upload documents to them. File naming conventions can be enforced as well as directory structure etc. Many of them allow for document collaboration and approval. It's a complex problem, and a valuable solution will take some serious thought and time. I've heard some people use google documents, but for a company of your size I wouldn't recommend it. In any case, folders on network drives are NOT the answer.
Google appliance.
Oliver's law of assumed responsibility: If you're seen fixing it, you will be blamed for breaking it.
Check out http://www.columbiasoft.com
document locator google it, this is the solution we use
manages all files in a sql database
organizes etc.
There's a bunch of different document management solutions out there. I'm very unhappy with the one my company uses, so I'm not going to mention it, but if you do a search for document management on google, I'll sure you'll come up with tons of stuff. There are probably open source solutions.
I work at a midwestern public university in the USA, and we've been using this program for several years and a few versions. Backend can work on AIX, Linux, or Windows, and the frontend at least Windows (don't know if Macs or *nix are supported, we don't have many of those on users' desks). We probably have several gigs of imaged documents in this system, and it seems to work pretty well.
You'll have to import all the documents into the system, of course. The company recommends certain tractor-feed scanners for this; lighter-duty ones are USB, heavier are SCSI. I think it also has a software printer emulator to let you dump e.g. Word documents into the system; how you organize things is up to you.
Hail Eris, full of mischief...
E pluribus sanguinem
Whatever the solution, you have to get staff to declare what it is on the front end. It's not all about the technology. I see some of the benefits of Sharepoint, but depending on your audience (tech-savvy or not) it may become a training issue. Prepare for change management.
What I like about Sharepoint is the Office integration, the improvements over the last few years, document history (versions), and mostly, the ability to require metadata. If you have a taxonomy of topics, it will make it much easier to create a search appliance that can find what people are looking for. You may be forced to look at auto-classification if you can't get staff to do it, or hire knowledge managers (librarians) to properly catalogue. Trouble for us is getting to agreed-upon taxonomies and hierarchies across divisions (I'm in the knowledge management trenches here).
A good way to start might be Sharepoint repositories, require a topic field, seed it with however many topics you can come up with, and leave an OTHER field so you can collect what you have not organized. If you analyze what comes into the OTHER topic, you may keep adding new topics.
Find the logical buckets to start search before they think about searching too. Does your staff only care about 1 project at a time, break it up into project searches. Basically offer them one level of selection before they get to search - it may make things easier (if you are structured that way). They may look for something from a particular function - Marketing search vs. Operations search.
Also, sharepoint can leverage active directory info, so you may be able to get some metadata automation (Docs from sales staff vs. R&D, etc.)
Hope these points help. Contact me if you need more.
Jim - your name is Jim...
I work for a mid-tier medical company and we use Objective http://www.objective.com/.
It has its limitations, but it indexes, searches and does version control. Oh, and the FDA know about it.
No idea of the cost :)
I implemented swish-e, http://swish-e.org/ for a client with html and .pdf indexing (nightly) in 11 hours from a standing start (never used swish-e before).
Verbum caro factum est
Go with alfresco....can be a pain to setup but its a clustering champ
A company with the guts to challenge the big guys, IBM and EMC, and usually wins: http://onbase.com
Besides, their office has two slides. One for speed (metal) and one twisty (plastic, like a playground!).
They also have a hosted version.
Hi there, I am one of the developers of this nice web tool which in fact was designed to achieve the requirements you say, we are calling it anydata, but dunno if we'll need to change it's name as it's a registered trademark, at least you see our goal ;)
:D
http://devel.anydata.tv/
Try it out with firefox if you don't want to see something ugly right now. It's a beta, but in less than 1 month you will see it complete. It looks like a filemanager, pretty well known user interface for browsing documents and information. This system ables you to store files, bookmarks, text notes, contacts and soon pgp'ed passwords for secure-sharing across system administrators.
In short, keeps the 'tree-browsing' typical schema of filesystems plus generating and showing previews of documents, tagging, automatic keyword gathering from documents and a search engine.
By the way, it's GPL
Anyone interested just send me an email to kenneth at gnun d-o-t net and I'll give you a testing user or whatever needed.
Cheers!
Kenneth
OK, so it is a bit hard to get your documents out once you put them in to this system, but man, does it tidy up a mess of documents.
-ted
You could use mediawiki as a front end to your documents, possibly with the semantic mediawiki plugin.
I'm serious! If all your documents have a URL, you can link to them from the wiki, and then build a comprehensive system of summaries, categorisation, and semantic data about the documents.
But that's just one tool. There are many such tools. There's no magic bullet; you just need someone to organize all your data.. It sounds like you need something like a librarian, possibly you could hire one part time?
if there is decent metadata or the content is somehow indexable, you can try a digital asset management system, perhaps open source, to get some kind of organization and accessibility.
How about Desksite (formerly iManage) or PC Docs?
You could set up a Document Management System like Alfresco or god-forbid, Sharepoint. Or you could run OS X Server and let Spotlight index everything.
I've got this car, and it doesn't run and it's got all these strange bits inside under this hood thingie. . . . Hire a librarian or someone with a degree in knowledge management who has experience in the corp world.
First, you're potentially dealing with more than one problem here you're trying to solve: slowness, and naming convention. I'm guessing they're somewhat related (large directory listings due to lack of organization), but there might be a deeper infrastructure issue that needs to be dealt with, too.
As for organizing files, You need a naming convention for your project files, first and foremost. Throwing a bunch of disparate files at a CMS is going to do nothing but complicate things more (from a sane-management perspective).
Data categorization is key. You need to figure out a way to organize it in a fashion which is both contextual to how people use it as well as how it relates to the other data (in, say, a project).
For instance, you will want (at a minimum) the equivalent of user-level and group-level data shares. This would, in all likelihood, get kind of tricky with shifting working groups. For this there are multiple ways to use ACLs (as opposed to just user/group/all permissions) within Samba (with or without shackling the machine to a Windows domain/authentication server). ext3 and XFS both have the ability to use ACLs (XFS natively), last I checked. Ultimately, this would probably be better than just using user/group, as it would be more extensible.
As for a Solution...
Something to look into specific to samba, is the "veto files" directive for smb.conf. It is per-share. I am uncertain whether it supports regex (it didn't in early 2005 when I last used it), if it did it could be very useful for enforcing a specific namespace (going forward).
I would recommend "enforcing" namespace. While this is likely a self-created problem (ie you or your predecessor did not set things up properly in the first place), you really need to push to your users the importance of this. You need to tell them "organize your files, it'll make things faster" if there's any bitching.
There was an article in LinuxMagazine a while ago about determining the age of data. Utilizing this in some sort of auto-sort script to move "old" data to a "pre$date" directory within the original messy directory might speed things up. Also, archiving (or at least moving it to an "old shit" directory) past, unused data is important. It eases the "human element" of data organization.
Projects should all have a reference number (because there is, in all certainty, hard paper associated with the projects, and sometimes you need to cross reference). Keeping this consistent is important. Use what works, keep it short/demarked so users don't avoid using them. I like each project folder to have the project number to relate to contract/etc. start (short) date (eg. 080112 for Jan 12th, '08) followed by a 2-3 digit number (depending on how many projects are started per day) followed by major revision. End result: something like "080112.01.a Jennings Construction" Or organize by client ID. Or something.
Requiring and/or encouraging project naming conventions through the managers (at the bequest of your manager/CIO/whomever, or just pleading) might also be worth a try. One department out of 5 doing it would be better than none.
IMO, once you've reached this step, you can consider putting it in a CMS to help perpetuate/encourage the organization. But remember that a CMS is not a panacea, and might even complicate things further (ie, instead of navigating to a file, -everyone- just searches the whole index, slowing things down further).
~/ssh slashdot.org ssh: connect to host slashdot.org port 22: too many beers
A document management system is a must for that many documents. Check out Alfresco. It's open source and as such isn't outrageously expensive like it's competitors. If setup seems too daunting for you, check out tsgrp.com. Technology Services Group is a consulting firm in Chicago with experience working with Alfresco and may be able to make this transition easier for you.
Depending on what you need to store this might or might not be of help. Here we I work we use Atlassian's confluence. We've created spaces for each team and then have pages for things like 'manuals', 'system documentation or whatever and attach the files to those. The attachment can then be linked to the page.
Or DMS. Commercial packages include Docs Open, and Soft Solutions.
Open Source DMS = http://mydms.sourceforge.net/
You could use Documentum. Not inexpensive. It can manage anywhere from 1,000s to 1,000,000,000 docs. Support for remote cache servers is available via several methods. Security, H/A, D/R, distributed docbases, and much more can address a very wide range of problems.
It sounds like they're heading for an epic fail. Aerospace == Process + CMS. They will never survive the NTSB audits and safety Nazi without both. They will need to prove the Change trail for every nut/bolt/software path/data item/paper clip and who authorised/designed/checked/tested it for the rest of their natural lives. So if they don't have Process + CMS, they are screwed beyound belief. To me it sounds like a medium sized software house, that's decided to switch to Aerospace because it's cool or high tech or the marketing guys sold some product.
Checkout http://www.dspace.org/ Cheers m
$ whatis msft msft: nothing appropriate
I think you are in over your head with this issue. The short list of solutions will fail without clear backing from your executive management to provide "incentives" for users to help whatever new system you deploy be successful.
The flexibility of your user community will be a huge factor in this solution. The rigor with which documents contain useful metadata and whether the documents are specifically organized or just stored anywhere in the current system(s) will be factors too.
Here's the list of products that I'd start researching in order: ... something
- Xerox Docushare
- Alfresco
- MS-Sharepoint
- Filenet
- EMC/Documentum
There are other possibilities too, but if document versioning is required, be certain that capability is part of them at the start.
The simplicity of Docushare is simply amazing when compared to **all the other solutions**. It is worth the first look for anyone dealing with CMS/DMS.
I've deployed Docushare, Alfresco, Sharepoint, and Documentum. By far, users were happiest with Docushare and this was back in 2000. I can only imagine the progress that's been made since. It isn't the cheapest nor is it anywhere near the costs of the last two which usually require huge infrastructure and expensive per-user licenses.
Sharepoint had so many issues that it was worthless as a DMS. Heck, searching didn't work. It does have some other interesting features that can be useful in an open, trusting environment, but these are not useful when record level security is a requirement. It's been about a year since I saw sharepoint. It appears cheaper than Docushare and could be a good fit for trivial needs.
Like I said before, you are in over your head and need to hire a consultant to gather real requirements, learn your workflow, and help you select the best answer to trial in your environment. This isn't really the business that my company is in, but you can contact us at http://algoloma.com./ We aren't affiliated with any of the options listed above and these opinions are my own, not necessarily that of the company.
http://www.nasw.org/users/nbauman/txtsrch.htm
http://www.nasw.org/users/nbauman/lawdb.htm
http://www.nasw.org/users/nbauman/discover.htm
They were imaging and indexing up to several million documents. During a civil suit, in discovery, companies on each side of the lawsuit have to disclose every relevant document to each other.
Lawyers probably use the most flexible and all-encompassing systems, since they have to deal with every industry, every profession, everything. They also spend more money on their systems than most people can afford. They told me it costs them about $1 a page to thoroughly index big databases.
Information scientists told me the best model of a document database was PubMed, which indexes virtually every significant published medical article. http://www.ncbi.nlm.nih.gov/sites/entrez?db=pubmed
The big limitation of Google is that you can't search too well by date. Another limitation of text searches is that you can't search for concepts -- just words. Sometimes words (particularly names) match concepts very well, but if they don't, you've got a problem.
Yeah, it would have been nice if you had set up coding and naming conventions at the beginning, so the original authors could have sorted them as you went along. It may be difficult or impossible to go back and re-code them after the fact. It could wind up costing $1 a document. OTOH, you could be lucky -- some industries have been using standardized filing schemes and standardized jargon since the days of slide rules and T-squares.
There should be standard filing schemes and procedures throughout your industry, so your solutions may be industry-specific. There should be consultants that deal with your industry who would be happy to talk to you (for the prospect of maybe getting your business). There should be trade magazines in your industry that have covered the same issue for companies of your size. (Hell, if the price is right I'll write a roundup for them.) Or you might have a trade or professional association with some friendly people who have done it before. Trade and professional associations usually have a computer or information technology section, and if you're a member of the association, you can call up the members of the section.
You could have them hosted online, something like Google Docs.
everyone is talking about document management software and search appliances. You're going about it all wrong...
Hire a document management staff.
Librarians. Hot librarians.
Back in the 90's I helped create a media department for large textbook publisher. One of the first projects was an asset library and tracking system. To this message brief. We first needed a naming convention. Look for a constant throughout your products, ours was ISBN numbers. That became the main identity of the product/project and their main digital folder. Every item or product was dropped in a sub folder such as images, design, text, etc. From here the main folders were always scanned by Portfolio and it was told/programmed that the main descriptions should come from the folder names. This allowed anyone with knowledge of the product ISBN to find details on the project. It also greatly minimized keyboarding of metadata onto the files needlessly. Portfolio then will allow check in and check out (versioning) to stay abreast of any edits or updates. The whole metadata catalog would also be exported and brought into Filemaker for secondary backup. Look to a constant for naming convention, keep it simple, look at ways to minimize keyboarding metadata, go over the counter (they are much easier to work with and you can experiment-they are also more than capable of handling 100K documents). Last. Good luck and if needed look for help.
Why not use subversion? Files will be accessible using a subversion client (including log + history), as webdav (only current version) and through a standard browser (read-only).
The company I work for uses a system called Document Locator. It is a Windows-shell integrated document management system. Basically, if you took Subversion and gave yourself extremely fine-grained control of repositories, folders and the like. It scales decently, too -- we have millions of documents spread across 25 major repositories, many of which include AutoCAD, Bentley Microstation, Smartplant 3D and other sizable files. The system is also fairly extensible, as we've built quite a few internal applications off of the DL system and there are plenty of third-party plug-ins available (a notable one being Brava, an application that allows adding QC and other markup to repository files). And if you don't want to be constrained to Windows, there is a web client available, which works decently. While it is not without its problems, the overall experience has been pretty good.
Full disclosure: My company is ColunbiaSoft's largest customer and, as such, we know a good deal of the development team.
My UID is a prime number. Yeah, I planned that.
My last company relied on a program called isys to index and search documents and email. You don't have to worry about what a document is named, just the type of content you're looking for. This solution can save a lot of time, especially if your users are good and phrasing queries. On the other hand, I did not have to maintain it, so I have no idea how much administration time was devoted to keeping it working.
Make love, not reality television.
I was on a team that implementedthis for a very large Aircraft Engine company in the 90s. (Still cooking along today). I'll outline what we did (overkill for you but the principles are the same andthe techniquesmay well be borrowed.
We had over 3 million drawings mostly E size scanned at 200 DPI in our spinning cache. Millions more on optical jukebox (10 inch write-only platters.) (when we were done, scanning was an early part of the project.)
What we were moving to was not savig the drawings but the 3-d Solid models. We standardized on one CAD/CAM software solution. Sometimes we used others for conversion and interfaces but get one standard. We did have a long standing drawing number and naming convention but when we were looking for files now we were getting into new territory and the naming was breaking down because people start by screwing around and then get discipline later (except they don't) .
We started by fixing what the process was rather than trying to fix the data first. We had a large network of Unix and Windows workstations and used AFS to be the "official" file system for where the final drawings were stored because of the ability to create an abstract file structure with security and consistency. This would be your analogue to NAS. When files were issued,they became read only and stored with a serialized number with the path in a database (This is the Data management system you don't have but I used to call it the bag and tag system, the programs come down to a database with a number of seachable fields and a pointer to a filename path with a unique serial number identifying that version of the file.) Get get a copy of that file, you log on to the database, find whatyou want and it copiesthat serialized file to your local path and renames it with the proper drawnig/part number.
We actually got into the drawing formats (the "frame" around the drawing where the drawing number goes and turned that into all parameters. When you werre going to save the files for sharing and adding to the formal process, the drawing fromat forced the proper drawing number and other official info (engineer drafter issued by etc... all were parameters. When you saved the file, it created the filename based on the drawing number and part.suffix.This took care of standard filenames (not paths yet). We then created a script (actually me) using Perl and Tcl/TK that did all the leg work of a simple electronic sign off system. We had acces to the full electronic signoff systems and found them too inflexible and in general a nightmare to use. Our engineers had good disipline WRT drawings and when to issue them so our system made use of that with a few users with certain roles. An engineer could theoretically sign off on someone else's model but no one would unless authorized. so we had several roles when someone wanted to issue a drawing, they chose who to notify mynameand position so they usually knew who needed it andif that person was out with someoone else covering they could still pick it up andmove it along without the beauraucracy.
Once signed off, the files would be opened by script to get the metadata parameters from the drawing format which was then put into the proper place.
We also ran into a problem where even a file that was opened, looked at and closed, the file contents changed because metadata was automatically saved to show that the file was opened. We worked with the software CO. to stop thatbehavior. We then went to a Posix Checksum program to test if the checksum of the file on the system matched the checksum of the local file, no match means a change was made.
I'm way over doing this, but I guess I'm saying get discipline by finding out where discipline already exists (there must be some somewhere) and hooking in to that discipline by automating in software what is stored and where. Then start fixing what you have. Otherwise you're herding cats forever. ALso it starts to take care of itself because the hot stuff is getting used and revised so order starts to come to old files because of the new discipline enforced in software.
ok, so everyone else said "google appliance" for a reason, but here's some solutions that do similar:
Use Lucene to index your data ( by The Apache Foundation, so you know it's good ) - http://lucene.apache.org/java/docs/
Use Droids ( to crawl your existing data ) - http://incubator.apache.org/droids/
Use Tika ( to make your existing document formats into an index-able format - excel, word, powerpoint, gzip, bzip, zip, tar, mp3, xml, html, class, jar, odf, plain text, pdf, rtf - all supported by default. ) - http://lucene.apache.org/tika/
use Solr ( high performance search server built using Lucene Java, with XML/HTTP and JSON/Python/Ruby APIs, hit highlighting, faceted search, caching, replication, and a web admin interface. ) - http://lucene.apache.org/solr/
optionally you might also find Forrest useful. - "Apache Forrestâ software is a publishing framework that transforms input from various sources into a unified presentation in one or more output formats. " ( http://forrest.apache.org/ ) - it's designed to work with Solr and Droids. :-)
One of your options is to use Salesforce Content, which is a very usable content & collaboration piece from salesforce.com. It's fully wired in to the rest of the force.com platform and CRM apps suite too, so if you're looking to build out more of your company's apps in the cloud, it's worth taking a look at it. http://www.salesforce.com/crm/marketing-automation/document-content-management/
After looking at backup systems and maintaining libraries of data
our company found that we needed something that fit our needs.
We designed a system that worked and knuckled down to programming it.
We now have a search-able database of documents and files with attributes
as well as context from content for over 20 years of data and documents.
We can pretty much find any file in less than 5 minutes.
We could still make it better but we sure couldn't have done anything like
it C.O.T.S., Google included.
If Google failed tomorrow, where would your documents be then?
If you're using some kind of *nix machine to host smb you could script a perl/bash script to find all files in/under a directory, pars the file name, make it all lower case, turn spaces to _, take whole words and add them to the meta data (for Mac OS X Spotlight or any indexing), and use an index server of some kind...
establish a naming convention. come up with a few simple rules regarding:
file names
directory names
customer names
job/project names
department names
limit the number of total allowable characters in a file name, and publish and distribute your rules in an easy to follow cheatsheet. for example:
all files for client "Smith Inc." reside in a directory named "SI"
all files for Smith Inc for project "Widget X" reside in a subdirectory names "WX"
all files for Smith Inc for project Widget X have a unique number generated by you accounting system
all files generated by the sales department need to have "S" after the project number
enforce using file name extensions for all file types
so a powerpoint deck created by the sales department for a sales pitch to smith inc for Project X with an internal job number of 1234 would be named "SIWX1234_S.ppt".
a well structured naming convention with simple but rigid rules will allow users to navigate a file system to find files and identify wrongly filed assets.
invest in a digital asset management system that with a database backend.
there are many DAM systems available both commercially and opensource.
utilize one that has a web front end, so you can enforce consistancy in end user experience(as opposed to a fat client embed metadata into the files themselves in XML format thru the DAM if possible.
based on the naming convention you've established and the DAM system you've deployed, you should be able to track when a file was created, modified, and last accessed. establish rules regarding when a file moves from disk to tape, and from online tape(in jukebox) to offline tape(out of jukebox), to cold storage(offsite).
three can keep a secret, if two are dead - benjamin franklin
I know that I'll probably get verbally lynched for saying this here, but MOSS 2007 enterpise search is a REALLY nice way of dealing with this . Since MOSS can index your file shares, then all of your users can search for documents contextually using a simple web portal across multiple sites... I better leave before I'm hanging from the Slashdot tree.
http://en.wikipedia.org/wiki/Document_management_system
A DMS sounds like it is what you could use in this case.
I decommissioned a document management system at my client, a smallish law firm, because the system was too complicated, insecure, and expensive. Updating it to run w/ the latest version of MS-Office would have cost thousand$ just for the s/w. We replaced it with Google Search, and we defined a file hierarchy and naming convention for all documents created after the switchover. Client is very happy, their file access is more efficient, and they saved a bundle of money on administration, not to mention all the h/w and s/w they never bought.
Obviously documents are the lifeblood of any law firm. These guys only have about 100,000 or so, less than the aerospace company in question, but the lesson applies. It's extremely unlikely the IT admin of the aerospace company has the resources to manage, much less install, a proprietary document management system.
The ONLY reason to have a formal document management system with a database (like Microsoft SQL *ugh*) is to control access. But access control is something that really, really should be done through the directory. So unless you're NASA or another organization with many, many millions of documents and a legally mandated auditing requirement, there's no reason to make this more complicated than necessary. And even then....
Of course, if we're talking about images with no searchable text, that's another story.
-- "The only thing that is ever new in the world is the history you do not know." -- Harry Truman
I don't at all mean to be pat or facetious with such a short answer. But, seriously, you're asking the wrong crowd. Librarians have masters degrees in answering just the question you're asking and it goes far beyond just books. A couple of dozen hours of consulting contract with a good librarian can set you straight - whether you keep the samba store or you pony up for document management software. Because if you have a strategy for organizing your information and execute on it you will reap benefits that don't show up on any productivity spreadsheet. And a good librarian will tailor the system to how the people in your organization actually use the information. Get an internship program going with a library school to have someone remotely do the cleaning and maintenance every once in a while. Whole thing should be doable for a few grand.
You need to deal with this issue on multiple points
1. Consider PDF with OCR. That way you can search within files for specific words
2. FIle naming. Use a standard like date_headline.pdf
3. Hire a library sciences major, as an earlier poster suggested. They spend years studying how to organize and retrieve.
Easy, just rename all the files with a 8 digit index number and provide an excel spreadsheet with the index number and a description of the file!
Airbus? Is that you?
Outside of shoring up your connectivity to the remote site, you should use the structure of your company to your advantage.
It sounds like the wild west. You gave everyone full RW access to the fileserver.
Build a file structure the mirrors the organization of the company and apply permissions appropriately.
Map drives in the same fashion. An added advantage to this, you can split the files across separate Samba servers later with a minor map change.
The finance department has no reason digging around in your design documents.
The engineers don't have any reason to poke around in your sales collateral.
Does everyone in the company need to be tempted to open "DOD_GPS_NOYB_47-090611.xls"
Getting every employee to adhere to a single naming convention is like herding cats. Delegate responsibility to the directors and managers to keep their areas on the server organized to their own needs. Then you just need to deal with the occasional outlaw.
You may also want to deploy Samba servers to the local offices and back them up to a central server regularly. Use this for personal shares and anything that is primarily used ONLY in the local office.
In most cases, I doubt that "the single person" working on Project X at Remote Site A needs to work off of a centralized copy of their document. Do you really need to share this document across your entire organization? Let the employee keep their file on the local offices share. Let a employee or a manager share it with the entire department. Let the director share it with sales.
In the end, you may find one small part of the organization that REALLY needs a naming or numbering convention. You can address that when they approach you. For now, you need to stop everyone from treating the company share like their own desktop.
It's called a "database". You might want to look into it.
As a PC user, I have found one of the best products to manage hundreds of thousands of documents (*.doc, *.txt, *.wpd, *.xls, *.ppt, and email, images, etc.) is Isys by Odyssey. It requires very little work on the part of the endusers. Just searching. For the IT person, it requires very little to be up-and-running. You can set up automatic indexing to run anytime, without restricting usage and searching. This can be done across all hard drives. I found this little company (and their software) about 15 years ago when I was still using DOS. They have, of course, developed their software to match all the Windows versions that have come out, and have Web versions also. I manage a huge library of both physical and digital documents - all that must be located within seconds. Without this software, I would not be able to perform this job in the high-level capacity that I currently do. Yes, Google is a great contender, but it has its limitations. Google desktop, for example, does not index all different types of software that the hundreds of users may have/use/need. I have found the Isys by Odyssey to not only be extremely fast, high quality, but they have great customer service, and their prices are reasonable. You can always start slow - with a low number of licenses, and work your way up, depending on the company's finances and needs. We have 2 licenses, where I work. I currently am the main end-user to the product, and people request documents or information from me, which I can find and email to them in an instant. It's worth the time to check them out. Their home web page is: http://www.isys-search.com/
The content engines like IBM/FileNet are set up to manage millions of documents. Many also have the ability to add remote cache servers to improve local performance for repeat document access in satellite offices. Contact Dave at Softech-assoc.com if you need help.
KnowledgeTree or Alfresco. Open source and no charge.
Search engine
Move sig!
There are typically 2 approaches to this. One option for the files is to pull them into a document management tool like enterprise vault or documentum to name a couple. Those applications will help classify content and reign in administrative controls. As for enhancing the speed to remote offices there are two options there. One option is something like Microsoft's DFS or the andrew file system. These file systems spread data files to where they are needed. As well some of the storage array vendors have caching appliances or capabilities in their gear. In that case you'd have a smaller remote storage array that acts as a read-through cache to the central storage array where files are managed. But for CIFS traffic that gets pretty complicated because, at least SMB1.x is a persistent connection. Option B for remote user performance enhancement might be to look at some packet level de-duplication technology like Cisco WAAS or Riverbed. The WAAS device is really cool because it has a disk cache in it that holds back often called for information. (thus acting as a quasi file cache) The nice part about these things is you don't have to back them up or worry about managing the content.
How about getting a real document management system, with version control, unique document numbers, and structured metadata?
Kronodoc [www.kronodoc.com], Documentum, or something along those lines
Some of the suggestions above says that you should just chuck everything haphazardly into a big pile and then use search engines to trawl the whole mess. I don't buy that. Instead, (like some others) I'd suggest a proper content management system such as the ones from http://www.alfresco.com/, http://www.interwoven.com/ or http://www.hummingbird.com/.
The reason for this suggestion is that I know that these systems are being used by organisations which handle, as OP said, hundreds of thousands of documents and which have satellite offices (e.g. large multinational lawfirms). They provide several benefits such as the possibility to structure projects, have both project related documents and e-mails saved and indexed in the project folders, allows for searching and proper document version chains (meaning that you can revert to older versions of documents if some klutz breaks a newer version).
Of course, this means quite an investment, a learning curve for everyone at your company and, most likely, the hiring of an individual with experience of the chosen system.
Firstly, you can absolutely forget about any system that requires users to name documents in a way that is descriptive, consistent, unique or anything else that a sane person would do.
Secondly, MacOS X Spotlight Server (as of version 10.5.7) doesn't work as one would expect/hope. Users' files stored on the server get indexed by the server but this index can only be read by users logged in to the server console (or via ssh), not clients that access the files my mounting them as shared volumes. If a client wishes to search the files, it must build its own index over the network. The workload on the server/network can cause severe performance issues until the clients have built their indexes, a process that will take hours and may take days to complete if you have a lot of files.
Namgge
Dark Green have just this week gone live with Gina2, a web solution for document archives.
Have a look at http://www.gina2.net/ - the text is currently in German, but the English translation will be up there in the next couple of weeks.
Dark Green are offering Gina2 as a hosted service for companies whose core business is not managing IT infrastructure.
I work for a company that stores terabytes of documents. There are two products that do this well EMC's documentum and Microsoft Sharepoint. Pick your poison depending on whom you want to abuse you.
I've written a recruitment app that has 200k resumes and other types of folder indexed in text.
The files live on the disk in /TYPE/YEAR/MONTH and are converted to text and inserted into MySQL database.
They can be searched on name, date record id, free text, type, etc, etc; or just browsed to on disk.
The front end is PHP on MySQL.
These were imported from a files on disk approach.
It can scale with master slave replication, etc. Just keeping it simple helps.
Go to Google main page and look for business solutions. They have a scheme where they'll charge you x dollars to index y hundred thousand documents, and they throw in the tinware (a custom pre-configured rack of search hardware, very scaleable) for you to plug into your LAN. All strictly inside your firewall. Set it up to crawl all your file shares and it won't matter whether you have a document management system or not. Most document management systems depend on keywords, taxonomies and special file name codes, all of which are decidedly old-hat. Index it and let 'em go search. The smallest version is kind of basic, but go up one level and they'll crawl pdf's, word docs, pretty much anything with text in it compressed or in source libraries or whatnot. They're pretty good. Not cheap, but then you're an aerospace firm...
Do not mock my vision of impractical footwear
Google is just a search engine. They need document management. :-) Correct me if it is not the thing called content management they need?
Import it into some CMS, sort it and make it available through the website secured by the password. We did something like this for http://www.olympus-ims.com/ (but these are public documents) and it really contains thousands of documents (in dozen languages) together with all the document revisions it is over the hundred of thousands of documents. Easy to search, easy to navigate, easy to manage.
Simply: CMS is what you need. Do research.
Well, I've got to get back to work. When I stop rowing, the slave ship just goes in circles.
*not knowing what format your docs are written in* ... whatever) which goes through docs, converts them and uploads them to a MediaWiki installation (http://www.mediawiki.org/wiki/API). ...
- Write a script(s) (Pyhton, Perl,
- Categorize your docs based upon the directory names.
- Learn your people how to write their documents in MediaWiki syntax.
- Everything is web based, which makes less overhead on the network for remote offices, simplifies management.
- MediaWiki is a controlled document system, with detailed history.
- MediaWiki is FREE and has WikiPedia (and more) as a reference.
- Check http://www.smetj.net/wiki/wikiinject for some ideas
- http://openwetware.org/wiki/Converting_documents_to_mediawiki_markup
Several years ago I worked for a NASA project called the National Technology Transfer Center. A big part of the job there is organizing and searching through tens of thousands of pages of research documents. They used a document oriented database at the time although they may have migrated to something else since then. You might want to contact them for advice.
A friend of mine was the person primarily responsible for scanning in the documents. IIRC, the process involved OCR of the scans for key word search and indexing and then storing a compressible graphic image of the page - this got them around the problem of text databases not storing technical drawings, etc.
Time's fun when you're having flies. - Kermit the Frog
Since you mention you are an Aerospace company - are you managing CAD documents? Which can be a head ache because of all the dependencies - Assembly, Parts, Drawings. CAD data can also have some very funky naming conventions - especially the older systems like CATIA V4 and CADDS 5.
We have developed a distributed doc mgmt/vaulting system based on Open Source technologies (Apache, MySQL, Perl, etc...) called DMXchange - that we currently market and sell as a product with services. All of the source code is included and open.
Given that you are looking to access the documents from many sites which are connected over a WAN - most of the client/server based approaches will not work very well. For more info see www.dmforge.com
Samba shared over a VPN? Man, you are asking for no end of painful trouble. There are many good ways of sharing docs, but putting MS docs in a filesystem shared over a VPN is not one of them. A simple way to improve things would be to drop all the filesystem sharing and create some sort of searchable index on a web server. If you want more sophistication and have money to burn (who hasn't these days?), go and talk to Oracle, they have some very good software for this very purpose.
I don't know why companies always do it this way - it is the worst possible way of organizing your documents. When you put them in a filesystem, people have to try to remember how to find the one they need; a directory is like a hiearchical database, badly implemented. Sharing it via a networked filesystem makes it even worse, because now you have a huge network overhead and the risk of undetectable corruption when the network stumbles. And the VPN means that your network traffic is something like 10 times as heavy because of the encryption.
The latest installment of Visual Source Safe is pretty good, they improved the performance over the network which used to kill on a domain spread across multiple cities (back during vb6 days), but now is really good repository tool. I also used another , but it lacked the history/detail section and could only keep a max number of files....seeing as you have hundreds of thousands
for what you need I suggest Alfresco. It has indexing, and publishing options (CIFS, WebDAV, etc). And it is extensible.
We are implementing it right now in our company.
Smeadsoft might work for you. -- note: I don't work for them or any affiliate of theirs, and have no vested interest in them being used --
I'm in the process of setting up one of their systems for document management, it seems to be quite capable of that. Its not open source and it would involve some cash to set it up, but I think it worth looking into if those two things don't eliminate it from consideration. (they also handle management of physical files, which is where they came from)... Thus far set up involves setting up a lot of framework and tags for the actual documents, and scanning a lot of physical files to be stored. There is this system of using large scanners with something called VRS, and putting barcode identifier sheets with stacks of documents.
So for example you could have a large stack of papers, of which half belong to one category (or subcat or subsubetc), the others to a second. You put barcode sheet (a blank paper save for one barcode) for the first category, then all those papers, then a barcode sheet for the next category, and so on. You load them into the scanner (obviously a high capacity one) and it reads them all and puts the scanned documents into the proper location in the database automatically.
"Waste not one watt!" - CZ
Onbase from Hyland
I would go with Worldox. It allows remote branches to search documents across a WAN and provides security too. Does not use a SQL datgabase (i.e. no expensive licensing).
Don't try to use the file name or directory structure. This is difficult to adapt or relocate as the namespace becomes distorted from its original content over time.
Try this instead:
Assign arbitrary file names.
Adopt a directory structure derivable from those names, if you must.
Build a database of several tables to link keywords, project names, authors, etc, to the arbitrary file names.
Award small prizes for verified corrections to the database.
See http://en.wikipedia.org/wiki/Content-addressable_storage for more information.
Teamcenter. It freaking rules. Also, as evil as StarTeam is, it will do the job for you as well.
I have been a user/admin of both Teamcenter and StarTeam.
Hire a librarian. Seriously. Get someone in there with a degree in library science, and let them do their thing.
Organizing a large collection of related documents would be right up their alley...
Look at www.Blinkedm.com
It's easy to use, offers a lot of features and far less expensive than a lot of the products out there.
We use EMC's Documentum suite here to manage our large volumes of documents. Expensive, but works great...and integrates with Fax software, MS Exchange, etc.
Similar to CamelCase. Limits the number of variations on the same name considerably (no: camelcase, Camelcase, Camel case, Cam El Case,...)
Reminds me of the command 'passwd' in *nix, I always have to 'apropos password' to find the correct spelling. Why is it not 'password' or 'psswrd'? Arbitrarily dropping 1 vowel and 1 consonant is silly.
What you need is a content management system. Such systems do more than store and find documents. They allow true document taxonomy management, records management for compliance and control, and many other features.
I personally specialize in IBM Content Manager. It's great for companies like yours where you have distributed offices. You can keep your metadata at one central location but have the documents themselves stored at your remote locations, all while maintaining centralized control.
Doug Hansknecht
Certified IT Architect
DougFromOhio@us.ibm.com
...lotus notes teamrooms
To help you with the challenge of sharing documents with your remote sites, there are universal web-based document viewers on the market that you can use to embed document viewing capability into your intranet or web site. The documents can be of different file formats too, they don't all need to be PDF. Some options use Adobe Flash, so a plug-in needs to be downloaded by the end user, but other options do not. Adeptol and Vuzit are two examples, but if you search for "online document viewer" in Google you'll find a number of options.
I'm surprised that there were quite a few programs not mentions on the DMS wikipedia page -- People might consider them to be more as repository software than DMS (or RMS), but some other ones to mention that would be useful to managing already existing documents:
And if you're looking for librarians with an IT background, in the libraries they're called "Systems Librarians". You might also check out the oss4lib and code4lib communities.
Build it, and they will come^Hplain.
It seems that the first responses to a request like this is to suggest new technologies and programs to solve the problem. It sounds, though, like 95% of the problem is that there are no procedures and organization in place already so that files have a purpose or place to go. A good file storage policy with the appropriate instructions sent to the users could just as easily make this work going forward. I've seen collections of millions of files that were perfectly fine as they were organized by user, purpose, source, destination, etc...and then subdivided as needed...and users knew what the organization was and how to maintain it (to their own benefit as it means they can find their own stuff). You can also institute a more structured system where organization is already there for them to use, but it's your call. ALWAYS ALWAYS ALWAYS figure out how you want things to be organized first! What are the functions of these files, why are they saved, who created them,, who accesses them? This will make the job of sorting the mess out easier.
Species.
In terms of making the WAN experience less painful, you need to get some WAN optimization appliances in your network:
http://en.wikipedia.org/wiki/WAN_optimization
http://www.riverbed.com/
I've found solr (http://lucene.apache.org/solr/) super easy to install and very effective.
I can't believe no one has suggested Solr yet! It's probably the most flexible and mature search product available and it's free (part of the Apache project)! Set up a solr server and hire a student or three to meta-tag your documents.
Related question, has anyone used, or would recommend using IntraLinks to help manage a similar scenario?
HIRE A REFERENCE LIBRARIAN! Seriously. You have all sorts of ad-hoc suggestions here, none of which addresses the core issue: You have a metric shitload of written, unorganized data. There is a category of professionals who specialize in organizing, cataloging, abstracting and making writtten data available in easily-usable formats. Reference librarians. They even use IT extensively. Check ala.org for more.
(Besides, some real-life reference librarians are hawt - just not where you live.)
Use Knowledge Tree. Download a free version from http://www.knowledgetree.com/
Here's a nickle, kid. Get yourself a real computer.
The simple solution is to put them all into a GOOD filesystem that keeps journaling, metadata, and file indexing services, and simply have the person search the Index for the file they want. Then you don't have to deal with creating hierarchies.
If you don't want to go the google appliance route, Notes works great, is cheap to set up, and simple to administer.
One db.
One form with a couple of fields
One view
Render to the web
Write a simple agent that crawls your directory structure, snags the files and attach each one to a Notes doc. Stuff in the directory/file name if you care.
Let Notes build an index (and it can index damn near any file).
Poof - done.
Remove user's rights to leave crap in file directories and make 'em put new stuff into Notes and you have something that's maintainable without a ton of work.
If you then want to get fancy, you can make users enter some meta data before they can save new docs.
You can set up access control, etc, etc, etc.
Documentum costs about a quarter mil just to get it in the door and a boat load of cash to make it useful. (at least it did in the late '90s).
Notes server license a couple grand. If you need user authentication, it's around $150/client (ask your rep for prices because IBM is working tons of price schemes). If you don't need authentication, all you need is the server license.
Where I work we wrote our own Document Management System that integrates with the rest of our systems. The integration has proven quite beneficial. Off-the-shelf systems can integrate but it generally doesn't work very well. Anyway, we were looking at using SharePoint as our back-end to get the indexing support and improved versioning. What we discovered is that SharePoint just doesn't scale very well. When you get into the hundreds of thousands of documents it has problems. When you get into the tens of millions it has major problems.
Given that the submitter already needs to file 500,000 documents I question if SharePoint is feasible.
i'm surprised so many folks here would jump to a goog appliance conclusion here. it's one of many search-only answers.
part of the problem is text search. the other part is how why and when the docs ended up in these directories in the first place. a file directory alone does not a solution make. you need something that can hold and search your current directory based docs, but get you past this obviously inappropriate way of doing things.
merge blog (journal what's going on and why one my care about a document) and wiki (stable organization of docs over time) and better-than-google search - and you have a solution. www.tractionsoftware.com has an approach for this. there may be others that satisfy as well.
I think that Microsoft Office SharePoint Server 2007 is what you are looking for.
(http://sharepoint.microsoft.com/Pages/Default.aspx)
There is also a more robust application called Documentum by EMC2.
(http://www.emc.com/products/category/subcategory/collaboration-and-document-management.htm)
If you need a consultancy service to help you, please contact us: www.iteris.com.br
Good luck!
Consider Office Evolve by Documatics. They've a system that will; organise your directories in projects, provides fully indexed searching of all your documents, caters for document generation from templates, has a complete history of all your documents, integrates with Outlook and manages workflow. It's in use at GE. We love it.
OnBase from Hyland Software
Not a student. This is not a summer job. Even if someone at your office has nothing else to do, they will not be able to do a better job than a pro.
the real answer is ofcourse to start to migrate to a model driven envirnment based on ontology and UML/SysML
word, excel, ppt etc were all designed to automate paper. in a truly digital word these will go the way of the computer drawn blue print (now replaced with 3-D models). information must go the same route.
if you look at your customers (lockheed Martin, Nothrop Grumen etc) they are all on some path to move to this paradigm.
Documents will simply become a proxy representation of a model like a 2 D plot of #D model.
some day designs will be on source forge in UML, SysML and OWL and other sources. just like code.