Best Way to Build a Searchable Document Index?
Blinocac writes "I am organizing the IT documentation for the agency I work for, and we would like to make a searchable document index that would render results based on meta tags placed in the documents, which include everything from Word files, HTML, Excel, Access, and PDF's." What methods or tools have others seen that work? Anything to avoid?
You're the one gettin' paid, you figure it out.
Check out Apache's free Lucene engine, found at lucene.apache.org/. Lucene is a powerful indexing engine that handles all kinds of docs, and you can easily mod it to handle whatever it doesn't. It also allows custom scoring and a very powerful query language.
Previous place i worked we had a Google Mini and it was better than anything we had come up with in-house.
We even pointed it at the web-cvs server and bugzilla and it was great at searching those too.
To see all the bugs still open against v 2.2.1 or something like that bugzilla's own search was better. but for searching for "bugs about X" the google mini was great.
It only cost something like $3k ircc.
not exactly what you asked about, but you should definitely see if this wouldn't work for you instead.
it wil cost you some bucks just buy MS sharepoint portal server, and leave the indexing over to sharepoint.
Your not even realy required to use added tags... (as most people will put in poor tags).
But if you like you can add tags even with sharepoint.
I know you're out there. I can feel you now. I know that you're afraid. You're afraid of us. You're afraid of change.
I'd suggest you should consider a full-text search engine. First start here:
http://en.wikipedia.org/wiki/Full_text_search
If you're not afraid to do a little reading and potentially coding a custom front end, you may want to look at two of the big open source engines: Lucene and Xapian.
Lucene is quite popular now, and is an Apache Java project. It's a good choice if you're a Java shop.
Xapian seems to be based on a little more solid and modern information retrieval theory and is incredibly scalable and fast. It's written in C++, with SWIG-based front ends to many languages. It might not have as polished of a front end or as fancy of a website as Lucene, but I believe it's a better choice if you have really really huge data sets or want to venture outside the Java universe.
There are also many other wholely-contained indexers too, mostly which are based on web indexing (they have spiders, query forms, etc.) all bundled together. Like ht://Dig, mnogosearch, and so forth. They are good, especially if you want more of a drop-in solution rather than a raw indexing engine, and if you're indexing web sites (and not complex entities like databases, etc).
Directories full of random documents in random formats of random version with varying degrees of completeness and accuracy tend to get less useful as an information source as time goes on. Docs get abandoned and continue to provide outdated information and dead links. Doc formats change and require converters to import. Doc maintainers leave the company.
If you work somewhere where people are not trained to attach Office docs to every email, where people don't use Word to compose 10 bullet points, where people don't use a spreadsheet as a substitute for all sorts of CRM and business applications... a Wiki is actually a good solution.
You can use something like MediaWiki or Twiki or... heck you can use a whole variety of content management systems.
The key to success is to *EMPOWER* people to actually update information, and have a few people who are empowered to actually edit, rehash, sort, move, prune wiki pages and content. As the content improves, it will draw in more users and more content creators. Pretty soon, employees will *COMPLAIN* when someone sends out information and doesn't update the wiki.
Some corporate cultures are not wiki-friendly. Some management chains *fear* the wiki. Some companies have whole webmaster groups who believe it is their job to delay the process of getting useful content onto the web by controlling it. If you're in one of those companies... start up your own wiki and beg for forgiveness later.
Meta tags are worthless, generally, unless you have a librarian who ensures correctness.
DON'T TRUST USERS TO ENTER META DATA!!!
I've worked in electronic document management in 3 different businesses and metadata entered by end users is worst than worthless - it is wrong. Searches that don't use full text for general documents are less than ideal.
Just to prove that you're question is missing critical data:
- how many documents?
- how large is the average and largest documents?
- what format will be input? PDF, HTML, XLS, PPT, OO, C++, what?
- what search tools do you use elsewhere?
- any budget constraints?
- did you look at general document management systems? Documentum, Docushare, Filenet, Sharepoint? If so, what didn't work with these systems?
- Did you consider OSS solutions? htdig, e-swish, custom searching?
- A buddy of mine wrote an article on "how to index anything" that was in the Linux Journal a few years ago. Google is your friend.
AND if i didn't get this across yet - DON'T TRUST META DATA IN HIDDEN DOCUMENT FIELDS - bad Metadata in MS-Office files will completely destroy the usefulness of your searches.