Best Way to Build a Searchable Document Index?

← Back to Stories (view on slashdot.org)

Best Way to Build a Searchable Document Index?

Posted by ScuttleMonkey on Monday October 1, 2007 @09:36AM from the build-a-better-boss-trap dept.

Blinocac writes "I am organizing the IT documentation for the agency I work for, and we would like to make a searchable document index that would render results based on meta tags placed in the documents, which include everything from Word files, HTML, Excel, Access, and PDF's." What methods or tools have others seen that work? Anything to avoid?

13 of 216 comments (clear)

Min score:

Reason:

Sort:

Meta tags placed? by harmonica · 2007-10-01 09:40 · Score: 3, Insightful

Who places what types of meta tags in the documents? I don't understand the requirements.

Generally, Lucene does a good job. It's easy to learn and performance was fine for me and my data (~ 2 GB of textual documents).
1. Re:Meta tags placed? by rainmayun · 2007-10-01 10:43 · Score: 2, Insightful
  I don't understand the requirements.
  I don't either, and that's because the submitter didn't give enough information. I'm working on a fairly large enterprise content management system for the feds (think 2.5 TB/month of new data), and I don't see any of the solution components we use mentioned in any thread yet. If I were being a responsible consultant, I'd want to know the answers to the following questions at minimum before making any recommendations:
  
  What is the budget?
  
  How many documents are we talking about? The answer for 10,000 is different than for 10,000,000.
  
  Are you looking for off-the-shelf, or is software development + integration going to be involved
  
  Who is going to maintain the integrity of this data?
  
  Although I am as much a fan of open source as anybody, I don't think the offerings in this area are anywhere near the maturity of commercial offerings. But some of those offerings cost a pretty penny, so it might be worthwhile to hire a developer or two for a few weeks or months to get what you want.
Google Desktop or Applicance by wsanders · 2007-10-01 09:40 · Score: 3, Insightful

Because if you have to spend more than an hour on this kind of project nowadays, you're wasting your time.

The inexpensive Google appliacances don't have very fine-grained access control, though. But I am involved in several semi-failed projects of this nature in my organization, but new and legacy, and my Google Desktop outperforms all of them.

--
Give a man a fish and you have fed him for today. Teach a man to fish, and he'll say "WHERE'S MY FISH, YOU IDIOT?"
Beagle, Spotlight? by Lord+Satri · 2007-10-01 09:43 · Score: 2, Insightful

Is this something that would suit your needs: Beagle for Linux, Spotlight for OSX? I haven't tried Beagle (I don't have root access on my Debian installation at work), but Spotlight is probably my most cherished feature in OSX... it's so useful.

--
Animoog.org
what to avoid by Anonymous Coward · 2007-10-01 09:52 · Score: 2, Insightful

You should avoid any system that relies on individual employees putting in these meta-tags. It won't work; they either won't do it, or will do it wrong (spelling errors, inventing their own tags on the fly, and so on.) And then you'll catch hell when they can't find one of those documents they mislabled. Trust me.
Depends on size of document base by Nefarious+Wheel · 2007-10-01 10:14 · Score: 2, Insightful

It depends on the size of your document base, and how you're going to store it -- if you're using something industry-strength like Documentum or Hummingbird then the Google Mini won't index it, you have to go up a notch and use the yellow box solutions. And if you're using Lotus Notes, you'll need a third party crawler such as C-Search. Google Desktop can be bent into some solutions, and it's free, but for many users you're better off having a separate server do the indexing. Google bills on the number of documents you need to keep in the index at once, and they throw in a bit of tinware to support that on a 2 year contract.
Disclaimer: I flog Google search solutions at work, so I'm way biased.

--
Do not mock my vision of impractical footwear
use a Wiki instead by poopie · 2007-10-01 10:20 · Score: 4, Insightful

Directories full of random documents in random formats of random version with varying degrees of completeness and accuracy tend to get less useful as an information source as time goes on. Docs get abandoned and continue to provide outdated information and dead links. Doc formats change and require converters to import. Doc maintainers leave the company.

If you work somewhere where people are not trained to attach Office docs to every email, where people don't use Word to compose 10 bullet points, where people don't use a spreadsheet as a substitute for all sorts of CRM and business applications... a Wiki is actually a good solution.

You can use something like MediaWiki or Twiki or... heck you can use a whole variety of content management systems.

The key to success is to *EMPOWER* people to actually update information, and have a few people who are empowered to actually edit, rehash, sort, move, prune wiki pages and content. As the content improves, it will draw in more users and more content creators. Pretty soon, employees will *COMPLAIN* when someone sends out information and doesn't update the wiki.

Some corporate cultures are not wiki-friendly. Some management chains *fear* the wiki. Some companies have whole webmaster groups who believe it is their job to delay the process of getting useful content onto the web by controlling it. If you're in one of those companies... start up your own wiki and beg for forgiveness later.
Meta tags are worthless, generally by Anonymous Coward · 2007-10-01 10:35 · Score: 4, Insightful

Meta tags are worthless, generally, unless you have a librarian who ensures correctness.
DON'T TRUST USERS TO ENTER META DATA!!!
I've worked in electronic document management in 3 different businesses and metadata entered by end users is worst than worthless - it is wrong. Searches that don't use full text for general documents are less than ideal.

Just to prove that you're question is missing critical data:
- how many documents?
- how large is the average and largest documents?
- what format will be input? PDF, HTML, XLS, PPT, OO, C++, what?
- what search tools do you use elsewhere?
- any budget constraints?
- did you look at general document management systems? Documentum, Docushare, Filenet, Sharepoint? If so, what didn't work with these systems?
- Did you consider OSS solutions? htdig, e-swish, custom searching?
- A buddy of mine wrote an article on "how to index anything" that was in the Linux Journal a few years ago. Google is your friend.

AND if i didn't get this across yet - DON'T TRUST META DATA IN HIDDEN DOCUMENT FIELDS - bad Metadata in MS-Office files will completely destroy the usefulness of your searches.
Requirements spec by Anonymous Coward · 2007-10-01 10:50 · Score: 1, Insightful

The requirements spec there reads like most of the projects Ive worked on the last few years. *sigh*

In light of the above I cant (IGC) recommend anything specific, but I can advise you to avoid :-

1) In house solutions (expensive, usually buggy).
2) Anything from Thunderstone (If they've fixed the numerous Vortex bugs over the years I might revise my opinion but my last experience was painful).
3) MS Full text search/indexing (slow - and yeah you can throw a load of hardware at this but hardly the optimal solution).
4) Lucene (Ive seen too many sites with dead lucene searches).

The recommendations re Google are probably safe-bets ("nobody ever got fired for buying google") and Ive had a lot of success with Swish-e for smaller (20,000 docs) projects.
Google backdoor appliance by Anonymous Coward · 2007-10-01 10:56 · Score: 1, Insightful

And it reports everything straight back to Google! Such a deal!
Re:Most easy solution by Anonymous Coward · 2007-10-01 11:46 · Score: 1, Insightful

Just a couple bucks, $60-70 per seat, plus how much for the server software?
Re:Google by rgaginol · 2007-10-01 13:48 · Score: 2, Insightful

I'd have to agree - I'm a Java developer so if I was doing the solution, I'm sure I could whip up something cool with Lucene or whatever. But... in terms of long term maintanance costs, why develop anything yourself if the problem is already solved. And on the point of cool: good IT systems aren't cool... they do a job and do it well... maybe this is the first project where you find a "cool" solution is just not justifiable. I'm sure the Google appliance would let you put some quite extensive customizations on top of their API... well, that's been my experience with other Google products/services (Google Web Toolkit or Google Maps). Still, I've also found that some of the "nice to have" API's are kept out of reach with some things - I was trying to put a listener on a custom tile image in a map application and suddenly came up against a the barrier of "oh-boy-we'll-let-you-play-but-don't-dare-touch-that" sure came to mind. I guess the only way forward is to scope your requirements well - and I'll bet half of the "must haves" aren't really that important. After that, some research on each of the possible solutions would be good and the cost to implement/maintain them. If you've got programming expertise in house, it may be tempting to use them as a no brainer, but it is worth finding out their cost to implement a good solution (one with maintenance and documentation factored in... basically, whatever amount of time they say it will take to code, times 4).
Re:Gee I don't know.. by Anonymous Coward · 2007-10-01 19:08 · Score: 1, Insightful

why do you bother answering if you're just going to be an idiot. The guy asked a legit question for a legit problem......