Slashdot Mirror


Best Way to Build a Searchable Document Index?

Blinocac writes "I am organizing the IT documentation for the agency I work for, and we would like to make a searchable document index that would render results based on meta tags placed in the documents, which include everything from Word files, HTML, Excel, Access, and PDF's." What methods or tools have others seen that work? Anything to avoid?

3 of 216 comments (clear)

  1. Lucene by v_1_r_u_5 · · Score: 5, Informative

    Check out Apache's free Lucene engine, found at lucene.apache.org/. Lucene is a powerful indexing engine that handles all kinds of docs, and you can easily mod it to handle whatever it doesn't. It also allows custom scoring and a very powerful query language.

    1. Re:Lucene by Anonymous Coward · · Score: 5, Informative

      yes. It's hard to beat Lucene if you don't mind working at the API level. If you want a ready-build web crawler, check out Nutch, which is based on Lucene.

  2. Re:Google by rta · · Score: 5, Informative

    Previous place i worked we had a Google Mini and it was better than anything we had come up with in-house.

    We even pointed it at the web-cvs server and bugzilla and it was great at searching those too.

    To see all the bugs still open against v 2.2.1 or something like that bugzilla's own search was better. but for searching for "bugs about X" the google mini was great.

    It only cost something like $3k ircc.

    not exactly what you asked about, but you should definitely see if this wouldn't work for you instead.