Slashdot Mirror


Best Way to Build a Searchable Document Index?

Blinocac writes "I am organizing the IT documentation for the agency I work for, and we would like to make a searchable document index that would render results based on meta tags placed in the documents, which include everything from Word files, HTML, Excel, Access, and PDF's." What methods or tools have others seen that work? Anything to avoid?

7 of 216 comments (clear)

  1. Lucene by v_1_r_u_5 · · Score: 5, Informative

    Check out Apache's free Lucene engine, found at lucene.apache.org/. Lucene is a powerful indexing engine that handles all kinds of docs, and you can easily mod it to handle whatever it doesn't. It also allows custom scoring and a very powerful query language.

    1. Re:Lucene by Anonymous Coward · · Score: 5, Informative

      yes. It's hard to beat Lucene if you don't mind working at the API level. If you want a ready-build web crawler, check out Nutch, which is based on Lucene.

    2. Re:Lucene by knewter · · Score: 4, Informative

      Lucene's good. If you haven't yet have a look at Ferret, a port of Lucene for Ruby. It's listed as faster than Lucene. I've used it in 20+ projects now as my built-in fulltext index of choice, and it's pretty great. You can easily define your own ranking algorithms if you'd like. You can find more information on Ferret here: http://ferret.davebalmain.com/trac/

      I've got a prototype of the system described in the OP that we did while quoting a fairly large project. It's really easy to have an 'after upload' action that'll push the document through strings (or some other third party app that can operate similarly, given the document type) and throw the strings into a field that gets indexed as well. That pretty much handles everything you may need.

      Obviously I'd also allow someone to specify keywords when uploading a document, but if this engine's going to just be thrown against an existing cache of documents, strings-only's the way to go.

      --
      -knewter
    3. Re:Lucene by passthecrackpipe · · Score: 4, Informative

      "If you needed to run, say 100 indexing engines in parallel and merge the indexes, you'd have to research that. Somebody's probably done it."

      Yes, they have. In my previous job we had to search 2 terrabytes of plain text data (HTML) really fast. The company chose Autonomy, and many developers spent many months trying to make it work, consuming insane amounts of hardware resources for mediocre results, and still requiring . One lone (and brilliant) dev whipped up a Lucene proof of concept in a weekend, and it was faster (full index in a day) required less resources (a single HP DL 585, 16GB RAM, 4xdual core AMD as opposed to 10 of the same), had a smaller index (about a 5th of Autonomies'), returned results faster, the result set was more accurate, and was significantly more flexible in making it do what we actually needed it to do.

      Lucene wins hands down

      --
      People who think they know everything are a great annoyance to those of us who do.
  2. Re:Google by rta · · Score: 5, Informative

    Previous place i worked we had a Google Mini and it was better than anything we had come up with in-house.

    We even pointed it at the web-cvs server and bugzilla and it was great at searching those too.

    To see all the bugs still open against v 2.2.1 or something like that bugzilla's own search was better. but for searching for "bugs about X" the google mini was great.

    It only cost something like $3k ircc.

    not exactly what you asked about, but you should definitely see if this wouldn't work for you instead.

  3. Most easy solution by PermanentMarker · · Score: 4, Informative

    it wil cost you some bucks just buy MS sharepoint portal server, and leave the indexing over to sharepoint.
    Your not even realy required to use added tags... (as most people will put in poor tags).

    But if you like you can add tags even with sharepoint.

    --
    I know you're out there. I can feel you now. I know that you're afraid. You're afraid of us. You're afraid of change.
  4. Also see Xapian by dmeranda · · Score: 4, Informative

    I'd suggest you should consider a full-text search engine. First start here:
    http://en.wikipedia.org/wiki/Full_text_search

    If you're not afraid to do a little reading and potentially coding a custom front end, you may want to look at two of the big open source engines: Lucene and Xapian.

    Lucene is quite popular now, and is an Apache Java project. It's a good choice if you're a Java shop.

    Xapian seems to be based on a little more solid and modern information retrieval theory and is incredibly scalable and fast. It's written in C++, with SWIG-based front ends to many languages. It might not have as polished of a front end or as fancy of a website as Lucene, but I believe it's a better choice if you have really really huge data sets or want to venture outside the Java universe.

    There are also many other wholely-contained indexers too, mostly which are based on web indexing (they have spiders, query forms, etc.) all bundled together. Like ht://Dig, mnogosearch, and so forth. They are good, especially if you want more of a drop-in solution rather than a raw indexing engine, and if you're indexing web sites (and not complex entities like databases, etc).