Slashdot Mirror


Open Source Analog to Microsoft's Index Server?

An Anonymous Coward asks: "I have been tasked by my noble employer to find a better way accessing the 4,000 odd management documents and procedures we have. Currently MS Index Server is being used to provide a fairly good searching system. Index Server (for those that don't know) trawls through files and indexes their content.. ASP is then used to search the resulting database. My question is, there has to be a way to do this with nice open source software? Does anyone know of any competitors to index server that can index microsoft office documents? Thanks!" Might not HT://dig be a good foundation on which to build such a system?

4 of 38 comments (clear)

  1. ASPSeek worked well for me.. by maeglin · · Score: 5, Informative

    I had about 2GB of documentation dumped onto me for a project. The documentation had no visible structure nor any place to really start tackling it so I decided to just index it all. The documentation was on my Windows2000 machine and I put ASPSeek (a GPL'd search engine that no one seems to know about) on one of my Linux workstations. I used pdftotext and word2txt as filters and let it chew through the documentation. The results were good enough that, when I left the project and shut down the ASPSeek interface, it took about 15 minutes before someone (who already had it all indexed on his Windows2000 workstation) was at my desk trying to get me to turn it back on.

  2. Re:Ironic by ericski · · Score: 2, Informative

    The original post does specifically state "microsoft office documents" so it is fair from likely that they're ASCII text.

  3. Apache Lucene by danpat · · Score: 4, Informative

    I highly recommend taking a look at the Apache Lucene Project, at http://jakarta.apache.org/lucene/

    It's a full text search engine API, so some coding for your specific requirements would be required. However, it's fast, extremely flexible, and has a pluggable interface for documents. It comes with native support for plain text, and for proprietry document types, we've written simple wrappers around tools like "pdf2text" and "catdoc" to index PDF's and Word docs.

  4. and Apache POI by tpv · · Score: 2, Informative
    POI (http://jakarta.apache.org/poi) is an MS Office file reader.

    Much talk has been made of intergrating Lucene + POI to provide indexing of MS Office Docs, but I don't what stage that is at.

    --
    Read more of this story at Slashdot.Read more of this story at Slashdot.Read more of this story at Slashdot.