Slashdot Mirror


Indexing and Searching Text and PDF Repositories?

foobar104 asks: "My company is trying to set up a permanent archive of PDF documents, lots and lots of PDF documents. The goal is to be able to collect a ton of documents and assemble them into a directory structure, then burn the whole tree to a DVD-ROM (or similar medium) for permanent storage. Of course, 4 GB of PDFs would be pretty useless if you couldn't find what you were looking for, so I want to include one or more indices and a search program on the disk. I wanted to put Glimpse to the task, but it looks like somewhere along the line the Glimpse folks turned commercial, so they're no longer my first choice. (Unless there's still a source for Glimpse 4.0 source code?) Search tools like ht://dig and such are fine, but they're designed to index hyperlinked web content, and this collection isn't organized that way. Are there any open-source tools around for indexing and searching, with a command-line-interface, plain-text and PDF documents that aren't tied to HTML/HTTP?"

1 of 10 comments (clear)

  1. ht://Dig is what you are looking for by BigJim.fr · · Score: 5

    We index dozens of gigs of txt, html, pdf, xls, doc and ps. Not 100% of the documents are indexed but it's a parser problem with some of the files (a few pdf, xls doc and ps seem to make their parser choke).

    And beside being flexible, ht:/Dig is fast.

    4.9. How do I index PDF files?
    http://www.htdig.org/FAQ.html#q4.9