Indexing and Searching Text and PDF Repositories?
foobar104 asks: "My company is trying to set up a permanent archive of PDF documents, lots and lots of PDF documents. The goal is to be able to collect a ton of documents and assemble them into a directory structure, then burn the whole tree to a DVD-ROM (or similar medium) for permanent storage. Of course, 4 GB of PDFs would be pretty useless if you couldn't find what you were looking for, so I want to include one or more indices and a search program on the disk. I wanted to put Glimpse to the task, but it looks like somewhere along the line the Glimpse folks turned commercial, so they're no longer my first choice. (Unless there's still a source for Glimpse 4.0 source code?) Search tools like ht://dig and such are fine, but they're designed to index hyperlinked web content, and this collection isn't organized that way. Are there any open-source tools around for indexing and searching, with a command-line-interface, plain-text and PDF documents that aren't tied to HTML/HTTP?"
We index dozens of gigs of txt, html, pdf, xls, doc and ps. Not 100% of the documents are indexed but it's a parser problem with some of the files (a few pdf, xls doc and ps seem to make their parser choke).
And beside being flexible, ht:/Dig is fast.
4.9. How do I index PDF files?
http://www.htdig.org/FAQ.html#q4.9