Indexing and Searching Text and PDF Repositories?
foobar104 asks: "My company is trying to set up a permanent archive of PDF documents, lots and lots of PDF documents. The goal is to be able to collect a ton of documents and assemble them into a directory structure, then burn the whole tree to a DVD-ROM (or similar medium) for permanent storage. Of course, 4 GB of PDFs would be pretty useless if you couldn't find what you were looking for, so I want to include one or more indices and a search program on the disk. I wanted to put Glimpse to the task, but it looks like somewhere along the line the Glimpse folks turned commercial, so they're no longer my first choice. (Unless there's still a source for Glimpse 4.0 source code?) Search tools like ht://dig and such are fine, but they're designed to index hyperlinked web content, and this collection isn't organized that way. Are there any open-source tools around for indexing and searching, with a command-line-interface, plain-text and PDF documents that aren't tied to HTML/HTTP?"
The retrieval section is here.
The MS Index Server supports indexing PDF files with an extension from Adobe. These files can be anywhere, not necessarily on a web site or linked to by web pages. However, it is quite easy to search the index (which support free text searching) via the web. Its a fairly easy solution to setup too....
ÕÕ
ÕÕ
I see that google is now indexing pdf's. Maybe you can check out what they are using.
Yep, I never spell check.
More incorrect spellings can be found he
The main problem I see is that it is a) no longer currently developed and b) platform-limited. However, it is a great product.
I'd switch if I could find another search engine which would do simple things like phrase/near searching, fielded searches, etc., etc., etc.,. The main problem is that most of the search engines are good for doing web-like searches where you don't need a lot of control over the search. Perhaps some of the other readers will be able to help.
I've thought several times about just writing a new search engine. Then I think about what is really involved..... :)
When working as a consultant, I came across a company which was using a product called 'Fulcrum KnowledgeServer', made by Hummingbird. I can't go into too many details because of an NDI, but let's just say they had what appeared to be hundreds of thousands of documents in PDF all over their file servers.
It looked like they were publishing everything in their company in PDF, and avoiding other office document formats entirely. In any case, Fulcrum apparently supports just about everything you asked for.
It comes with a desktop app which everyone was previously using to perform the searches. All documents were categorized by metadata such as date, creator, maintainer, and a whole tree of categories.
My job in this case was to integrate the search capabilities into their intranet site, a task which was surprisingly simple once I was able to track down the documentation and libraries (of the correct version, mind) I needed. Seemed to be a pretty powerful product, if managed well. I have absolutely no idea how much it cost or exactly how it worked, I just interfaced with it. However, the documents were available at the file level by network shares, so it sounds like it'll do what you ask.
You can accomplish anything you set your mind to. The impossible just takes a little longer.
We index dozens of gigs of txt, html, pdf, xls, doc and ps. Not 100% of the documents are indexed but it's a parser problem with some of the files (a few pdf, xls doc and ps seem to make their parser choke).
And beside being flexible, ht:/Dig is fast.
4.9. How do I index PDF files?
http://www.htdig.org/FAQ.html#q4.9