Best Way to Build a Searchable Document Index?
Blinocac writes "I am organizing the IT documentation for the agency I work for, and we would like to make a searchable document index that would render results based on meta tags placed in the documents, which include everything from Word files, HTML, Excel, Access, and PDF's." What methods or tools have others seen that work? Anything to avoid?
I posted this before on slashdot. I discovered a while ago a cool system called Alfresco. There is a free (as is liberty) and commercial versions. It acts like a SMB (like SAMBA), ftp, and WebDAV server so you don't have to use the web interface to get files into the system. Users can map it as a network drive. The web interface allows users to set metatags, retrieve previous versions of the file, and most importantly, search the documents in the system.
.doc and .pdf's, use Alfresco.
Alfresco also has plugins for Microsoft Office so you can manage the repository from Word, etc. They are also working on OpenOffice integration.
Don't use SAMBA for
I am not affiliated with Alfresco, just a happy user.
and copernic desktop search outperforms GDS by a long way...
Seconded. I've done implementations (hosted an in-house) of both Google Minis as well as the full blown Enterprise appliances. They are amazing creatures. I would recommend the Mini to almost anyone, while the Enterprise costs a pretty penny.
We've been using Confluence, from Atlassian for our wiki, and it's pretty fully featured for a wiki.
dominionrd.blogspot.com - Restaurants on
Like software development, the quality of the outcome is implementation dependent.
We run six thesauri, plus a number of different controlled lists for our users to input metadata. We don't publish any documents that don't have meta attached, and we perform random quality audits. Our users have been trained in fundamentals of classification and also in the payoff for getting it right.
We use metadata to structure our navigation for some sites, we depend on it for search for our internal documents. Our metadata implementation works incredibly well for us; clearly and consistently outperforming plain text indexing.
Barbara Felden claims prior art on the flip phone, sues Motorola, Nokia.
It's true. I finish a double major in University, worked in a relevent field the whole time, have excellent references, and now I can't find work... Hire me and I will do this for you.
Dealing with large data sets isn't really technologically challenging. You can grow to an arbitrarily large data set size simply by partitioning: it works for google. It may be expensive to stand up a bunch of servers, but I don't think it's really that hard.
What is more complicated is to deal with large numbers of concurrent requests. Then you need clustering. There are big sites that do both partitioning and clustering simultaneously with Lucene. I seem to recall reading that Technorati uses Lucene on a cluster of 40 servers. With 8TB of data you are going to have a big one time computational problem to build the index. Lucene can recombine indexes, so you could distribute index creation to a lot of servers.
i know this will give me flames, but:
you might try Oracle Text (also part of Oracle XE).
Supports 140 document formats, has a lot of options and works via SQL.
Can build indexes for documents stored in DB or in the file system.
You can even join the serach terms from the document with the database records where metadata might be stored by your application.
I found that very helpful in similar projects. And it's free.