Best Way to Build a Searchable Document Index?

← Back to Stories (view on slashdot.org)

Best Way to Build a Searchable Document Index?

Posted by ScuttleMonkey on Monday October 1, 2007 @09:36AM from the build-a-better-boss-trap dept.

Blinocac writes "I am organizing the IT documentation for the agency I work for, and we would like to make a searchable document index that would render results based on meta tags placed in the documents, which include everything from Word files, HTML, Excel, Access, and PDF's." What methods or tools have others seen that work? Anything to avoid?

9 of 216 comments (clear)

Min score:

Reason:

Sort:

Check out Alfresco! by thule · 2007-10-01 10:01 · Score: 2, Interesting

I posted this before on slashdot. I discovered a while ago a cool system called Alfresco. There is a free (as is liberty) and commercial versions. It acts like a SMB (like SAMBA), ftp, and WebDAV server so you don't have to use the web interface to get files into the system. Users can map it as a network drive. The web interface allows users to set metatags, retrieve previous versions of the file, and most importantly, search the documents in the system.

Alfresco also has plugins for Microsoft Office so you can manage the repository from Word, etc. They are also working on OpenOffice integration.

Don't use SAMBA for .doc and .pdf's, use Alfresco.

I am not affiliated with Alfresco, just a happy user.
1. Re:Check out Alfresco! by G1369311007 · 2007-10-01 10:47 · Score: 1, Interesting
  
  Along the same lines of Alfresco is Plone. www.plone.org I'm currently the sole admin of a Plone site serving ~50 users on an intranet. We use it for document management etc. Just another option. The CIA's website is made in Plone so it can't be that bad right?! www.cia.gov
  
  --
  "Don't blink. Don't even blink. Blink and you're dead."
Re:Google Desktop or Applicance by Anonymous Coward · 2007-10-01 10:11 · Score: 1, Interesting

and copernic desktop search outperforms GDS by a long way...
Re:Google by TooMuchToDo · 2007-10-01 10:20 · Score: 2, Interesting

Seconded. I've done implementations (hosted an in-house) of both Google Minis as well as the full blown Enterprise appliances. They are amazing creatures. I would recommend the Mini to almost anyone, while the Enterprise costs a pretty penny.
Re:use a Wiki instead by LadyLucky · 2007-10-01 12:12 · Score: 2, Interesting

We've been using Confluence, from Atlassian for our wiki, and it's pretty fully featured for a wiki.

--
dominionrd.blogspot.com - Restaurants on
Re:Google Desktop or Applicance by anthonys_junk · 2007-10-01 15:25 · Score: 2, Interesting

Meta-data is one of those things that seems like a really good idea, but like all plans, doesn't tend to survive contact with the enemy, which in this case is the user.
Like software development, the quality of the outcome is implementation dependent.

We run six thesauri, plus a number of different controlled lists for our users to input metadata. We don't publish any documents that don't have meta attached, and we perform random quality audits. Our users have been trained in fundamentals of classification and also in the payoff for getting it right.

We use metadata to structure our navigation for some sites, we depend on it for search for our internal documents. Our metadata implementation works incredibly well for us; clearly and consistently outperforming plain text indexing.

--
Barbara Felden claims prior art on the flip phone, sues Motorola, Nokia.
Re:Gee I don't know.. by ditto999999999999999 · 2007-10-01 15:33 · Score: 3, Interesting

It's true. I finish a double major in University, worked in a relevent field the whole time, have excellent references, and now I can't find work... Hire me and I will do this for you.
Re:Lucene by bwt · 2007-10-01 17:03 · Score: 2, Interesting

Dealing with large data sets isn't really technologically challenging. You can grow to an arbitrarily large data set size simply by partitioning: it works for google. It may be expensive to stand up a bunch of servers, but I don't think it's really that hard.

What is more complicated is to deal with large numbers of concurrent requests. Then you need clustering. There are big sites that do both partitioning and clustering simultaneously with Lucene. I seem to recall reading that Technorati uses Lucene on a cluster of 40 servers. With 8TB of data you are going to have a big one time computational problem to build the index. Lucene can recombine indexes, so you could distribute index creation to a lot of servers.
How about Oracle text by iq1 · 2007-10-01 22:34 · Score: 2, Interesting

i know this will give me flames, but:
you might try Oracle Text (also part of Oracle XE).

Supports 140 document formats, has a lot of options and works via SQL.
Can build indexes for documents stored in DB or in the file system.
You can even join the serach terms from the document with the database records where metadata might be stored by your application.
I found that very helpful in similar projects. And it's free.