Best Way to Build a Searchable Document Index?
Blinocac writes "I am organizing the IT documentation for the agency I work for, and we would like to make a searchable document index that would render results based on meta tags placed in the documents, which include everything from Word files, HTML, Excel, Access, and PDF's." What methods or tools have others seen that work? Anything to avoid?
You're the one gettin' paid, you figure it out.
Check out Apache's free Lucene engine, found at lucene.apache.org/. Lucene is a powerful indexing engine that handles all kinds of docs, and you can easily mod it to handle whatever it doesn't. It also allows custom scoring and a very powerful query language.
We have a Google appliance, but you can do it with regular Google, too. Just make sure you disable caching (with headers or by encrypting documents). Then place an IP or password restriction for non-Google crawlers (check IP, not user-agent). People will be able to search with the power of Google, but only people you allow in will be able to get the full documents.
If you value your privacy, invest in a Google mini, though.
Who places what types of meta tags in the documents? I don't understand the requirements.
Generally, Lucene does a good job. It's easy to learn and performance was fine for me and my data (~ 2 GB of textual documents).
Because if you have to spend more than an hour on this kind of project nowadays, you're wasting your time.
The inexpensive Google appliacances don't have very fine-grained access control, though. But I am involved in several semi-failed projects of this nature in my organization, but new and legacy, and my Google Desktop outperforms all of them.
Give a man a fish and you have fed him for today. Teach a man to fish, and he'll say "WHERE'S MY FISH, YOU IDIOT?"
http://www.swish-e.org/
Is this something that would suit your needs: Beagle for Linux, Spotlight for OSX? I haven't tried Beagle (I don't have root access on my Debian installation at work), but Spotlight is probably my most cherished feature in OSX... it's so useful.
Animoog.org
http://www.theregister.co.uk/2001/11/05/the_bofh_content_management_system/
The secret to creativity is knowing how to hide your sources. -- Albert Einstein
If you're using Office 2007, you can probably hack something together really quickly to pull the meta tags from the files and put them in a database. Not sure about the other formats you need, though--and support from Google, for instance, would probably be beneficial for your company anyway. Hope that helps!
US businesses that currently accept chip and PIN/signature
You should avoid any system that relies on individual employees putting in these meta-tags. It won't work; they either won't do it, or will do it wrong (spelling errors, inventing their own tags on the fly, and so on.) And then you'll catch hell when they can't find one of those documents they mislabled. Trust me.
it wil cost you some bucks just buy MS sharepoint portal server, and leave the indexing over to sharepoint.
Your not even realy required to use added tags... (as most people will put in poor tags).
But if you like you can add tags even with sharepoint.
I know you're out there. I can feel you now. I know that you're afraid. You're afraid of us. You're afraid of change.
Let google do the indexing!
I posted this before on slashdot. I discovered a while ago a cool system called Alfresco. There is a free (as is liberty) and commercial versions. It acts like a SMB (like SAMBA), ftp, and WebDAV server so you don't have to use the web interface to get files into the system. Users can map it as a network drive. The web interface allows users to set metatags, retrieve previous versions of the file, and most importantly, search the documents in the system.
.doc and .pdf's, use Alfresco.
Alfresco also has plugins for Microsoft Office so you can manage the repository from Word, etc. They are also working on OpenOffice integration.
Don't use SAMBA for
I am not affiliated with Alfresco, just a happy user.
You are in marketing aren't you?
(I'm sold anyway)
I'd suggest you should consider a full-text search engine. First start here:
http://en.wikipedia.org/wiki/Full_text_search
If you're not afraid to do a little reading and potentially coding a custom front end, you may want to look at two of the big open source engines: Lucene and Xapian.
Lucene is quite popular now, and is an Apache Java project. It's a good choice if you're a Java shop.
Xapian seems to be based on a little more solid and modern information retrieval theory and is incredibly scalable and fast. It's written in C++, with SWIG-based front ends to many languages. It might not have as polished of a front end or as fancy of a website as Lucene, but I believe it's a better choice if you have really really huge data sets or want to venture outside the Java universe.
There are also many other wholely-contained indexers too, mostly which are based on web indexing (they have spiders, query forms, etc.) all bundled together. Like ht://Dig, mnogosearch, and so forth. They are good, especially if you want more of a drop-in solution rather than a raw indexing engine, and if you're indexing web sites (and not complex entities like databases, etc).
Disclaimer: I flog Google search solutions at work, so I'm way biased.
Do not mock my vision of impractical footwear
Directories full of random documents in random formats of random version with varying degrees of completeness and accuracy tend to get less useful as an information source as time goes on. Docs get abandoned and continue to provide outdated information and dead links. Doc formats change and require converters to import. Doc maintainers leave the company.
If you work somewhere where people are not trained to attach Office docs to every email, where people don't use Word to compose 10 bullet points, where people don't use a spreadsheet as a substitute for all sorts of CRM and business applications... a Wiki is actually a good solution.
You can use something like MediaWiki or Twiki or... heck you can use a whole variety of content management systems.
The key to success is to *EMPOWER* people to actually update information, and have a few people who are empowered to actually edit, rehash, sort, move, prune wiki pages and content. As the content improves, it will draw in more users and more content creators. Pretty soon, employees will *COMPLAIN* when someone sends out information and doesn't update the wiki.
Some corporate cultures are not wiki-friendly. Some management chains *fear* the wiki. Some companies have whole webmaster groups who believe it is their job to delay the process of getting useful content onto the web by controlling it. If you're in one of those companies... start up your own wiki and beg for forgiveness later.
Meta tags are worthless, generally, unless you have a librarian who ensures correctness.
DON'T TRUST USERS TO ENTER META DATA!!!
I've worked in electronic document management in 3 different businesses and metadata entered by end users is worst than worthless - it is wrong. Searches that don't use full text for general documents are less than ideal.
Just to prove that you're question is missing critical data:
- how many documents?
- how large is the average and largest documents?
- what format will be input? PDF, HTML, XLS, PPT, OO, C++, what?
- what search tools do you use elsewhere?
- any budget constraints?
- did you look at general document management systems? Documentum, Docushare, Filenet, Sharepoint? If so, what didn't work with these systems?
- Did you consider OSS solutions? htdig, e-swish, custom searching?
- A buddy of mine wrote an article on "how to index anything" that was in the Linux Journal a few years ago. Google is your friend.
AND if i didn't get this across yet - DON'T TRUST META DATA IN HIDDEN DOCUMENT FIELDS - bad Metadata in MS-Office files will completely destroy the usefulness of your searches.
If you host all of your documentation on a website, take a look at ht://dig [http://www.htdig.org].
/cgi-bin/search.cgi to your page, and you can auto-magically search your documentation.
I've deployed it across a handful of servers, and it does a good job of crawling, but doesn't do well with javascript. If you have javascript for your web's frontend, you can write a shell script to find . -print, prepend the urls into a file, and point htdig at that file. It will dig into each file it finds, and create a searchable database of everything that it finds.
You add
- Avron
There are 2 problems: getting plain text out of documents, then indexing the plain text
A good tool for getting plain text out of various versions of Word documents is the "antiword" command line utility.
The Apache POI project (Java) can read and write several Microsoft Office formats.
For indexing: I like Lucene (Java), Ferret (Ruby+C), and Montezuma (Common Lisp).
I have mostly been using Ruby the last few years for text processing. Here is a short article I wrote using the Java Lucene library using JRuby:
http://markwatson.com/blog/2007/06/using-lucene-with-jruby.html
Here is another short snippet for reading OpenOffice.org documents in Ruby:
http://markwatson.com/blog/2007/05/why-odf-is-better-than-microsofts.html
---
You might just want to use the entire Nutch stack:
http://lucene.apache.org/nutch/
stack that collects documents, spiders the web, has plugins for many document types, etc. Good stuff!
As a Documentum developer, especially in light of the recent 6.0 release, I'd be remiss not to recommend it for such a purpose. It's expensive, rather complex, and requires solid development talent to implement, but is almost infinitely configurable and customizable, and there are separate components (at cost, of course) that can add on all sorts of fun functionality like collaboration, digital asset management, etc. It has the ability to auto-tag documents based on configurable rules using Content Intelligence Services and supports extensible object hierarchies, workflows, lifecycles, taxonomies, web services, you name it. It's probably overkill for the user in question, and it's far from open source (although EMC is doing an admirable job at encouraging code exchange, and the new dev. environment is based on Eclipse), but it's pretty darn slick when you look at the ground it covers, functionally.
If you're looking for an index, a document management system probably makes sense. This one is inexpensive and very good.
Sorry, mangled the URL in the parent: Wumpus-Search.org
i know this will give me flames, but:
you might try Oracle Text (also part of Oracle XE).
Supports 140 document formats, has a lot of options and works via SQL.
Can build indexes for documents stored in DB or in the file system.
You can even join the serach terms from the document with the database records where metadata might be stored by your application.
I found that very helpful in similar projects. And it's free.
Terrier - LINK
Indri/Lemur - LINK / LINK
MG - LINK
As someone who has made a 30-year career out of designing and building document management systems, I would urge you to look first at how you expect your users to find the documents they need. The expected results of a search should guide your choice of indexing methods - and the popular "meta tagging" method isn't always the best. There are shortcomings with all methods.
Full-text indexing allows users to search the entire contents of documents, but the results are imprecise and voluminous and not terribly useful in most cases (think web search engines here). Yes, you can find all documents that contain the word "patent", but you get a lot of old references to patent leather shoes in addition to what you were probably after. So, with full-text search you get it all, but force the user to subsearch for what they really want.
Using meta-tags gives the appearance of pre-classifying documents and having the users do it themselves means you don't have to have a dedicated person to assign the tags. The disadvantage is that everybody makes up their own tags or if you have a standard set, you have to rely on people being diligent about applying them. And tag popularity can easily change over time. For example, if you want to find docs that refer to "removable media", this might have garnered a "floppy" tag 15 years ago and "CD" or "DVD" today. You are therefore almost guaranteed of missing some documents using this method.
Database indexing means that you list all your docs in a database, perhaps by title, author, date, or other fields that your users would find useful for searching. The advantage is that every doucment is indexed the same way, searching is really fast, and the results are usually relevant if your schema is meaningful. The disadvantages are that indexing the docs takes work on input and users need to know how to search to get the best results.
Finally, you could organize the docs by simple name and folder. This works fine for the desktop and users usually can identify the category that points them to the folder they want. The disadvantage is that this only works well for limited document sets. Once you start getting hundreds of categories and thousands and thousands of documents, things become too hard to find.
So - understand your users search requirements and the size of your expected database. Only then can you make an informed decision about how to create and index the repository.