Slashdot Mirror


Best Way to Build a Searchable Document Index?

Blinocac writes "I am organizing the IT documentation for the agency I work for, and we would like to make a searchable document index that would render results based on meta tags placed in the documents, which include everything from Word files, HTML, Excel, Access, and PDF's." What methods or tools have others seen that work? Anything to avoid?

27 of 216 comments (clear)

  1. Lucene by v_1_r_u_5 · · Score: 5, Informative

    Check out Apache's free Lucene engine, found at lucene.apache.org/. Lucene is a powerful indexing engine that handles all kinds of docs, and you can easily mod it to handle whatever it doesn't. It also allows custom scoring and a very powerful query language.

    1. Re:Lucene by Anonymous Coward · · Score: 5, Informative

      yes. It's hard to beat Lucene if you don't mind working at the API level. If you want a ready-build web crawler, check out Nutch, which is based on Lucene.

    2. Re:Lucene by BoberFett · · Score: 3, Informative

      I haven't used Lucene, but as for commercial software I've used dtSearch and ISYS and they are both excellent full text search engines. Both have web interfaces as well as desktop programs, and SDKs are available for custom applications. They scale to a massive number of documents per index, in a large variety of formats and are very fast. They have additional features like returning documents of all types in HTML so no reader is required on the front end other than a browser so legacy formats are easier to access.

    3. Re:Lucene by knewter · · Score: 4, Informative

      Lucene's good. If you haven't yet have a look at Ferret, a port of Lucene for Ruby. It's listed as faster than Lucene. I've used it in 20+ projects now as my built-in fulltext index of choice, and it's pretty great. You can easily define your own ranking algorithms if you'd like. You can find more information on Ferret here: http://ferret.davebalmain.com/trac/

      I've got a prototype of the system described in the OP that we did while quoting a fairly large project. It's really easy to have an 'after upload' action that'll push the document through strings (or some other third party app that can operate similarly, given the document type) and throw the strings into a field that gets indexed as well. That pretty much handles everything you may need.

      Obviously I'd also allow someone to specify keywords when uploading a document, but if this engine's going to just be thrown against an existing cache of documents, strings-only's the way to go.

      --
      -knewter
    4. Re:Lucene by caseydk · · Score: 2, Informative


      DocSearcher - http://docsearcher.henschelsoft.de/ - already does it. A friend with the US Coast Guard wrote it 4+ years ago, I deployed it within the Department of Justice for a few projects, and it's pretty widely used among some of the local tech circles. It even plugs into Tomcat if you want a web-based UI.

    5. Re:Lucene by dilute · · Score: 2, Informative

      Lucene is strictly an indexing engine. It wants to index text. It can index metadata as well as full text. Your surrounding application gets the files to index from wherever (local hard drive, database BLOBS, remote Windows shares or what have you). We don't care if the files are Word, PDF, Powerpoint, HTML, or whatever. A parser (many free ones available) extracts the text. We also don't care what web server you are using - using the index to identify and retrieve files is a totally separate process. Lucene indexes the text stream one-by-one and stores the results in a very efficiently organized index. It has been ported to a bunch of languages, including dotNET. I haven't tried it on terabytes of data but it rips through gigabytes very fast. Assuming all 8 terabytes don't change between runs the scale should be no problem.

      If you needed to run, say 100 indexing engines in parallel and merge the indexes, you'd have to research that. Somebody's probably done it.

    6. Re:Lucene by Cato · · Score: 2, Informative

      You might also like to investigate Plucene and KinoSearch which are both Perl ports of the Lucene engine. It's also worth considering combining your search engine with Wikis where possible - then you can find documents by keywords and also navigate to a Wiki page providing context and ralted documents or Wiki pages. TWiki, which is the most popular open source enterprise Wiki engine, has plugins for both these engines, see http://twiki.org/cgi-bin/view/Plugins/SearchEnginePluceneAddOn

    7. Re:Lucene by passthecrackpipe · · Score: 4, Informative

      "If you needed to run, say 100 indexing engines in parallel and merge the indexes, you'd have to research that. Somebody's probably done it."

      Yes, they have. In my previous job we had to search 2 terrabytes of plain text data (HTML) really fast. The company chose Autonomy, and many developers spent many months trying to make it work, consuming insane amounts of hardware resources for mediocre results, and still requiring . One lone (and brilliant) dev whipped up a Lucene proof of concept in a weekend, and it was faster (full index in a day) required less resources (a single HP DL 585, 16GB RAM, 4xdual core AMD as opposed to 10 of the same), had a smaller index (about a 5th of Autonomies'), returned results faster, the result set was more accurate, and was significantly more flexible in making it do what we actually needed it to do.

      Lucene wins hands down

      --
      People who think they know everything are a great annoyance to those of us who do.
  2. Google by Anonymous Coward · · Score: 3, Informative

    We have a Google appliance, but you can do it with regular Google, too. Just make sure you disable caching (with headers or by encrypting documents). Then place an IP or password restriction for non-Google crawlers (check IP, not user-agent). People will be able to search with the power of Google, but only people you allow in will be able to get the full documents.

    If you value your privacy, invest in a Google mini, though.

    1. Re:Google by rta · · Score: 5, Informative

      Previous place i worked we had a Google Mini and it was better than anything we had come up with in-house.

      We even pointed it at the web-cvs server and bugzilla and it was great at searching those too.

      To see all the bugs still open against v 2.2.1 or something like that bugzilla's own search was better. but for searching for "bugs about X" the google mini was great.

      It only cost something like $3k ircc.

      not exactly what you asked about, but you should definitely see if this wouldn't work for you instead.

    2. Re:Google by shepmaster · · Score: 3, Informative

      The company I work for, Vivisimo, makes an awesome search engine. Although I've never dealt with the Google box directly, I know that we have had customers get fed up with the Google box and replace it quite easily with our software. Click the first link to see a pretty flash demo, or go to Clusty.com to try out a subset of the functionality for real. We specialize in "complex, heterogenous search solutions", which exactly fits most intranet sites I've seen. Files are on SMB shares, local disks, Sharepoint, Lotus, Documentum, IMAP, Exchange, etc, etc, etc. We connect to all those sources and provide a unified interface. You can do really neat tricks with combining content across multiple repositories, such as metadata from a database added to files on SMB shares. We support Linux, Solaris, and Windows, all 32 and 64 bit. Although I may work here, it really is a great product, and I use it at home to crawl my email archives and various blogs, websites, forums, things that I use frequently but have sucky search.

  3. Swish-E by ccandreva · · Score: 2, Informative
  4. Most easy solution by PermanentMarker · · Score: 4, Informative

    it wil cost you some bucks just buy MS sharepoint portal server, and leave the indexing over to sharepoint.
    Your not even realy required to use added tags... (as most people will put in poor tags).

    But if you like you can add tags even with sharepoint.

    --
    I know you're out there. I can feel you now. I know that you're afraid. You're afraid of us. You're afraid of change.
    1. Re:Most easy solution by DigitalSorceress · · Score: 3, Informative

      Actually, if you are an MS shop and have Microsoft Server 2003, SharePoint Services 3.0 (as opposed to the SharePoint Portal server (now renamed, I believe, to Microsoft Office SharePoint Server) which does indeed cost a packet.

      I do a lot of LAMP development, and I'm not the strongest fan of Microsoft for a lot of things, but if you have a MS desktop and MS Office environment, SharePoint services really is quite decent for INTRANET applications. Especially for collaberation. You can set up work flows for check-out/check-in, and it integrates really nicely with some of the more recent MS Office releases. If you connect it to a real MS SQL server on the back end (as opposed to the express edition that it defaults to), you can have full text indexing even with the free SharePoint Services version. Only need for the full blown Portal/MOSS version is if you think you are going to have a large number of sharePoint sites, and want to simplify cross-connecting and management. (At least as far as I can recall)

      I'm not saying SharePoint is the way to go, but I'd at least read up on it and consider it IF you have a lot of MS Office stuff that you plan on indexing/sharing.

      I'd strongly advise avoiding it if you plan to do Internet-based stuff though... at lest until you get a good enough understanding of the security issues involved that you feel that you really know what you're doing.

      Just my $0.02 worth.

      --

      The Digital Sorceress
    2. Re:Most easy solution by guruevi · · Score: 2, Informative

      I am just done with 6 months of SharePoint integration in an MS shop. From a development and security standpoint: STAY AWAY FROM IT. 2003 seems to be an Alpha version, 2007 is still full of bugs (better than Beta but still) and it's also very, very slow (it's based on .NET). To work fairly good for 100 users it requires 1 SQL Server (MSDE will not work for non-development purposes), 2 Frontends and 1 loadbalancer/firewall based on Microsoft Forefront just for security purposes (since the built-in SharePoint requires a lot of stuff to be opened to all users). It's also expensive... 10k/server + CAL's for every user and that was in a big MS shop.

      Next to that, there are a lot of caveats and as soon as you start modifying the layout (even though it's just the HTML) in SharePoint Designer, Microsoft Support will not help you (as if they could in the first place). Simple things like whitespaces in-between table structures can make your list workflows screw up (yes there is an actual opening with Microsoft Support for that very issue). A lot of things will not work either and require a nasty hack or workaround (like attachment upload on modified forms) and are known with Microsoft and have been known for the last 9 months.

      --
      Custom electronics and digital signage for your business: www.evcircuits.com
  5. Also see Xapian by dmeranda · · Score: 4, Informative

    I'd suggest you should consider a full-text search engine. First start here:
    http://en.wikipedia.org/wiki/Full_text_search

    If you're not afraid to do a little reading and potentially coding a custom front end, you may want to look at two of the big open source engines: Lucene and Xapian.

    Lucene is quite popular now, and is an Apache Java project. It's a good choice if you're a Java shop.

    Xapian seems to be based on a little more solid and modern information retrieval theory and is incredibly scalable and fast. It's written in C++, with SWIG-based front ends to many languages. It might not have as polished of a front end or as fancy of a website as Lucene, but I believe it's a better choice if you have really really huge data sets or want to venture outside the Java universe.

    There are also many other wholely-contained indexers too, mostly which are based on web indexing (they have spiders, query forms, etc.) all bundled together. Like ht://Dig, mnogosearch, and so forth. They are good, especially if you want more of a drop-in solution rather than a raw indexing engine, and if you're indexing web sites (and not complex entities like databases, etc).

    1. Re:Also see Xapian by risk+one · · Score: 2, Informative

      I agree that Lucene is a great choice specifically for java shops, it does have ports for pretty much all major languages. The java implementation is the 'mothership' but you can use lucene with php, python, .NET or C++ or whatever.

      Secondly, I'd like to point out Lemur. It's an indexing engine similar to Lucene, but geared much more toward the language modeling approach of information retrieval. All IR approaches will use either a vector space based approach or a language model approach. Lucene does vector space very well, but it's difficult to get it to do language model based retrieval (although extensions are available), Lemur can do both. Lemur also has Indri, a search engine written on top of Lemur, which can parse html, PDF and xml. And like Lucene, Lemur has multiple language ports of the API.

      A final point I would like to make is that IR is a very actively researched field. If you're going to do your own coding (specifically the retrieval model), I suggest you buy a book and get reading. Most of the basic problems (and there are many) have been figured out and it'll save you a lot of trouble if you just read up on how to update an index or find spelling suggestions, instead of figuring it out for yourself. It's possible to index your documents with Lucene and run searches on them in half an afternoon, but it takes some basic knowledge to get it right, and make the app useful. (Look at wikipedia's search for an example of what you get when you don't follow through, and stop after it seems to work ok).

  6. Re:Google Desktop or Applicance by NoNeeeed · · Score: 2, Informative

    Yep, a Google appliance (or equivalent, there are others on the market such as X1) is the way to go.

    I set up a Google Mini for indexing an internal wiki, our bug tracking system, and some other systems, and it is very straight-forward.

    I know the original question mentioned meta-data, but you have to ask yourself if the meta-data is going to be maintained well enough that the search index will be valid. Going the Google Appliance route is so much simpler. It takes a bit of tweaking to set up the search restrictions, but once up and running, it works flawlessly. Most importantly, it doesn't require everyone to make sure that all their document meta-data is perfect.

    Google appliance pricing is really quite cheap when you compare it to the time cost of setting up a meta-data driven system.

    Meta-data is one of those things that seems like a really good idea, but like all plans, doesn't tend to survive contact with the enemy, which in this case is the user.

    Paul

  7. Re:Easy by avronius · · Score: 2, Informative

    If you host all of your documentation on a website, take a look at ht://dig [http://www.htdig.org].

    I've deployed it across a handful of servers, and it does a good job of crawling, but doesn't do well with javascript. If you have javascript for your web's frontend, you can write a shell script to find . -print, prepend the urls into a file, and point htdig at that file. It will dig into each file it finds, and create a searchable database of everything that it finds.

    You add /cgi-bin/search.cgi to your page, and you can auto-magically search your documentation.

    - Avron

  8. I do this in several programming languages by MarkWatson · · Score: 2, Informative

    There are 2 problems: getting plain text out of documents, then indexing the plain text

    A good tool for getting plain text out of various versions of Word documents is the "antiword" command line utility.

    The Apache POI project (Java) can read and write several Microsoft Office formats.

    For indexing: I like Lucene (Java), Ferret (Ruby+C), and Montezuma (Common Lisp).

    I have mostly been using Ruby the last few years for text processing. Here is a short article I wrote using the Java Lucene library using JRuby:

    http://markwatson.com/blog/2007/06/using-lucene-with-jruby.html

    Here is another short snippet for reading OpenOffice.org documents in Ruby:

    http://markwatson.com/blog/2007/05/why-odf-is-better-than-microsofts.html

    ---

    You might just want to use the entire Nutch stack:

    http://lucene.apache.org/nutch/

    stack that collects documents, spiders the web, has plugins for many document types, etc. Good stuff!

  9. If money's not an object... by djpretzel · · Score: 2, Informative

    As a Documentum developer, especially in light of the recent 6.0 release, I'd be remiss not to recommend it for such a purpose. It's expensive, rather complex, and requires solid development talent to implement, but is almost infinitely configurable and customizable, and there are separate components (at cost, of course) that can add on all sorts of fun functionality like collaboration, digital asset management, etc. It has the ability to auto-tag documents based on configurable rules using Content Intelligence Services and supports extensible object hierarchies, workflows, lifecycles, taxonomies, web services, you name it. It's probably overkill for the user in question, and it's far from open source (although EMC is doing an admirable job at encouraging code exchange, and the new dev. environment is based on Eclipse), but it's pretty darn slick when you look at the ground it covers, functionally.

  10. Re;Easy off the shelf by homey+of+my+owney · · Score: 2, Informative

    If you're looking for an index, a document management system probably makes sense. This one is inexpensive and very good.

  11. Re:Install Wumpus Search by gvc · · Score: 2, Informative

    Sorry, mangled the URL in the parent: Wumpus-Search.org

  12. Re:Check out Alfresco! by dsgfh · · Score: 2, Informative

    The parent here speaks the truth.
    I notice a lot of the comments in the thread are coming from developers or sysadmins who want to solve everything with libraries or command line tools. But it really sounds to me like you need a reasonable document management system (and of course being a slashdot reader you want it for free).

    Again, I'm not affiliated with Alfresco, but did quite a bit of research into open source DMS's that would run in a java environment for a couple of recent projects. I found Alfresco to be well architected, easily extendible if I needed it to be and importantly simple to deploy & get running. It will integrate with your LDAP for access and while it's marketed as an Enterprise CMS, is quite capable of doing DMS.

    It uses Lucene under the hood, and while it has a web UI, isn't focused on indexing web sites. You can record meta-data against docs, and it's also capable of extracting some metadata from common MS Office formats. I've no doubt this could be extended if there were other doc properties you wanted access to (although I've never tried myself).

    Most importantly is that the project & community is quite healthy with very active forums. You can get paid support (the Enterprise License) if you so desire, but I expect you'd probably start with the GPL version just to get yourself up & running.

    I wouldn't recommend the SMB interface for the time being as there's currently an outstanding bug with it that causes it to die after a while (the rest of the app continues to run happily), however the FTP interface is great for an initial import of docs. Also take a look at the rules capability for classifying/sorting docs as they're imported.

    It does the basics like check-in/check-out & workflow, and can be backed by your DB of choice as it uses Hibernate for ORM. Searching can be done against keywords or meta-data (classifications, dates, authors etc) & in my experience is more powerful/useful than sharepoints keyword based searching. If you're really keen you can use the Java or Web Service API's for integrating into other solutions.

    Again, I'm not affiliated, but clearly I'm a fan-boy :) I'd recommend installing the base Alfresco Community release (no need for Web Content Management, Records Management etc to start with), loading some docs into it via the FTP interface (or upload a zip via the web interface which it will explode out for you) & giving it a test run. I've got people asking me every couple of days when we're rolling it out internally (just got to finish the sharepoint comparison first).

  13. Search software by j.leidner · · Score: 2, Informative
    Lucene - LINK

    Terrier - LINK

    Indri/Lemur - LINK / LINK

    MG - LINK

  14. Re:Meta tags are worthless, generally by Keith_Beef · · Score: 2, Informative

    Meta tags are worthless, generally, unless you have a librarian who ensures correctness.

    DON'T TRUST USERS TO ENTER META DATA!!!

    I've worked in electronic document management in 3 different businesses and metadata entered by end users is worst than worthless - it is wrong. Searches that don't use full text for general documents are less than ideal.

    Unless you can pin responsibility for a document to a named person, you can't trust anything in the document. Not metadata, not content, not presentation.

    The meta tags most of the documents I deal with are inserted by the applications, and only the content is human-drafted. Those meta tags contain information like creation date, mdification date, application name, character encoding, etc. They are generally trustworthy.

    I'm also in the process of building a documentation system; it will be a set of documents in various formats, with an HTML interface, TomCat server and Lucene to make it fully searchable.

    In a previous job, I did a similar thing with Apache and ht://dig on an old Dell I recycled. Document files could be uploaded by anybody with an FTP account on the server, and index files were automatically regenerated by a CRON task at 04h00 each day.
    I could have made a trigger to regenerate the index after each FTP upload session, but using CRON was easier and sufficiently frequent to be useful.

    This time around, the whole system of TomCat webserver and Lucene search engine is bundled on a CD-ROM with the docs to run on any of the firm's laptops. Because I control the documents, I can build the index files and burn them to the CD-ROM before distribution.

    Beef

  15. Look at how you will access the docs first by rclandrum · · Score: 3, Informative

    As someone who has made a 30-year career out of designing and building document management systems, I would urge you to look first at how you expect your users to find the documents they need. The expected results of a search should guide your choice of indexing methods - and the popular "meta tagging" method isn't always the best. There are shortcomings with all methods.

    Full-text indexing allows users to search the entire contents of documents, but the results are imprecise and voluminous and not terribly useful in most cases (think web search engines here). Yes, you can find all documents that contain the word "patent", but you get a lot of old references to patent leather shoes in addition to what you were probably after. So, with full-text search you get it all, but force the user to subsearch for what they really want.

    Using meta-tags gives the appearance of pre-classifying documents and having the users do it themselves means you don't have to have a dedicated person to assign the tags. The disadvantage is that everybody makes up their own tags or if you have a standard set, you have to rely on people being diligent about applying them. And tag popularity can easily change over time. For example, if you want to find docs that refer to "removable media", this might have garnered a "floppy" tag 15 years ago and "CD" or "DVD" today. You are therefore almost guaranteed of missing some documents using this method.

    Database indexing means that you list all your docs in a database, perhaps by title, author, date, or other fields that your users would find useful for searching. The advantage is that every doucment is indexed the same way, searching is really fast, and the results are usually relevant if your schema is meaningful. The disadvantages are that indexing the docs takes work on input and users need to know how to search to get the best results.

    Finally, you could organize the docs by simple name and folder. This works fine for the desktop and users usually can identify the category that points them to the folder they want. The disadvantage is that this only works well for limited document sets. Once you start getting hundreds of categories and thousands and thousands of documents, things become too hard to find.

    So - understand your users search requirements and the size of your expected database. Only then can you make an informed decision about how to create and index the repository.