Slashdot Mirror


Best Way to Build a Searchable Document Index?

Blinocac writes "I am organizing the IT documentation for the agency I work for, and we would like to make a searchable document index that would render results based on meta tags placed in the documents, which include everything from Word files, HTML, Excel, Access, and PDF's." What methods or tools have others seen that work? Anything to avoid?

216 comments

  1. Gee I don't know.. by Anonymous Coward · · Score: 4, Funny

    You're the one gettin' paid, you figure it out.

    1. Re:Gee I don't know.. by ditto999999999999999 · · Score: 3, Interesting

      It's true. I finish a double major in University, worked in a relevent field the whole time, have excellent references, and now I can't find work... Hire me and I will do this for you.

    2. Re:Gee I don't know.. by Anonymous Coward · · Score: 1, Insightful

      why do you bother answering if you're just going to be an idiot. The guy asked a legit question for a legit problem......

    3. Re:Gee I don't know.. by Anonymous Coward · · Score: 0, Troll

      I agree with the above. This is what communities are for. I've been a programmer for 8 years, and forums/newsgroups/web have helped me and every other programmer I've ever met alot over the years. For the guy with the "double major" stick in his butt: If you werent such a pompus ass, someone might have hired you already, or better yet, you might still be working at your old job with those "excellent references" of yours. If you read the question, you'd have understood that the guy works in an agency. This means he needs to be a jack of all trades. Agencies ask the impossible from programmers in rediculously short deadlines. You have to get the job done now. You dont have the luxury of taking months for planning, years for coding and months for debugging. You cant know everything about everything. You need to know how to search, and if you cant find what you are looking for, you have to know how to ask questions. Like the guy before me said, he asked a ledgit question for a ledgit problem. If you dont know the answer, dont waste everyones time. There is always someone who knows better than you. If you understood that, you might still be working. I pray to god that the original poster ends up beeing your boss one day so he can fire your ass.

      F-ing Tool....

      I almost forgot about the question (lol sorry, idiots piss me off): Lucene is the way to go. There's a small learning curve, but its robust and stable. After you've used it once, you'll find it easy for other projects that come along.

    4. Re:Gee I don't know.. by rhyder128k · · Score: 1

      The guy's not in college now. In the real world, the correct answer is often, "I'd do my research in order to find the best solution."

      --
      Michael Reed, freelance tech writer.
    5. Re:Gee I don't know.. by intheshelter · · Score: 1

      Did it ever occur to you that is exactly what he is doing? He polls a large group of vocal and knowledgeable (your comment excluded of course) people and based on their suggestions narrows his search quickly. Not surprising your sig was AC with an asinine comment like that.

    6. Re:Gee I don't know.. by strings42 · · Score: 1

      He's *doing* the research, that's why he's asking the question. Geez ...

    7. Re:Gee I don't know.. by tha_mink · · Score: 2, Funny

      It's true. I finish a double major in University, worked in a relevent field the whole time, have excellent references, and now I can't find work... Hire me and I will do this for you. Perhaps it's your grammar?
      --
      You'll have that sometimes...
    8. Re:Gee I don't know.. by ditto999999999999999 · · Score: 1

      I devote time to proof-reading my cover letter and resume. My slashdot posts OTOH...

    9. Re:Gee I don't know.. by Anonymous Coward · · Score: 0

      Isn't the point of a resume to ask someone to hire you and to tell them why they should? Isn't that what you did in that post?

    10. Re:Gee I don't know.. by Anonymous Coward · · Score: 0

      Yes, but can you tell the difference two? I know I can.

    11. Re:Gee I don't know.. by EvanED · · Score: 1

      I don't have anything meaningful to add to this conversation, but I would just like to say that I find the juxtaposition of your post and signature amusing:
      Perhaps it's your grammar?
      --
      You'll have that sometimes.

  2. Easy by Anonymous Coward · · Score: 0, Redundant

    Grep and flat files. The way God intended.

    1. Re:Easy by avronius · · Score: 2, Informative

      If you host all of your documentation on a website, take a look at ht://dig [http://www.htdig.org].

      I've deployed it across a handful of servers, and it does a good job of crawling, but doesn't do well with javascript. If you have javascript for your web's frontend, you can write a shell script to find . -print, prepend the urls into a file, and point htdig at that file. It will dig into each file it finds, and create a searchable database of everything that it finds.

      You add /cgi-bin/search.cgi to your page, and you can auto-magically search your documentation.

      - Avron

  3. Lucene by v_1_r_u_5 · · Score: 5, Informative

    Check out Apache's free Lucene engine, found at lucene.apache.org/. Lucene is a powerful indexing engine that handles all kinds of docs, and you can easily mod it to handle whatever it doesn't. It also allows custom scoring and a very powerful query language.

    1. Re:Lucene by Anonymous Coward · · Score: 5, Informative

      yes. It's hard to beat Lucene if you don't mind working at the API level. If you want a ready-build web crawler, check out Nutch, which is based on Lucene.

    2. Re:Lucene by BoberFett · · Score: 3, Informative

      I haven't used Lucene, but as for commercial software I've used dtSearch and ISYS and they are both excellent full text search engines. Both have web interfaces as well as desktop programs, and SDKs are available for custom applications. They scale to a massive number of documents per index, in a large variety of formats and are very fast. They have additional features like returning documents of all types in HTML so no reader is required on the front end other than a browser so legacy formats are easier to access.

    3. Re:Lucene by knewter · · Score: 4, Informative

      Lucene's good. If you haven't yet have a look at Ferret, a port of Lucene for Ruby. It's listed as faster than Lucene. I've used it in 20+ projects now as my built-in fulltext index of choice, and it's pretty great. You can easily define your own ranking algorithms if you'd like. You can find more information on Ferret here: http://ferret.davebalmain.com/trac/

      I've got a prototype of the system described in the OP that we did while quoting a fairly large project. It's really easy to have an 'after upload' action that'll push the document through strings (or some other third party app that can operate similarly, given the document type) and throw the strings into a field that gets indexed as well. That pretty much handles everything you may need.

      Obviously I'd also allow someone to specify keywords when uploading a document, but if this engine's going to just be thrown against an existing cache of documents, strings-only's the way to go.

      --
      -knewter
    4. Re:Lucene by Anonymous Coward · · Score: 0

      One of the few products I've found that you can search for partial strings, not just from the beginning of a word, is Copernic.

      w

    5. Re:Lucene by caseydk · · Score: 2, Informative


      DocSearcher - http://docsearcher.henschelsoft.de/ - already does it. A friend with the US Coast Guard wrote it 4+ years ago, I deployed it within the Department of Justice for a few projects, and it's pretty widely used among some of the local tech circles. It even plugs into Tomcat if you want a web-based UI.

    6. Re:Lucene by Dadoo · · Score: 1

      I checked out the documentation on Lucene, and it appears to be designed for searching the documents on a few web servers.

      In my situation, I've got a couple dozen servers (mostly Windows, but some Linux), and maybe 8TB of data, mostly in Word documents, Excel spreadsheets, etc. Can Lucene (or Nutch) scale up to something like that? I'd also like it to search Windows network drives. Is that possible?

      --
      Sit, Ubuntu, sit. Good dog.
    7. Re:Lucene by jafac · · Score: 1

      About 10 years ago, I used a product called Folio, which was the same product Novell used for their "Novell Support Encyclopedia" - they had a great set of robust tools, including support filters for a wide enough variety of formats (and tools to write your own), and your data would all compile down to what they called an "Infobase" - which was a single indexed file containing full text, markup, graphics, and index - readable in a free downloadable reader with a pretty decent (for 1996) search engine, that did boolean, stem, and proximity searches.

      It was very convenient to be able to give to either a reseller, or high-end customer, this single-file, containing reasonably up-to-date support information on our product that they could search on. It was probably the #1 thing our tech support department spent money on that reduced call volume. Then we got bought, and the new tech support director, an IBM guy, replaced everything with Lotus Notes - (and other headcount-increasing, empire building tools).

      The Folio company was bought up and I think the technology is now used by the company that does Lexis-Nexis.

      For now, my current employer is using Lucene.

      --

      These are my friends, See how they glisten. See this one shine, how he smiles in the light.
    8. Re:Lucene by dilute · · Score: 2, Informative

      Lucene is strictly an indexing engine. It wants to index text. It can index metadata as well as full text. Your surrounding application gets the files to index from wherever (local hard drive, database BLOBS, remote Windows shares or what have you). We don't care if the files are Word, PDF, Powerpoint, HTML, or whatever. A parser (many free ones available) extracts the text. We also don't care what web server you are using - using the index to identify and retrieve files is a totally separate process. Lucene indexes the text stream one-by-one and stores the results in a very efficiently organized index. It has been ported to a bunch of languages, including dotNET. I haven't tried it on terabytes of data but it rips through gigabytes very fast. Assuming all 8 terabytes don't change between runs the scale should be no problem.

      If you needed to run, say 100 indexing engines in parallel and merge the indexes, you'd have to research that. Somebody's probably done it.

    9. Re:Lucene by BoberFett · · Score: 1

      Yep, Folio Views is another one, but I have no personal experience with it. I wrote two commercial software packages (CD and internet legal research systems) using ISYS and dtSearch, and I'm familiar with Folio because Lexis was a competitor of the company I worked for. I'm not sure what it's complete capabilities are, but I have to imagine it's comparable.

    10. Re:Lucene by bwt · · Score: 2, Interesting

      Dealing with large data sets isn't really technologically challenging. You can grow to an arbitrarily large data set size simply by partitioning: it works for google. It may be expensive to stand up a bunch of servers, but I don't think it's really that hard.

      What is more complicated is to deal with large numbers of concurrent requests. Then you need clustering. There are big sites that do both partitioning and clustering simultaneously with Lucene. I seem to recall reading that Technorati uses Lucene on a cluster of 40 servers. With 8TB of data you are going to have a big one time computational problem to build the index. Lucene can recombine indexes, so you could distribute index creation to a lot of servers.

    11. Re:Lucene by Cato · · Score: 2, Informative

      You might also like to investigate Plucene and KinoSearch which are both Perl ports of the Lucene engine. It's also worth considering combining your search engine with Wikis where possible - then you can find documents by keywords and also navigate to a Wiki page providing context and ralted documents or Wiki pages. TWiki, which is the most popular open source enterprise Wiki engine, has plugins for both these engines, see http://twiki.org/cgi-bin/view/Plugins/SearchEnginePluceneAddOn

    12. Re:Lucene by passthecrackpipe · · Score: 4, Informative

      "If you needed to run, say 100 indexing engines in parallel and merge the indexes, you'd have to research that. Somebody's probably done it."

      Yes, they have. In my previous job we had to search 2 terrabytes of plain text data (HTML) really fast. The company chose Autonomy, and many developers spent many months trying to make it work, consuming insane amounts of hardware resources for mediocre results, and still requiring . One lone (and brilliant) dev whipped up a Lucene proof of concept in a weekend, and it was faster (full index in a day) required less resources (a single HP DL 585, 16GB RAM, 4xdual core AMD as opposed to 10 of the same), had a smaller index (about a 5th of Autonomies'), returned results faster, the result set was more accurate, and was significantly more flexible in making it do what we actually needed it to do.

      Lucene wins hands down

      --
      People who think they know everything are a great annoyance to those of us who do.
    13. Re:Lucene by mynickwastaken · · Score: 1, Funny

      From the Nutch! web site: "Search this site with Google" ;-)

    14. Re:Lucene by Anonymous Coward · · Score: 0

      If you want something that doesn't require programming but still uses lucene have a look at focuseek searchbox: http://www.focuseek.com/products.php

    15. Re:Lucene by Thuktun · · Score: 4, Funny

      Yes, they have. In my previous job we had to search 2 terrabytes [...] ...of good ole down-to-earth data.
    16. Re:Lucene by daemous · · Score: 1

      Solr is the defacto search server implementation of the Lucene library. http://lucene.apache.org/solr/ There is also a Ruby client system that Erik Hatcher (who co-authored "Lucene In Action") has made called, "Solr Flare".

    17. Re:Lucene by M.+Baranczak · · Score: 1

      Lucene's the bomb. And if you want something sort of in between Nutch and the bare Lucene library, check out Solr. It's a J2EE web application that provides an XML-based front-end to Lucene.

    18. Re:Lucene by Anonymous Coward · · Score: 0

      Another option related to Lucene is to use Solr. It's basically a RESTful API layer that uses Lucene in behind the scenes. It's nice because it plays nice with others. I'm developing a Rails interface to a large document repository and am using solr-ruby to talk to the Solr servlet and get native ruby objects out of it. It's very cool, easy to set up and extremely fast.

      The only problem is in the parsing of the different file formats. You'll have to pull in some external parsers to get the text out of the documents that you've got.

      Someone else suggested that you use Nutch to do a crawl/parse of your data. If you do this you could conceivably use Solr to interface with the lucene indexes created by nutch. This would allow you to use the Nutch parsers while keeping your full-text interface abstracted through Solr.

    19. Re:Lucene by ubrgeek · · Score: 1

      So silly question: Is there a way to take the next step and have pre-determined words be used to have the documents auto-populate a wiki? (i.e. the documents go in and the links automatically exist between the wiki pages.)

      --
      Bark less. Wag more.
    20. Re:Lucene by pthisis · · Score: 1

      What is more complicated is to deal with large numbers of concurrent requests. Then you need clustering.


      But note that with a good indexer, "large numbers" should be pretty large. We have no problem handling about 500 requests/second for full-text paragraph-level searches of about 1 TB of data on a fairly modest machine. I can't remember the exact details, but it's something like a 1.8 GHz Pentium IV with 512 MB of RAM, and it's got plenty of room to grow (otherwise we'd at least add more RAM).

      Failover for availability is certainly nice, but with a decent design it'd be surprising to need clustering for performance until you get well past the "what indexers are out there?" question-asking phase of your data mining career.

      (The main thing we use clustering for is full regex searches, but the text indexer and smart caching can be used to make many common applications of those pretty fast as is).
      --
      rage, rage against the dying of the light
  4. Google by Anonymous Coward · · Score: 3, Informative

    We have a Google appliance, but you can do it with regular Google, too. Just make sure you disable caching (with headers or by encrypting documents). Then place an IP or password restriction for non-Google crawlers (check IP, not user-agent). People will be able to search with the power of Google, but only people you allow in will be able to get the full documents.

    If you value your privacy, invest in a Google mini, though.

    1. Re:Google by rta · · Score: 5, Informative

      Previous place i worked we had a Google Mini and it was better than anything we had come up with in-house.

      We even pointed it at the web-cvs server and bugzilla and it was great at searching those too.

      To see all the bugs still open against v 2.2.1 or something like that bugzilla's own search was better. but for searching for "bugs about X" the google mini was great.

      It only cost something like $3k ircc.

      not exactly what you asked about, but you should definitely see if this wouldn't work for you instead.

    2. Re:Google by TooMuchToDo · · Score: 2, Interesting

      Seconded. I've done implementations (hosted an in-house) of both Google Minis as well as the full blown Enterprise appliances. They are amazing creatures. I would recommend the Mini to almost anyone, while the Enterprise costs a pretty penny.

    3. Re:Google by shepmaster · · Score: 3, Informative

      The company I work for, Vivisimo, makes an awesome search engine. Although I've never dealt with the Google box directly, I know that we have had customers get fed up with the Google box and replace it quite easily with our software. Click the first link to see a pretty flash demo, or go to Clusty.com to try out a subset of the functionality for real. We specialize in "complex, heterogenous search solutions", which exactly fits most intranet sites I've seen. Files are on SMB shares, local disks, Sharepoint, Lotus, Documentum, IMAP, Exchange, etc, etc, etc. We connect to all those sources and provide a unified interface. You can do really neat tricks with combining content across multiple repositories, such as metadata from a database added to files on SMB shares. We support Linux, Solaris, and Windows, all 32 and 64 bit. Although I may work here, it really is a great product, and I use it at home to crawl my email archives and various blogs, websites, forums, things that I use frequently but have sucky search.

    4. Re:Google by TooMuchToDo · · Score: 1

      How does the pricing compare to the Google product offerings? And does your licensing allow us to offer hosted solutions?

    5. Re:Google by shepmaster · · Score: 1

      I'm just a code monkey, but I know that we are targeted at the enterprise and intranet markets. I'm not 100% sure on the hosting, but our salesfolk are fairly straightforward about answering such things... You can check out some of our customers to get a feeling of some of the companies that use our software. Chances are good that you have actually used our code at some point in the past and never even known it...

    6. Re:Google by shepmaster · · Score: 1

      And one day, I will learn how to make links! clusty.com

    7. Re:Google by rgaginol · · Score: 2, Insightful

      I'd have to agree - I'm a Java developer so if I was doing the solution, I'm sure I could whip up something cool with Lucene or whatever. But... in terms of long term maintanance costs, why develop anything yourself if the problem is already solved. And on the point of cool: good IT systems aren't cool... they do a job and do it well... maybe this is the first project where you find a "cool" solution is just not justifiable. I'm sure the Google appliance would let you put some quite extensive customizations on top of their API... well, that's been my experience with other Google products/services (Google Web Toolkit or Google Maps). Still, I've also found that some of the "nice to have" API's are kept out of reach with some things - I was trying to put a listener on a custom tile image in a map application and suddenly came up against a the barrier of "oh-boy-we'll-let-you-play-but-don't-dare-touch-that" sure came to mind. I guess the only way forward is to scope your requirements well - and I'll bet half of the "must haves" aren't really that important. After that, some research on each of the possible solutions would be good and the cost to implement/maintain them. If you've got programming expertise in house, it may be tempting to use them as a no brainer, but it is worth finding out their cost to implement a good solution (one with maintenance and documentation factored in... basically, whatever amount of time they say it will take to code, times 4).

    8. Re:Google by aoteoroa · · Score: 1

      I have to agree. How much is your time worth? I looked at rolling my own search engine but we were able to purchase a Google mini for $6000. When it arrived I at 9am, I had it set up and crawling our intranet by lunch time. (and spent the whole next day fooling around with it and showing it to everybody). Maybe with a month or two of development time I could have developed a database based CMS that was more tailored to our company but who has that much spare time in a day?

    9. Re:Google by A+Non+Mouse+Cowhand · · Score: 1

      I can't deny that Lucene, Nutch, Solr and Hadoop are all useful. But boy, they take some real time to get to the point where your boss is all "sweet as a nut boys!". Personally, for medium sized document collections like this I simply use Google Desktop Search and turn it into a server using 2 things:

      1) DNKA - it acts as a web server (search server) by interacting as a layer between Google Desktop Search (Enterprise) and user http://www.dnka.com/
      2) Kongulo - it crawls websites and transmits the HTML documents it finds back into your new GDS+DNKA server http://goog-kongulo.sourceforge.net/

      The cool thing about Kongulo is that it's written in python, so you can easily modify it to suit your needs more clearly, for example - a list of no follow sites/titles/URIs, or a meta-tag extraction component.

      We have at least one index here available to about 1000 folks on one dual-core 2GB RAM Dell generic server that has now indexed over half a million documents, and it does not look like it's going to slow down any time soon. I find this a really useful solution, even though your meta-tags might be difficult to deal with it's worth a look IMHO.

    10. Re:Google by Anonymous Coward · · Score: 0

      Dude, iirc - if I remember correctly. Not irrc or ircc

  5. Meta tags placed? by harmonica · · Score: 3, Insightful

    Who places what types of meta tags in the documents? I don't understand the requirements.

    Generally, Lucene does a good job. It's easy to learn and performance was fine for me and my data (~ 2 GB of textual documents).

    1. Re:Meta tags placed? by rainmayun · · Score: 2, Insightful

      I don't understand the requirements.

      I don't either, and that's because the submitter didn't give enough information. I'm working on a fairly large enterprise content management system for the feds (think 2.5 TB/month of new data), and I don't see any of the solution components we use mentioned in any thread yet. If I were being a responsible consultant, I'd want to know the answers to the following questions at minimum before making any recommendations:

      • What is the budget?
      • How many documents are we talking about? The answer for 10,000 is different than for 10,000,000.
      • Are you looking for off-the-shelf, or is software development + integration going to be involved
      • Who is going to maintain the integrity of this data?

      Although I am as much a fan of open source as anybody, I don't think the offerings in this area are anywhere near the maturity of commercial offerings. But some of those offerings cost a pretty penny, so it might be worthwhile to hire a developer or two for a few weeks or months to get what you want.

    2. Re:Meta tags placed? by Ctrl-Z · · Score: 1

      This doesn't sound like an enterprise-scale problem. I work for a large ECM vendor, and unless the IT department that we're talking about is huge, ECM is going to be overkill. Not that that would stop vendors from trying to sell it to you if you have enough money to put on the table.

      --
      www.timcoleman.com is a total waste of your time. Never go there.
    3. Re:Meta tags placed? by rainmayun · · Score: 1

      You're probably right. Then again, I think a lot of smaller and growing organizations underestimate the volume of their data and the value of organizing it well. Get it right now, and maybe they grow enough to need a real ECM solution.

  6. Google Desktop or Applicance by wsanders · · Score: 3, Insightful

    Because if you have to spend more than an hour on this kind of project nowadays, you're wasting your time.

    The inexpensive Google appliacances don't have very fine-grained access control, though. But I am involved in several semi-failed projects of this nature in my organization, but new and legacy, and my Google Desktop outperforms all of them.

    --
    Give a man a fish and you have fed him for today. Teach a man to fish, and he'll say "WHERE'S MY FISH, YOU IDIOT?"
    1. Re:Google Desktop or Applicance by Anonymous Coward · · Score: 1, Interesting

      and copernic desktop search outperforms GDS by a long way...

    2. Re:Google Desktop or Applicance by sootman · · Score: 1

      Since Google Desktop works by running its own little webserver, can you install Google Desktop on a server and access it by visiting http://server.ip.address:4664/ ? (I'm at work and my only Windows box has its firewall options set by group policy.)

      --
      Dear Slashdot: next time you want to mess with the site, add a rich-text editor for comments.
    3. Re:Google Desktop or Applicance by slazzy · · Score: 1

      Yes - it is possible to configure google desktop that way (disabled by default) there are also a few programs out there that were designed to access google desktop search remotly: http://www.asabox.com/goolag/index_en.htm

      --
      Website Just Down For Me? Find out
    4. Re:Google Desktop or Applicance by NoNeeeed · · Score: 2, Informative

      Yep, a Google appliance (or equivalent, there are others on the market such as X1) is the way to go.

      I set up a Google Mini for indexing an internal wiki, our bug tracking system, and some other systems, and it is very straight-forward.

      I know the original question mentioned meta-data, but you have to ask yourself if the meta-data is going to be maintained well enough that the search index will be valid. Going the Google Appliance route is so much simpler. It takes a bit of tweaking to set up the search restrictions, but once up and running, it works flawlessly. Most importantly, it doesn't require everyone to make sure that all their document meta-data is perfect.

      Google appliance pricing is really quite cheap when you compare it to the time cost of setting up a meta-data driven system.

      Meta-data is one of those things that seems like a really good idea, but like all plans, doesn't tend to survive contact with the enemy, which in this case is the user.

      Paul

    5. Re:Google Desktop or Applicance by Anonymous Coward · · Score: 0

      Take a look at IBM/Yahoo's free (as in beer) enterprise search box. I have had good experiences with it and especially like it's open forum (the ibm developers help out as much as they can).

      http://omnifind.ibm.yahoo.net/
      http://www.networkcomputing.com/channels/enterpriseapps/showArticle.jhtml?articleID=199905064&pgno=9

    6. Re:Google Desktop or Applicance by shepmaster · · Score: 1

      If security is important, you should take a look at Vivisimo Velocity. We offer access control down to the content level. Have a single result document with a title, a snippet, and a piece of sensitive info like money amounts? You can make it so that only select users (from LDAP, or AD, or wahtever system) can see the sensitive information, but everyone can see the title and snippet. We also respect the security restrictions from our content sources, such as Windows fileshares or Documentum. And I've heard from customers that we have easily replaced their Google Appliances, and we can be installed on any commodity linux/solaris/windows box.

    7. Re:Google Desktop or Applicance by anthonys_junk · · Score: 2, Interesting

      Meta-data is one of those things that seems like a really good idea, but like all plans, doesn't tend to survive contact with the enemy, which in this case is the user.

      Like software development, the quality of the outcome is implementation dependent.

      We run six thesauri, plus a number of different controlled lists for our users to input metadata. We don't publish any documents that don't have meta attached, and we perform random quality audits. Our users have been trained in fundamentals of classification and also in the payoff for getting it right.

      We use metadata to structure our navigation for some sites, we depend on it for search for our internal documents. Our metadata implementation works incredibly well for us; clearly and consistently outperforming plain text indexing.

      --
      Barbara Felden claims prior art on the flip phone, sues Motorola, Nokia.
  7. Swish-E by ccandreva · · Score: 2, Informative
    1. Re:Swish-E by nuzak · · Score: 1

      swish-E is fast, but the quality of its search results is just awful. We use socialtext at work, which uses swish-e for search, and the search results may as well just be random.

      I don't think it even handles unicode, either.

      --
      Done with slashdot, done with nerds, getting a life.
    2. Re:Swish-E by PinkPanther · · Score: 1
      Swish-E's configuration is pretty flexible even when it comes to relevancy ranking, though it is also quite non-intuitive for lots of different aspects of the configuration.

      And yes, it does not support UTF-8/Unicode/anything-non-ASCII-8.

      But the developer list is quite active and responses are usually accurate (though they also can be terse and sometimes overly-authoritative).

      --
      It's a simple matter of complex programming.
  8. Open Source by HartDev · · Score: 1

    There are many open source solutions for what you are trying to do, also if you want it be portable then I would suggest a CMS that does not require a MySQL database like "Limbo" what does the organization do?

    --
    To see a few of my Android apps goto: www.hartwired.com
    1. Re:Open Source by Anonymous Coward · · Score: 0

      There are many open source solutions for what you are trying to do, also if you want it be portable then I would suggest a CMS that does not require a MySQL database like "Limbo" what does the organization do? Exactly what's wrong with a DB based solution? After all, how much more portable can it get? And why MySQL specifically? (That wouldn't be a slam on Alfresco or Mondrian, would it? Both run on more than MySQL)
  9. Avoid kat and beagle by Anonymous Coward · · Score: 0

    Hmm, avoid kat and beagle. Consider a using a one line perl script using find and grep instead...

    1. Re:Avoid kat and beagle by Tablizer · · Score: 1

      Hmm, avoid kat and beagle. Consider a using a one line perl script using find and grep instead...

      But it still needs some kind of indexing. I don't think they want a sequencial search every time a query is used, unless its a small set. Thus, a database of some sort is probably in order. SqLite may be a quick way to go, although I've heard nasty things about its ODBC drivers (maybe since fixed). But I envision a schema something like:

          table: tags
          -----------
          tag_ID
          tag_descript

          table: tag_doc
          -----------
          tag_ref
          document_path

      Perhaps also have a "document" table to give ID's to documents instead of using paths, and also a document summary description.

  10. Beagle, Spotlight? by Lord+Satri · · Score: 2, Insightful

    Is this something that would suit your needs: Beagle for Linux, Spotlight for OSX? I haven't tried Beagle (I don't have root access on my Debian installation at work), but Spotlight is probably my most cherished feature in OSX... it's so useful.

    1. Re:Beagle, Spotlight? by hejog · · Score: 0

      Spotlight doesn't support indexing of network volumes, which is a killer. It'll support them in Leopard (you'll be able to search spotlight indexes of remote servers) we can't wait.

    2. Re:Beagle, Spotlight? by Constantine+XVI · · Score: 1

      I don't think Beagle and Spotlight are really network-friendly. There's not really any point of having each and every machine having to index all the drives on the network. It'd be better to have some sort of networked solution.

      --
      "I think an etch-a-sketch with an ethernet port would beat IE7 in web standards compliance."
    3. Re:Beagle, Spotlight? by M-RES · · Score: 1

      Nope, true. But Spotlight does currently allow you to search network drives/mounted volumes, so once Leopard adds searching remote indexes then it'll save a lot of time and energy and might just do the trick.

    4. Re:Beagle, Spotlight? by jambarama · · Score: 1

      If all you're looking for is desktop search, like beagle or spotlight, it seems to me there are dozens of alternatives. On Windows you have google, msn, yahoo, and my favorite - copernic. On Linux you've got beagle, catfish, and my favorite - kat. Heck, you can just grep most stuff, or shoehorn google desktop. On Macs you've got spotlight, and google desktop, plus frontends like quicksilver and blinkx.

      I'm sure you could find a way to shoehorn some of these apps to work centralized on a server (maybe move the index to the server, point the apps at that index, then point "my documents" to a shared drive so access may be had by all). BUT - if you want a real server app, I think the best suggestions - lucene, google appliances - have already been made.

    5. Re:Beagle, Spotlight? by GiMP · · Score: 1

      Beagle is based on Lucene.NET -- others have recommended Lucene (and its clones) as well, as do I.

    6. Re:Beagle, Spotlight? by Anonymous Coward · · Score: 0

      Leopard (Server) will support network Spotlight searching.

  11. Google Applicance by spcherub · · Score: 1

    If you have a reasonable budget *and* an intranet, you can consider implementing a Google Appliance and pointing it at the network location that houses the documents. The side benefit is that documents can be found/accesses via browser.

  12. All depends on the document filters ... by molarmass192 · · Score: 1

    Since you're indexing non-text data, you'll need a search engine that has plenty of document filters. We use Oracle Text to do something similar to this, but it's not for the faint of heart. The nice thing about Oracle Text is it includes filters for pretty much any document you'd want to index (PDF, Word, Excel, etc). Of course, Oracle Text query syntax needs an awful lot of lipstick to be made to look like Google query syntax. WMMV.

    --

    Good people do not need laws to tell them to act responsibly, while bad people will find a way around the laws-Plato
  13. google by Anonymous Coward · · Score: 0

    get your site indexed by google then add parameters like

    site:mysitename.com
    filetype:xml ;-)

    cheap quick and easy

  14. mnogosearch by nereid666 · · Score: 1

    Try http://www.mnogosearch.org/ is like a small free google spider.

    --
    Damia
  15. BOFH by wilymage · · Score: 2, Funny
    --
    The secret to creativity is knowing how to hide your sources. -- Albert Einstein
  16. XML document formats by mind21_98 · · Score: 2

    If you're using Office 2007, you can probably hack something together really quickly to pull the meta tags from the files and put them in a database. Not sure about the other formats you need, though--and support from Google, for instance, would probably be beneficial for your company anyway. Hope that helps!

    1. Re:XML document formats by flyingfsck · · Score: 1

      "pull the meta tags from the files" You think so? Usually there is absolutely no relationship between meta data and the file contents. Just think of meta tags on web pages...

      --
      Excuse me, but please get off my Pennisetum Clandestinum, eh!
    2. Re:XML document formats by mind21_98 · · Score: 1

      I was thinking of pulling the XML tags from Office 2007's XML format...

  17. Google corporate search by Gwala · · Score: 1

    A googlebox. Indexes file shares and internal websites and makes them searchable. Can be a little pricey though.

    --
    #!/bin/csh cat $0
    1. Re:Google corporate search by Anonymous Coward · · Score: 0

      Another alternative to the Google box are the Thunderstone search solutions.

  18. what to avoid by Anonymous Coward · · Score: 2, Insightful

    You should avoid any system that relies on individual employees putting in these meta-tags. It won't work; they either won't do it, or will do it wrong (spelling errors, inventing their own tags on the fly, and so on.) And then you'll catch hell when they can't find one of those documents they mislabled. Trust me.

    1. Re:what to avoid by Mspangler · · Score: 1

      avoid a program called keyfile.

      Evil, Evil, Evil!

      At least the part that's not brain-damaged.

    2. Re:what to avoid by shepmaster · · Score: 1

      Completely correct. However, having a taxonomy of some type is still useful. That's why an automatic taxonomy, generated from your result set, can be extremely useful. Check out clusty.com for an example.

  19. Most easy solution by PermanentMarker · · Score: 4, Informative

    it wil cost you some bucks just buy MS sharepoint portal server, and leave the indexing over to sharepoint.
    Your not even realy required to use added tags... (as most people will put in poor tags).

    But if you like you can add tags even with sharepoint.

    --
    I know you're out there. I can feel you now. I know that you're afraid. You're afraid of us. You're afraid of change.
    1. Re:Most easy solution by DigitalSorceress · · Score: 3, Informative

      Actually, if you are an MS shop and have Microsoft Server 2003, SharePoint Services 3.0 (as opposed to the SharePoint Portal server (now renamed, I believe, to Microsoft Office SharePoint Server) which does indeed cost a packet.

      I do a lot of LAMP development, and I'm not the strongest fan of Microsoft for a lot of things, but if you have a MS desktop and MS Office environment, SharePoint services really is quite decent for INTRANET applications. Especially for collaberation. You can set up work flows for check-out/check-in, and it integrates really nicely with some of the more recent MS Office releases. If you connect it to a real MS SQL server on the back end (as opposed to the express edition that it defaults to), you can have full text indexing even with the free SharePoint Services version. Only need for the full blown Portal/MOSS version is if you think you are going to have a large number of sharePoint sites, and want to simplify cross-connecting and management. (At least as far as I can recall)

      I'm not saying SharePoint is the way to go, but I'd at least read up on it and consider it IF you have a lot of MS Office stuff that you plan on indexing/sharing.

      I'd strongly advise avoiding it if you plan to do Internet-based stuff though... at lest until you get a good enough understanding of the security issues involved that you feel that you really know what you're doing.

      Just my $0.02 worth.

      --

      The Digital Sorceress
    2. Re:Most easy solution by Anonymous Coward · · Score: 1, Insightful

      Just a couple bucks, $60-70 per seat, plus how much for the server software?

    3. Re:Most easy solution by JacobO · · Score: 1

      hahahahaha...

      I've never seen a SharePoint site where the search worked well at all (particularly in the document libraries.) You might think by its observable behavior that it is simply offering up documents at random instead of searching their contents.

    4. Re:Most easy solution by Anonymous Coward · · Score: 0

      If you upload the files to SharePoint (as opposed to having it index file shares or another web site) you can use the free version of SharePoint, WSS. You need MOSS if you want to index external content. Microsoft has a free download of a virtual machine that has MOSS installed so you can give it a try without putting out any money.

      tk

    5. Re:Most easy solution by guruevi · · Score: 2, Informative

      I am just done with 6 months of SharePoint integration in an MS shop. From a development and security standpoint: STAY AWAY FROM IT. 2003 seems to be an Alpha version, 2007 is still full of bugs (better than Beta but still) and it's also very, very slow (it's based on .NET). To work fairly good for 100 users it requires 1 SQL Server (MSDE will not work for non-development purposes), 2 Frontends and 1 loadbalancer/firewall based on Microsoft Forefront just for security purposes (since the built-in SharePoint requires a lot of stuff to be opened to all users). It's also expensive... 10k/server + CAL's for every user and that was in a big MS shop.

      Next to that, there are a lot of caveats and as soon as you start modifying the layout (even though it's just the HTML) in SharePoint Designer, Microsoft Support will not help you (as if they could in the first place). Simple things like whitespaces in-between table structures can make your list workflows screw up (yes there is an actual opening with Microsoft Support for that very issue). A lot of things will not work either and require a nasty hack or workaround (like attachment upload on modified forms) and are known with Microsoft and have been known for the last 9 months.

      --
      Custom electronics and digital signage for your business: www.evcircuits.com
    6. Re:Most easy solution by smutt · · Score: 1

      I have one problem that I've learned from my interaction with SharePoint. It doesn't scale. Period end of story. It works great when you have 1000 users or less accessing it concurrently. But try going higher and you're in for a world of hurt.

      --
      The Information Revolution will be fought on the command line.
    7. Re:Most easy solution by big+ben+bullet · · Score: 1

      (Sorry, I don't like sharepoint at all!)
      Or build your own app on top of a Microsoft SQL Server 2005 with Full Text Search
      Technet Article

      No need for tags... let the document itself be the tags...

      The free-as-in-beer express ed. (with advanced blahblah..) however is limited to 2Gb. So you will at the least need a Standard ed. though.

    8. Re:Most easy solution by PermanentMarker · · Score: 1

      I'm sorry it does scale youre unable to design a proper sharepoint storage solution, buy a good book
      And you can learn how to create sharepoint farms....

      --
      I know you're out there. I can feel you now. I know that you're afraid. You're afraid of us. You're afraid of change.
    9. Re:Most easy solution by perky · · Score: 1

      Yep - totally agree. If you are a MS shop, and have Server 2003 infrastructure (which you will), then WSS 3 is a free download. I have had great success with it for replacing a network file share for document sharing, and for replacing "Tracking" spreadsheets with Sharepoint lists.

      It has basic sarch built in, and there is an upgrade path. IFilters are built in for the MS formats, and there is third party for non-MS formats.

      One thing - Office Sharepoint designer essentially doesn't work. Just don't bother with it.

      --
      "The new wave is not value-added; it's garbage-subtracted" - Esther Dyson, Dec 1994
    10. Re:Most easy solution by PermanentMarker · · Score: 1

      where most sharepoints go failing, people who customize it trough their own html
      try it as it is plain from the box.
      I'm not realy going to defend sharepoint here (i'm an Exchange server tech kid) but i did some of those support cases for sharepoint 9 out the 10 cases was related to people thinking they could change it in a certain way, while they where no great programmers afteral. In allt of these cases MS helped the custommers out. non went unsolved. Well i can say its handy to be premier partner

      --
      I know you're out there. I can feel you now. I know that you're afraid. You're afraid of us. You're afraid of change.
    11. Re:Most easy solution by pnutjam · · Score: 1

      Try OWL, it's a neat little project that looks like it can easily handle your requirements.

    12. Re:Most easy solution by profaneone · · Score: 1

      Actually, if you think you want Sharepoint, then you will really want the service upon which Sharepoint is based: Expedio (formerly owned by IntraNetSolutions). Purchased by Oracle and rebranded as "Universal Content Management",
      it has serveral modules from which to choose. It has converters to convert _all_ types of document types to pdfs, indexing, workflows, web publishing...

      Overview - http://www.oracle.com/technology/products/content-management/ucm/ucm.pdf
      Downloads - http://www.oracle.com/technology/software/products/content-management/index.html

    13. Re:Most easy solution by Inda · · Score: 1

      Yeah Shitpoint is great...

      It doesn't search on filenames. lol

      You cannot use characters like "&" in your filenames because it's a web server, not a file server. Wait, that shouldn't metter either but it does.

      It doesn't find anything useful even if you know the exact title of a document you wrote last week.

      Most of us here map our portals to drive letters and use XP's build in file search. Um, we did that before spending the money...

      Wait, I have an alert... "User XYZ has modified a folder". Well gee thanks for letting me know. Of course, my links still work even though the folder has been changed? No? Bah.

      It's useless.

      --
      This post contains benzene, nitrosamines, formaldehyde and hydrogen cyanide.
    14. Re:Most easy solution by hobo+sapiens · · Score: 1

      In my experience, Sharepoint is useful for one thing: generating pages of links to other pages of links to other pages of links wherein there are word documents with links to other pages of links which, *maybe*, have some real search results, though never the ones you really needed or were searching for. Sharepoint does that *really* well.

      I am sure some of the fault lies with the people implementing Sharepoint sites. But if Sharepoint site after Sharepoint site sucks and is generally worthless, you have to begin suspecting the technology. It's like so much other stuff coming out of Redmond these days: probably a decent idea in its own right, but not well thought out. And that means it fails miserably in sub-optimal real world conditions.

      --
      blah blah blah
    15. Re:Most easy solution by DigitalSorceress · · Score: 1

      I don't have direct experience with scaling it (having only set up on my server at home to get a feel for installation and to play with it a bit), but it would seem that if you're having scaling issues, that the MOSS (Formerly SharePoint Portal Server) may be a better fit.

      Apparently, there's a fairly decent upgrade path from SharePoint Services to full-on MOSS, but again, I do not state that from experience... just from what I've heard.

      --

      The Digital Sorceress
    16. Re:Most easy solution by dekemoose · · Score: 1

      Couple points. Xpedio wasn't owned by IntranetSolutions, it was IntranetSolutions. IntranetSolutions changed its name to Xpedio for a brief period of time and then Xpedio changed its name to Stellent. Oracle bought Stellent. And how is Sharepoint based on that product?

  20. Livelink by Anonymous Coward · · Score: 0

    We use Livelink.

    It's huge, kludgy, awkward, slow, resource intensive, but it works (when it's up).

    1. Re:Livelink by Ajehals · · Score: 2, Funny

      You are in marketing aren't you?

      (I'm sold anyway)

    2. Re:Livelink by Anonymous Coward · · Score: 0

      Actually no, I'm not. I'm one of techies.

      Similar reasoning as to why our business back-end is SAP on Windows 2k3 servers with a SQL2005 back end.

  21. Upload it to the web by omgamibig · · Score: 2, Funny

    Let google do the indexing!

  22. Check out Alfresco! by thule · · Score: 2, Interesting

    I posted this before on slashdot. I discovered a while ago a cool system called Alfresco. There is a free (as is liberty) and commercial versions. It acts like a SMB (like SAMBA), ftp, and WebDAV server so you don't have to use the web interface to get files into the system. Users can map it as a network drive. The web interface allows users to set metatags, retrieve previous versions of the file, and most importantly, search the documents in the system.

    Alfresco also has plugins for Microsoft Office so you can manage the repository from Word, etc. They are also working on OpenOffice integration.

    Don't use SAMBA for .doc and .pdf's, use Alfresco.

    I am not affiliated with Alfresco, just a happy user.

    1. Re:Check out Alfresco! by G1369311007 · · Score: 1, Interesting

      Along the same lines of Alfresco is Plone. www.plone.org I'm currently the sole admin of a Plone site serving ~50 users on an intranet. We use it for document management etc. Just another option. The CIA's website is made in Plone so it can't be that bad right?! www.cia.gov

      --
      "Don't blink. Don't even blink. Blink and you're dead."
    2. Re:Check out Alfresco! by dsgfh · · Score: 2, Informative

      The parent here speaks the truth.
      I notice a lot of the comments in the thread are coming from developers or sysadmins who want to solve everything with libraries or command line tools. But it really sounds to me like you need a reasonable document management system (and of course being a slashdot reader you want it for free).

      Again, I'm not affiliated with Alfresco, but did quite a bit of research into open source DMS's that would run in a java environment for a couple of recent projects. I found Alfresco to be well architected, easily extendible if I needed it to be and importantly simple to deploy & get running. It will integrate with your LDAP for access and while it's marketed as an Enterprise CMS, is quite capable of doing DMS.

      It uses Lucene under the hood, and while it has a web UI, isn't focused on indexing web sites. You can record meta-data against docs, and it's also capable of extracting some metadata from common MS Office formats. I've no doubt this could be extended if there were other doc properties you wanted access to (although I've never tried myself).

      Most importantly is that the project & community is quite healthy with very active forums. You can get paid support (the Enterprise License) if you so desire, but I expect you'd probably start with the GPL version just to get yourself up & running.

      I wouldn't recommend the SMB interface for the time being as there's currently an outstanding bug with it that causes it to die after a while (the rest of the app continues to run happily), however the FTP interface is great for an initial import of docs. Also take a look at the rules capability for classifying/sorting docs as they're imported.

      It does the basics like check-in/check-out & workflow, and can be backed by your DB of choice as it uses Hibernate for ORM. Searching can be done against keywords or meta-data (classifications, dates, authors etc) & in my experience is more powerful/useful than sharepoints keyword based searching. If you're really keen you can use the Java or Web Service API's for integrating into other solutions.

      Again, I'm not affiliated, but clearly I'm a fan-boy :) I'd recommend installing the base Alfresco Community release (no need for Web Content Management, Records Management etc to start with), loading some docs into it via the FTP interface (or upload a zip via the web interface which it will explode out for you) & giving it a test run. I've got people asking me every couple of days when we're rolling it out internally (just got to finish the sharepoint comparison first).

  23. X1 by bhovinga · · Score: 1

    For searching Microsoft products you can't beat X1 as far as user interface.

  24. Also see Xapian by dmeranda · · Score: 4, Informative

    I'd suggest you should consider a full-text search engine. First start here:
    http://en.wikipedia.org/wiki/Full_text_search

    If you're not afraid to do a little reading and potentially coding a custom front end, you may want to look at two of the big open source engines: Lucene and Xapian.

    Lucene is quite popular now, and is an Apache Java project. It's a good choice if you're a Java shop.

    Xapian seems to be based on a little more solid and modern information retrieval theory and is incredibly scalable and fast. It's written in C++, with SWIG-based front ends to many languages. It might not have as polished of a front end or as fancy of a website as Lucene, but I believe it's a better choice if you have really really huge data sets or want to venture outside the Java universe.

    There are also many other wholely-contained indexers too, mostly which are based on web indexing (they have spiders, query forms, etc.) all bundled together. Like ht://Dig, mnogosearch, and so forth. They are good, especially if you want more of a drop-in solution rather than a raw indexing engine, and if you're indexing web sites (and not complex entities like databases, etc).

    1. Re:Also see Xapian by mikeboone · · Score: 1

      I've had good luck for several years using Xapian integrated with PHP. It did take some work to integrate but it's fast and flexible.

    2. Re:Also see Xapian by risk+one · · Score: 2, Informative

      I agree that Lucene is a great choice specifically for java shops, it does have ports for pretty much all major languages. The java implementation is the 'mothership' but you can use lucene with php, python, .NET or C++ or whatever.

      Secondly, I'd like to point out Lemur. It's an indexing engine similar to Lucene, but geared much more toward the language modeling approach of information retrieval. All IR approaches will use either a vector space based approach or a language model approach. Lucene does vector space very well, but it's difficult to get it to do language model based retrieval (although extensions are available), Lemur can do both. Lemur also has Indri, a search engine written on top of Lemur, which can parse html, PDF and xml. And like Lucene, Lemur has multiple language ports of the API.

      A final point I would like to make is that IR is a very actively researched field. If you're going to do your own coding (specifically the retrieval model), I suggest you buy a book and get reading. Most of the basic problems (and there are many) have been figured out and it'll save you a lot of trouble if you just read up on how to update an index or find spelling suggestions, instead of figuring it out for yourself. It's possible to index your documents with Lucene and run searches on them in half an afternoon, but it takes some basic knowledge to get it right, and make the app useful. (Look at wikipedia's search for an example of what you get when you don't follow through, and stop after it seems to work ok).

    3. Re:Also see Xapian by mr_da3m0n · · Score: 1

      Another vote for Xapian here. I indexed a Wikipedia Dump with it and found to be eerily accurate when searching. The API is really simple, and fairly easy to work with -- i'm a total retard when it comes to C++, but I managed to write a simple interface that did what I wanted.

      It also has bindings for Java, Python, PHP and friends.

      (Disclaimer: I like Xapian so much I elected to build and package the Python bindings for Windows, even though it is not my platform of choice.

      Cheap plug: Xapian Python 2.5 Bindings for Win32)

  25. Depends on size of document base by Nefarious+Wheel · · Score: 2, Insightful
    It depends on the size of your document base, and how you're going to store it -- if you're using something industry-strength like Documentum or Hummingbird then the Google Mini won't index it, you have to go up a notch and use the yellow box solutions. And if you're using Lotus Notes, you'll need a third party crawler such as C-Search. Google Desktop can be bent into some solutions, and it's free, but for many users you're better off having a separate server do the indexing. Google bills on the number of documents you need to keep in the index at once, and they throw in a bit of tinware to support that on a 2 year contract.

    Disclaimer: I flog Google search solutions at work, so I'm way biased.

    --
    Do not mock my vision of impractical footwear
    1. Re:Depends on size of document base by shepmaster · · Score: 1
      If you find yourself connecting to many different repositories, you should check out Vivisimo Velocity. We have some awesome connectors to the most popular repositories:
      • Offers connectors to repositories such as file servers, MS SharePoint, Documentum, Lotus Notes, Exchange, Legato etc.
      • Crawls information in databases such as SQL Sever, MYSQL, PostgreSQL, DB2, Oracle and Sybase.
      • Supports many file formats including Microsoft Word, Excel, PowerPoint, WordPerfect, PDF, Postscript, email archives, XML, HTML, RTF and others.
      • Rich media including images, audio and video.
  26. use a Wiki instead by poopie · · Score: 4, Insightful

    Directories full of random documents in random formats of random version with varying degrees of completeness and accuracy tend to get less useful as an information source as time goes on. Docs get abandoned and continue to provide outdated information and dead links. Doc formats change and require converters to import. Doc maintainers leave the company.

    If you work somewhere where people are not trained to attach Office docs to every email, where people don't use Word to compose 10 bullet points, where people don't use a spreadsheet as a substitute for all sorts of CRM and business applications... a Wiki is actually a good solution.

    You can use something like MediaWiki or Twiki or... heck you can use a whole variety of content management systems.

    The key to success is to *EMPOWER* people to actually update information, and have a few people who are empowered to actually edit, rehash, sort, move, prune wiki pages and content. As the content improves, it will draw in more users and more content creators. Pretty soon, employees will *COMPLAIN* when someone sends out information and doesn't update the wiki.

    Some corporate cultures are not wiki-friendly. Some management chains *fear* the wiki. Some companies have whole webmaster groups who believe it is their job to delay the process of getting useful content onto the web by controlling it. If you're in one of those companies... start up your own wiki and beg for forgiveness later.

    1. Re:use a Wiki instead by LadyLucky · · Score: 2, Interesting

      We've been using Confluence, from Atlassian for our wiki, and it's pretty fully featured for a wiki.

      --
      dominionrd.blogspot.com - Restaurants on
    2. Re:use a Wiki instead by jbarr · · Score: 1

      A Wiki can be great, but the downside is that it requires the user(s) to actually enter the data into the Wiki. How would you handle disparate business documents with a Wiki? For example, how would you manage hundreds of Word, Excel, PowerPoint and .PDF documents? Documents created by multiple people over a long time in many locations? Wouldn't it be easier for this scenario to use an indexing tool? I've used Google Desktop to manage these types of documents, and it's amazingly effective. Wikis are absolutely wonderful going forward but they're cumbersome with existing, varying documents--cumbersome at least until the data contained in the documents is incorporated into the Wiki. Also , check out TiddlyWiki.com. It's a single-file "personal Wiki". It's essentially a self-updating HTML file built around JavaScript. It's truly innovative. (Of course, it's for capturing small amounts of data, not creating large, multi-user repositories.)

      --
      My mom always said, "Jim, you're 1 in a million." Given the current population, there are 7000 of me. God help us all!
    3. Re:use a Wiki instead by linatux · · Score: 0

      If lots of existing MS Office docs & PDF's need to be indexed, take a look at Perspective.
      Nice Wiki, easy install (Windows/IIS based meant it was easy to get it running in a largely Windows shop).

      It uses the indexing service in Windows to search external documents. This can (I've had mixed results) index PDF's as well as any Office format file.

      Results of searches are returned as links straight from the Wiki. Works great for me (much of our doc was already in word/visio etc). As time goes on, more stuff gets documented directly in the Wiki.

      May not be the best solution for you but suits us very well.

    4. Re:use a Wiki instead by Blinocac · · Score: 1

      Yeah, I've been trying to sell them on the Wiki route, I like TiddlyWiki myself, but it's been shot down.

  27. I'd Be Interested... by morari · · Score: 1

    In an image-based solution. My business requires customer access to literally thousands of individual images. It would be nice to be able to scan them all in and tag them appropriately (multiple tags!) so as to create an easily searchable database.

    --
    "He who can destroy a thing, controls a thing." --Paul Atreides, Dune
    1. Re:I'd Be Interested... by Bronster · · Score: 1

      Wow, you run a porn site too?

  28. True hackers by athloi · · Score: 1

    Always write their own homebrew search engines.

  29. FileNet by drhamad · · Score: 1

    How about IBM FileNet? Or are you looking for something free? We use FileNet everywhere I've been.

    The downside to the suggestions like Google Appliances is that you're then storing this information on Google servers.... something that most companies find HIGHLY objectionable (security).

    --
    -Daniel
    1. Re:FileNet by James+Youngman · · Score: 1

      No, if you buy a Google appliance, the index is stored on the appliance, not on Google's servers. That's kinda the point.

    2. Re:FileNet by rainmayun · · Score: 1

      Um, they aren't "Google servers" in the sense that Google owns and operates them... you buy it and run it in your own enterprise, the same way someone might run, say an "Oracle server".

    3. Re:FileNet by drhamad · · Score: 1

      I should amend what I said. I was referring to Google Desktop Search, not the standalone, separated Google Appliance application/hw.

      --
      -Daniel
  30. I'll plug my software by vondo · · Score: 1

    DocDB (http://docdb-v.sourceforge.net/) can interface to the search engines others are suggesting, but organizing your documents with decent meta-data in the first place (and not on a Wiki that is allowed to rot) is also important. That's what DocDB does.

  31. Extensis + SQL Connect by Anonymous Coward · · Score: 0

    I'm doing something similar at work. The Extensis web stuff they sell is a bit pathetic, so you're better off buying their SQL connect stuff and building your own web frontend.

  32. Personal GSA experience by PIPBoy3000 · · Score: 1

    We've been quite happy with our Google Search Applance.

    The two exceptions are the way it handles secured documents (on our mostly-Windows network, that meant authenticating twice or doing complicated Kerberos stuff), and hardware (we've had two boxes fail with drive issues in the last year).

    Still, when it comes to search results and speed, it's been very good. I'm also a fan of Google Desktop, but that's a completely different story and more difficult to centrally manage.

  33. $$$ - Universal Content Management by hrieke · · Score: 1

    So far, everything I've read really doesn't sound like it's geared towards an enterprise level; mostly put a bunch of files out there in a folder somewhere and let a crawler index them. That's all good and fine until someone gets the idea to search for the payroll documents...

    At work, and granted it's a fair size HMO, we use Universal Content Management by Oracle, formerly known as Stellent's Content Management System.
    UCM allows for named accounts with control over access, plus a full audit log, full change log (plus revisions), and a centralized location for searching for document of any type (Word, Excel, Powerpoint, AutoCAD, MPG movies, TIFF, etc.).

    Supports work flows as well, a nice plus if something needs to go through a formal process, gives nice audit trails, and supports 3 different full text indexes (FAST, Verity, and database) of the stored content.

    This might not be for everyone, but it is a decent tool for large size companies to manage documents.

    --
    III.IIVIVIXIIVIVIIIVVIIIIXVIIIXIIIIIIIIVIIIIVVIIIV IIVIIIIIIVIII...
  34. Google by JimDaGeek · · Score: 1

    Seriously, spend a tiny bit of money on a Google Appliance and get excellent search. I tried to use MS stuff, like the built-in index server and it just wasn't good enough.

    We got a Google "appliance" and the damn thing just works, and works well. I don't work for Google, nor do I get paid if they make a sale. Just saying what worked great for us.

    --
    General, you are listening to a machine! Do the world a favor and don't act like one.
  35. Anything to avoid? by Anonymous Coward · · Score: 0

    How about avoiding Word docs, Excel spreadsheets and Access databases?

  36. Meta tags are worthless, generally by Anonymous Coward · · Score: 4, Insightful

    Meta tags are worthless, generally, unless you have a librarian who ensures correctness.
    DON'T TRUST USERS TO ENTER META DATA!!!
    I've worked in electronic document management in 3 different businesses and metadata entered by end users is worst than worthless - it is wrong. Searches that don't use full text for general documents are less than ideal.

    Just to prove that you're question is missing critical data:
      - how many documents?
      - how large is the average and largest documents?
      - what format will be input? PDF, HTML, XLS, PPT, OO, C++, what?
      - what search tools do you use elsewhere?
      - any budget constraints?
      - did you look at general document management systems? Documentum, Docushare, Filenet, Sharepoint? If so, what didn't work with these systems?
      - Did you consider OSS solutions? htdig, e-swish, custom searching?
      - A buddy of mine wrote an article on "how to index anything" that was in the Linux Journal a few years ago. Google is your friend.

    AND if i didn't get this across yet - DON'T TRUST META DATA IN HIDDEN DOCUMENT FIELDS - bad Metadata in MS-Office files will completely destroy the usefulness of your searches.

    1. Re:Meta tags are worthless, generally by Keith_Beef · · Score: 2, Informative

      Meta tags are worthless, generally, unless you have a librarian who ensures correctness.

      DON'T TRUST USERS TO ENTER META DATA!!!

      I've worked in electronic document management in 3 different businesses and metadata entered by end users is worst than worthless - it is wrong. Searches that don't use full text for general documents are less than ideal.

      Unless you can pin responsibility for a document to a named person, you can't trust anything in the document. Not metadata, not content, not presentation.

      The meta tags most of the documents I deal with are inserted by the applications, and only the content is human-drafted. Those meta tags contain information like creation date, mdification date, application name, character encoding, etc. They are generally trustworthy.

      I'm also in the process of building a documentation system; it will be a set of documents in various formats, with an HTML interface, TomCat server and Lucene to make it fully searchable.

      In a previous job, I did a similar thing with Apache and ht://dig on an old Dell I recycled. Document files could be uploaded by anybody with an FTP account on the server, and index files were automatically regenerated by a CRON task at 04h00 each day.
      I could have made a trigger to regenerate the index after each FTP upload session, but using CRON was easier and sufficiently frequent to be useful.

      This time around, the whole system of TomCat webserver and Lucene search engine is bundled on a CD-ROM with the docs to run on any of the firm's laptops. Because I control the documents, I can build the index files and burn them to the CD-ROM before distribution.

      Beef

    2. Re:Meta tags are worthless, generally by Blinocac · · Score: 1

      Well the users would be us IT folks, and if we can't trust ourselves to put the Meta Data in the docs, then we have bigger problems.

      The number of documents is roughly 9000. Most of the documents are 1-100k, but some get a bit heavier. By input I'm assuming you mean the files to be indexed, which would be multiple formats, including but not limited to HTML, Word, RTF, Excel, PDF. As far as other search tools, we use the standard MS stuff mostly. I have no budget. We've looked at sharepoint, but it's a bit clunky for what we're doing here. The only solution I have looked at so far has been MS Desktop Search, but I'm an advocate of using OSS whenever possible.

    3. Re:Meta tags are worthless, generally by dekemoose · · Score: 1

      I work for a content management company and I've seen a lot of successes and failures with content management. Metadata is inconvenient for users, and a lot of times they just don't understand it, so they rarely get it right. The best bet for decent metadata is when you can ingest it in such a fashion that the metadata is applied automatically. Batch ingestion of content which has the same metadata is one way, depending on your content. Another way is to provide users a way to get contet into the system without having to specify the metadata. We have a webdav interface that customers can use to drag and drop content into folders from their desktop. The content inherits predefined metadata depending on the folder. It's a lot easier to train a user that this type of document needs to go to this folder. There are other ways, but these are two simple ones.

  37. Swish++, HyperEstraier by ecloud · · Score: 1

    I once integrated Swish++ as a document search system for a MediaWiki installation, to handle uploaded documents. I liked the results so then I started using it to build an index on a large codebase so I could quickly find all usages of a particular symbol (in source files, libraries and executables too). The catch is you have to define how to translate each type of file into plain text so it can be indexed. There are plenty of tools available for Word docs, PDFs, nm for libraries, etc. Compared to some others I think Swish++ has the advantage of speed. I haven't tried Lucene but my feeling is I'd rather not use Java for that unless the whole system is in Java.

    HyperEstraier has an excellent reputation but I haven't tried it yet. It's harder to get going with it.

    Too bad Beagle is written in .net; sounded like a well-integrated solution otherwise...

  38. Requirements spec by Anonymous Coward · · Score: 1, Insightful

    The requirements spec there reads like most of the projects Ive worked on the last few years. *sigh*

    In light of the above I cant (IGC) recommend anything specific, but I can advise you to avoid :-

    1) In house solutions (expensive, usually buggy).
    2) Anything from Thunderstone (If they've fixed the numerous Vortex bugs over the years I might revise my opinion but my last experience was painful).
    3) MS Full text search/indexing (slow - and yeah you can throw a load of hardware at this but hardly the optimal solution).
    4) Lucene (Ive seen too many sites with dead lucene searches).

    The recommendations re Google are probably safe-bets ("nobody ever got fired for buying google") and Ive had a lot of success with Swish-e for smaller (20,000 docs) projects.

  39. Google backdoor appliance by Anonymous Coward · · Score: 1, Insightful

    And it reports everything straight back to Google! Such a deal!

  40. cough cough by Evets · · Score: 1

    Microsoft Index... oh nevermind. I can't get it out with a straight face.

    Lucene is the way to go. There are APIs for Perl for dealing with Lucene data sets and for many other languages as well. Nutch is a good place to start getting to know the power of Lucene - you can get a nutch crawler interface up and running quickly and you can browse through some of the source files to get an understanding of how to bring in various file formats - Office documents, PDFs, etc.

    The Google Search boxes are decent, but with any commercial solution you end up paying fees for the amount of documents in your index. They open source the code, presumably because of OSS components (maybe even Lucene) but the documentation they publish is laughable.

  41. Alfresco is doc based, Plone is web based by thule · · Score: 0

    Last I saw, Plone is very web-centric. It is designed to manage a web site. Alfresco is designed to handle documents. It is more similar to SharePoint than Plone.

    1. Re:Alfresco is doc based, Plone is web based by seanmeister · · Score: 1

      Actually, Plone does a pretty good job of document indexing/management, particularly when you plug in TextIndexNG3. Add Enfold Desktop on top of that, and you've got desktop and MS Office integration easy enough for any office drone to use.

    2. Re:Alfresco is doc based, Plone is web based by Flambergius · · Score: 1

      And, with the Web Content Management in Alfresco 2.0, Alfresco does a pretty good job on the web side.

      I can't comment too much on Plone as I have been out of touch with it for 5 years. Anyways, I'm proud to be a Alfresco fanboy today.

      --
      Computers are useless. They can only give you answers - Pablo Picasso
  42. I do this in several programming languages by MarkWatson · · Score: 2, Informative

    There are 2 problems: getting plain text out of documents, then indexing the plain text

    A good tool for getting plain text out of various versions of Word documents is the "antiword" command line utility.

    The Apache POI project (Java) can read and write several Microsoft Office formats.

    For indexing: I like Lucene (Java), Ferret (Ruby+C), and Montezuma (Common Lisp).

    I have mostly been using Ruby the last few years for text processing. Here is a short article I wrote using the Java Lucene library using JRuby:

    http://markwatson.com/blog/2007/06/using-lucene-with-jruby.html

    Here is another short snippet for reading OpenOffice.org documents in Ruby:

    http://markwatson.com/blog/2007/05/why-odf-is-better-than-microsofts.html

    ---

    You might just want to use the entire Nutch stack:

    http://lucene.apache.org/nutch/

    stack that collects documents, spiders the web, has plugins for many document types, etc. Good stuff!

    1. Re:I do this in several programming languages by doas777 · · Score: 1

      Yep, I'm following a simillar algorithm in .net/TSQL.

      1) upload the document server side and store a bin representation of the file in SQL server for retrieval later.

      2) extract the text of the document into a SQL full-text indexed field. our requirements only specify PDFs, but libraries exist to extract text from most common file types. for PDF I used pdfbox based on a java lib, ported through the IKVM project. it's pretty kool.

      3) code a stored procedure (or several if you like) to query the metadata and the full-text index. its up to you if you want to combine the query's into one select or do them seperatly and then cross-join or do a left-outer or whatever you want. cross for inclusive personality, left or right outer to favor either the FTI or the metadata depending.

      that way, with a data-layer call or two, you can search via FTI or metadata or both.

      also I've heard good things about lucene, but i haven't tried it myself.

      good luck

  43. Many options, here are a couple by Anonymous Coward · · Score: 0

    General purpose indexing is not very fine tuned, when it comes to enterprise documentation and the indexing around it you get what you put into it. So if you don't have standards for meta-data then its not going to meet the needs of your management. For example if they type in "insert name of internal project" then they feel they should get all the stuff surrounding that project regardless if the name of that project is nowhere to be found in that document.

    Give EMC a call and talk to them about Documentum and all the things around it. Or organize your documents around projects/collaborations and use sharepoint and all the tools around it. There are a couple of things to get you started.

  44. If money's not an object... by djpretzel · · Score: 2, Informative

    As a Documentum developer, especially in light of the recent 6.0 release, I'd be remiss not to recommend it for such a purpose. It's expensive, rather complex, and requires solid development talent to implement, but is almost infinitely configurable and customizable, and there are separate components (at cost, of course) that can add on all sorts of fun functionality like collaboration, digital asset management, etc. It has the ability to auto-tag documents based on configurable rules using Content Intelligence Services and supports extensible object hierarchies, workflows, lifecycles, taxonomies, web services, you name it. It's probably overkill for the user in question, and it's far from open source (although EMC is doing an admirable job at encouraging code exchange, and the new dev. environment is based on Eclipse), but it's pretty darn slick when you look at the ground it covers, functionally.

  45. WSS 3.0 by madagajs · · Score: 1

    I'd recommend WSS 3.0. It can search any document that you can find/write an IFilter for with many built in, out of the box. It also provides an area for people to discuss a document without having to place comments in emails or the document itself (which could inadvertently make their way to a client).

  46. Re;Easy off the shelf by homey+of+my+owney · · Score: 2, Informative

    If you're looking for an index, a document management system probably makes sense. This one is inexpensive and very good.

    1. Re:Re;Easy off the shelf by Anonymous Coward · · Score: 0

      I'll have to disagree.

      First of all, its not that cheap; the kit for a decent set of functionality for a medium size organisation is going to be around 50K.

      It has quite a few bugs too, especially in the windows and web clients; it can corrupt the DB at times; is a bit slow (java) and you need a new license key from xerox each time you reinstall it.

      PS: I used to work for a xerox representative, and everybody I know that used docushare hated it.

    2. Re:Re;Easy off the shelf by homey+of+my+owney · · Score: 1

      Hmmm, I guess that's everybody... but me and all the people I've worked with.

  47. Install Wumpus Search by gvc · · Score: 1

    It is free, libre. Wumpus Search.

    1. Re:Install Wumpus Search by gvc · · Score: 2, Informative

      Sorry, mangled the URL in the parent: Wumpus-Search.org

    2. Re:Install Wumpus Search by Hatta · · Score: 1

      Wow, this option looks especially nice considering they use fschange to obviate the need for constant reindexing of the drive. fschange tells wumpus when a file changes and what particular portion of that file, so it can reindex the part it needs to. The constant reindexing and related performance problems are what stopped me from using Beagle, etc. Of course fschange requires a kernel patch, but big deal.

      --
      Give me Classic Slashdot or give me death!
  48. Full text search? by CoffeeIsMyGod · · Score: 1

    You may want to consider something besides full text searching (Google and company) as this usually starts to degrade fairly quickly with the size of the documents. So far I have not seen anything that actually comes close to human index documents. There are several tools that help users tie documents into a pre-build taxonomy or thesaurus so you get consistency, accuracy, and a *well designed* solution, not random machine learning grouped results. They usually cost money, though so be prepared. I think that Lucine has a taxonomy module that is in beta mode so that really helps. Automated categorization is still quite terrible so you will need to sit a few users down and have them tag the data. You will be much happier with the result, honestly.

  49. IBM OmniFind by bbdd · · Score: 1

    well, this may not apply to you, as you do not mention the size/number of items to index.

    but, for small shops where there is no money to throw at this type of thing, try IBM OmniFind Yahoo! Edition. can't beat the price.

    http://omnifind.ibm.yahoo.net/index.php

  50. Lucene by AtomicDevice · · Score: 1

    I had an internship over the summer at a large scholary journal archiving company who used lucene. I found it to be very easy to learn and powerful to use and customize. I was easily able to manage the tags and whatnot for documents, also I didn't really notice any issues with scalability, we indexed millions of documents and were able to search them just fine. It also has some nice basic options to get you on the way with semantic indexing if that is your bag (there are some better tools for that, but a lucene index is a good place to start for those)

    --
    Ze Atomic Device! It iz Ztolen!
  51. You may be asking the WRONG question... by ivi · · Score: 1

    Haven't you listened to the TED talk (cf: http://ted.com/
    on Spaghetti Sauces, which refers to Moskowicz's idea that:

      There isn't a (one) best , only best

    eg, best spaghetti sauceS, etc.

    Some find one best, others find another best [for them]...

    Whatcha think?

  52. Access Control and Searching by queenb**ch · · Score: 1

    The project that I worked on was also concerned with who was able to access the data. For that reason, we used a wiki-like format, converted everything into text, using a variety of conversion methods, and assigned access controls to it via a in-house web based application.

    This allowed for the full text to be searchable, provided a reference back to the original file. If it was in a digital format, like a Word document, it was also stored in the database. If it, it referenced a physical file. The user could suggest modifications to the searchable entry if an error was found. The archive team would investigate the suggested correction and usually implement it.

    2 cents,

    QueenB.

    --
    HDGary secures my bank :/
  53. Hack Something Together? by LionKimbro · · Score: 1

    I once spent 4 hours hacking together a symbol indexer for the 10,000+ CPP files in our source code repository. I wrote it in Python. It worked by brute force: "For every directory, for every .h or .cpp or .c file, crack it open, and, line by line, look for all instances of this regex..."

    It's a little slow-- 10 seconds to look up all instances of a symbol. And it takes ~3 hours to refresh the full index.

    But is saves an enormous amount of time, makes impossible tasks possible, and I have used it every day since. It's been about a 8 months now, and it's been absolutely wonderful.

    It would have taken far longer and many more resources to begin to figure out how to hook in Lucerne, or some other heavy duty package.

  54. An answer and a question - metadata tagging? by JonToycrafter · · Score: 1

    To answer the original question, I'm currently using Microsoft's Indexing Server. I bought SearchSimon for $25 or so - basically a disappointment, but still a worthwhile purchase for me, since I found it useful to look at the included source code. I found an article which I can't find right now about rolling your own page based on the Index Server engine.

    But this raises my question - how do you enforce metadata tagging? I can't even find a decent Windows-based metadata browser/editor for after-the-fact bulk tagging. I'm aware that certain commercial projects hook the save function on MS Office and replace your save dialog bog with one that has required metadata fields. Is there a way to get that functionality without paying thousands and thousands of dollars? Or, if one is going to pay that kind of money, what project should one go with?

    PS - Thanks to everyone mentioning Nutch/Xapian, I will definitely check them out tomorrow.

  55. Grep and strings by jshriverWVU · · Score: 1
    Personally I find you can search all text data with Grep, and if you need to search binary data just pass it threw strings and grep that :)

    1/2 joking 1/2 serious.

    grep -R "foobar" /

  56. Commercial tool also available by lotus87 · · Score: 1

    Inxight (now apparently part of Business Objects) has a very good knowledge search, data mining, and concept analysis system in their Smart Discovery servers. I don't work for them, but helped evaluate and deploy the product in a previous job. Definitely had some useful features beyond just indexed search. http://www.inxight.com/products/

  57. htDig by mattr · · Score: 1

    Someone mentioned htDig. I would just like to mention that I had much success with it. It's a C++ based crawler and search engine with customizable templates. I built a mod_perl wrapper to search 60 databases, a total of 1GB and got response times of about 0.1 seconds per query, including fuzzy searching. Actually it has so many thinks to tweak it is crazy. However this was a while ago and you may want to check the others mentioned here.

  58. Funnelback - it solved my problems by Anonymous Coward · · Score: 0

    just my 2c, our organization uses Funnelback (www.funnelback.com) to do exactly this and it works a treat. it really changed the way we work. You can ask them for a quote or trial it yourself.

  59. Forget meta, it's all about the content by aussiedood · · Score: 1

    Don't search meta tags, search the content. Adding meta tags to documents is time-comsuming, unreliable are difficult to maintain. If you index the content of the files to be searched you don't run the risk of missing a vital nugget of information buried deep in a document. I have used ISYS for years to do just this. It's easy to set up and maintain, fast and accurate and indexes files the Google desktop won't touch.

  60. I agree by StarkRG · · Score: 0, Troll

    We use Google apps at the place I work and it's great. Gmail, search, maps, etc.

    Of course, I work AT Google... :P

    1. Re:I agree by Anonymous Coward · · Score: 0

      What a piece of irrelenvat post that just shows how arrogant a help desk contractor at google can be.

  61. IBM OmniFind Yahoo Edition by Anonymous Coward · · Score: 0

    Look at IBM OmniFind Yahoo Edition. It's free, based upon the Lucene engine, supports 200 document types and websites, easily customized through a GUI, supports up to 500,000 documents, etc. This product's engine will be used in IBM's new version of OmniFind Enterprise Edition to be released sometime next year.

  62. Why? by Anonymous Coward · · Score: 0

    Users should not be allowed to search for files in the first place.

  63. Google search by PeteyG · · Score: 1

    Consider a Google search box

    Or, you know, you could add meta data to each and every single page you want to index... I'd personally rather stab my eyes out with a ballpoint pen.

    --
    no thanks
  64. Another suggestion and things to look our for by PassMark · · Score: 1

    People have made a lot of good suggestions,

    My suggestion is the Zoom Search Engine.
    By I am way bias, as I wrote half the code.

    Some other things to consider.
    1) Some of the solutions are Linux or Windows only. And some of the Linux solutions can't index Office documents. (Linux modules to extract text from all Office documents are not always available)

    2) Don't forget about the new Office 2007 document formats (the compressed XML formats). They are really different from the Office 2003 formats.

    3) You stated that you wanted to index Access databases. In this case you will proably need to expose the content of the database via web pages, to allow the spider to spider them. For example,
    http://www.yourwebsite.com/AccessDBRecord.php?id=1
    http://www.yourwebsite.com/AccessDBRecord.php?name=Project1
    http://www.yourwebsite.com/AccessDBRecord.php?name=Project2
    etc..

    4) You might need to manually edit the meta data on some documents. If the document is read only and can't be regenerated, then you might need a method, the Zoom's .desc files to add meta data to read only Office files

    5) Get a native code solution, the search time benchmarks we did show that compiled C++ code will out perform PHP and another scripting languages to 10 times or more.

  65. There are lots of choices... by LauraW · · Score: 1

    Other people have said most of this already, but the ones I've seen used the most are:

    • Swish-E: easy to set up, easy to script from Perl or whatever, but not very good results. I used this on a web site I ran about 7 or 8 years ago, and it worked pretty well, especially considering the state of the art at the time. I can't remember the licensing terms.
    • Lucene: Parses lots of document formats, easy to program in Java, works pretty well. Apache license
    • Google Mini: Easy to set up, good indexing, limited repository size. Closed source.
    • Google Search Appliance: Expensive, fast, big repository size. I don't know anything about administering it. Closed source.

    Disclaimer: I work for Google but not on search. I definitely think you should use only as big a hammer as you need for the job, or maybe a little bigger to allow for growth. I've even seen Lucene used on small, internal, Java projects at Google where our full-blown web search infrastructure would have been the equivalent of a thermonuclear flyswatter.

  66. Alfresco by golemwashere · · Score: 1

    Check out http:alfresco.org/ Alfresco .
    It's an open office / lucene / tomcat based content management system. It has a powerful smb/cifs interface and indexes all office docs out of the box.

  67. openkast by DangerousDriver · · Score: 1

    I've seen their product in action - it's fast and will index almost anything: http://www.openkast.com/

  68. So, what are the requirements? by Tim+C · · Score: 1

    You have some documents you want to index. How many? How many users? What advanced features do you need (if any)? What's your budget? What technologies and languages are you comfortable with? What OS does it need to run on?

    Where I work, we've used htdig, Verity K2 and Google search appliance, and have looked at (and heard good things about) Lucene.

    Which one I'd recommend would depend entirely on the answers to my questions.

  69. Need some advice by omega4711 · · Score: 1

    Hello, I have a similar issue as the article writer. I work in 3D visualisation/animation and we have multiple texture archives which are sorted by applying a filename structure and putting them into specific folders, e.g. x:\nature\plants\tree_d_1024r_n.jpg (d= diffuse, 1024=resolution of the longer side, r=rectangular, n=non-tileable) However, i think that a tag-based system would be much better to sort all our stuff (i spent roughly 20% of my worktime searching for specific files). Our files are currently hosted via samba on ubuntu linux. My question: Is there an EASY, fool-proof way of using tags to sort our archives ? Thanks in advance

  70. Re:Only one real choice by pasamio · · Score: 1

    Docs...you've got to be kidding me right? This bloated ugly piece of trash rarely works properly, just look at the mess they made in their latest release. I'd pitch at it being worse than Windows Vista because Microsoft is at least in a position to improve on Vista, DOCS went back and rewrote their last version to get an upgrade because the one they had been working on didn't work properly and it had customers left right and centre complaining about it. We've got a DOCS deployment and Helpdesk is still fighting to get it working properly in their SOE, at least these days it properly supports Office 2000 (we had a hack at one point that prevented people clicking the cross on Excel because it broke the integration system). Last time I checked it had issues with Office 2003 let alone anything beyond that. If you're in the "real world" and on Windows, stay the hell away from DOCS.

    --
    I always wondered where this setting was...
  71. Have a look at Concept Searching by Anonymous Coward · · Score: 0

    Here is the blurb on the Concept Searching homepage:

    "Most meaning is expressed in short patterns of words and conceptSearch is the only search product to automatically recognize multi-word concepts and use these as the basis for searching. Single words in isolation are highly ambiguous resulting in Low Precision. Whilst phrase searching can be used to improve Precision it does so at the expense of Recall since any document that does not match the exact phrase will be ignored. conceptSearch delivers High Precision and High Recall, with better ranking of results, compared to all other search engines that utilise an index of single works."

    If you are running a taxonomy to organise your information then it will also do automatic document classification.

    There is also a version available for MS SharePoint.

  72. dunno, but probably not the way I did it... by tomandlu · · Score: 1

    ...when I had to write a quick-n-dirty wiki attachment search tool..

    # perform search, checking if fname_only, cs, etc.,
    # stripping non-printable ascii
    sub DoSearch{
      my $path = $searchpath;
      $path .= $twiki;
      find({wanted=>\&wanted,
        untaint=>1,untaint_pattern=>'^([\040-\176]*)$',untaint_skip=>1},
        @twikipaths);

      sub wanted{
        if($_ !~ /,v$/i and /.+\.$search_ext$/i){
          if($cs && /$searchterm/){
            push @matched_files, $File::Find::name;
          }
          elsif(!$cs && /$searchterm/i){
            push @matched_files, $File::Find::name;
          }
          elsif(!$fname_only){
            open(DOC, $File::Find::name)||
              die "Couldn't open $File::Find::name:$!\n";
            THISFILE: while(my $line = <DOC>){
              $line =~ s/[^\011\012\015\040-\176]//g;
              if($cs && $line =~ /$searchterm/){
                close DOC;
                push @matched_files, $File::Find::name;
                last THISFILE;
              }
              elsif(!$cs && $line =~ /$searchterm/i){
                close DOC;
                push @matched_files, $File::Find::name;
                last THISFILE;
              }
            }
          }
        }
      }
    }

  73. Regain works well by Anonymous Coward · · Score: 0

    Try Regain:

        http://regain.sourceforge.net/

  74. Aduna Aufofocus/Metadata Server by xanton159 · · Score: 1

    http://www.aduna-software.com/products/autofocus/overview.view
    http://www.aduna-software.com/products/autofocus_server/overview.view

    Fast, free (as in freedom, and as in beer), efficient document and metadeta searching for a single desktop or large enterprise. I use it for searching thousands of HTML pages in local website mirriors.

    http://www.aduna-software.com/images/screenshots/autofocus_server/autofocus_server3.png
    http://www.aduna-software.com/images/screenshots/autofocus/query-answer-3.png

  75. Managing Gigabytes by chris_sawtell · · Score: 1

    http://www.cs.mu.oz.au/mg/

    To get more info including a peep into the book do a Google search on "Managing Gigabytes"

    otoh for something cheap and cheerful there is htdig.

    http://htdig.org/

    It's remarkably good for indexing an intranet.

  76. Mnogosearch - an efficient solution by Anonymous Coward · · Score: 0

    I'm a long time user of mnogosearch and tested different other solution like omega (part of xapian), lucene or alike. If you want a ready to use solution, supporting a large set of document format via filters, multi-language... mnogosearch (http://www.mnogosearch.org/) is really a good solution. it works like a web scrapper but can also work on mounted filesystem, it's supporting caching... worth the test if you want a good search engine.

  77. How about Oracle text by iq1 · · Score: 2, Interesting

    i know this will give me flames, but:
    you might try Oracle Text (also part of Oracle XE).

    Supports 140 document formats, has a lot of options and works via SQL.
    Can build indexes for documents stored in DB or in the file system.
    You can even join the serach terms from the document with the database records where metadata might be stored by your application.
    I found that very helpful in similar projects. And it's free.

  78. If you have any spare time left over...... by heffrey · · Score: 1

    ......why don't you try re-inventing the wheel?!

  79. Search software by j.leidner · · Score: 2, Informative
    Lucene - LINK

    Terrier - LINK

    Indri/Lemur - LINK / LINK

    MG - LINK

  80. Lucene Subprojects by esme · · Score: 1

    I see a lot of people have already recommended Lucene, and I heartily agree.

    But, I suggest you look at the various Lucene sub-projects to see if one of them meets your needs. For example, Nutch includes a crawler and parsers for Word/PowerPoint/PDF/HTML/etc. so you wouldn't have to write that part yourself. Solr is a webapp that wraps a Lucene index in a simple web service and comes preconfigured to run inside its own servlet container on a separate port, so that's pretty easy to setup and use.

    -Esme

  81. None of the current systems work by CarpetShark · · Score: 1

    I've tried quite a few of the current systems, and looked at a number of the available APIs so far, in the hopes of creating something that'll do what I want.

    Basically, I think all document indexers currently suck. Must-have features:

    * Indexing documents on a per-sentence, per-paragraph, per-page, per-chapter, per-section (etc.) basis: I should be able to search for books that have the words "people" and "crimewave" or just sentences that contain that word. There's no point indexing a thousand cross-referenced and cited PDFs about pyschology for terms like "neurons and fear". When I search a document collection for neurons and fear, I want it to show me paragraphs or sections that discuss those two topics together, in relation to each other, in depth. I guess this is similar to proximity searching, BUT...

    * It MUST be able to bring up the right section. If the search engine just throws up a list that says, "yep, book121314 --- "Everything in the human body, in detail" (which is 98435 pages long) has both those words in it", then it's no better than grep. Not a single PDF viewer I've looked at on unix has the ability to open a PDF at a particular page, much less a certain anchor on a page, with given words highlighted.

    Not so crucial, but important:

    * Tagging. It should allow me to tag documents, pages, etc.

    * Cross-referencing, and and comparison. Side-by-side scrolling of documents in different languages, or just different translations and commentaries on documents, a bit like what sword's UIs do, but more generally.

    1. Re:None of the current systems work by pasamio · · Score: 1

      evince appears to have the ability to return to a specific page on a pdf document you've opened previously. So if it has the ability to remember there must be a way to get it to jump to the right page, though it may not be obvious.

      --
      I always wondered where this setting was...
    2. Re:None of the current systems work by CarpetShark · · Score: 1

      KPDF might be coerced to do it too -- it has a DCOP call that lets you change the page. You can also load it up as a component in your app, and tell talk to that particular instance through dcop, which is the best solution I've found so far. Still, it's a lot of work just to find a word on a page :/

  82. kinosearch, swish-e, zebra, ht:/dig, etc. by ericleasemorgan · · Score: 1

    There are many ways to skin this cat. I believe most of them have been mentioned, but I will outline my experiences anyway.

    swish-e is a grand-daddy of an indexer. It can act as a robot, crawl your local file system, or get its input from STDIN. If indexing HTML, swish-e will index the document's metatags and provide field searching against them. Swish-e comes with a C, Perl, and PHP API. I don't think swish-e supports anything but ASCII very well.

    kinosearch is my new favorite. Written in C but with a Perl API, this indexer works a lot like Lucene. Its resulting indexes (files) may be readable by Lucene. Kinosearch works by initializing a "document" with attributes, filling each attribute with values, and saving the document. Searching is fast an easy. It does not support wildcard searching, but uses extensive stemming instead. Kinosearch does not index files from your file system; you must parse your data and feed it to Kinosearch.

    Ht:/dig is nice, but the last time I looked, it had no API. I found this to be too limiting. It indexes documents.

    The Google Appliance is cool (and kewl) but also very expensive. This black box (well, it is really gold or blue) does a lot of the work for you. Configuring its output is dependent on your ability to do XSLT. You can feed the Google Appliance database dumps and other streams of data. Nice. I still think the price is steep.

    There's Plucene, a Perl port of Lucene. Too slow, and seemingly unsupported.

    Lucene and its kin seem to be the Gold Standard these days. I appreciate that, but alas, I don't have any Java experience. Increasingly people swear against SOLR, a Web Services-based interface to Lucene.

    Zebra is an unsung hero. It has been around for more than ten years, actively supported and used extensively in Library Land. (I'm a librarian.) This thing can index just about any kind of document. It supports every type of searching feature (stemming, wild card, fielded, Boolean logic, relevance ranked, etc.). It can read files or be fed things from STDIN. Fast!

    As an added bonus, I advocate readers explore abstracting their search interfaces with something like OpenSearch or Search/Retrieve via URL (SRU). These abstract layers allow you to create user interfaces to your underlying indexers without worrying what those indexers are. In other words, these abstract layers define the syntax for queries, the transport mechanism to the index, and the structure of the returned result. Given such a framework, you can write an OpenSearch or SRU interface to your index, but if you decide that Lucene is not what you want to use anymore but Kinosearch is, then you can change your indexer without the need to change your user interface. Very nice. OpenSearch is simpler to implement but is weak when it comes to expressive searches and search results. SRU is more robust but also more complicated.

  83. Look at how you will access the docs first by rclandrum · · Score: 3, Informative

    As someone who has made a 30-year career out of designing and building document management systems, I would urge you to look first at how you expect your users to find the documents they need. The expected results of a search should guide your choice of indexing methods - and the popular "meta tagging" method isn't always the best. There are shortcomings with all methods.

    Full-text indexing allows users to search the entire contents of documents, but the results are imprecise and voluminous and not terribly useful in most cases (think web search engines here). Yes, you can find all documents that contain the word "patent", but you get a lot of old references to patent leather shoes in addition to what you were probably after. So, with full-text search you get it all, but force the user to subsearch for what they really want.

    Using meta-tags gives the appearance of pre-classifying documents and having the users do it themselves means you don't have to have a dedicated person to assign the tags. The disadvantage is that everybody makes up their own tags or if you have a standard set, you have to rely on people being diligent about applying them. And tag popularity can easily change over time. For example, if you want to find docs that refer to "removable media", this might have garnered a "floppy" tag 15 years ago and "CD" or "DVD" today. You are therefore almost guaranteed of missing some documents using this method.

    Database indexing means that you list all your docs in a database, perhaps by title, author, date, or other fields that your users would find useful for searching. The advantage is that every doucment is indexed the same way, searching is really fast, and the results are usually relevant if your schema is meaningful. The disadvantages are that indexing the docs takes work on input and users need to know how to search to get the best results.

    Finally, you could organize the docs by simple name and folder. This works fine for the desktop and users usually can identify the category that points them to the folder they want. The disadvantage is that this only works well for limited document sets. Once you start getting hundreds of categories and thousands and thousands of documents, things become too hard to find.

    So - understand your users search requirements and the size of your expected database. Only then can you make an informed decision about how to create and index the repository.

  84. And if you want some reading. by StarfishOne · · Score: 1

    This is a rather nice book:

    Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto
    http://people.ischool.berkeley.edu/~hearst/irbook/

    Amazon link for the reviews (no, no referer tricks, don't worry)
    http://www.amazon.com/Modern-Information-Retrieval-Ricardo-Baeza-Yates/dp/020139829X

  85. similar situation by kuruptacus · · Score: 0

    I had the same issue with a smallish agency and after defining criteria and scoring qualified courses of action, I determined that cots/appliances where the most cost effective and management (in the stuffed shirt sense of the word) 'friendly' solutions. The Google Search Appliance (I would avoid the mini because you might be shocked by the number of documents you really have), and Vivisimo (vivisimo.com) were my top choices. For extensibility and access control via ldap my recommendation was Vivisimo over Google, but management choose Google and it wasn't a bad choice. It's tough to match that 'bang for the buck' even if you are a good developer or have one your staff.

    --
    Shop as usual. Avoid panic buying.
  86. KnowledgeTree by bondjamesbond · · Score: 0

    ...has a free version that works really well. You'll probably need your own box for it but I've done it with CentOS on an old Dell box and it worked beautifully. Install is a snap and uploading docs is done from a zip. The PDF indexing is a little light, but everything else indexes just fine.

  87. meta search by Anonymous Coward · · Score: 0

    try tag2find

  88. Recommend the FAST product by Evil+W1zard · · Score: 1

    I think you should take a look at the search capabilities provided by something called FAST ESP. They are based out of Norway but used all over the US govt and tons of commercial entities (like LexisNexis). The website for them is www.fast-search.com and from people I talk to it is supposedly pretty robust and can do intelligent searches, data tagging, authorization against data and stores, geo-tagging, yada yada yada and etc...

    W1z

    --
    News Reporters Make Tasty Polar Bear Treats!
  89. SCAN by Johann_NL · · Score: 1

    You can try this: http://scan.sf.net/

  90. Ah, if anybody could follow Dublin Core... by brouits · · Score: 1

    http://dublincore.org/ is making effort for documetn metadata, imrpoving indexation through document headers.. to me this is a stright line to follow.

    --
    -- "Since the best cannot be had, we must take the next best." -- Abraham Platz, mayor of Leipzig, 1723.
  91. Folio Views Experience by tjstork · · Score: 1

    I used Folio Views a long time ago for a litigation support project. To really pull it off, I had to back it with an RDBMS. So, I had a full text search using the Folio Views, and then, an RDBMS to do meta deta searches. All the documents were supplied as both plaintext and as OCR. Paralegals coded the RDBMS portion as part of the document review, and the OCR was fed into a Folio Views NFO. The thing is, the NFO was really only good for static datasets... you took all of your text, built a sort of a hyperlink system onto it using FolioViews proprietary tags, and then, you were off to the races. The full text search, though, was really good.

    Were I able to do it again, today, and I had plaintext, I'd probably be tempted to drop the whole lot directly into the database. Oracle and SQL Server both have some sort of a full text search thing you can get, although I've never used either. But that way, you could build queries that went against both the meta data and the document text at the same time.

    --
    This is my sig.
  92. greenstone by Anonymous Coward · · Score: 0

    Use the digital library software: www.greenstone.org
    Highly configurable and its all gpl.

    Also from the same professor(s) for greenstone developed something called 'weka' -- machine learning in java.

    Enjoy

  93. rhe best searchable softwatr? by hellmuthchileno · · Score: 1

    Virginia Systems from Canada, with the software we ordered customize for a Danish Newspaper. Talk to Philip van Cleave asj for WebSonar for the Mac of course. At the moment indexes automatically every single words (excluding the excluded words like the a one two etc) and the search is almost instantaneous. Searches by aproximation phonetical also. In the apliccation I mentioned the index contains over 1509ñ000 articles over 10 years. is that enough for yiou. Ask for the price and sit down because ius really che Give my kind regards to Philip. philip@virginiasystems.com He doe not know that I live now in Chile. We met while I worked in Denmark. DO NOT SEARCH ANYWHERE ELSE. By far the best software ever Kind regards Hellmuth Stuven Lira Civ Ing Webmaster Apple Developer and writer

  94. Alternative to Lucene by Anonymous Coward · · Score: 0

    I'm not suggesting this as a solution to your problem since there are a fair number of off the shelf solutions, but for reference you if you were going to create something using an API such as Lucene, check out the awesome open source Sphinx: http://sphinxsearch.com/

    It's very powerful and very fast, but you'd have you figure out how to convert the pdfs, the docs, etc into a format that it will take. I'm currently using it for a scientific journal article system. I'm not indexing full text of the articles but I am indexing the citation information including abstracts. It works very well for my purposes.

  95. Have you heard of Coveo? by Anonymous Coward · · Score: 0

    They made a great tool to index large amount of documents, it's commercial so it's not free and it's Windows only. That depends on what you need. http://www.coveo.com/

  96. Thanks by Blinocac · · Score: 1

    I appreciate all the input, and am reviewing many of the different solutions mentioned.

  97. When is a directory with files no longer good? by jollyreaper · · Score: 1

    I've looked into online document management solutions for use in my own company and have passed each and every time. For starters, we're just not big enough to justify that sort of thing. But even if we were, I'm doubtful that the work is worth the effort. I'm willing to be convinced otherwise, I just haven't encountered a good enough argument yet.

    My biggest concern would be the end users. People are just not tech savvy. Even when you go to great lengths to tell, show, and teach, they will still do boneheaded things that will amaze and confound. Take the simple example of "Don't store shit on your laptop! You can put something there temporally but it better permanently reside on the server." It does not sink in. I can even try the approach of "Look, if you lose the data on your laptop, it's your ass on the line. Put it on the server and if it gets lost, it's my ass on the line." It still does not sink in.

    End users will create tons of duplicate data on the server. Oh, marketing pictures are in this directory? Rather than make a shortcut to get there, why not copy everything to my folder instead? Yes, yes, that's good. Ugh. I've got duplicate file detective so I can see just how much waste we have here. Management has been informed and has done nothing. When we run out of space, I'll just hand them the report again and tell them we have to start removing dupes.

    Another perfect example of how people don't listen ... our accounting system splits data for different subsidiaries across multiple files. So if Accounting wants to know what's outstanding for vendor A across multiple projects, they need a spreadsheet that pulls the info from all those companies. I've explained to Accounting that when they setup a vendor, they need to use the same 6 character abbreviation so that we can group the data properly by that vendor. Do they listen? No. And this directly creates more work for them!

    So given all this chaos, the best document storage system I can come up with is still just plain NTFS drives with directories on the server. I can lock access control with the security system and groups, people are told to manually file their work in sensible places. If they don't, Windows search still works well enough and can poke inside office documents for keywords. Contracts are scanned in as PDF's and a naming scheme is required so that we have lot, buyer, community, and contract edition in the filename. Revisions to the contract are scanned as they arrive and named as such. Additionally, contracts are placed in their own folders so if a data entry error is made in the indexing, we can still manually browse to where it's supposed to be to find it. Doing a full listing of the entire directory will also sort screwups to the top or bottom of the list where they can be located and corrected. What's more, shadow copy is working so I can go back and restore files in case anyone makes screwups. The whole array is backed up to external hard drives nightly with point-in-time saves going back six months in case an error is located after the backup is gone from shadow copy. So, at what point would this system be too primitive and actually impede the business of the day?

    Unrelated note: this is typical. Just got an email back from the photographer. I'm asking for larger versions of the photos that were already sent, they're just too small.

    I can try to make them bigger but the problem is that they start to blur.
    I will try and you can tell me what you think.


    Shoot me now.

    --
    Kwisatz Haderach
    Sell the spice to CHOAM
    This Mahdi took Shaddam's Throne
  98. Re:It's not MS's fault you suck by pasamio · · Score: 1

    Way to pro MS troll on a post that actually gave Microsoft a compliment! The problem in this case wasn't with Microsoft, in fact I never said there was a problem with Microsoft, the problem lies with DOCS. These days its improving sure but its still a piece of trash (case in point: why does DocsX exist?). I don't disagree that Microsoft Office is the best Office suite out there for those who are using its more advanced features however for the most part most users are happy with changing font size, making things bold or underline and adding images, but again thats not the point of my post. The point of my post is how bad DOCS has been for a large number of organisations and how messed up their software is. I'm sure you use it with no problem, you seem to be unable to misread with no problems either. In fact to reply to your title "Its not MS's fault", its true its not MS's fault at any point, its DOCS's fault.

    --
    I always wondered where this setting was...
  99. Navigate the metadata, use Dieselpoint by ccleve · · Score: 1
    For most corporate-type search apps, you want both full-text search *and* navigation. Navigation, sometimes called faceted navigation, gives you the results broken down by category down the left-hand side of the results page. Something like this:

    --Departments-- (search results go down the middle of the page here)
    Sales (100)
    Marketing (200)
    HR (15)

    --Year--
    2006 (50)
    2007 (90)

    --Author--
    Joe Jones (40)
    Frank Smith (99)

    etc. Each row is a clickable link you can use to narrow the search. You can build these menus based on any available metadata for your documents.

    Navigation is especially useful in a corporate setting because the relevance ranking isn't going to be as good as you get with a web search engine. The reason is that web search engines can take advantage of the links in documents to discover which pages are more important, and likely more relevant. You probably don't have a lot of links in your Word and PDF docs. So breaking them down by category is really helpful.

    There are a handful of companies that sell search software that can do this. The company I work for, Dieselpoint, makes enterprise search software that can create navigation contexts over really large collections.