Best Way to Build a Searchable Document Index?
Blinocac writes "I am organizing the IT documentation for the agency I work for, and we would like to make a searchable document index that would render results based on meta tags placed in the documents, which include everything from Word files, HTML, Excel, Access, and PDF's." What methods or tools have others seen that work? Anything to avoid?
You're the one gettin' paid, you figure it out.
Grep and flat files. The way God intended.
Check out Apache's free Lucene engine, found at lucene.apache.org/. Lucene is a powerful indexing engine that handles all kinds of docs, and you can easily mod it to handle whatever it doesn't. It also allows custom scoring and a very powerful query language.
We have a Google appliance, but you can do it with regular Google, too. Just make sure you disable caching (with headers or by encrypting documents). Then place an IP or password restriction for non-Google crawlers (check IP, not user-agent). People will be able to search with the power of Google, but only people you allow in will be able to get the full documents.
If you value your privacy, invest in a Google mini, though.
Who places what types of meta tags in the documents? I don't understand the requirements.
Generally, Lucene does a good job. It's easy to learn and performance was fine for me and my data (~ 2 GB of textual documents).
Because if you have to spend more than an hour on this kind of project nowadays, you're wasting your time.
The inexpensive Google appliacances don't have very fine-grained access control, though. But I am involved in several semi-failed projects of this nature in my organization, but new and legacy, and my Google Desktop outperforms all of them.
Give a man a fish and you have fed him for today. Teach a man to fish, and he'll say "WHERE'S MY FISH, YOU IDIOT?"
http://www.swish-e.org/
There are many open source solutions for what you are trying to do, also if you want it be portable then I would suggest a CMS that does not require a MySQL database like "Limbo" what does the organization do?
To see a few of my Android apps goto: www.hartwired.com
Hmm, avoid kat and beagle. Consider a using a one line perl script using find and grep instead...
Is this something that would suit your needs: Beagle for Linux, Spotlight for OSX? I haven't tried Beagle (I don't have root access on my Debian installation at work), but Spotlight is probably my most cherished feature in OSX... it's so useful.
Animoog.org
If you have a reasonable budget *and* an intranet, you can consider implementing a Google Appliance and pointing it at the network location that houses the documents. The side benefit is that documents can be found/accesses via browser.
Since you're indexing non-text data, you'll need a search engine that has plenty of document filters. We use Oracle Text to do something similar to this, but it's not for the faint of heart. The nice thing about Oracle Text is it includes filters for pretty much any document you'd want to index (PDF, Word, Excel, etc). Of course, Oracle Text query syntax needs an awful lot of lipstick to be made to look like Google query syntax. WMMV.
Good people do not need laws to tell them to act responsibly, while bad people will find a way around the laws-Plato
get your site indexed by google then add parameters like
;-)
site:mysitename.com
filetype:xml
cheap quick and easy
Try http://www.mnogosearch.org/ is like a small free google spider.
Damia
http://www.theregister.co.uk/2001/11/05/the_bofh_content_management_system/
The secret to creativity is knowing how to hide your sources. -- Albert Einstein
If you're using Office 2007, you can probably hack something together really quickly to pull the meta tags from the files and put them in a database. Not sure about the other formats you need, though--and support from Google, for instance, would probably be beneficial for your company anyway. Hope that helps!
US businesses that currently accept chip and PIN/signature
A googlebox. Indexes file shares and internal websites and makes them searchable. Can be a little pricey though.
#!/bin/csh cat $0
You should avoid any system that relies on individual employees putting in these meta-tags. It won't work; they either won't do it, or will do it wrong (spelling errors, inventing their own tags on the fly, and so on.) And then you'll catch hell when they can't find one of those documents they mislabled. Trust me.
it wil cost you some bucks just buy MS sharepoint portal server, and leave the indexing over to sharepoint.
Your not even realy required to use added tags... (as most people will put in poor tags).
But if you like you can add tags even with sharepoint.
I know you're out there. I can feel you now. I know that you're afraid. You're afraid of us. You're afraid of change.
We use Livelink.
It's huge, kludgy, awkward, slow, resource intensive, but it works (when it's up).
Let google do the indexing!
I posted this before on slashdot. I discovered a while ago a cool system called Alfresco. There is a free (as is liberty) and commercial versions. It acts like a SMB (like SAMBA), ftp, and WebDAV server so you don't have to use the web interface to get files into the system. Users can map it as a network drive. The web interface allows users to set metatags, retrieve previous versions of the file, and most importantly, search the documents in the system.
.doc and .pdf's, use Alfresco.
Alfresco also has plugins for Microsoft Office so you can manage the repository from Word, etc. They are also working on OpenOffice integration.
Don't use SAMBA for
I am not affiliated with Alfresco, just a happy user.
For searching Microsoft products you can't beat X1 as far as user interface.
I'd suggest you should consider a full-text search engine. First start here:
http://en.wikipedia.org/wiki/Full_text_search
If you're not afraid to do a little reading and potentially coding a custom front end, you may want to look at two of the big open source engines: Lucene and Xapian.
Lucene is quite popular now, and is an Apache Java project. It's a good choice if you're a Java shop.
Xapian seems to be based on a little more solid and modern information retrieval theory and is incredibly scalable and fast. It's written in C++, with SWIG-based front ends to many languages. It might not have as polished of a front end or as fancy of a website as Lucene, but I believe it's a better choice if you have really really huge data sets or want to venture outside the Java universe.
There are also many other wholely-contained indexers too, mostly which are based on web indexing (they have spiders, query forms, etc.) all bundled together. Like ht://Dig, mnogosearch, and so forth. They are good, especially if you want more of a drop-in solution rather than a raw indexing engine, and if you're indexing web sites (and not complex entities like databases, etc).
Disclaimer: I flog Google search solutions at work, so I'm way biased.
Do not mock my vision of impractical footwear
Directories full of random documents in random formats of random version with varying degrees of completeness and accuracy tend to get less useful as an information source as time goes on. Docs get abandoned and continue to provide outdated information and dead links. Doc formats change and require converters to import. Doc maintainers leave the company.
If you work somewhere where people are not trained to attach Office docs to every email, where people don't use Word to compose 10 bullet points, where people don't use a spreadsheet as a substitute for all sorts of CRM and business applications... a Wiki is actually a good solution.
You can use something like MediaWiki or Twiki or... heck you can use a whole variety of content management systems.
The key to success is to *EMPOWER* people to actually update information, and have a few people who are empowered to actually edit, rehash, sort, move, prune wiki pages and content. As the content improves, it will draw in more users and more content creators. Pretty soon, employees will *COMPLAIN* when someone sends out information and doesn't update the wiki.
Some corporate cultures are not wiki-friendly. Some management chains *fear* the wiki. Some companies have whole webmaster groups who believe it is their job to delay the process of getting useful content onto the web by controlling it. If you're in one of those companies... start up your own wiki and beg for forgiveness later.
In an image-based solution. My business requires customer access to literally thousands of individual images. It would be nice to be able to scan them all in and tag them appropriately (multiple tags!) so as to create an easily searchable database.
"He who can destroy a thing, controls a thing." --Paul Atreides, Dune
Always write their own homebrew search engines.
technical writing / development
How about IBM FileNet? Or are you looking for something free? We use FileNet everywhere I've been.
The downside to the suggestions like Google Appliances is that you're then storing this information on Google servers.... something that most companies find HIGHLY objectionable (security).
-Daniel
DocDB (http://docdb-v.sourceforge.net/) can interface to the search engines others are suggesting, but organizing your documents with decent meta-data in the first place (and not on a Wiki that is allowed to rot) is also important. That's what DocDB does.
I'm doing something similar at work. The Extensis web stuff they sell is a bit pathetic, so you're better off buying their SQL connect stuff and building your own web frontend.
We've been quite happy with our Google Search Applance.
The two exceptions are the way it handles secured documents (on our mostly-Windows network, that meant authenticating twice or doing complicated Kerberos stuff), and hardware (we've had two boxes fail with drive issues in the last year).
Still, when it comes to search results and speed, it's been very good. I'm also a fan of Google Desktop, but that's a completely different story and more difficult to centrally manage.
So far, everything I've read really doesn't sound like it's geared towards an enterprise level; mostly put a bunch of files out there in a folder somewhere and let a crawler index them. That's all good and fine until someone gets the idea to search for the payroll documents...
At work, and granted it's a fair size HMO, we use Universal Content Management by Oracle, formerly known as Stellent's Content Management System.
UCM allows for named accounts with control over access, plus a full audit log, full change log (plus revisions), and a centralized location for searching for document of any type (Word, Excel, Powerpoint, AutoCAD, MPG movies, TIFF, etc.).
Supports work flows as well, a nice plus if something needs to go through a formal process, gives nice audit trails, and supports 3 different full text indexes (FAST, Verity, and database) of the stored content.
This might not be for everyone, but it is a decent tool for large size companies to manage documents.
III.IIVIVIXIIVIVIIIVVIIIIXVIIIXIIIIIIIIVIIIIVVIII
Seriously, spend a tiny bit of money on a Google Appliance and get excellent search. I tried to use MS stuff, like the built-in index server and it just wasn't good enough.
We got a Google "appliance" and the damn thing just works, and works well. I don't work for Google, nor do I get paid if they make a sale. Just saying what worked great for us.
General, you are listening to a machine! Do the world a favor and don't act like one.
How about avoiding Word docs, Excel spreadsheets and Access databases?
Meta tags are worthless, generally, unless you have a librarian who ensures correctness.
DON'T TRUST USERS TO ENTER META DATA!!!
I've worked in electronic document management in 3 different businesses and metadata entered by end users is worst than worthless - it is wrong. Searches that don't use full text for general documents are less than ideal.
Just to prove that you're question is missing critical data:
- how many documents?
- how large is the average and largest documents?
- what format will be input? PDF, HTML, XLS, PPT, OO, C++, what?
- what search tools do you use elsewhere?
- any budget constraints?
- did you look at general document management systems? Documentum, Docushare, Filenet, Sharepoint? If so, what didn't work with these systems?
- Did you consider OSS solutions? htdig, e-swish, custom searching?
- A buddy of mine wrote an article on "how to index anything" that was in the Linux Journal a few years ago. Google is your friend.
AND if i didn't get this across yet - DON'T TRUST META DATA IN HIDDEN DOCUMENT FIELDS - bad Metadata in MS-Office files will completely destroy the usefulness of your searches.
I once integrated Swish++ as a document search system for a MediaWiki installation, to handle uploaded documents. I liked the results so then I started using it to build an index on a large codebase so I could quickly find all usages of a particular symbol (in source files, libraries and executables too). The catch is you have to define how to translate each type of file into plain text so it can be indexed. There are plenty of tools available for Word docs, PDFs, nm for libraries, etc. Compared to some others I think Swish++ has the advantage of speed. I haven't tried Lucene but my feeling is I'd rather not use Java for that unless the whole system is in Java.
.net; sounded like a well-integrated solution otherwise...
HyperEstraier has an excellent reputation but I haven't tried it yet. It's harder to get going with it.
Too bad Beagle is written in
The requirements spec there reads like most of the projects Ive worked on the last few years. *sigh*
:-
In light of the above I cant (IGC) recommend anything specific, but I can advise you to avoid
1) In house solutions (expensive, usually buggy).
2) Anything from Thunderstone (If they've fixed the numerous Vortex bugs over the years I might revise my opinion but my last experience was painful).
3) MS Full text search/indexing (slow - and yeah you can throw a load of hardware at this but hardly the optimal solution).
4) Lucene (Ive seen too many sites with dead lucene searches).
The recommendations re Google are probably safe-bets ("nobody ever got fired for buying google") and Ive had a lot of success with Swish-e for smaller (20,000 docs) projects.
And it reports everything straight back to Google! Such a deal!
Microsoft Index... oh nevermind. I can't get it out with a straight face.
Lucene is the way to go. There are APIs for Perl for dealing with Lucene data sets and for many other languages as well. Nutch is a good place to start getting to know the power of Lucene - you can get a nutch crawler interface up and running quickly and you can browse through some of the source files to get an understanding of how to bring in various file formats - Office documents, PDFs, etc.
The Google Search boxes are decent, but with any commercial solution you end up paying fees for the amount of documents in your index. They open source the code, presumably because of OSS components (maybe even Lucene) but the documentation they publish is laughable.
Last I saw, Plone is very web-centric. It is designed to manage a web site. Alfresco is designed to handle documents. It is more similar to SharePoint than Plone.
There are 2 problems: getting plain text out of documents, then indexing the plain text
A good tool for getting plain text out of various versions of Word documents is the "antiword" command line utility.
The Apache POI project (Java) can read and write several Microsoft Office formats.
For indexing: I like Lucene (Java), Ferret (Ruby+C), and Montezuma (Common Lisp).
I have mostly been using Ruby the last few years for text processing. Here is a short article I wrote using the Java Lucene library using JRuby:
http://markwatson.com/blog/2007/06/using-lucene-with-jruby.html
Here is another short snippet for reading OpenOffice.org documents in Ruby:
http://markwatson.com/blog/2007/05/why-odf-is-better-than-microsofts.html
---
You might just want to use the entire Nutch stack:
http://lucene.apache.org/nutch/
stack that collects documents, spiders the web, has plugins for many document types, etc. Good stuff!
General purpose indexing is not very fine tuned, when it comes to enterprise documentation and the indexing around it you get what you put into it. So if you don't have standards for meta-data then its not going to meet the needs of your management. For example if they type in "insert name of internal project" then they feel they should get all the stuff surrounding that project regardless if the name of that project is nowhere to be found in that document.
Give EMC a call and talk to them about Documentum and all the things around it. Or organize your documents around projects/collaborations and use sharepoint and all the tools around it. There are a couple of things to get you started.
As a Documentum developer, especially in light of the recent 6.0 release, I'd be remiss not to recommend it for such a purpose. It's expensive, rather complex, and requires solid development talent to implement, but is almost infinitely configurable and customizable, and there are separate components (at cost, of course) that can add on all sorts of fun functionality like collaboration, digital asset management, etc. It has the ability to auto-tag documents based on configurable rules using Content Intelligence Services and supports extensible object hierarchies, workflows, lifecycles, taxonomies, web services, you name it. It's probably overkill for the user in question, and it's far from open source (although EMC is doing an admirable job at encouraging code exchange, and the new dev. environment is based on Eclipse), but it's pretty darn slick when you look at the ground it covers, functionally.
I'd recommend WSS 3.0. It can search any document that you can find/write an IFilter for with many built in, out of the box. It also provides an area for people to discuss a document without having to place comments in emails or the document itself (which could inadvertently make their way to a client).
If you're looking for an index, a document management system probably makes sense. This one is inexpensive and very good.
It is free, libre. Wumpus Search.
You may want to consider something besides full text searching (Google and company) as this usually starts to degrade fairly quickly with the size of the documents. So far I have not seen anything that actually comes close to human index documents. There are several tools that help users tie documents into a pre-build taxonomy or thesaurus so you get consistency, accuracy, and a *well designed* solution, not random machine learning grouped results. They usually cost money, though so be prepared. I think that Lucine has a taxonomy module that is in beta mode so that really helps. Automated categorization is still quite terrible so you will need to sit a few users down and have them tag the data. You will be much happier with the result, honestly.
well, this may not apply to you, as you do not mention the size/number of items to index.
but, for small shops where there is no money to throw at this type of thing, try IBM OmniFind Yahoo! Edition. can't beat the price.
http://omnifind.ibm.yahoo.net/index.php
I had an internship over the summer at a large scholary journal archiving company who used lucene. I found it to be very easy to learn and powerful to use and customize. I was easily able to manage the tags and whatnot for documents, also I didn't really notice any issues with scalability, we indexed millions of documents and were able to search them just fine. It also has some nice basic options to get you on the way with semantic indexing if that is your bag (there are some better tools for that, but a lucene index is a good place to start for those)
Ze Atomic Device! It iz Ztolen!
Haven't you listened to the TED talk (cf: http://ted.com/
on Spaghetti Sauces, which refers to Moskowicz's idea that:
There isn't a (one) best , only best
eg, best spaghetti sauceS, etc.
Some find one best, others find another best [for them]...
Whatcha think?
The project that I worked on was also concerned with who was able to access the data. For that reason, we used a wiki-like format, converted everything into text, using a variety of conversion methods, and assigned access controls to it via a in-house web based application.
This allowed for the full text to be searchable, provided a reference back to the original file. If it was in a digital format, like a Word document, it was also stored in the database. If it, it referenced a physical file. The user could suggest modifications to the searchable entry if an error was found. The archive team would investigate the suggested correction and usually implement it.
2 cents,
QueenB.
HDGary secures my bank
I once spent 4 hours hacking together a symbol indexer for the 10,000+ CPP files in our source code repository. I wrote it in Python. It worked by brute force: "For every directory, for every .h or .cpp or .c file, crack it open, and, line by line, look for all instances of this regex..."
It's a little slow-- 10 seconds to look up all instances of a symbol. And it takes ~3 hours to refresh the full index.
But is saves an enormous amount of time, makes impossible tasks possible, and I have used it every day since. It's been about a 8 months now, and it's been absolutely wonderful.
It would have taken far longer and many more resources to begin to figure out how to hook in Lucerne, or some other heavy duty package.
To answer the original question, I'm currently using Microsoft's Indexing Server. I bought SearchSimon for $25 or so - basically a disappointment, but still a worthwhile purchase for me, since I found it useful to look at the included source code. I found an article which I can't find right now about rolling your own page based on the Index Server engine.
But this raises my question - how do you enforce metadata tagging? I can't even find a decent Windows-based metadata browser/editor for after-the-fact bulk tagging. I'm aware that certain commercial projects hook the save function on MS Office and replace your save dialog bog with one that has required metadata fields. Is there a way to get that functionality without paying thousands and thousands of dollars? Or, if one is going to pay that kind of money, what project should one go with?
PS - Thanks to everyone mentioning Nutch/Xapian, I will definitely check them out tomorrow.
1/2 joking 1/2 serious.
grep -R "foobar" /
Inxight (now apparently part of Business Objects) has a very good knowledge search, data mining, and concept analysis system in their Smart Discovery servers. I don't work for them, but helped evaluate and deploy the product in a previous job. Definitely had some useful features beyond just indexed search. http://www.inxight.com/products/
Someone mentioned htDig. I would just like to mention that I had much success with it. It's a C++ based crawler and search engine with customizable templates. I built a mod_perl wrapper to search 60 databases, a total of 1GB and got response times of about 0.1 seconds per query, including fuzzy searching. Actually it has so many thinks to tweak it is crazy. However this was a while ago and you may want to check the others mentioned here.
just my 2c, our organization uses Funnelback (www.funnelback.com) to do exactly this and it works a treat. it really changed the way we work. You can ask them for a quote or trial it yourself.
Don't search meta tags, search the content. Adding meta tags to documents is time-comsuming, unreliable are difficult to maintain. If you index the content of the files to be searched you don't run the risk of missing a vital nugget of information buried deep in a document. I have used ISYS for years to do just this. It's easy to set up and maintain, fast and accurate and indexes files the Google desktop won't touch.
We use Google apps at the place I work and it's great. Gmail, search, maps, etc.
:P
Of course, I work AT Google...
Look at IBM OmniFind Yahoo Edition. It's free, based upon the Lucene engine, supports 200 document types and websites, easily customized through a GUI, supports up to 500,000 documents, etc. This product's engine will be used in IBM's new version of OmniFind Enterprise Edition to be released sometime next year.
Users should not be allowed to search for files in the first place.
Consider a Google search box
Or, you know, you could add meta data to each and every single page you want to index... I'd personally rather stab my eyes out with a ballpoint pen.
no thanks
People have made a lot of good suggestions,
.desc files to add meta data to read only Office files
My suggestion is the Zoom Search Engine.
By I am way bias, as I wrote half the code.
Some other things to consider.
1) Some of the solutions are Linux or Windows only. And some of the Linux solutions can't index Office documents. (Linux modules to extract text from all Office documents are not always available)
2) Don't forget about the new Office 2007 document formats (the compressed XML formats). They are really different from the Office 2003 formats.
3) You stated that you wanted to index Access databases. In this case you will proably need to expose the content of the database via web pages, to allow the spider to spider them. For example,
http://www.yourwebsite.com/AccessDBRecord.php?id=1
http://www.yourwebsite.com/AccessDBRecord.php?name=Project1
http://www.yourwebsite.com/AccessDBRecord.php?name=Project2
etc..
4) You might need to manually edit the meta data on some documents. If the document is read only and can't be regenerated, then you might need a method, the Zoom's
5) Get a native code solution, the search time benchmarks we did show that compiled C++ code will out perform PHP and another scripting languages to 10 times or more.
Other people have said most of this already, but the ones I've seen used the most are:
Disclaimer: I work for Google but not on search. I definitely think you should use only as big a hammer as you need for the job, or maybe a little bigger to allow for growth. I've even seen Lucene used on small, internal, Java projects at Google where our full-blown web search infrastructure would have been the equivalent of a thermonuclear flyswatter.
Check out http:alfresco.org/ Alfresco .
It's an open office / lucene / tomcat based content management system. It has a powerful smb/cifs interface and indexes all office docs out of the box.
I've seen their product in action - it's fast and will index almost anything: http://www.openkast.com/
You have some documents you want to index. How many? How many users? What advanced features do you need (if any)? What's your budget? What technologies and languages are you comfortable with? What OS does it need to run on?
Where I work, we've used htdig, Verity K2 and Google search appliance, and have looked at (and heard good things about) Lucene.
Which one I'd recommend would depend entirely on the answers to my questions.
It's official. Most of you are morons.
Hello, I have a similar issue as the article writer. I work in 3D visualisation/animation and we have multiple texture archives which are sorted by applying a filename structure and putting them into specific folders, e.g. x:\nature\plants\tree_d_1024r_n.jpg (d= diffuse, 1024=resolution of the longer side, r=rectangular, n=non-tileable) However, i think that a tag-based system would be much better to sort all our stuff (i spent roughly 20% of my worktime searching for specific files). Our files are currently hosted via samba on ubuntu linux. My question: Is there an EASY, fool-proof way of using tags to sort our archives ? Thanks in advance
Docs...you've got to be kidding me right? This bloated ugly piece of trash rarely works properly, just look at the mess they made in their latest release. I'd pitch at it being worse than Windows Vista because Microsoft is at least in a position to improve on Vista, DOCS went back and rewrote their last version to get an upgrade because the one they had been working on didn't work properly and it had customers left right and centre complaining about it. We've got a DOCS deployment and Helpdesk is still fighting to get it working properly in their SOE, at least these days it properly supports Office 2000 (we had a hack at one point that prevented people clicking the cross on Excel because it broke the integration system). Last time I checked it had issues with Office 2003 let alone anything beyond that. If you're in the "real world" and on Windows, stay the hell away from DOCS.
I always wondered where this setting was...
Here is the blurb on the Concept Searching homepage:
"Most meaning is expressed in short patterns of words and conceptSearch is the only search product to automatically recognize multi-word concepts and use these as the basis for searching. Single words in isolation are highly ambiguous resulting in Low Precision. Whilst phrase searching can be used to improve Precision it does so at the expense of Recall since any document that does not match the exact phrase will be ignored. conceptSearch delivers High Precision and High Recall, with better ranking of results, compared to all other search engines that utilise an index of single works."
If you are running a taxonomy to organise your information then it will also do automatic document classification.
There is also a version available for MS SharePoint.
...when I had to write a quick-n-dirty wiki attachment search tool..
.= $twiki;
/,v$/i and /.+\.$search_ext$/i){ /$searchterm/){ /$searchterm/i){ /$searchterm/){ /$searchterm/i){
# perform search, checking if fname_only, cs, etc.,
# stripping non-printable ascii
sub DoSearch{
my $path = $searchpath;
$path
find({wanted=>\&wanted,
untaint=>1,untaint_pattern=>'^([\040-\176]*)$',untaint_skip=>1},
@twikipaths);
sub wanted{
if($_ !~
if($cs &&
push @matched_files, $File::Find::name;
}
elsif(!$cs &&
push @matched_files, $File::Find::name;
}
elsif(!$fname_only){
open(DOC, $File::Find::name)||
die "Couldn't open $File::Find::name:$!\n";
THISFILE: while(my $line = <DOC>){
$line =~ s/[^\011\012\015\040-\176]//g;
if($cs && $line =~
close DOC;
push @matched_files, $File::Find::name;
last THISFILE;
}
elsif(!$cs && $line =~
close DOC;
push @matched_files, $File::Find::name;
last THISFILE;
}
}
}
}
}
}
Try Regain:
http://regain.sourceforge.net/
http://www.aduna-software.com/products/autofocus/overview.view
http://www.aduna-software.com/products/autofocus_server/overview.view
Fast, free (as in freedom, and as in beer), efficient document and metadeta searching for a single desktop or large enterprise. I use it for searching thousands of HTML pages in local website mirriors.
http://www.aduna-software.com/images/screenshots/autofocus_server/autofocus_server3.png
http://www.aduna-software.com/images/screenshots/autofocus/query-answer-3.png
http://www.cs.mu.oz.au/mg/
To get more info including a peep into the book do a Google search on "Managing Gigabytes"
otoh for something cheap and cheerful there is htdig.
http://htdig.org/
It's remarkably good for indexing an intranet.
I'm a long time user of mnogosearch and tested different other solution like omega (part of xapian), lucene or alike. If you want a ready to use solution, supporting a large set of document format via filters, multi-language... mnogosearch (http://www.mnogosearch.org/) is really a good solution. it works like a web scrapper but can also work on mounted filesystem, it's supporting caching... worth the test if you want a good search engine.
i know this will give me flames, but:
you might try Oracle Text (also part of Oracle XE).
Supports 140 document formats, has a lot of options and works via SQL.
Can build indexes for documents stored in DB or in the file system.
You can even join the serach terms from the document with the database records where metadata might be stored by your application.
I found that very helpful in similar projects. And it's free.
......why don't you try re-inventing the wheel?!
Terrier - LINK
Indri/Lemur - LINK / LINK
MG - LINK
I see a lot of people have already recommended Lucene, and I heartily agree.
But, I suggest you look at the various Lucene sub-projects to see if one of them meets your needs. For example, Nutch includes a crawler and parsers for Word/PowerPoint/PDF/HTML/etc. so you wouldn't have to write that part yourself. Solr is a webapp that wraps a Lucene index in a simple web service and comes preconfigured to run inside its own servlet container on a separate port, so that's pretty easy to setup and use.
-Esme
I've tried quite a few of the current systems, and looked at a number of the available APIs so far, in the hopes of creating something that'll do what I want.
Basically, I think all document indexers currently suck. Must-have features:
* Indexing documents on a per-sentence, per-paragraph, per-page, per-chapter, per-section (etc.) basis: I should be able to search for books that have the words "people" and "crimewave" or just sentences that contain that word. There's no point indexing a thousand cross-referenced and cited PDFs about pyschology for terms like "neurons and fear". When I search a document collection for neurons and fear, I want it to show me paragraphs or sections that discuss those two topics together, in relation to each other, in depth. I guess this is similar to proximity searching, BUT...
* It MUST be able to bring up the right section. If the search engine just throws up a list that says, "yep, book121314 --- "Everything in the human body, in detail" (which is 98435 pages long) has both those words in it", then it's no better than grep. Not a single PDF viewer I've looked at on unix has the ability to open a PDF at a particular page, much less a certain anchor on a page, with given words highlighted.
Not so crucial, but important:
* Tagging. It should allow me to tag documents, pages, etc.
* Cross-referencing, and and comparison. Side-by-side scrolling of documents in different languages, or just different translations and commentaries on documents, a bit like what sword's UIs do, but more generally.
There are many ways to skin this cat. I believe most of them have been mentioned, but I will outline my experiences anyway.
swish-e is a grand-daddy of an indexer. It can act as a robot, crawl your local file system, or get its input from STDIN. If indexing HTML, swish-e will index the document's metatags and provide field searching against them. Swish-e comes with a C, Perl, and PHP API. I don't think swish-e supports anything but ASCII very well.
kinosearch is my new favorite. Written in C but with a Perl API, this indexer works a lot like Lucene. Its resulting indexes (files) may be readable by Lucene. Kinosearch works by initializing a "document" with attributes, filling each attribute with values, and saving the document. Searching is fast an easy. It does not support wildcard searching, but uses extensive stemming instead. Kinosearch does not index files from your file system; you must parse your data and feed it to Kinosearch.
Ht:/dig is nice, but the last time I looked, it had no API. I found this to be too limiting. It indexes documents.
The Google Appliance is cool (and kewl) but also very expensive. This black box (well, it is really gold or blue) does a lot of the work for you. Configuring its output is dependent on your ability to do XSLT. You can feed the Google Appliance database dumps and other streams of data. Nice. I still think the price is steep.
There's Plucene, a Perl port of Lucene. Too slow, and seemingly unsupported.
Lucene and its kin seem to be the Gold Standard these days. I appreciate that, but alas, I don't have any Java experience. Increasingly people swear against SOLR, a Web Services-based interface to Lucene.
Zebra is an unsung hero. It has been around for more than ten years, actively supported and used extensively in Library Land. (I'm a librarian.) This thing can index just about any kind of document. It supports every type of searching feature (stemming, wild card, fielded, Boolean logic, relevance ranked, etc.). It can read files or be fed things from STDIN. Fast!
As an added bonus, I advocate readers explore abstracting their search interfaces with something like OpenSearch or Search/Retrieve via URL (SRU). These abstract layers allow you to create user interfaces to your underlying indexers without worrying what those indexers are. In other words, these abstract layers define the syntax for queries, the transport mechanism to the index, and the structure of the returned result. Given such a framework, you can write an OpenSearch or SRU interface to your index, but if you decide that Lucene is not what you want to use anymore but Kinosearch is, then you can change your indexer without the need to change your user interface. Very nice. OpenSearch is simpler to implement but is weak when it comes to expressive searches and search results. SRU is more robust but also more complicated.
As someone who has made a 30-year career out of designing and building document management systems, I would urge you to look first at how you expect your users to find the documents they need. The expected results of a search should guide your choice of indexing methods - and the popular "meta tagging" method isn't always the best. There are shortcomings with all methods.
Full-text indexing allows users to search the entire contents of documents, but the results are imprecise and voluminous and not terribly useful in most cases (think web search engines here). Yes, you can find all documents that contain the word "patent", but you get a lot of old references to patent leather shoes in addition to what you were probably after. So, with full-text search you get it all, but force the user to subsearch for what they really want.
Using meta-tags gives the appearance of pre-classifying documents and having the users do it themselves means you don't have to have a dedicated person to assign the tags. The disadvantage is that everybody makes up their own tags or if you have a standard set, you have to rely on people being diligent about applying them. And tag popularity can easily change over time. For example, if you want to find docs that refer to "removable media", this might have garnered a "floppy" tag 15 years ago and "CD" or "DVD" today. You are therefore almost guaranteed of missing some documents using this method.
Database indexing means that you list all your docs in a database, perhaps by title, author, date, or other fields that your users would find useful for searching. The advantage is that every doucment is indexed the same way, searching is really fast, and the results are usually relevant if your schema is meaningful. The disadvantages are that indexing the docs takes work on input and users need to know how to search to get the best results.
Finally, you could organize the docs by simple name and folder. This works fine for the desktop and users usually can identify the category that points them to the folder they want. The disadvantage is that this only works well for limited document sets. Once you start getting hundreds of categories and thousands and thousands of documents, things become too hard to find.
So - understand your users search requirements and the size of your expected database. Only then can you make an informed decision about how to create and index the repository.
This is a rather nice book:
Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto
http://people.ischool.berkeley.edu/~hearst/irbook/
Amazon link for the reviews (no, no referer tricks, don't worry)
http://www.amazon.com/Modern-Information-Retrieval-Ricardo-Baeza-Yates/dp/020139829X
I had the same issue with a smallish agency and after defining criteria and scoring qualified courses of action, I determined that cots/appliances where the most cost effective and management (in the stuffed shirt sense of the word) 'friendly' solutions. The Google Search Appliance (I would avoid the mini because you might be shocked by the number of documents you really have), and Vivisimo (vivisimo.com) were my top choices. For extensibility and access control via ldap my recommendation was Vivisimo over Google, but management choose Google and it wasn't a bad choice. It's tough to match that 'bang for the buck' even if you are a good developer or have one your staff.
Shop as usual. Avoid panic buying.
...has a free version that works really well. You'll probably need your own box for it but I've done it with CentOS on an old Dell box and it worked beautifully. Install is a snap and uploading docs is done from a zip. The PDF indexing is a little light, but everything else indexes just fine.
try tag2find
I think you should take a look at the search capabilities provided by something called FAST ESP. They are based out of Norway but used all over the US govt and tons of commercial entities (like LexisNexis). The website for them is www.fast-search.com and from people I talk to it is supposedly pretty robust and can do intelligent searches, data tagging, authorization against data and stores, geo-tagging, yada yada yada and etc...
W1z
News Reporters Make Tasty Polar Bear Treats!
You can try this: http://scan.sf.net/
http://dublincore.org/ is making effort for documetn metadata, imrpoving indexation through document headers.. to me this is a stright line to follow.
-- "Since the best cannot be had, we must take the next best." -- Abraham Platz, mayor of Leipzig, 1723.
I used Folio Views a long time ago for a litigation support project. To really pull it off, I had to back it with an RDBMS. So, I had a full text search using the Folio Views, and then, an RDBMS to do meta deta searches. All the documents were supplied as both plaintext and as OCR. Paralegals coded the RDBMS portion as part of the document review, and the OCR was fed into a Folio Views NFO. The thing is, the NFO was really only good for static datasets... you took all of your text, built a sort of a hyperlink system onto it using FolioViews proprietary tags, and then, you were off to the races. The full text search, though, was really good.
Were I able to do it again, today, and I had plaintext, I'd probably be tempted to drop the whole lot directly into the database. Oracle and SQL Server both have some sort of a full text search thing you can get, although I've never used either. But that way, you could build queries that went against both the meta data and the document text at the same time.
This is my sig.
Use the digital library software: www.greenstone.org
Highly configurable and its all gpl.
Also from the same professor(s) for greenstone developed something called 'weka' -- machine learning in java.
Enjoy
Virginia Systems from Canada, with the software we ordered customize for a Danish Newspaper. Talk to Philip van Cleave asj for WebSonar for the Mac of course. At the moment indexes automatically every single words (excluding the excluded words like the a one two etc) and the search is almost instantaneous. Searches by aproximation phonetical also. In the apliccation I mentioned the index contains over 1509ñ000 articles over 10 years. is that enough for yiou. Ask for the price and sit down because ius really che Give my kind regards to Philip. philip@virginiasystems.com He doe not know that I live now in Chile. We met while I worked in Denmark. DO NOT SEARCH ANYWHERE ELSE. By far the best software ever Kind regards Hellmuth Stuven Lira Civ Ing Webmaster Apple Developer and writer
I'm not suggesting this as a solution to your problem since there are a fair number of off the shelf solutions, but for reference you if you were going to create something using an API such as Lucene, check out the awesome open source Sphinx: http://sphinxsearch.com/
It's very powerful and very fast, but you'd have you figure out how to convert the pdfs, the docs, etc into a format that it will take. I'm currently using it for a scientific journal article system. I'm not indexing full text of the articles but I am indexing the citation information including abstracts. It works very well for my purposes.
They made a great tool to index large amount of documents, it's commercial so it's not free and it's Windows only. That depends on what you need. http://www.coveo.com/
I appreciate all the input, and am reviewing many of the different solutions mentioned.
I've looked into online document management solutions for use in my own company and have passed each and every time. For starters, we're just not big enough to justify that sort of thing. But even if we were, I'm doubtful that the work is worth the effort. I'm willing to be convinced otherwise, I just haven't encountered a good enough argument yet.
... our accounting system splits data for different subsidiaries across multiple files. So if Accounting wants to know what's outstanding for vendor A across multiple projects, they need a spreadsheet that pulls the info from all those companies. I've explained to Accounting that when they setup a vendor, they need to use the same 6 character abbreviation so that we can group the data properly by that vendor. Do they listen? No. And this directly creates more work for them!
My biggest concern would be the end users. People are just not tech savvy. Even when you go to great lengths to tell, show, and teach, they will still do boneheaded things that will amaze and confound. Take the simple example of "Don't store shit on your laptop! You can put something there temporally but it better permanently reside on the server." It does not sink in. I can even try the approach of "Look, if you lose the data on your laptop, it's your ass on the line. Put it on the server and if it gets lost, it's my ass on the line." It still does not sink in.
End users will create tons of duplicate data on the server. Oh, marketing pictures are in this directory? Rather than make a shortcut to get there, why not copy everything to my folder instead? Yes, yes, that's good. Ugh. I've got duplicate file detective so I can see just how much waste we have here. Management has been informed and has done nothing. When we run out of space, I'll just hand them the report again and tell them we have to start removing dupes.
Another perfect example of how people don't listen
So given all this chaos, the best document storage system I can come up with is still just plain NTFS drives with directories on the server. I can lock access control with the security system and groups, people are told to manually file their work in sensible places. If they don't, Windows search still works well enough and can poke inside office documents for keywords. Contracts are scanned in as PDF's and a naming scheme is required so that we have lot, buyer, community, and contract edition in the filename. Revisions to the contract are scanned as they arrive and named as such. Additionally, contracts are placed in their own folders so if a data entry error is made in the indexing, we can still manually browse to where it's supposed to be to find it. Doing a full listing of the entire directory will also sort screwups to the top or bottom of the list where they can be located and corrected. What's more, shadow copy is working so I can go back and restore files in case anyone makes screwups. The whole array is backed up to external hard drives nightly with point-in-time saves going back six months in case an error is located after the backup is gone from shadow copy. So, at what point would this system be too primitive and actually impede the business of the day?
Unrelated note: this is typical. Just got an email back from the photographer. I'm asking for larger versions of the photos that were already sent, they're just too small.
I can try to make them bigger but the problem is that they start to blur.
I will try and you can tell me what you think.
Shoot me now.
Kwisatz Haderach
Sell the spice to CHOAM
This Mahdi took Shaddam's Throne
Way to pro MS troll on a post that actually gave Microsoft a compliment! The problem in this case wasn't with Microsoft, in fact I never said there was a problem with Microsoft, the problem lies with DOCS. These days its improving sure but its still a piece of trash (case in point: why does DocsX exist?). I don't disagree that Microsoft Office is the best Office suite out there for those who are using its more advanced features however for the most part most users are happy with changing font size, making things bold or underline and adding images, but again thats not the point of my post. The point of my post is how bad DOCS has been for a large number of organisations and how messed up their software is. I'm sure you use it with no problem, you seem to be unable to misread with no problems either. In fact to reply to your title "Its not MS's fault", its true its not MS's fault at any point, its DOCS's fault.
I always wondered where this setting was...
--Departments-- (search results go down the middle of the page here)
Sales (100)
Marketing (200)
HR (15)
--Year--
2006 (50)
2007 (90)
--Author--
Joe Jones (40)
Frank Smith (99)
etc. Each row is a clickable link you can use to narrow the search. You can build these menus based on any available metadata for your documents.
Navigation is especially useful in a corporate setting because the relevance ranking isn't going to be as good as you get with a web search engine. The reason is that web search engines can take advantage of the links in documents to discover which pages are more important, and likely more relevant. You probably don't have a lot of links in your Word and PDF docs. So breaking them down by category is really helpful.
There are a handful of companies that sell search software that can do this. The company I work for, Dieselpoint, makes enterprise search software that can create navigation contexts over really large collections.