Search Engines for Your Intranet or Small Business?
coreboarder asks: "Google recently revamped their nifty little Google Mini. It now does 100,000 documents of 220 different formats, makes your bed, and pours your beer. Where I work we have a reasonably large amount of technical data files (~80,000) of varying formats stored on a number of Windows 2000 and 2003 servers. File access is handled by permissions on the containing folder(s). Over time duplication has crept in because people cannot find what they need where they expect it to be. The $3,000 price point on the Google Mini is very attractive but is their a better way of making files and their content easily findable on a 1000 node network while still retaining their security? We also use ht://dig but it cannot handle all the file formats that would be involved here."
In that same vein, Gneral Tsao asks: "As an IT worker for a small research business, I'm trying to find a good text search engine for our subscriber facing publications. After much searching, I've found a few prospects such as Mnogosearch (which we currently use), Nutch, and Swish-e, but really no discussion about or comparison between them. This seems like a job for the Slashdot community. An ideal solution for me would be able to handle 20,000 or so pages, have a customizable PHP frontend, and allow for some amount of control over categorization." Any suggestions?
I run the Boutell search engine on my Company's internal website.
Intron: the portion of DNA which expresses nothing useful.
Give the users a shell and tell them to read the grep manpage.
The guy deserves all the "Yuo should switch to Lunix beacuase whats wrong with find and locate? i can always find hello.c." replies he's gonna get.
We picked up a mini about 2 weeks ago. The thing is amazing. From the time I cut open the box it was delivered in, to when I had our entire intranet & internet sites indexed and serving results was only 90 minutes. It's very easy to configure. Overall, it's a steal for any organization needing search.
One project in this area I've been playing with is Nutch.
Advice: on VPS providers
I run Nutch, a project which is now part of Apache Incubator. I'm indexing a few tens of gaming-related websites, on www.playfuls.com. There is a lack of documentationm but if you read and play with the config files, you'll do fine.
If you can't find a way, make one!
I can't figure out from the original post how you expect the Google Mini to crawl your content. The mini is limited to only stuff accessible via a website interface. Also, the Google Mini doesn't have any way for you to securely restrict search access to your various content.
Has filters for lots of doc types, you can write more.
http://www.namazu.org/
The long term solution is to put your data into groupware - lotus workplace and domino/notes is the example of how this can and should be done.
Of course workplace has limits to the amount of formats you can import into it, but definitely not the amount of data (well of course hd space, and whatever limit db2 has applies).
Why don't you use the recently released Google Desktop Enterprise Edition? It has access controls, the ability to be pushed out to all of the client computers seemlessly, filters for a huge ammount of files, the option of plugins to read more files, and is completely free.
http://www.enterfind.com/
Supports indexing docs on Windows shares directly (as well as HTTP crawling), supports hundreds of document formats (including exotic ones like dwg files), allows precise control over indexing process and allows access via Web Services API as well as browser.
No limitations on number of users or documents and fully customizable search page.
Disclaimer: I participated in the development of this product. They (company) are good people, take care of their customers.