Domain: htdig.org
Stories and comments across the archive that link to htdig.org.
Comments · 38
-
Re:Sphinx Search
Here's another vote in favor of Sphinx. I recently was presented with an online shopping site whose search functions were pathetically slow and inaccurate. I replaced these with Sphinx and now get incredibly fast results which are nearly always on target. You'll want to play with the weights assigned to fields and other features to optimize the searches, but if your content is already stored in a MySQL or PostgreSQL database, Sphinx should be one of your top contenders.
As the parent says, the indexing isn't real-time, but Sphinx has features to enable you to keep live indexes active while you reindex. The frequency of re-indexing will obviously depend on how important recency is for your users.
If your content is just text files, I'd consider htdig as well. While it's no longer being actively maintained, I've used it for years to index web archives of listserver postings with great success.
-
kinosearch, swish-e, zebra, ht:/dig, etc.
There are many ways to skin this cat. I believe most of them have been mentioned, but I will outline my experiences anyway.
swish-e is a grand-daddy of an indexer. It can act as a robot, crawl your local file system, or get its input from STDIN. If indexing HTML, swish-e will index the document's metatags and provide field searching against them. Swish-e comes with a C, Perl, and PHP API. I don't think swish-e supports anything but ASCII very well.
kinosearch is my new favorite. Written in C but with a Perl API, this indexer works a lot like Lucene. Its resulting indexes (files) may be readable by Lucene. Kinosearch works by initializing a "document" with attributes, filling each attribute with values, and saving the document. Searching is fast an easy. It does not support wildcard searching, but uses extensive stemming instead. Kinosearch does not index files from your file system; you must parse your data and feed it to Kinosearch.
Ht:/dig is nice, but the last time I looked, it had no API. I found this to be too limiting. It indexes documents.
The Google Appliance is cool (and kewl) but also very expensive. This black box (well, it is really gold or blue) does a lot of the work for you. Configuring its output is dependent on your ability to do XSLT. You can feed the Google Appliance database dumps and other streams of data. Nice. I still think the price is steep.
There's Plucene, a Perl port of Lucene. Too slow, and seemingly unsupported.
Lucene and its kin seem to be the Gold Standard these days. I appreciate that, but alas, I don't have any Java experience. Increasingly people swear against SOLR, a Web Services-based interface to Lucene.
Zebra is an unsung hero. It has been around for more than ten years, actively supported and used extensively in Library Land. (I'm a librarian.) This thing can index just about any kind of document. It supports every type of searching feature (stemming, wild card, fielded, Boolean logic, relevance ranked, etc.). It can read files or be fed things from STDIN. Fast!
As an added bonus, I advocate readers explore abstracting their search interfaces with something like OpenSearch or Search/Retrieve via URL (SRU). These abstract layers allow you to create user interfaces to your underlying indexers without worrying what those indexers are. In other words, these abstract layers define the syntax for queries, the transport mechanism to the index, and the structure of the returned result. Given such a framework, you can write an OpenSearch or SRU interface to your index, but if you decide that Lucene is not what you want to use anymore but Kinosearch is, then you can change your indexer without the need to change your user interface. Very nice. OpenSearch is simpler to implement but is weak when it comes to expressive searches and search results. SRU is more robust but also more complicated.
-
Managing Gigabytes
http://www.cs.mu.oz.au/mg/
To get more info including a peep into the book do a Google search on "Managing Gigabytes"
otoh for something cheap and cheerful there is htdig.
http://htdig.org/
It's remarkably good for indexing an intranet. -
sounds like htdig
http://www.htdig.org/, no google ads either, custom search pages can be designed indexes updated etc. no google widgets here required.
-
Thanks MS for feeding Google.
PS: Oh I forgot to add.
It's MS that drove us to Googles big yellow Linus box
for indexing our web pages. Yes It's the Google
software that lets us index the the Ugly MS-word and
MS-power-point files. If it where not for that I would
be using ht://dig.
Thanks MS for feeding Google. -
Choice C: Working search engine in short orderThe main advantage to Google from an end user's point is the ranking algorithm which may be of dubious value for smaller sites. For $17000 per year, you can pay for a lot of maintenance and customization of your own search engine. It would be re-inventing the wheel, and not necessarily a better wheel, to write a search engine from scratch.
No need to write one from scratch, there are plenty out there including some not on the list. Some of these are quite customizable, you can prune various servers, directories or file types from indexing. It's even possible to custom pre-processing, for example getting rid of all navigation menus identified by 'class' from the index. At the low end- there's even Swish and htdig
If you're a sucker for punishment you can even front end one of the higher end search engines with other protocols. For example, Z39.50 allows search clients like BookWhere, Procite and Endnote to do the search, something which is useful if you have a lot of research documents. Perhaps there is a use for LDAP here, too.
However, no way would it take months to install and configure an existing search engine in its basic form. If you have a machine, it takes 20 minutes to slap Debian (or your favorite Linux or BSD) on it and a few more to install the search engine and its prerequisites. Then you spend the rest of the week reading about it and tweaking it.
-
....or you could....
Save the $30k or more and use COTS hardware with mnogosearch or htdig (or
...) -
Re:Gee - if only I used MS products....
The obvious thing to try then is to set up Apache (or Squid, or similar software) running as a reverse proxy on that machine.
The first thing I did when finding out about this tool was to install it on a Windows machine with a couple of Samba mounted network drives (I'm hoping that it will index the content of these drives, but I can't tell yet), then set up Apache as a reverse proxy to provide the indexed material as a URL that would be widely accessible on the local LAN.
So far I can't quite get it to work -- I can connect from another computer (a Mac running Safari), but first I get complaints about running the wrong browser, and then I get errors about invalid URLs that apparently aren't being passed through.
Still though, it seems certain that this should be doable, and if it can be done, this would beat the living snot out of my company's current ht://Dig based search engine.
Google is right to make this tool inaccessible from non-localhost access -- the average home user does not need to have the contents of their hard drive set up with an easy to browse, globally accessible search interface. And I can see where Google wouldn't want this to work on LANs either -- it would cut into their business of selling search appliances. But come on, this is right on the cusp of working as it is, and it's only in beta. If Google doesn't provide a way to turn on access for local (e.g. 192.168.x.x) addresses, I'm sure that Apache or something like it can be configured to do this.
-
Re:For Linux?
Have you tried ht://Dig?
-
Um, what?
So, what's the deal? I actually RTFA'ed, but did I miss something? What will KDE do that ht://Dig and mnogosearch and the like don't? User-friendly setup and use, I suppose.
-
Re:Will we see something like this on linux?
locate is good for a fast prebuilt database of filenames.... I'd maybe recommend htdig for a prebuilt database of contents.
-
Search engines are not always internet portals
Something a lot of folks are missing here is that search engines are used in applications, intranets, individual sites etc as well as Google type whole-internet portals.
When you click on 'Find Files' in Windows, or look for a song in your chosen P2P app, or look something up on your O'Reilly CD Bookshelf, or search /. for an old article, that's a using a search engine just as much as Google is.
If you're interested in something for your own project, lucene is a great application-centric search engine. It's just a bunch of Java classes that you call from your application. Or you can use a website-centric engine such as htdig if you're dealing with an intranet or website rather than an app. They're both GPLed I think. -
Re:not a good idea....
What they seem to be doing is offering an alternative in the area of Enterprise search.
Oh, you mean like what ht://dig has been doing since 1995? -
Re:Hook it up to slashdot!
167 posts and no mention of ht://dig? It's a great open source search engine, and I've been using it daily (well, cron really uses it now, not me) to spider about 100 sites on my intranet, which has servers all over the world.
While not currently designed for massive whole-web spidering (it's aimed at single websites or intranets), ht://dig is a great starting point (and a lot further along than the Nutch 'nascent effort' mentioned in the story). Some database optimization to ht://dig seems easier than starting over with Nutch. Plus, the name 'Nutch' sucks. -
Re:Googling your harddisk
I'm much more interested in having a google search available over my harddisk.
I thought I remember Google having a product like this, but I can't find it now.
MS Win2k and WinXP have an indexing service that's supposed to do just what you want. It's not enabled by default in 2k; not sure about XP. I've been afraid to try it for various paranoia and stability reasons.
HTdig was my next thought. It's designed for web pages, but I bet you could restrict it to your hard disk. However, the site says they don't index non-text files yet.
For some reason I felt like searching Freshmeat and came up with SWISH++. It says it can index hard drives and non-text files "such as Microsoft Office documents", although the method they describe they use is not one I'm sure would work since Office docs can be in Unicode.
Both HTdig and SWISH++ are GPL. There were other possibilities on Freshmeat, too. -
Re:So, Where's the Web Site?
-
Re:That's not how it works though
Ok, so then when you need info on one of those resumes you do a word search through lots of compressed files? I don't think so...
You don't? I do. htdig with gzip/zip and word doc reading addons does a great job of looking inside all sorts of files for me all the time, compressed or not.
Nice try FUD-master. -
Re:I might have prior art
Prior art of using hyperlinks from other documents, pointing to a document, and increasing the score. Of course, Google slightly refines this by saying, "removing documents from the sub-set that are from the same host or from an affiliated host as the particular one of the relevant documents" in claim #2. However, the MEAT of their patent (claim #1) was already done before their filing.
-
AltaVista appliance for intranet searching?
AltaVista used to be the best search engine; it's strength lies in basic text searching and it's incredible speed and scalability. Unfortunately it did not account much for the interlinked nature of the web and was easily subverted by web author tricks. These faults were mostly solved by Google.
However, just as Google offers a stand-alone embedded box, the Google Appliance, for use within corporate intranets, I suspect that is an area where AltaVista's technology could thrive much better.
Intranet searching and indexing is still a rather underexploited market. There's basically Microsoft's Index Server, flaws and all, the Google Appliance, and several good but not great minor choices such as ht://Dig. If we could get an AltaVista appliance that ran under Unix (or at least not bound to Microsoft) and underpriced the Google Appliance I would have to believe that a lot of companies would take notice.
-
Re:I have Beta VersionI have the same thing on my Linux laptop.
It's called htdig. I hit Alt-F2 to bring up the KDE "Run Command" dialog and type "s:<my_search_terms_here>". I get a Konqueror window with the search results.
Very useful.
-
Installed ht://Dig today... what timing!What amazing timing... the meta keyword tag is declared dead on the same day I (finally) got around to setting up a search engine for my website.
Starting this morning I began reading the docs and installing the ht://Dig search engine. There are a lot of configurable settings.
When I first got it working, I immediately realized that the 350-some static html files on my site really only have a couple dozen different sets of meta tags (due to starting new pages by copying existing ones). In fact, many of my pages don't even have really unique title that differentiate them from other similar pages on the site. If you're interested in seeing it, it's not yet linked from the rest of the site, but will be soon, at this new search page. The results still suck, mostly due to my poor meta and title tags.
That's not ht://Dig's fault, of course, and they do have you options to configure the weight for various things... and luckily I've used <h2>l and <h3> tags for labeling sections on almost all the pages, so I turned up the weighting for the text in those and in the link text on the site.
Still I have a lot of work to do to make my little site nicely searchable... and most of it is in the titles and meta tags. The keyword meta tags are the one place where you can list words that you can be certain a local search engine like ht://Dig will make use of them and display those pages.
Too bad the meta keyword tag was declared dead today.
-
Re:mod_googleYou should check out htdig. It now comes with Redhat. It will crawl a web site or web sites, index them, and provide a web search. You can set it up to look a lot like google. You can tweak the parameters so that it pays attention to how often a page is linked and you can set up weights for how important a word is based on where in the page it is and even if it is in the link text that points to that page. I don't think that google has much more than that, but they seem to have their values well tuned.
It isn't the easiest thing to configure since there are so many options for crawling and ranking pages. The look and feel for the pages it spits out isn't so clean looking as google, so when I've set it up I've had to modify that as well. It doesn't do caching, or tie in the a directory, but for a local search, those aren't much use anyway.
-
Dude, this article is more than 2 months old.It's a very interesting article, but it came out in February. That aside it's good that some of these are getting mainstream press.
Protocols to mention besides OpenLDAP and OAI are Whois++ and Z39.50. OAI actually is transported over HTTP. You could do the same with EAD or others.
Projects which implemented Z39.50 for the purposes of interoperability are ONE and ONE-2, EUROPAGATE, Desire and Desire II, DECOMATE and DECOMATE II, and Renardus just to touch the surface. Don't forget OHIOLINK...
Another other older, but interesting, metadata activity have been SGML MARC, and the corresponding XML MARC.
Those that are interested in more detailed reading can check out the Nordic Metadata Project, Nordic Metadata Project II, which studied the practical implications of cross browsing multiple databases and especially the use of Dublic Core. Even if you get agreement on the protocol and data standard, cross searching's not as easy as it sounds. One of the tools is the Dublin Core Metadata Temple (get it while you still can).The BYTE article was exciting to see again and could have benefited further from pointing out the relative ease of use of Dublic Core. OAI uses unqualified Dublic Core, SAFARI uses qualified Dublin Core to create an up to date index over academic research in Sweden. Shoot, since it already uses some META tags, you could even tweak htdig to use Dublic Core on your own site for those high precision searches.
With the interest in structured data (XML?) maybe well see some sites serving up not just HTML with Dublic Core, but maybe even Docbook or even TEI / TEI Lite. There are great tools for converting from Docbook to HTML, PDF, RTF, etc. and AbiWord and Kword already have partial support for docbook. If there were more, then we could see some real changes on searching the web. Coding for SGML is more difficult, so the obvious choice would be to start from Docbook XML.
-
Users or developers?
are you documenting for users of the projects or are you documenting for present and future developers of the project? The two are completely different and have different requirements as such.
A web application by nature should almost always be self explanatory. A help link or button should be available prominently on every page. The better you do this part, the less it costs to support your app.
Developer documentation for a web app also works well with HTML. Not only can you use comments extensively, you can link variables and functions from where they are used to their actual definition. A common way to structure HTML documentation is to have a frame with the left frame containing a tree of links, an index, and a search. I would use something like ht://dig rather than a database to index your docs and allow searches. -
This was my final year project thesis
This was my final year project thesis. Just remember the golden rule unstructured 2 structured == convert 2 XML I wrote a [very bad] program in C++/Perl/tcsh IPC=pipes to add XML tags to English, and then index them into a search engine which would use the lingual data stored in the XML tags to help the search.
NIST does a MASSIVE competition on this annually. I don't want to be an XML-buzzword whore <Arnold Schwarzenegger accent> (XML commando eats Green berets, C++, Java, Perl, COBOL for breakfast)</Arnold Schwarzenegger accent> but you can't beat XML for easily converting anything that you can make sense out of into computer readable format. Real h3cKoRs use SGML, but us underlings have to stick with things we can understand like XML. As for expandability, if we want to encode something else into the document, then just tag-it-and-go
It took me 200 hours to fish out all these links (before the Google days), I don't want anyone to have to waste as much time as I did feeding the search engines exotic foods. It's a year old so pardon me for the odd broken link, armed with these you could probably turn jello into XML ;-)
My favourite bookmarx
PROJect[21 links]
Beginners' Guide[13 links]
Berkeley Linguistics Dept. Course Summaries, general stuffzzzzzzzzzzzzzzCryptic IR Vocabulary defined
Explanations of weird words like hypernym zzzzzzzzzzzzzzHow do we produce and understand speech
How Inverted Files are Created - Univeristy of Berkeley zzzzzzzzzzzzzzNLP Univ. of Indiana, very good basics e.g. word sense d
Simple langauge - useful.... zzzzzzzzzzzzzzWhat is Natural Language Processing, links
What is POS tagging........ zzzzzzzzzzzzzzWord Sense Disambiguation defined
Word Sense Disambiguation in detail, scroll down far zzzzzzzzzzzzzzWord Sense Disambiguator - LOLITA (tested at MUC-7 and SENSEVAL competition as best)
XML for the absolute beginner
HTML, XML stuff + parsers[19 links]
Apache plug-in that uhhh does stuff with XML zzzzzzzzzzzzzzConvert COM to XML
convert XML, HTML to Unix pipeable formats zzzzzzzzzzzzzzconverters to and from HTML
expat XML parser zzzzzzzzzzzzzzHTML Tidy - converts HTML 2 XML + source code!!
Parse DB (RDBMS, whatever) to XML zzzzzzzzzzzzzzPerl-XML Module List
PHP Manual XML parser functions - what the hell are they talking about, PHP Virtual M... zzzzzzzzzzzzzzPublic SGML-XML Software
Pyxie - XML Processor for Python, Perl, etc. zzzzzzzzzzzzzzSGML+XML tools.org
The XML Resource Centre - massive number of links zzzzzzzzzzzzzzW4F wrapper - wrapper converts XML to HTML
XFlat - convert flat file into XML zzzzzzzzzzzzzzXML Parsers and other XML stuff
XML.com - Parsers, etc. zzzzzzzzzzzzzzXML-Data Catalog System - uhhhh looks close
XTAL's general converter - convert anything 2 XML
other Background[8 links]
Is Linux ready for the Enterprise, scalable... zzzzzzzzzzzzzzLinux reliability
Linux Versus Windows NT, Mark(sysinternals bloke) zzzzzzzzzzzzzzPC reliability (pcworld)
SPEC - Standard Performance Evaluation Corp. zzzzzzzzzzzzzzSystems benchmarks
TPC - Transaction Processing Performance Council zzzzzzzzzzzzzzUnix Beats Back NT In EDA Workstation Arena
Proper TREC(-8) QA systems[2 links]
pg. 387 LIMSI-CNRS pretty deep parsing[2 links]
More links....
NLP, IR links - lots to corpii, etc.
pg. 575 U. of Ottawa and NRL (shit system, got 0%)[1 links]
LAKE Lab
pg. 607! University of Sheffield (crap system, but OPEN SOURCE!)[2 links]
GATE - FREE IE app w`source code
LaSIE - ER, coreference, template (cv)
pg. 617 Univ of Surrey (inconclusive matches)[2 links]
System Quirk - Or is this their search system..... Hmmmmmm
Univ of Surrey - pointers (hopefully this is their WILDER search system...)
SMU - Pg. 65[1 links]
Natural Language Processing Laboratory at SMU
Textract[2 links]
Cymfony - Technology
Textract - State of the Art Information Extraction
Xerox uhhhhh maybe[1 links]
Xerox Palo Alto Research Center
(OVERVIEW) 1999 TREC-8 Q&A Track Home Page
NLP bloke, Univ Sussex
Tcl-Tk[4 links] Tcl tutorial
Tcl-Tk Contributed Programs Index
Tcl-Tk Resources, sources
TclXML - manipulating XML using Tcl-Tk
Artificial Natural Language - Is this what I'm trying to parse into...
Comparison of Indexers - Prise vs. Inquery vs. MG, etc.
Eagles - Language Engineering Standards
Language Technology Group - lots of modules!
LDC - Linguistic Data Consortium, lots of corpora
Lexical Resources
Links 2 resources, indexers.....
Lots of IR stuff, University of uhhh
Managing Gigabytes Indexer
Managing Gigabytes Manuals and stuff
Htdig search system
NLP & IR (NLPIR, NIST) Group
OVERVIEW OF MUC-7-MET-2
Perl XML Indexing - XML search engine type thing
Phrasys Language Processing Software Components (money)
QA HCI bullshit
SIGIR - TREC-type thing, resources
SMART indexer system documentation
Text REtrieval Conference (TREC) Home Page
The Natural Language Software Registry
Thunderstone IE and IR products
WordNet - FREE DOWNLOADABLE lexical English database
Page created with URL+, nice utility for working with internet shortcuts -
Heard of ht://Dig before? Any good?I've never seen ht://Dig before. Where I've needed search engines, I've deployed Harvest or WAIS.
Aside from the GNU license and association with SourceForge, I'm not sure what advantages ht://Dig has over the other free/commercial indexing products. Perhaps somebody has a comparison page?
-
Re:Looking for a good internal search engine
try ht://Dig. It's free and works with *nix. Info about pdf indexing is here: http://www.htdig.org/FAQ.html#q4.9
It's a good solution for a small to medium sized website. If you run Linux, it might be on your install CD's, or might be installed already. -
Re:Looking for a good internal search engineTry htDig. It does all these things and is free software. I used it on a corporate intranet in the past. Not as good as Google, but you can't argue with the price.
-
Cheaper to beef up...
... the ht://dig search engine.
In this climate of IT layoffs, I reckon it would prove cheaper and better to hire a programmer to take the GPL'ed ht://dig code and hack in some Google-like improvements.
The major improvement needed is the ability to search on phrases, and to do boolean searches.
Such a beefed up search/indexing system would not be subject to licensing fees, and would be freely redistributable (say, to other company offices). -
Ouch. Try HTDIG.
Yes, quite CLEARLY it's only for those who've got some cash to blow. If you've got a modest-sized Intranet site, I would highly recommend htDig. I've installed and configured it in several places and it works like a charm. Best of all, it's GPLed! Sure, it doesn't have all the fancy matching algorithms used by Google, but it does a damned good job nonetheless.
-
ht://Dig is what you are looking for
We index dozens of gigs of txt, html, pdf, xls, doc and ps. Not 100% of the documents are indexed but it's a parser problem with some of the files (a few pdf, xls doc and ps seem to make their parser choke).
And beside being flexible, ht:/Dig is fast.
4.9. How do I index PDF files?
http://www.htdig.org/FAQ.html#q4.9
-
Re:So what the hell is Altavista going to do?
What concerns me most about this is the effect it could have on exellent free search engines programs like ht:://dig and swish.
Given that using one of the big search engines to index content or buying the egine for your own site has become a market niche, it isn't hard to imagine AVs next target being the elimination of free products using their <sarcasm>oh so unique and never before thought of search techniques</sarcasm>
-
Usefulness is important
Of course it's important to be 'accessible', but beyond that, make it _useful_. An exceptionally bad design, for example, is the New Jersey Transit homepage. I can't see a complete line schedule for the trains, and all of the bus schedules are scanned PDFs (meaning I can't text search anything).
A slightly better example is the Texas state homepage. There's lots of information available about laws and whatnot, but unfortunately none of it is searchable. On the state legislation page there is (as far as I can tell) a complete legislation listing, but none of it has been indexed.If I could make one suggestion, it would be this: Include a search capability.
-
ht://Dig may work
-
ht://Dig may work
-
Re:Anyone have experience with it?http://www.htdig.org/ is a GPL'd search engine that will crawl your site. It can go a certain depth, or start from any given page (like a site index page). We use it at saintjoe.edu and it works wonderful for everything we need.
We have the indexer running on a cron job twice a week during the middle of the night. It does kinda screw up webalizer results, but you can work around that.
Theres also one called glimpse, but my experience with that a few years ago showed it to not be as useful as htdig. things might have changed, though, and YMMV.
-
Re:Anyone have experience with it?
I'm obviously a bit biased, but there *are* strong, open-sourced search engines. Try ht://Dig for example www.htdig.org or if you don't like that, you should check out the excellent SearchTools.com website. Cheers, -Geoff
-
Make wild claims; get free /. publicityNeed a free search engine? Try ht://dig. It's been around awhile, and is stable and highly configurable. It includes a spider, but is more suitable for medium sized collections, not the whole Web.
Examination of their ftp distribution site reveals this is an early work in progress...most docs are "under construction," and even their helpers.txt (supposedly giving credit to others) is basically empty.
I'll post more if/when their src tarball ever finishes downloading (54M - whew!...and the site is getting
/.'ed right now). My guess is they drew heavily from ht://dig, WAIS, SMART and other public-source search engines and spiders.For those who can't get through to the site: they hope to sell subscriptions to their database, so that you can run their search engine internally. It's not clear whether they intend to license the spider/crawler or just the database.
Meanwhile, to those who have complained that easy searches turn up with nil results: read the page, dudes! It says clearly that you're searching a minimal test collection, but can search the whole thing (on your local system, seems like) for a subscription fee.
Credibility break: I'm an information science professor and design/evaluate alternate information retrieval systems.