Open Source Search Engine Benchmarks
Sean Fargo writes "This article has benchmarks for the latest versions of Lucene, Xapian, zettair, sqlite, and sphinx. It tests them by indexing Twitter and Medical Journals, providing comparative system stats and relevancy scores. All the benchmark code is open source."
Nothing else to say, really
Really? Am I the only person that found it interesting that Lucene, the only non C/C++ implementation, gave some pretty impressive stats? I mean, it's written in Java and although it has a slower index time its search time, index size and relevancy are impressive.
I may have to poke around in the Lucene code after work tonight to figure out what kind of strange majick those Apache developers employ. Hopefully I'll walk away with some extra spells in my bag.
My work here is dung.
Okay so the fastest engine is using Lucerne, a Java search engine, and this is neither tuned nor horizontally scaled (which it can do very well).
C++ and C both fail to deliver the same level of performance as the Java virtual machine.
Oh wait hang on... does this mean that for complex applications the most important performance piece is normally actually the efficiency of the code rather than the efficiency of the base platform and therefore having a language in which it is easier to write efficient code is better than just having the one that is fastest to execute a for loop?
But hell this is Slashdot and Java is Slooooooow...
An Eye for an Eye will make the whole world blind - Gandhi
Really? Am I the only person that found it interesting that Lucene, the only non C/C++ implementation, gave some pretty impressive stats?
Is it really that big a surprise? Given that some of the largest, most information-heavy sites on the Internet (e.g. Wikipedia) use it for their internal search?
Meh, look at any /. article about Java and you'll see somebody complain about the speed of Java, and a reply explaining that Java isn't particularly slow. It has some weaknesses that mean it isn't as optimal as really good C, but it also has some capacity for dynamic optimisation which can make it faster than poorly optimised C. Regardless in a DB type application, a lot of your time will be spent in vendor supplied code. Whether that is disk access supplied by the OS or some functions available as part of the language standard library. A lot of actually runs this type of app isn't particularly guaranteed to be written in the same language as the app.
Also, most of the Java code you run across in real life is crap. That's not a dig at the language itself. IMO, it's the volume of poor coders that give Java a reputation for slowness more than anything else. You probably won't find any secret double ninja techniques in Lucene as much as you will just find relatively few embarrassing fuckups.
Really? Am I the only person that found it interesting that Lucene, the only non C/C++ implementation, gave some pretty impressive stats? I mean, it's written in Java and although it has a slower index time its search time, index size and relevancy are impressive.
Lucene is a great search tool. As TFA pointed out, however, if you're looking for a "search solution" rather than "search engine" then you should check out Solr instead. Lucene is a toolkit that you build on top of, not something you really want to deploy by itself. Solr is that thing built on top of Lucene.
Be aware that while Lucene/Solr has made terrific progress, it is not quite in the "enterprise search" category. For superscale implementations you'll still likely need to look at a high-priced product like FAST.
Oh wait - seems TFA is saying a lot of sites just use an SQL DB and use like '%FOO%' as a "search engine....
Ok, this is reasonable, however, I don't see why anyone would choose sqllite as a benchmark. If you are trying to compare search engines, and consider an RDBMS to be a 'search engine' category, then you at least need to include 4 or 5 of the most popular open source RDBMSs in the benchmark (SQL lite, POstgreSQL, MySQL, Derby, Firebird), not just one.
All the other search engines except lucene are written in C/C++. Why didn't Vik Singh test also CLucene (http://sourceforge.net/projects/clucene/)?
Here is the CLucene's description on SourceForce: "CLucene is a C++ port of Lucene: the high-performance, full-featured text search engine written in Java. CLucene is faster than lucene as it is written in C++."
But Wikipedia's internal search is the suckiest thing that ever sucked! Seriously, does anyone use it, instead of just sticking "wikipedia" into their Google search?
Solr/Lucene power a number of sites that would be in the enterprise search category (Apple, Netflix, C-Net). Where I work, we index 5 million docs in Solr/Lucne and serve out millions of search requests a day. It's not google scale, but most people don't need that. The markets where one needs a FAST are dwindling quickly.
In theory, there is no difference between theory and practice, in practice there is.
Last time I had to implement an indexing and searching solution, swish++ was by far the performance winner.
The Web is like Usenet, but
the elephants are untrained.
Solr/Lucene real-time search (or near real-time) is one of its weaker points. I think it could keep up with the updates but making them appear in the index immediately and having the caching still perform can be tricky.
We have one index with that's updated every 20 minutes, but only has about 50k documents and a combination of Solr cache auto-warming and squid's stale-while-re-validate logic works there.
In another system where updates need to be faster, we had to do some custom work to make it perform where there is an in memory index for recent changes, an on-disk index of previous changes, and process for moving from one to another. Hopefully these improvements will make their way back to Lucene in the future.
In theory, there is no difference between theory and practice, in practice there is.
Kind of like... CLucene?
Ah, thank you. So indeed, an implementation of the same algorithm turns out to be _three times_ as fast in C++ than it is in Java (see here).
I wonder if eldavojohn wishes to comment on that?