Open Source Search Engine Benchmarks
Sean Fargo writes "This article has benchmarks for the latest versions of Lucene, Xapian, zettair, sqlite, and sphinx. It tests them by indexing Twitter and Medical Journals, providing comparative system stats and relevancy scores. All the benchmark code is open source."
Nothing else to say, really
Really? Am I the only person that found it interesting that Lucene, the only non C/C++ implementation, gave some pretty impressive stats? I mean, it's written in Java and although it has a slower index time its search time, index size and relevancy are impressive.
I may have to poke around in the Lucene code after work tonight to figure out what kind of strange majick those Apache developers employ. Hopefully I'll walk away with some extra spells in my bag.
My work here is dung.
It was a foregone conclusion that lucene would trounce the others, if you ask me. And comparing sqlite vs lucene is slightly absurd, since most people with a clue already uses lucene on top of sqlite (and mysql as well) to get good search results.
Okay so the fastest engine is using Lucerne, a Java search engine, and this is neither tuned nor horizontally scaled (which it can do very well).
C++ and C both fail to deliver the same level of performance as the Java virtual machine.
Oh wait hang on... does this mean that for complex applications the most important performance piece is normally actually the efficiency of the code rather than the efficiency of the base platform and therefore having a language in which it is easier to write efficient code is better than just having the one that is fastest to execute a for loop?
But hell this is Slashdot and Java is Slooooooow...
An Eye for an Eye will make the whole world blind - Gandhi
Really? Am I the only person that found it interesting that Lucene, the only non C/C++ implementation, gave some pretty impressive stats?
Is it really that big a surprise? Given that some of the largest, most information-heavy sites on the Internet (e.g. Wikipedia) use it for their internal search?
Meh, look at any /. article about Java and you'll see somebody complain about the speed of Java, and a reply explaining that Java isn't particularly slow. It has some weaknesses that mean it isn't as optimal as really good C, but it also has some capacity for dynamic optimisation which can make it faster than poorly optimised C. Regardless in a DB type application, a lot of your time will be spent in vendor supplied code. Whether that is disk access supplied by the OS or some functions available as part of the language standard library. A lot of actually runs this type of app isn't particularly guaranteed to be written in the same language as the app.
Also, most of the Java code you run across in real life is crap. That's not a dig at the language itself. IMO, it's the volume of poor coders that give Java a reputation for slowness more than anything else. You probably won't find any secret double ninja techniques in Lucene as much as you will just find relatively few embarrassing fuckups.
Really? Am I the only person that found it interesting that Lucene, the only non C/C++ implementation, gave some pretty impressive stats? I mean, it's written in Java and although it has a slower index time its search time, index size and relevancy are impressive.
Lucene is a great search tool. As TFA pointed out, however, if you're looking for a "search solution" rather than "search engine" then you should check out Solr instead. Lucene is a toolkit that you build on top of, not something you really want to deploy by itself. Solr is that thing built on top of Lucene.
Be aware that while Lucene/Solr has made terrific progress, it is not quite in the "enterprise search" category. For superscale implementations you'll still likely need to look at a high-priced product like FAST.
This isn't really surprising to me. Disk I/O is the slowdown for almost all programs, so efficient disk access is more important than the application code, no matter how it is written. OTOH, a well designed system that minimizes wasteful I/O will do very well - even if it is written in, cough, java.
Way to go Apache guys!
BTW, I use Lucene on our document management system. It works well enough, but definitely eats more RAM than I'd like. Did anyone look at the RAM trade-off?
Really? Am I the only person that found it interesting that Lucene, the only non C/C++ implementation, gave some pretty impressive stats? I mean, it's written in Java and although it has a slower index time its search time, index size and relevancy are impressive.
Of course you are, fool! Everyone else on slashdot knows exactly how Lucene and sqlite's indexing systems work. I don't know why they bothered to take the benchmarks at all, anyone with half a clue has integrated a Java engine running Lucene into sqlite and hooked it into MyISAM already..
// MD_Update(&m,buf,j);
...have used it on several projects and always gotten good results. Setting it up is easy and the Ruby API is solid, although I needed a tiny bit of additional code for special character escaping. Highly recommended!
The Army reading list
Oh wait - seems TFA is saying a lot of sites just use an SQL DB and use like '%FOO%' as a "search engine....
Ok, this is reasonable, however, I don't see why anyone would choose sqllite as a benchmark. If you are trying to compare search engines, and consider an RDBMS to be a 'search engine' category, then you at least need to include 4 or 5 of the most popular open source RDBMSs in the benchmark (SQL lite, POstgreSQL, MySQL, Derby, Firebird), not just one.
The open source search engines are being measured by an open source benchmark. Must be a conspiracy. I want to see propriety benchmarks measuring these. I'm sure M$'s Bing would be the best.
All the other search engines except lucene are written in C/C++. Why didn't Vik Singh test also CLucene (http://sourceforge.net/projects/clucene/)?
Here is the CLucene's description on SourceForce: "CLucene is a C++ port of Lucene: the high-performance, full-featured text search engine written in Java. CLucene is faster than lucene as it is written in C++."
why index something as useless as twitter?
No, they are what keep Livejournals from becoming Deadjournals.
Does anybody know? That'd be a great comparison.
I mean, it's written in Java and although it has a slower index time its search time, index size and relevancy are impressive.
In the "benchmark," it wasn't just impressive in those areas: it had the lowest search time, the smallest index, and the highest relevance. That makes top honors, in my book.
Put identity in the browser.
Please, can we avoid the "java vs C/CC++" thread again?
the lucene based nutch has been a big help to our group. we currently index 60 sites across the company, dive through PDF files and even shockwave flash and powerpoint with ease. the search results are extremely fast and the results are so accurate theyve blown our corporate engine completely out of the water.
Good people go to bed earlier.
Yeah and look at the memory stats. It uses nearly twice the memory of the next one down and more than 6 times the memory of the best*. I don't imagine it gets better over a long period of time either. I see that time and again, long running Java processes are no good.
* With that said, SQLite needs a lot of tweaking and I can tell from the memory usage that they didn't tweak it much if at all. That pretty much invalidates SQLite's results in these tests.
Far more likely to be because of the choice of algorithms and the resources behind the project. Would be interesting to see how CLucene performs.
GameRanger - multiplayer gaming service for PC and Mac games
But Wikipedia's internal search is the suckiest thing that ever sucked! Seriously, does anyone use it, instead of just sticking "wikipedia" into their Google search?
SQLLite, Oracle, MySQL, PostgreSQL, etc. all have full-text indexing engines as part of the RDBMS, or as add-on packages. From TFA "I had some text issues with sqlite (also needs to be recompiled with FTS3 enabled) ...". According to this, it uses Full Text Search 3 (FTS3) as its text indexing engine. They all parse the CHAR(N) or CLOB(N) columns into tokens (words), and index those.
The standard SQL predicate "...WHERE columnN LIKE '%FOO%' " cannot be indexed by any RDBMS. That is a non-indexable CHAR(N) or CLOB(N) searchinside a string. Only, left-anchored LIKE queries can use the index, "...WHERE columnN LIKE 'Foo%' ".
Solr/Lucene power a number of sites that would be in the enterprise search category (Apple, Netflix, C-Net). Where I work, we index 5 million docs in Solr/Lucne and serve out millions of search requests a day. It's not google scale, but most people don't need that. The markets where one needs a FAST are dwindling quickly.
In theory, there is no difference between theory and practice, in practice there is.
Java can't seem to get past it's reputation for being slow - which quite simply is no longer true. Java can match and even exceed the speed of C/C++ implementations. This often seems like an impossible, even outrageous claim to many C/C++ developers. What they fail to see is, that Javas Hotspot compiler compiles critical code sections at runtime on the client computer. This has the advantage over C/C++ programs that the compiler has detailed info about the system it's running on and therefore can perform specific optimizations that a C/C++ program -that is compiled only on the developers PC- can't.
Although they might have full text indexing and searching, databases and search engines/libraries work differently.
E.g. you come to online DVD shop and search for "Tom Criuse" (hint: misspelled surname). Every decent search engine (including Lucene library, not sure of others evaluated here) would yield a result, despite misspelling. I am not sure whether database fulltext thing would spit anything at all. It's simply built do do different job, that's it.
Solr/Lucene power a number of sites that would be in the enterprise search category (Apple, Netflix, C-Net). Where I work, we index 5 million docs in Solr/Lucne and serve out millions of search requests a day. It's not google scale, but most people don't need that. The markets where one needs a FAST are dwindling quickly.
I work in a shop that uses fast, despite pressure from some to move to solr. As I understand it, solr can't keep up with the volume of changes we need to make to our data. I'm talking millions of documents of a 100+ fields changed, per day, with any given change visible to the customer within a short timeframe (10 minutes). solr can index that much data easily, but it can't keep with that kind of volume. That's what I've been told anyway.
Last time I had to implement an indexing and searching solution, swish++ was by far the performance winner.
The Web is like Usenet, but
the elephants are untrained.
Nothing else to say, really
Really? Am I the only person that found it interesting that Lucene, the only non C/C++ implementation, gave some pretty impressive stats? I mean, it's written in Java and although it has a slower index time its search time, index size and relevancy are impressive.
Yes, that's pretty much you yes. Different algorithms, therefore different performance. Reimplement Lucene in C++, then see what the differences are in terms of speed (and if you care, code size, complexity, etc.). Until then the comparison is totally meaningless.
And gee, what's with the defensive attitude...
DBSight uses Lucene's inverted index, and beats any database based B-tree search. And it's dead simple to use. Lucene Database Search in 3 minutes: http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
Solr/Lucene real-time search (or near real-time) is one of its weaker points. I think it could keep up with the updates but making them appear in the index immediately and having the caching still perform can be tricky.
We have one index with that's updated every 20 minutes, but only has about 50k documents and a combination of Solr cache auto-warming and squid's stale-while-re-validate logic works there.
In another system where updates need to be faster, we had to do some custom work to make it perform where there is an in memory index for recent changes, an on-disk index of previous changes, and process for moving from one to another. Hopefully these improvements will make their way back to Lucene in the future.
In theory, there is no difference between theory and practice, in practice there is.
It's no surprise to me. Java has long since been the best technology for all things internet. Streaming servers, forum software, indexing/archiving, Web2.0 sites; it's only several dozen times faster than Ruby or PHP, with similar memory usage. And I'm not talking applets here - I mean the backend. Tomcat is even significantly faster than mod_php or fastCGI with their C backends.
Keep in mind that anything Java based has VM overhead. If they included that in the Lucene graphs, then it performed the best while using about as much memory as sqlite. If they didn't, then it's a bit RAM hungry(add another 30MB), but still performs the best.
I've always been a big advocate of using easy languages for complex software. When I was first learning programming, I opted to create Tetris in Javascript. It took me a few days - about 12 hours - but I did it from scratch, without help! Now I could probably do the same task in Java in 2 hours, but working in an "easy" language certainly does help when the code is almost above your head. It helps you keep a larger part of the project in focus, instead of having to focus on the actual code.
And then there's the gains from when you make a mistake. I'm sure some of you will claim to be perfect - but in C/C++, if you mess up and introduce memory leaks, you have to waste time tracking them down, rather than spending that time optimizing, thinking up new algorithms, etc., easier languages are so much better for the average programmer, which may think up an impressive algorithm from time to time, but struggle with implementing it in a low level language.
Kind of like... CLucene?
Ah, thank you. So indeed, an implementation of the same algorithm turns out to be _three times_ as fast in C++ than it is in Java (see here).
I wonder if eldavojohn wishes to comment on that?
clucene beats jlucene (or simply Java Lucene) in everything.
http://clucene.wiki.sourceforge.net/Benchmarks
Sticking "wiki" into it usually suffices. :)
Any sufficiently advanced intelligence is indistinguishable from stupidity.