Open Source Search Engine Benchmarks
Sean Fargo writes "This article has benchmarks for the latest versions of Lucene, Xapian, zettair, sqlite, and sphinx. It tests them by indexing Twitter and Medical Journals, providing comparative system stats and relevancy scores. All the benchmark code is open source."
yeah. this is boring.
Nothing else to say, really
Are those anything like Livejournals?
Okay so the fastest engine is using Lucerne, a Java search engine, and this is neither tuned nor horizontally scaled (which it can do very well).
C++ and C both fail to deliver the same level of performance as the Java virtual machine.
Oh wait hang on... does this mean that for complex applications the most important performance piece is normally actually the efficiency of the code rather than the efficiency of the base platform and therefore having a language in which it is easier to write efficient code is better than just having the one that is fastest to execute a for loop?
But hell this is Slashdot and Java is Slooooooow...
An Eye for an Eye will make the whole world blind - Gandhi
This isn't really surprising to me. Disk I/O is the slowdown for almost all programs, so efficient disk access is more important than the application code, no matter how it is written. OTOH, a well designed system that minimizes wasteful I/O will do very well - even if it is written in, cough, java.
Way to go Apache guys!
BTW, I use Lucene on our document management system. It works well enough, but definitely eats more RAM than I'd like. Did anyone look at the RAM trade-off?
...have used it on several projects and always gotten good results. Setting it up is easy and the Ruby API is solid, although I needed a tiny bit of additional code for special character escaping. Highly recommended!
The Army reading list
Oh wait - seems TFA is saying a lot of sites just use an SQL DB and use like '%FOO%' as a "search engine....
Ok, this is reasonable, however, I don't see why anyone would choose sqllite as a benchmark. If you are trying to compare search engines, and consider an RDBMS to be a 'search engine' category, then you at least need to include 4 or 5 of the most popular open source RDBMSs in the benchmark (SQL lite, POstgreSQL, MySQL, Derby, Firebird), not just one.
driven out by the ?A super-organised am protesting
The open source search engines are being measured by an open source benchmark. Must be a conspiracy. I want to see propriety benchmarks measuring these. I'm sure M$'s Bing would be the best.
All the other search engines except lucene are written in C/C++. Why didn't Vik Singh test also CLucene (http://sourceforge.net/projects/clucene/)?
Here is the CLucene's description on SourceForce: "CLucene is a C++ port of Lucene: the high-performance, full-featured text search engine written in Java. CLucene is faster than lucene as it is written in C++."
why index something as useless as twitter?
Does anybody know? That'd be a great comparison.
Please, can we avoid the "java vs C/CC++" thread again?
the lucene based nutch has been a big help to our group. we currently index 60 sites across the company, dive through PDF files and even shockwave flash and powerpoint with ease. the search results are extremely fast and the results are so accurate theyve blown our corporate engine completely out of the water.
Good people go to bed earlier.
SQLLite, Oracle, MySQL, PostgreSQL, etc. all have full-text indexing engines as part of the RDBMS, or as add-on packages. From TFA "I had some text issues with sqlite (also needs to be recompiled with FTS3 enabled) ...". According to this, it uses Full Text Search 3 (FTS3) as its text indexing engine. They all parse the CHAR(N) or CLOB(N) columns into tokens (words), and index those.
The standard SQL predicate "...WHERE columnN LIKE '%FOO%' " cannot be indexed by any RDBMS. That is a non-indexable CHAR(N) or CLOB(N) searchinside a string. Only, left-anchored LIKE queries can use the index, "...WHERE columnN LIKE 'Foo%' ".
Java can't seem to get past it's reputation for being slow - which quite simply is no longer true. Java can match and even exceed the speed of C/C++ implementations. This often seems like an impossible, even outrageous claim to many C/C++ developers. What they fail to see is, that Javas Hotspot compiler compiles critical code sections at runtime on the client computer. This has the advantage over C/C++ programs that the compiler has detailed info about the system it's running on and therefore can perform specific optimizations that a C/C++ program -that is compiled only on the developers PC- can't.
Although they might have full text indexing and searching, databases and search engines/libraries work differently.
E.g. you come to online DVD shop and search for "Tom Criuse" (hint: misspelled surname). Every decent search engine (including Lucene library, not sure of others evaluated here) would yield a result, despite misspelling. I am not sure whether database fulltext thing would spit anything at all. It's simply built do do different job, that's it.
Last time I had to implement an indexing and searching solution, swish++ was by far the performance winner.
The Web is like Usenet, but
the elephants are untrained.
DBSight uses Lucene's inverted index, and beats any database based B-tree search. And it's dead simple to use. Lucene Database Search in 3 minutes: http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes
clucene beats jlucene (or simply Java Lucene) in everything.
http://clucene.wiki.sourceforge.net/Benchmarks