Cassandra Rewritten In C++, Ten Times Faster
urdak writes: At Cassandra Summit opening today, Avi Kivity and Dor Laor (who had previously written KVM and OSv) announced ScyllaDB — an open-source C++ rewrite of Cassandra, the popular NoSQL database. ScyllaDB claims to achieve a whopping 10 times more throughput per node than the original Java code, with sub-millisecond 99%ile latency. They even measured 1 million transactions per second on a single node. The performance of the new code is attributed to writing it in Seastar — a C++ framework for writing complex asynchronous applications with optimal performance on modern hardware.
Because it was written in Seastar
Seriously. WTF?
That is a lie!
I think they mean the C++ port is 10X SLOWER than Java.
Java is faster than C,C++ everyone knows that!
Maybe if they ran the code on a java interpreter, written in java, running on a java interpreter...
More recursive use of java == more speed!
Why slow a system down with all that C++ bloatware?
Almost as fast as native! Maybe even faster for some tasks!
sure
Ah... Crapple will do that Macbooks are much worse though.
This is the trademark reason why Java shouldn't be used in performance sensitive environments in the first place.
As for would it have been any faster if it was written in C or straight ASM, probably not worth chasing down that extra 1%. Generally the justification for straight C or ASM is to remove runtime bloat, and you'd first have to give up using any frameworks to get there.
Just to remind potential programmers. Lean C before you learn any other programming language, otherwise you will not understand why your code's performance is terrible.
Or is this another shitty Slashdot headline?
yaaaa... but are they using Lightning Memory Database (LMDB) as the back-end? http://developers.slashdot.org... https://en.wikipedia.org/wiki/...
Sure, but is it Web scale?
Sans sarcasm I would've also accepted: "duh"
--- Need web hosting?
I miss the Golden Girls cosmonaut troll personally. And that one guy who went on about host files.
We need a new merged copypasta supertroll. Can somebody get on that?
They also boosted performance by never freeing memory, too!
If you post it, they will read.
I suppose that modern hardware means a desktop, workstation or computational server CPU, aka x86_64 based. I wonder if modern hardware also includes low power or portable CPUs, aka ARM ones. It looks like portability has given way to almost to the metal optimization.
Oracle has just launched a new series of patent infringement lawsuits. Oracle allegations include reverse engineering Java to improve the speed of applications like Cassandra, benchmarking Java without permission. They are seeking an immediate cease and desist order, in addition to immediate financial relief for sustaining PPS (More commonly known as Poopy Pants Syndrome.).
-The wise argue that there are few absolutes, the fool argues that there are no probabilities.
Merged with Goatse links! The trifecta!
-The wise argue that there are few absolutes, the fool argues that there are no probabilities.
Databases are usually I/O bound and improvement of storage structure/network protocol is more important than spot optimization of code. A more likely statement is that scylladb performed ten times faster than Cassandra in one particular benchmark for which Cassandra has not been specifically optimized for yet and is ten percent faster in an average case.
In either case, good luck maintaining speed and stability after 5 releases when you implement every corner case of every feature and have to deal with legacy support.
I find it depressing that so little attention is paid to efficient computing. People now just throw memory and cycles at problems because they can with passable results. But I wonder how much more we could get out of our machines if software was carefully crafted from bottom to top.
Read a summary of how Cassandra works, and you will see why it can be so much faster, given what you already know (that thing about databases usually having an I/O problem).
# make clean sig
Databases used to be disk bound, sure. But these days we have huge RAM caches and SSDs - no spinning disks. It's very common for the vast majority of requests to be served entirely from cache. Read the guys' site - it looks like they know what they're doing.
Imagine if Redis was ten times slower or ten times faster. It would matter.
i was like seastore star whats it, had to read the post ic spanNpluss+, thought it was timmy
I still don't see why some sort of filter hasn't been implemented. These troll posts are usually just direct copy/pastes, which would be really easy to filter once identified.
herpa derp, IT drone doesn't know how Cassandra works.
Wow, two years ago everyone here told us that NoSQL is evil and tried to convince us that we should stick to MySQL.
Now everyone tells us Java is evil, because a rewrite in C++ is faster.
What a surprise.
If I would rewrite Cassandra from scratch, in Java, it also would be faster than the actual code.
Why? Because all the learning the original team did over a course of a decade I can reuse and improve on.
Keep in mind, the rewrite uses a new framework and new concepts for concurrency. Concurrency is one of the core areas where computing in future will certainly make lots of progress.
I for my part I'm waiting for a Lucene rewrite, regardless in what language. Probably the worst OSS code I have ever see ... actually the worst code regardless of OSS or closed source.
Cost free eBook I read (by iBook/Kobo/Amazon/ObookO/Gutenberg etc.): "The Green Odyssey" by Philip Jose Farmer.
Really,
a. was it a 1 to 1 port. Obviously using seastar, no.
b.if they rewrote the Java implementation with the latest the language offers and such (like using seastar), would it be 10x faster, likely not... likely on par with performance.
Use the tool for the right reason, C/C++ --> optimized compilers, Java--> Robust VM & framework flexibility. Yes, in 3yrs that DB is going to be a mess to maintain... really. But to those IT/Maintenance coders, they will deny that and boast great justifications why it's still better, mainly better that it's them with the $100K+ O&M jobs/contracts.
I will only use MongoDB because it is web scale.
Let me fix that for you: C/C++ --> optimized software, Java --> write once, debug very slowly everywhere.
But but but... That would interfere with his vested interests in java programmer spawn points in India, Pakistan, Vietnam etc....
Think about some databases project that switched to a vectorized engine and claim x10 to x100 improvement? Databases are not necessarily I/O bound.
Check the benchmark that scylladb published. They use some standard cassandra-stress command. I am sure that Cassandra has been heavily optimized to work well with cassandra-stress. In fact, I wouldn't be surprised it is the most optimized benchmark for Cassandra.
Yes. It's now easy to scale to a million or more IOPS on a single server. That makes the CPU the bottleneck again.
Generally the justification for straight C or ASM is to remove runtime bloat, and you'd first have to give up using any frameworks to get there.
Another is if you have to security audit the result and protect it from attack, as in OSes. C++ can generate stuff that isn't obvious from the local source code - thanks to definitions, overridings, and the like. (Linus makes this point - it's why the Linux kernel is in C and will stay there for the foreseeable future.)
But that shouldn't be enough of an issue here to drop the helpful things the C++ compiler can do for you. (Especially when you're porting from an original in Java: C++ is a good match for a target language, C is not.)
Bantam Dominique roosters crow a four-note song. Once you've heard it as "Happy BIRTHday" you can't NOT hear it that way
From TFS: "The performance of the new code is attributed to writing it in Seastar".
I look forward to this being ignored in favour of more shallow, ignorant Java bashing. After all, it's owned by Oracle now and they are baaaad so it must baaaad too! Baaaad! Baaad baaa baaaaa!
what's the fun in that? Some of us enjoy the interstitial trolls among the more 'serious' posts
this bypass kernel networking stack so it is faster.
No sense...
The real reason is much more nuanced than language differences between C++ and Java. The Seastar network architecture bypasses kernel TCP/IP stack entirely, but instead implements user mode TCP/IP stack using dpdk, which allows user mode to poll network card's packet buffer directly over memory mapped I/O. The user mode stack runs on single core only, but you could run multiple instances on multiple cores. It can scale linearly because there is very little shared state across cores.
C++ with custom network stack vs. Java with traditional network stack is not an apples-to-apples comparison. In theory, you could implement a Java based custom network stack over dpdk as well to make the comparison more fair.
I once had a signature.
Related thread - Why was Cassandra written in Java?
https://www.quora.com/Why-was-Cassandra-written-in-Java
Also, how easier or more difficult it would be to have Java C++ mix. That is identify bottlenecks in Java code, and only recode those in C++.
I suspect that during rewrite they also changed algorithms as well. Surely when devs do rewrite they don't just blindly copy code row by row so that it just works. It is also possible that during rewrite they fixed some problems that could only be done by rewriting huge junks of code... if they are really saying java is 10x slower, they are just incompetent which would make me rethink twice using their C++ Cassandra. Nothing against C++, just saying screwing up with it is easier.
Fast, but is it web-scale?
10x faster in C++? Imagine 100x faster using plain C instead of the overhead of C++
I have seen terrible C++ code and it takes long to get correct. The arrogance of "C++ code is fast" is well known. With Golang they could have written the whole thing faster with similar speed. But even then, I am also suspicious of the C++ is 10x faster than Java for this kind of task. The frameworks or the quality of java code make a huge difference. I am even tempted to take a look in the Cassandra codebase to see what went wrong. Maybe the problem is that they have not updated to Java 7 or 8 practices which in my opinion is half of performance lost. The other half is that people try to have something working correctly and optimise later. But even in that case, I could still try C or D before delving into C++. I really don't buy the whole argument.
Aerospike got a strong competitor now! The main advantage of using C++ instead of Java is the latency
I will use it
Even with SSDs there is still a bus between the CPU and the disk. That bus just returns a successful result faster.
In fact, if you could keep your writes small enough, and didn't worry about the write acknowledgements, you could consider your write done when it hit the spinning disks's cache. It is not a suitable way of writing for most databases, but if you are effectively using journaling techniques with small writes, you can get near-bus performance at the cost of three writes per actual write.
The CPU is rarely the bottleneck anymore. It's the bus, which is mostly being asked for RAM access. We've let our programs grow big enough that they cannot be easily cached, and so we have lousy cache performance. That's because everyone's in a rush to write performance code in the first pass, so we get a lot of crap loaded that's completely unnecessary, but takes up enough ram you overflow you cache.
a 1-to-1 port does not mean what you think it does,
You think it means mapping every method to the same written in a different language. That's just crazy.
What they have done is create the same functional beast as Cassandra, using whatever programming methods and libraries their chosen language (which just happens to be C++) has. Hence a 1-to-1 reimplementation of the project.
Its pointless to think that a C++ port should implement a Java string class when it would use its own. The methods would be different and various bits of code would therefore be different in how they operate on strings. That's just a simplistic example.
I've not seen the C++ code but there's no reason to think its unmaintainable, in much the same way as saying the Java code was unmaintainable too.
Seastar looks like it might be a useful library, but the documentation I can find on the web site seems a little thin. Any suggestions for code samples/documentation besides the distribution?
So many excuses and so much snark in this thread.
If you think your superior intellect can speed up the Java version to equal this C++ version, then prove it. Until then, STFU.
Right now, the only proof we have is that the C++ version is ~10 times faster than the Java version. This is unsurprising to many people, but your wishing it wasn't so doesn't change the reality.
The similarity of the name 'Seastar' to Connection Machines' dataparallel programming language C* can't be an accident. But C* needed to run in shared memory or at least atomic synchrony on low latency distributed memory in order to preserve consistency. And of course, it needed SIMD algorithms (do the same op concurrently on a large pool of data) or it could add no value over using C.
Sounds like a misnomer to me.
The purpose of Cassandra is to optimize durable writes to disk. If you don't care about occasional data loss or your main concern is read performance, it's probably not the right tool. If you want correct results, the node you are talking to has to wait for data hash from at least one other node, and these network hops can not be THAT fast, even if everything is memory.
Why not put a nice, optimized C, memcached instance in front of your cluster for the cases where you don't care about durability or consistency?
But no one believed me.
"Win treats sysadmins better than users. Mac treats users better than sysadmins. Linux treats everyone like sysadmins."
In 2007, there were no appropriate frameworks or other multitasking facilitites within C++. The optimisation was the pits.
Now, 8 years beyond, one can use C++, along with templates and all the other goodies, provided.....
provided that the latest Bjarne Strousaps's standards for coding clean C++ are followed. Bjarne proposes methods to follow, in lieu of "don't use arguments".
That means I may actually start using C++ again.
Leslie Satenstein Montreal Quebec Canada
No. "Lean" does not mean "fast" (nor does it mean "slow.") Lean means low-ish byte count. Fast means fast. Slow means slow. Lean means a smaller executable/dataset. There are many instances where a lean program will be slower than a heavyweight one. For instance, a precalculated table of complex formulat results typically takes much more space than the actual calculation. The precalculating program will be faster, but less lean, than the one that does the calculations every time they are needed. Further, this kind of thing can be an excellent use of resources if higher speed is desirable, which it usually is -- because while memory has become relatively inexpensive, CPU speed remains a scarce and hard-limited resource. Virtual or otherwise.
I've fallen off your lawn, and I can't get up.
The NASDAQ wall street big stock exchange called INET, with sub 100 microsecond latency and extreme throughput, is written entirely in Java. NASDAQ claims INET is fastest in the world. All the world's fastest stock exchanges are written either in Java or C++. High Frequency Traders go to the fastest stock exchange, and there are lot of money involved. If C++ were faster, everybody would ditch Java in favor of C++. The SECRET Nasdaq uses to get fast Java, is not to use real time Java. No, NASDAQ shuts off the Garbage Collector. In effect, NASDAQ INET stock exchange system preallocates lot of objects that are constantly reused all the time. In that way GC is never triggered and Java speed can rival C++.
In addition, adaptive optimizing Java compilers can in theory be faster than C++. When you compile C++, you typically only target a least common cpu with no special assembler instructions. There are not much optimizing going on, as the code needs to run on all cpus, including old ones without support for, say, vector instructions. So, there are not much special optimizing going on in C++ compilers, when you finally release your C++ binary. OTOH, Java can examine the cpu and turn on vector instructions - C++ can not do that. Sure, you can release several C++ binaries targeting different cpus, or provide different code paths in the cpu, so the code path will use vector instructions on a modern cpu.
In addition to this optimization, Java can also optimize like this (which C++ can not do): if your code path handles lot of objects of a sub class for the first hour and another sub class the next hour - Java can adapt and optimize according to subclass X or subclass Y. C++ can not do that run time. C++ has static optimization, Java has dynamic and adapts almost in real time depending on the different types of objects. C++ can not do this optimization. So this is at least one optimization where Java is better. And in theory, adaptive JIT is faster. Compiler theory advances, so it is just a matter of time before JIT is faster than static binaires.
In fact, I know that several High Frequency Trading firms obsessed with speed, are using Java. /Former NASDAQ employee
I learned Java way back in the 5 days. Obviously things have changed since then (and I was much less experienced). What examples are there of "the right way" Java applications?
For instance, is there a performant HTTP server/proxy that keeps pace with something like nginx whose source I could browse to see the state of the art?
-Bucky
Looking at the big picture, I am paid to write software for a manufacturing company. Everyone here is talking about execution speed but how about talking about the time to develop software. This cost is what the concerns the people paying you.The language you use should be the best fit for life cycle development for the particular situation. I've started in assembly in 1980 and have progressed to c++ and java OO now. For instance I use C or C++ for low level hardware (micro-controllers and data collection devices) and Java for Enterprise level applications over the Web. Many times C libraries can be used with Java with JNII or JNA and you get the best of both worlds. Although I prefer C++ personally, if I can get the job done in Java or another language quicker I will.