Java Performance under Linux

← Back to Stories (view on slashdot.org)

Posted by emmett on Wednesday January 19, 2000 @03:12AM from the waiting-for-guarana dept.

krshultz writes "IBM has posted a great technical article on Java performance on its DeveloperWorks site. I learned a lot about Java and Linux in general." This is a nice big well-indexed article. Go.

5 of 141 comments (clear)

Min score:

Reason:

Sort:

Interesting... by jd · 2000-01-18 22:30 · Score: 5

This is not the first time someone's commented on the Linux scheduler. There have been unofficial patches for it for some time, and there have been more than a few complaints as to the way it operates.
There seem to be three directions people want to go with the scheduler - coarse-grain, fine-grain and real-time. Instead of arguing which is "best", why don't the developers do what they've always done in the past - put the stuff in, and used menu options to let people choose! If one (or two) of the options turn out to be really redundant, back them out! Nothing's lost, but a few cycles of human time. And it's better spent with code than with flame-thrower.

--
It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
Wait a minute by aheitner · 2000-01-18 23:09 · Score: 5

I've got a fundmental disconnect here ...

Okay, the Linux scheduler is slower than it could be. It is taking up "up to 20% of CPU cycles" in the very process-intensive (given that native threads are no lighter weight than processes) benchmark, 400-2000 processes.

But there's a more fundamental problem: a 20% speedup isn't significant. I'm not saying we should abandon all speedups that don't affect asymptotic complexity; I'm just saying that I'm looking for speedups of at least 2x-10x before I'm impressed with anything. 20% is small stuff.

There's a bigger issue here: this many processes will never be fast. The cost of a context switch is high given current processor designs, and is not likely to get lower. Even assuming that on a thread switch, since you're dealing with the same data as the previous thread was using, the TLB and code/data caches remain useful (on a process switch in general they don't, and refilling the caches is very expensive), you still have to store a whole bunch of stuff to memory for the old thread and bring a whole bunch of stuff of stuff out of memory for the new thread. And you've got to leave userland for a bit to do that. Slow slow slow slow.

It seems to me that in general we need to reconsider the approach of relying on the operating system to schedule and share resources (in the case of chatservers and ftpservers and especially webservers, where we see the real performance hits for massive thread/process expenses). Right now all this stuff is based on the Berkeley sockets API, a high level network API (i.e. one that doesn't at all consider what the transport will be). This has been a tremendously successful API; it's used on all platforms (well I can't speak for sure for Mac :) and it can be reasonably argued that Berkeley sockets paved the way for the Internet.

But the fact remains that your ethernet card is fundamentally a serial device. I have to wonder if it wouldnt' be possible to write a webserver which does know about the transport for a change, and which could in only one process sit there putting packets onto the wire at a level much closer to the hardware, and therefore save a lot of expense in making the operating system arbitrate all these zillions of threads that want to share the connection.

It would be an interesting project to say the least.
More on thread mappings by JohnZed · 2000-01-18 23:47 · Score: 5

Interestingly enough, a heated thread on a related topic cropped up in the kernel-dev mailing list the other week. Check out Kernel Traffic for the details, but basically it had to do with some SGI engineers who wanted to make a change in a threading mechanism to facilitate 3D graphics performance on Linux. Linus explained that he felt their method was, basically, an unmaintainable, inelegant hack that has crept its way into Irix for marketing purposes but will never be in the Linux kernel.
The relevant thing in relation to the IBM article is Linus' discussion of the philosophy of fork() and how strongly committed he is to this model. He's stated quite often, in fact, that this thread scheduling mechanism (which schedules threads as separate processes) is a very intentional part of the kernel design.
Personally, I think this opinion will pretty much have to change over time when people are able to demonstrate very elegant patches for the many-to-many threading model discussed in the IBM article. In fact, if I remember correctly, this is the sort of threading model that TowerJ uses in their native Java compilation system to achieve such great scalability on Linux. You can find plenty of examples of in-process scheduling code if you're interested in checking it out: GNU portable threads is the first one that comes to mind, but almost every Java implementation offers this model as an option (green threads). The method IBM is talking about combines this inter-process tactic with the current, intra-process scheduler.
It just makes sense that if you have 10,000 processes in a queue and you have to recompute goodness for each every time you enter the schedule, this will be a less scalable approach than if you'd created 100 processes with 100 threads each, so that thread_goodness only needs to be computed when that particular process is entered. Think about the management of a large corporation: does the top management allocate resources, set timetables, and otherwise schedule every single employee? No, they schedule a number of departments and projects, then the next level of managers schedules each of the employees within those.
So far, I think this has been much less of an issue not just because Linux hasn't been focused on the enterprise space (where scalability to tens of thousands of threads is crucial), but more because the key server-side applications in Linux (Apache, etc), have been multi-process rather than multithreaded. Now, with the increase in multithreaded apps from Java (say what you will about the language, it makes threading MUCH easier than C) and, for example, the new Apache process models, we'll start to see serious real-world performance benefits for those OSes that have the best thread scalability. Linus, being the bright guy he is, will surely pick up on this make whatever changes are necessary. At least, that's the way I see it working out. --JRZ
AWESOME! by FascDot+Killed+My+Pr · 2000-01-18 22:36 · Score: 5

This article gave me a hard-on.

It's not so much about Java. It's mostly about threading under Linux. The meat of the article is about how to improve the scheduler.

But the BEST part was the scientific attitude AND clear explanation (and proof) of the issues. This is EXACTLY what Linux needs. Maybe IBM would like to fund an idea I've had for a while:

Set up a lab that does nothing but Linux benchmarking. This lab would research things like the scheduler issue from this article, memory access patterns, filesystem layout, etc. All of this research would be available to the public for kernel development, third-party developers, benchmarketing (and rebuttals thereof), etc. The lab could also provide patches to "fix" issues, but that would be of secondary concern. The main purpose would be to supplement the (usually excellent) intuition of the kernel programmers with some hard science.

To do it right this should really be a separate non-profit, but it could start out as an internal project at some large company.
---
This comment powered by Mozilla!

--
Linux MAPI Server!
http://www.openone.com/software/MailOne/
(Exchange Migration HOWTO coming soon)
SGI's IRIX scheduler - "less is more" by john@iastate.edu · 2000-01-19 00:03 · Score: 5

I'm reaching way back in my memory here, but I recall a white paper (perhaps from Usenix) from SGI where they investigated how to keep their scheduler from using so many cycles - not so much from a "improve throughput" thrust, but more so to "improve responsiveness".
Their conclusion was that what you wanted to do was have a two-level scheduler -- a real quick + dirty part that ran at interrupt level and just grabbed the next runnable processes from a circular list of the highest priority processes -- in and out in just a few cycles, but perhaps not grabbing *the* highest priority process this time -- then "every so often" (in computer terms, e.g. some fraction of a second) a lower level scheduler ran which did a more thorough re-ordering of the processes.
Of course, one immediately sees that this lower level scheduler could even be a regular process (making syscalls) which means you can plug in whatever scheduling algorithm you like.

--
Shut up, be happy. The conveniences you demanded are now mandatory. -- Jello Biafra