Slashdot Mirror


Java Performance under Linux

krshultz writes "IBM has posted a great technical article on Java performance on its DeveloperWorks site. I learned a lot about Java and Linux in general." This is a nice big well-indexed article. Go.

14 of 141 comments (clear)

  1. How do other OSes do it? by Telcontar · · Score: 3

    How do the *BSD schedulers cope with that problem? What alternatives to an evaluation of the "Goodness functions" have been thought of?

    Is it maybe possible that one only makes a rough (heuristic) estimate of that function, maybe based on older (exact) values, which are only updated from time to time? The same goes for the ranking of the results of these functions (apperently much time is lost here). After all, with so many threads
    a) one does not have to select the best process to run - choosing a good one is OK
    b) having a bigger data structure in the Kernel should not be a problem - the testing machines had 1 GB of RAM...

  2. Interesting... by jd · · Score: 5
    This is not the first time someone's commented on the Linux scheduler. There have been unofficial patches for it for some time, and there have been more than a few complaints as to the way it operates.

    There seem to be three directions people want to go with the scheduler - coarse-grain, fine-grain and real-time. Instead of arguing which is "best", why don't the developers do what they've always done in the past - put the stuff in, and used menu options to let people choose! If one (or two) of the options turn out to be really redundant, back them out! Nothing's lost, but a few cycles of human time. And it's better spent with code than with flame-thrower.

    --
    It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
  3. No by aheitner · · Score: 3

    You can't rely on user-space thread switches, it's just too messy. Remember, in theory Win3.1 did user space timesharing, and we all know how much of a joke that was. If you're going to be doing more than one thing at once (conceptually of course) in a single userland context, you should structure your code appropriately for the grain you want and build an event structure or whatever is appropriate. Calls to yield are just flaky -- you should be able to structure things much more intelligently that that.

    For example, a webserver has the fundamental building block of a packet on the transport, which has an MTU. So a webserver ought to be able to build a grainsize based on sending at least one (or perhaps several if the packet size is very small, such as ATM; this would be decided by looking at the hardware at configtime or runtime on the server, not a big deal) packet, then considering which client to serve next. Of course it helps massively if you can collect intelligence on what type of connection the client has, i.e. don't try to send more than a few kilobit to modem users, etc etc.

    It's bad in the first place if there has to be a VM choosing which code to general (slow slow slow!). I'm only talking about the "real-code" (i.e. traditional, compiled webservers and other multiconnection servers) case here. I don't believe one has the right to complain about performance at all if one chooses to use Java, so even tho "green-threads" style threads might be appropriate and effective for Java, they're not very useful for a real application.

  4. Wait a minute by aheitner · · Score: 5

    I've got a fundmental disconnect here ...

    Okay, the Linux scheduler is slower than it could be. It is taking up "up to 20% of CPU cycles" in the very process-intensive (given that native threads are no lighter weight than processes) benchmark, 400-2000 processes.

    But there's a more fundamental problem: a 20% speedup isn't significant. I'm not saying we should abandon all speedups that don't affect asymptotic complexity; I'm just saying that I'm looking for speedups of at least 2x-10x before I'm impressed with anything. 20% is small stuff.

    There's a bigger issue here: this many processes will never be fast. The cost of a context switch is high given current processor designs, and is not likely to get lower. Even assuming that on a thread switch, since you're dealing with the same data as the previous thread was using, the TLB and code/data caches remain useful (on a process switch in general they don't, and refilling the caches is very expensive), you still have to store a whole bunch of stuff to memory for the old thread and bring a whole bunch of stuff of stuff out of memory for the new thread. And you've got to leave userland for a bit to do that. Slow slow slow slow.

    It seems to me that in general we need to reconsider the approach of relying on the operating system to schedule and share resources (in the case of chatservers and ftpservers and especially webservers, where we see the real performance hits for massive thread/process expenses). Right now all this stuff is based on the Berkeley sockets API, a high level network API (i.e. one that doesn't at all consider what the transport will be). This has been a tremendously successful API; it's used on all platforms (well I can't speak for sure for Mac :) and it can be reasonably argued that Berkeley sockets paved the way for the Internet.

    But the fact remains that your ethernet card is fundamentally a serial device. I have to wonder if it wouldnt' be possible to write a webserver which does know about the transport for a change, and which could in only one process sit there putting packets onto the wire at a level much closer to the hardware, and therefore save a lot of expense in making the operating system arbitrate all these zillions of threads that want to share the connection.

    It would be an interesting project to say the least.

  5. Idle criticism by jbert · · Score: 3

    Whilst it seems that these people are nice and thorough, a couple of points:

    1) If you are running one heck of a lot of processes/threads (same thing on Linux) you would expect the time spent in the scheduler to be big.
    That is unavoidable overhead of *all* thread models. (You can try and reduce it - thats good...but run enough threads and it will dominate).

    2) {I am not a hacker but} If they are at the level of seeing improvments in the scheduler by tweaking things like structure layout to improve cacheline localilty then can we sure that the "low performance impact" IBM Kernel trace patch is not having an effect? What was the throughput like (i.e. the main benchmark measurement) like on a stock kernel?

    3) If you move to a many-many scheduling model you *will* reduce the time spent in the kernel scheduler. However, you *will* spend time in your user-land scheduler. Which is the win?


    I don't mean to suggest that these people don't have some good points (I hope that they develop patches and I hope that the best patch wins), but it is important not to jump to conclusions.

    PS - I only skimmed the article, so I may have got the wrong end of the stick. I'm sure someone will put me right if so :-)

  6. great technical article..and send feedback! by tuffy · · Score: 3
    I hope this patch, or something equivilent hacked out between IBM and Linus/Alan/etc. will make it into the 2.3 tree prior to 2.4. IBM looks to be continuing their great Java and Linux software development, much to everyone's benefit.

    And don't forget that little feedback thing at the bottom. Let IBM know these are the sorts of things we like to hear!

    --

    Ita erat quando hic adveni.

  7. Re:"Why threads Are A Bad Idea (for most purposes) by JohnZed · · Score: 3

    Curiously enough, two years later Ousterhout turned around and touted TCL's threading features as a major advantage that it enjoys over Perl.
    I've programmed a fair amount with both threads in Java and non-blocking I/O in C, and the one-thread-per-connection model is VASTLY easier to program, maintain, and use. Non-blocking I/O leads to code that's extremely non-linear, and much more confusing, than multithreaded implementations. It's like having to work with code that uses a million goto's; you never know where you'll be executing next. Threading, on the other hand, achieves the same benefits, but it lets the programmer work at a higher level of abstraction.
    Are C++ and Java broken because they use, for example, object-oriented representations of streams rather than a series of calls to "write" on a file descriptor? Well, this difference does cause a performance impact. But if you can get your product to market twice as quickly by using technologies that extract a 15% performance hit, isn't that worth the difference? As operating systems improve more and more to cooperate with sophisticated threading models, the performance hit for using them will continue to decline.
    Rather than sticking our heads in the sand and saying, "Well, there's another, more confusing, less modern way to do it that doesn't require us to change the way we've done things for years," let's actually try to find ways to make programming easier AND produce a high-performance result.
    --JRZ

  8. More on thread mappings by JohnZed · · Score: 5

    Interestingly enough, a heated thread on a related topic cropped up in the kernel-dev mailing list the other week. Check out Kernel Traffic for the details, but basically it had to do with some SGI engineers who wanted to make a change in a threading mechanism to facilitate 3D graphics performance on Linux. Linus explained that he felt their method was, basically, an unmaintainable, inelegant hack that has crept its way into Irix for marketing purposes but will never be in the Linux kernel.

    The relevant thing in relation to the IBM article is Linus' discussion of the philosophy of fork() and how strongly committed he is to this model. He's stated quite often, in fact, that this thread scheduling mechanism (which schedules threads as separate processes) is a very intentional part of the kernel design.

    Personally, I think this opinion will pretty much have to change over time when people are able to demonstrate very elegant patches for the many-to-many threading model discussed in the IBM article. In fact, if I remember correctly, this is the sort of threading model that TowerJ uses in their native Java compilation system to achieve such great scalability on Linux. You can find plenty of examples of in-process scheduling code if you're interested in checking it out: GNU portable threads is the first one that comes to mind, but almost every Java implementation offers this model as an option (green threads). The method IBM is talking about combines this inter-process tactic with the current, intra-process scheduler.

    It just makes sense that if you have 10,000 processes in a queue and you have to recompute goodness for each every time you enter the schedule, this will be a less scalable approach than if you'd created 100 processes with 100 threads each, so that thread_goodness only needs to be computed when that particular process is entered. Think about the management of a large corporation: does the top management allocate resources, set timetables, and otherwise schedule every single employee? No, they schedule a number of departments and projects, then the next level of managers schedules each of the employees within those.

    So far, I think this has been much less of an issue not just because Linux hasn't been focused on the enterprise space (where scalability to tens of thousands of threads is crucial), but more because the key server-side applications in Linux (Apache, etc), have been multi-process rather than multithreaded. Now, with the increase in multithreaded apps from Java (say what you will about the language, it makes threading MUCH easier than C) and, for example, the new Apache process models, we'll start to see serious real-world performance benefits for those OSes that have the best thread scalability. Linus, being the bright guy he is, will surely pick up on this make whatever changes are necessary. At least, that's the way I see it working out. --JRZ

  9. Re:VM's will always be slow by noom · · Score: 3

    There are compilers available for linux (TowerJ being one) but their primary benifit is for server-side code; it'd be much more difficult (but not entirely impossible -- proof-carrying-code would work) to ensure safety if you distribute binaries to clients. Indeed, the whole point of using platform independent byte-codes is so that the JVM can ensure saftey. Platform-specific machine code running on a server will probably coexist with platform-independent java byte-codes for client applications.

    -NooM

  10. AWESOME! by FascDot+Killed+My+Pr · · Score: 5

    This article gave me a hard-on.

    It's not so much about Java. It's mostly about threading under Linux. The meat of the article is about how to improve the scheduler.

    But the BEST part was the scientific attitude AND clear explanation (and proof) of the issues. This is EXACTLY what Linux needs. Maybe IBM would like to fund an idea I've had for a while:

    Set up a lab that does nothing but Linux benchmarking. This lab would research things like the scheduler issue from this article, memory access patterns, filesystem layout, etc. All of this research would be available to the public for kernel development, third-party developers, benchmarketing (and rebuttals thereof), etc. The lab could also provide patches to "fix" issues, but that would be of secondary concern. The main purpose would be to supplement the (usually excellent) intuition of the kernel programmers with some hard science.

    To do it right this should really be a separate non-profit, but it could start out as an internal project at some large company.
    ---
    This comment powered by Mozilla!

    --
    Linux MAPI Server!
    http://www.openone.com/software/MailOne/
    (Exchange Migration HOWTO coming soon)
  11. Re:VM's will always be slow by javatips · · Score: 4

    I develop with Java since the end of 1995 (Java 1.0 Beta2).

    Over the years, I have seen a drastic increase of performance of the JVM.

    Now I have to disagree with you. In multithreading application I have seen Java beform better than C++. The application where build by the same person and used the same architecture. (The guy was a beginner in Java and experimented in C++.)

    Currently I develop server-side component based (EJB) application (using application server written in Java - WebLogic) and batch processes written in Java. I can say that they perform really well.

    From my experience, by coding carefully you can achieve wonderfull performance. We add a batch application must process 6 to 12 millions of record per day (and do a lot of processing on each records). The first version of the batch was doing 7 record per seconds (di not meet our requirements) by optimizing the code and changing algorythm we went to > 300 record per second.

    Maybe we could gain a 10% to 20% more speed if we rewrote the whole thing in C++. But it would take at least twice the time to develop and will not be as stable as the Java version.

    I conceed that Java is a little bit slower that c++ (not in all cases) but the gain in programmer productivity and stability is really worth it.

  12. Want something native with Java semantics? by Scurrilous+Knave · · Score: 3

    If you're looking for a portable language that compiles to native machine code and which implements much of Java's semantics, check out Ada 95. You can find information here, or download a complete GPL'ed compiler here.

    I'm totally serious, folks. Do not regale me with tales of how much Ada sucks--most originate from introductory CS classes where Ada83 was shoved down unwilling throats by indifferent or hostile educators. Please, go read and experience for yourself before replying. And for those who dispute my claim about Java semantics, please pay special attention to the links on this page before you comment.

  13. Thread Oriented OS Needed by Baldrson · · Score: 3
    Systems in which light-weight threads are first order constructs, such as Mozart illustrate why relational programming will eventually subsume functional or procedural programming:

    Functions are special cases of relations.

    It's important to build relational semantics (light-weight threads with logic variables or their equivalent) in at the kernel. Otherwise, you end up kludging around, either recreating it at the higher levels or malanalyzing your relational task to fit your functional tool.

    Open source studies like this one are increasing awareness of the need for light weight threads.

    That's good.

    The next step will be for people to recognize that what they are doing with all those threads is essentially relational in nature so they can really address the impedance mismatch between relational database and object oriented programming.

  14. SGI's IRIX scheduler - "less is more" by john@iastate.edu · · Score: 5
    I'm reaching way back in my memory here, but I recall a white paper (perhaps from Usenix) from SGI where they investigated how to keep their scheduler from using so many cycles - not so much from a "improve throughput" thrust, but more so to "improve responsiveness".

    Their conclusion was that what you wanted to do was have a two-level scheduler -- a real quick + dirty part that ran at interrupt level and just grabbed the next runnable processes from a circular list of the highest priority processes -- in and out in just a few cycles, but perhaps not grabbing *the* highest priority process this time -- then "every so often" (in computer terms, e.g. some fraction of a second) a lower level scheduler ran which did a more thorough re-ordering of the processes.

    Of course, one immediately sees that this lower level scheduler could even be a regular process (making syscalls) which means you can plug in whatever scheduling algorithm you like.

    --
    Shut up, be happy. The conveniences you demanded are now mandatory. -- Jello Biafra