Java IO Faster Than NIO
rsk writes "Paul Tyma, the man behind Mailinator, has put together an excellent performance analysis comparing old-school synchronous programming (java.io.*) to Java's asynchronous programming (java.nio.*) — showing a consistent 25% performance deficiency with the asynchronous code. As it turns out, old-style blocking I/O with modern threading libraries like Linux NPTL and multi-core machines gives you idle-thread and non-contending thread management for an extremely low cost; less than it takes to switch-and-restore connection state constantly with a selector approach."
That one person per coffee cup is the best solution. Or not. It could go either way. :p
Of course old school techniques are faster. We don't drop old school because we want better performance, we drop it because we're lazy, and want easier ways to get the job done!
I need trepanation like I need a hole in the head.
JDK7 will bring a new IO API that underneath uses epoll (Linux) or completion port (Windows). High performance servers will be possible in Java too.
Look at the timestamp of this presentation :) It's a bit of old news.
It was discussed here: http://www.theserverside.com/news/thread.tss?thread_id=48449
And it mostly shows that NIO is deficient. I encountered similar problems in my tests. Solved them by using http://mina.apache.org/ .
never really used the NIO API,and never saw a reason to,it's not even included in the SCJP exam,which is the basis for all Sun certification
Now we just need a Sun spokesperson to have a press conference and show that old-school blocking I/O with modern threading libraries affects all languages, not just Java.
Trolling is a art,
I'm not sure where / when NIO got equated to lower latency. The primary benefits of NIO (from my understanding of having designed and deployed both IO and NIO based servers) is that NIO allows you to have better concurrency on a single box i.e. you can service many more calls / transactions on a single machine since you aren't limited by the number of threads you can spawn on that box (and you aren't limited as much by memory, since each thread consumes a fair number of resources on the box).
For the most part (and from my experimentation), NIO actually has slightly higher latency than standard IO (especially with heavy loaded boxes).
The question you need to ask yourself is... do you require higher concurrency and fewer boxes (cheaper to run / maintain) at the expense of slightly higher latency (which would work well for most web sites), or are your transactions latency sensitive / real-time, in which case using standard IO would work better (at the cost of requiring more hardware and support).
Java NIO is actually rather old at this point and is slated for a huge overhaul with NIO.2 which will be released with Java 7 this Fall.
In the case where you have to manage many thousands of connections the select loop is far more efficent than spawning 1 thread per connection. I can multiplex several hundred connections to the same listen port on one thread and let a fixed thread pool do the computations and let the same (connection) thread serve the result.
It depends on what you are trying to do. While I did not go over the article completely I fail to see how asynchronous IO can be faster than NIO with buffers and DMA.
This looks like polling vs. pending, and if it is, pending won that war about 40 years ago.
the entire point of asynchronous is to acknowledge you will be waiting for IO, and try to do something else useful rather than just wait... asynchronous will obviously end up taking more time because of the overhead of managing states and performing the switches, but the tradeoff is something useful was getting done while waiting for IO a little longer instead of doing nothing except wait for the IO to complete. which method is best is completely application specific.
I wonder how much slower nio2 will be...
You'll laugh, hysterically.
First post... ..man, I knew I shouldn't have used that Java interface to post on Slashdot.
On Windows, the fastest way to do multithreaded I/O with a producer/consumer queue pattern is IO Completion Ports.
The fastest way to write a bunch of buffers to disk is WriteFileScatter. The fastest way to read a bunch of data from disk is ReadFileGather.
SQL Server uses these APIS to scale.
When I used to work at MS in evangelism, there was a big debate about how Unix does things one way, and Microsoft does it a COMPLETELY different way that you just can't #define away - it's just different. A guy named Michael Parkes said "I cannot go to these clients and say REPENT! and use IO completion ports! They do thread per client, because they have fork()".
When you listen to the technical explanations, the Microsoft way actually IS better - it's just aht it's totally incompatible with evrything else.
Learn IOCP and watch your context switches drop.
NIO was not created to be faster but rather to be scalable. Those are two completely different things. Fail.
Yawn ... slow day on Slashdot.
AIO gives you latency & scale improvements, not throughput. It's not surprising that synchronous I/O has faster throughput since it saves you a bunch of syscalls. AIO though means that you can handle more concurrent I/O operations even though each individual one may have a lower throughput & you offer more consistently low latency (whereas with synchronous I/O calls may block indefinitely, locking up the UI).
Ff you have multiple cores that do nothing otherwise (like all benchmarks happen to act), multithreading will use them and asynchronous nonblocking I/O won't, so maximum transfer rate for static data in memory over low-latency network will be always faster for blocking threads.
In real-life applications if you always have enough work to distribute between cores/processors, your nonblocking I/O process or thread will only depend on the data production and transfer rate, not the raw throughput of the combination of syscalls that it makes. If output buffers are always empty, and input buffers are empty every time a transaction happens, then both data transfer speed is maxed out, and adding more threads that perform I/O simultaneously will only increase overhead. If it is not maxed out, same applies to queued data before/after processing -- that is, if there is processing. So if worker threads/processes do more than copying data, then giving additional cores to them is more useful than throwing them on to be used for I/O.
Contrary to the popular belief, there indeed is no God.
This presentation is actually from 2008 (as indicated by every single slide in the PDF -- and thanks for the PDF warning, BTW). Aside from being old, is there any indication that it's still true?
https://www.eff.org/https-everywhere
That still makes it a .COM language. As in COMMAND.COM, though...
What do strips to hang things on the wall have to do with Perl?
But seriously though, Perl is a .exe language not a .com language because its interpreter is bigger than 65536 bytes. So my best guess is that by ".com language" you meant a language in which programs are stored as source code and parsed when they are run, as opposed to compilation ahead of time.
Java people desperately crowing about anything performance related are desperate. It doesn't matter what you do, Java people, Java will never be the fastest thing around because a thing running in a thing will never be faster than a single thing. But that's okay! We all understand how it all works and we can just accept it for what it is. Focus on the important stuff, like eliminating the need to find, download, and install piles of stuff before being able to run a Java program. Make it so that I can download a binary and run it and not know whether the source code was written in C, C++, or Java ... since I really don't care anyway.
101 Reasons why Java is better than .NET
All of which are null and void if there's no JVM for your (non-PC) target platform. Which Java edition lets the public program for a game console? .NET does; XNA for Xbox 360 is based on the .NET Compact Framework.
IIRC, even in Novell NetWare the old blocking I/O calls which prevented the client from doing anything else while it waited for a response from the server were much faster than the new non-blocking I/O...
I've abandoned my search for truth; now I'm just looking for some useful delusions.
I was given the task of handling 10k concurrent connections on a $500 linux box. Standar IO (multi-threaded) quite simply could not cut it - the OS context-switching overhead is prohibitive after a certain number of threads. NIO can handle 10k concurrent connections easily, and keep scaling. At lower-volume, even up to 2000 concurrent connections, regular IO will be fine. But for higher-volume, NIO is a must.
This may be true for Java.
It isn't true for C/C++.
With C/C++ and NPTL, the many-thread blocking IO style yields slightly lower latency at low IO rates, but offers significant latency variability and sharply decreased thruput at higher IO rates.
It seems that the linux scheduler is much to blame for this-- the number of times that a thread is scheduled on a different CPU increases dramatically with more threads, and this trashes the caches.
I've seen order-of-magnitude decreases in performance and order-of-magnitude increases in latency as a result of what appears to be the cache trashing.
The best way to write IO is to use one thread or process per CPU core and in that thread use non-blocking IO. I thought everyone knew this.
it won't matter, they'll rename nio2 back to nio after realizing that nobody can figure out the java numbering system.
What really pains me is when people decide to do the "easy" threaded I/O and then evolve strange monstrosities like thread pools to try to deal with the scaling problems. Allocating N*k worker threads to N processors and then doing select-style polling in each of them is just so ugly. Either the OS and language runtime should provide high performance async I/O or unlimited threading via smarter compilers and schedulers (or both). Having every application evolve from "trivial but slow" (simplistic async I/O) or "trivial but fragile" (simplistic threaded I/O) into "very complex but usually fast and scalable" (weird hybrids with thread pooling and application level work dispatchers) is just a terrible waste of software engineering resources.
shut up and mod me up.
I've done similar benchmarking. On average, Java socket IO takes 3000 cpu cycles to handle a packet, including kernel time. That isn't bad for TCP which is a complex protocol.
Java NIO implementation easily adds another 1000 cycles on top of that. It tries to provide too much abstraction and it does excessive locking.
Therefore, Java NIO is about 25% slower, it looks like, in a naive saturation test.
But it's meaningless. 1000 cpu cycle is nothing in a java program. Let's do the simplest non-trivial thing with the data, say, xor each bytes. That's about 2 cycles per byte, and 2000-3000 per packet. That's another 50% slowdown. A real application needs to do much much more than that, and the overhead of Java NIO becomes negligible.
NIO is better if there are a lot of connections blocked on read/write. But this is odd if you really think about it! Why couldn't thread impl achieve the same thing? As of today, the only reason is that thread stack size is fixed and quite large. Java's default stack size is 256K. A few thousands of threads will consume all the memory on board. If you maintain connection state yourself, it's far smaller than that, and it pays to not create one thread for every connection.
You use threads when your app is CPU-bound and async I/O (it's actually sync, it's just multiplexed, but the name stuck) when it's I/O bound. Using threaded/synchronous models works better when you have a small number of connections, because the OS scheduler can cope with them. Going to 10k connections over the same model is downright stupid. Your software would spend more time switching contexts than performing useful I/O. Hybrid models, with per-thread I/O loops, are a good approach. IIRC, Mina does the same thing, moving connection objects from one thread to another when their state changes, while doing async I/O in each thread. I'm not entirely sure on this, I'm not a Java guy, but I remember reading this in a paper at some point.
If you know how to code high-bandwidth, low-latency IO, the operating system doesn't matter, because you'll be hardware limited.
And believe it or not, you can do a passable job of that with Java.
But you sure as shit CAN'T with C++ IO streams. There's nobody worse at coding high-speed IO than some noob who can't get past his >> and <<.
Java will run on platforms that support C.
XNA Game Studio does not support C or Standard C++. It supports a largely incompatible C++ dialect called "C++/CLI with /clr:safe".
Please see GNU gcj and the Classpath project.
Does gcj compile Java into C or directly into object code? The iPhone developer program requires that Xcode and only Xcode compile your program's source code, which must be written in Objective-C (of which C is a subset) or Objective-C++ (of which Standard C++ is a subset).
I don't think NIO ever claimed to be faster than old IO. Its primary selling point is (and always has been, IMO), its ability to handle tens of thousands of connections on a single thread. With old IO, ten thousand connections would require ten thousand threads. I don't care how good your context switcher is -- that is a high price to pay, especially if the machine has other high-thread-count processes running as well.
So when you're pushing data as fast as you can through a socket, the old read(byte[]) or write(byte[]) are faster? Wow, no kidding.
You do NOT use java.nio (like Jetty's SelectChannelConnector) for maximum throughput. You use it to handle persistent connections, like all those long polling requests via AJAX which return on an event or timeout after a minute. This article is like recommending Apache with its hard limits on how many requests it can serve concurrently over newer, asynchronous servers like Nginx for static media servers with keep alive enabled.
The slides even mentions the C10K problem, but what it doesn't do is mention when to use either technology - async IO for concurrency and endless scaling, and synchronous IO for pushing a 10G Ethernet link to the limits. No wait, the nio setup can do that too, 700MB/s or 5.6Gbit/sec per core on 2008 hardware should be enough to max out anything you can buy now. It's great that synchronous IO can hit 1GB/s, a whopping 30% faster, but useful? I'd say no.
For most users, you don't use either API. Lets be honest here, writing highly concurrent software is hard, why reinvent the wheel when you can get off the shelve software that can do it better? You use Jetty and choose between the SelectChannelConnector or SocketConnector, or choose between Apache or Lighttpd/Nginx depending on the traffic pattern. What you do write is the bit that accepts a whole HTTP request and returns a HTTP response, everything before and after is magic.
Unless you're a file server, each 50k sized HTTP response will require enough work to make sure you run out of CPU or Disk IO long before you hit even the 100Mb/s ceiling in most rack switches. Even if your app is fast, 16 cores x 100ms per request x 50K is only 62 Mbits. Not 5600.
But if you need to scale in concurrent client count, there's no way around async IO. The latest name to watch is Netty. In Plurk Comet: Handling 100,000+ Concurrent Connections with Netty, it scales up to 100000 concurrent connections on a quad core server with 20% CPU load.
Just stop worrying about sockets already, and start worrying about your SQL server suffering a meltdown. Even if you get manage to grow into the Facebook, it's not like using synchronous IO will save you from deploying 30000 servers, it's the application code that's slow. Zero copy, one copy, "string concatenation style twenty copies response building" socket writes don't matter at all, memcpy is cheap compared to a few lines of interpreted code, servers are cheap compared to developers, and never mind the cost of the programming gods giving these presentations.
Async I/O predates Java and pthreads by decades.
On all sane programming environments Async IO is faster for non CPU bound applications as context switches are minimized.
The slides are dated 2008. Was there any follow up? Were these experiments repeated and/or confirmed anywhere else?
Thou shalt not begin a subject line or post with the word "Umm".
My understanding is that it is not supposed to be faster. It is non-blocking and asynchronous which serves a different need.
Weird, Paul's document has not changed since I last read it 2 years ago, but its still news?
Actually, Paul published this on his blog probably a bit over a year ago. I remember it well, it's a very nice presentation.
Move sig!
Is it me or is Mailinator a blatant rip-off of http://spamgourmet.com/?
In simple tests traditional thread based IO is faster than NIO. On the other hand NIO is more scalable. See this blog: http://weblogs.java.net/blog/sdo/archive/2008/04/more_on_the_sim.html
referred pdf seems to be outdated (over 2 years old), not sure whether it is still valid I'd say
Biggest problem of Java IO are ints. More specifically, unsigned ones.
Lots of actual data files and data protocols have integer numbers (of some size) that are supposed to be unsigned.
But Java won't let you do unsigned anything. "Use the next larger size!". Well what if you need unsigned long long values? What about having to remember the unsigned nature and convert eternally? (which also gives the problem that you can't override "+" so that you can add two unsigned values together, so have to use unsignedint.add(unsignedint2).
There's also the problem that in its violent attempt to avoid any system visibility, you can ask for the environment and even update it, but you can't actually change it because it won't affect the current process and it won't pick up the actual environment of the shell the process is opened in.
If you want $PATH set, you have to use -DPATH=....
But, and this is the pisser, the documentation doesn't tell you you can't manipulate the environment.
From Wiki page on NPTL threads in Linux 2.6 and higher: "In tests, NPTL succeeded in starting 100,000 threads on a IA-32 in two seconds."
This is wonderful achievement. However, the memory foot print matter still remains. A thread will typically consume a couple of pages of real memory right off the bat for a runtime stack. Then of course is the matter of how many pages of address space needs to be reserved per thread to allow for stack growth and thread local heap growth.
So on a 32-bit OS a couple of pages will usually be 8K. A 100K of threads consumes 781MB just to exist and be scheduled by the OS task switcher.
Using a NIO framework, say, like the Apache MINA, I get a relatively easy framework to program to, couple it to a thread pool (probably 25 worker threads could handle 100K socket connections - give or take a few). The memory foot print from threads is only around 200K.
Threading is an extremely resource inefficient to scale concurrency - even with the improvements of NPTL (which only applies to Linux OS anyway). Hence why the actor model (which multiplex many actors to just a few threads), coupled with its messaging approach (instead of shared memory that must be locked) has become a popular alternative to a pure threading approach to concurrency.
-1, O.T. reply to O.T. question:
1) Find the name of your touch device - $ xinput --list. For my MacBook it's "appletouch".
2) Find the id of the touch device using the name - $ xinput --list-props name | grep 'Device Enabled'. Don't literally use "name", use the name you found. E.g. My command line is $ xinput --list-props appletouch | grep 'Device Enabled'.
3) Issue the command $ xinput --set-int-prop name id 8 -d . Where name is the name from #1, id is from #2, "8" is literally "8", and "-d" is literally "-d". So for my laptop at this moment in time it was $ xinput --set-int-prop appletouch 115 8 -d.
HTH
I don't really care how Java does it but I would expect they wouldn't always require one thread per socket. This quickly becomes a bad approach on a server application that has more than a few connections.
Short answer: java.util.concurrent.ThreadPoolExecutor it is available in the Standard Edition since Java 5.0 - very simple to use.
NIO provides three additional capabilities to Java:
* select style I/O - managing a bunch of connections with one or more threads (instead of requiring a thread per I/O)
* direct buffers - allow Java to use raw byte buffers from the native OS without having to copy all of the data into and out of the virtual machine for every operation
* channels - which have better defined characteristics for interrupting I/O
Select style I/O is an architectural change to your app that may or may not improve performance depending on your design. Direct buffers are an implementation improvement that will generally improve performance for simple copy operations.
Pat Niemeyer
Author of Learning Java, O'Reilly & Associates
A thread needs memory. If you wan to support a large number of connections, using old school IO will use a large amount of memory. Java stacks are typically 512KB. With NIO you can keep the number of threads low. Performance is not everything.