Jens Axboe On Kernel Development

← Back to Stories (view on slashdot.org)

Jens Axboe On Kernel Development

Posted by ScuttleMonkey on Wednesday January 31, 2007 @05:41AM from the nuts-and-bolts dept.

BlockHead writes "Kerneltrap.org is running an interview with Jens Axboe, 15 year Linux veteran and the maintainer of the linux kernel block layer, 'the piece of software that sits between the block device drivers (managing your hard drives, cdroms, etc) and the file systems.' The interview examines what's involved in maintaining this complex portion of the Linux kernel, and offers an accessible explanation of how IO schedulers work. Jens details his own CFQ, or Complete Fair Queue scheduler which is the default Linux IO scheduler. Finally, the article examines the current state of Linux kernel development, how it's changed over the years, and what's in store for the future."

25 of 68 comments (clear)

Min score:

Reason:

Sort:

Khmm... Block devices? How quaint! by mi · 2007-01-31 06:04 · Score: 2, Informative

FreeBSD dispensed with them altogether years ago...
Character devices only, thank you very much.
*Duck*

--
In Soviet Washington the swamp drains you.
Scared me... by __aaclcg7560 · 2007-01-31 06:11 · Score: 2, Funny

I thought the title was: Ewe Boll On Kernel Development...
Disagree with Mr. Axboe... by isaac · 2007-01-31 06:13 · Score: 5, Interesting

JA: In your opinion, with the increased rate of development happening on the 2.6 kernel, has it remained stable and reliable?

Jens Axboe: I think so. With the new development model, we have essentially pushed a good part of the serious stabilization work to the distros.
I respectfully disagree that the new development model works well from an end-user's perspective (an "end user" of many thousands of linux hosts, not a toy desktop environment). Minor point releases now contain major changes in e.g. schedulers. This makes for a lot of work for real Linux users, backporting the useful bugfixes while retaining older algorithsm for which workloads are optimized. Result: a severely splintered kernel and a lot more work for us.

If core changes of such magnitude are no longer sufficient to merit a dev branch or even a major point release, why bother with the "2.6" designation at all? Just pull a Solaris and call the next release "Linux 20" or "Linux XX."

-Isaac

--
I am not a lawyer, and this is not legal advice. For Entertainment Purposes Only.
1. Re:Disagree with Mr. Axboe... by Kjella · 2007-01-31 06:57 · Score: 5, Insightful
  
  Well, on the other side distros were backporting *huge* amounts of patches from 2.5 to 2.4, so while plain vanilla 2.4 was stable, almost noone was running it. The 2.6 releases means the distros are shipping "stabilized unstables" instead of "destabilized stables", I guess that works out better for some and worse for some. Are RHEL, SLES, Debian stable kernels not good enough kernels to start out with, if stability is what you need? I feel there's quite a few things I see come which I find great that arrive in a timely fashion, not at the release of 2.8 in a few years. I think most that use a distro's kernel feel that way.
  
  If you're the kind of kernel hacker who liked to get yours directly from kernel.org, yes then it sucks. But IMO the kernel has grown too big for just the core devs, think of it as an "extended" kernel team including the distros, where kernel.org releases are "internal betas". I think if you cut it back and expect just kernel.org to deliver stable kernels with the resources they have (which admittingly, they used to) then kernel development will slow way down.
  
  --
  Live today, because you never know what tomorrow brings
2. Re:Disagree with Mr. Axboe... by ComputerSlicer23 · 2007-01-31 06:57 · Score: 4, Insightful
  
  Don't take this the wrong way, but your complaint sounds a lot like the story about a patient and a doctor:
  "Doctor, when I do this, it hurts", and the doctor replies, "Well don't do that".
  I mean, if you are following bleeding edge kernels, and complaining that they aren't as stable as you'd like. Why not just follow a vendors kernel? If you use or install "many thousands", you are either maintaining your own de-facto distribution or you are using someone else's distribution. Vendor's do exactly the work you want done on your behalf.
  I patiently wait for my vendor kernel, which might be 10 point releases behind integrate bug fixes and then upgrade in a year or two to a much newer point release (I think RedHat has used 2.6.9 and/or 2.9.13 in recent memory)... Incrementing a different number wouldn't really make any difference anyways. At that point it's all semantics, if you know the rules of the game, it's not hard to tell what's dangerous as an upgrade and what's not.
  It's not like 2.4.13 (or whatever one in the 2.4 series that introduced series disk corruption) was safe merely because it was a point release... They are safe because somebody took it out back and beat on the kernel for a while and it didn't cause any problems. If you upgrade without proper testing and it breaks, you get to keep the pieces.
  Kirby
3. Re:Disagree with Mr. Axboe... by diegocgteleline.es · 2007-01-31 08:21 · Score: 2, Insightful
  
  The kernel development model is optimized to make distros happy, not end users. Just like Gnome/KDE, BTW. This is because, well, in the Real World most of desktop/servers use (or should use) the kernel shipped by their distro. And because distros are who emply most of kernel hackers.
  
  In other words, the previous development model made happy say 1% of people (you) and 99% unhappy (distros and hence people using distros). The current model makes 99% of people happy (distros) and 1% unhappy.
  
  IMO it's was a good change. And if you don't like it, just use Opensolaris. There's nothing wrong with it.
Where are they now? by LaminatorX · 2007-01-31 06:18 · Score: 3, Interesting

I did a double take when I saw this, as Jens was an exchange student at my high-school way back when. Small internet.
Wow ... by ravee · 2007-01-31 06:23 · Score: 2, Funny

15 year Linux veteran and the maintainer of the linux kernel block layer,...

In the interview he says he is now 30 years old. Wow that means he started working in Linux at the age of 15 - a real prodigy. A very interesting interview.

Btw, it is nice that kerneltrap.org has finally had a make over. The earlier website design looked rather drab.

--
Linux Help
for all things on Linux
1. Re:Wow ... by Error27 · 2007-01-31 09:46 · Score: 3, Informative
  
  Marcello was only 18 when he took over the 2.4 branch. He was working for Conectiva at age 13 or 14... Debian has had a bunch of really young package maintainers for critical packages.
What about the process' priority? by mi · 2007-01-31 06:29 · Score: 4, Insightful

CFQ now uses a time slice concept for disk sharing, similar to what the process scheduler does. Classic work conserving IO schedulers tend to perform really poorly for shared workloads.

I wonder, if the originating process' priority is taken into account at all... It has always annoyed me, that the "nice" (and especially the idle-only) processes are still treated equally, when it comes to I/O...

--
In Soviet Washington the swamp drains you.
1. Re:What about the process' priority? by mi · 2007-01-31 06:43 · Score: 2, Insightful
  
  The article mentions an "ionice".
  
  Indeed, it does — but should not the I/O-niceness be automatically derived from the process' niceness?
  
  --
  In Soviet Washington the swamp drains you.
CFQ not the default scheduler? by rehabdoll · 2007-01-31 06:36 · Score: 4, Informative

Anticipatory is, according to my menuconfig:

The anticipatory I/O scheduler is the default disk scheduler. It is
generally a good choice for most environments, but is quite large and
complex when compared to the deadline I/O scheduler, it can also be
slower in some cases especially some database loads.*

Anticipatory is also preselected with a fresh .config
1. Re:CFQ not the default scheduler? by darkwhite · 2007-01-31 06:46 · Score: 2, Informative
  
  CFQ was committed relatively recently and there was discussion for a while as to whether and when to make it default. I think 2.6.19 uses Anticipatory by default, but 2.6.20 will use CFQ by default (not 100% sure though).
  
  --
  
  [an error occurred while processing this directive]
2. Re:CFQ not the default scheduler? by zdzichu · 2007-01-31 06:56 · Score: 4, Informative
  
  CFQ is default since 2.6.18, released back in September 2006.
  
  --
  :wq
No block devices = no disk scheduling? by Kadin2048 · 2007-01-31 07:34 · Score: 4, Interesting

So how does that work?

At risk of starting a holy war, is there any reason why one approach would be superior? And do they lend themselves to different methods of scheduling? In TFA, Axboe talks about [1] the scheduling mechanism used in later versions of the 2.6 kernel series, which alleviates a problem that I (and most other people, probably) have run into before.

I'm curious, because although I don't use any of the 'real' BSDs very often -- I spend most of my time (at home, anyway) using either Mac OS X, which uses the Mach/XNU kernel (which is derived from 4.3BSD, although I don't know if the I/O scheduler has been rewritten since then), or Linux with the 2.6 kernel, and it seems to me that OS X's disk I/O leaves something to be desired compared to Linux's.

Does BSD handle I/O differently in some fundamental fashion than Linux? It sounds like, by eliminating block devices, that they basically remove the kernel from doing any re-ordering or caching of data, which makes things "safer" (in the event of a crash) but seems like it would have big performance penalties when using drives that aren't very smart, and don't do a lot of caching and optimization on their own. It seems like getting rid of I/O scheduling altogether is a stiff price to pay for "safety."

[1] (quoting because there doesn't seem to be anchors in TFA)
Classic work conserving IO schedulers tend to perform really poorly for shared workloads. A good example of that is trying to edit a file while some other process(es) are doing write back of dirty data. ... Even with a fairly small latency of a few seconds between each read, getting at the file you wish to edit can take tens of seconds. On an unloaded system, the same operation would take perhaps 100 milliseconds at most. By allowing a process priority access to the disk for small slices of time, that same operation will often complete in a few hundred milliseconds instead. A different example is having more two or more processes reading file data. A work conserving scheduler will seek back and forth between the processes continually, reducing a sequential workload to a completely seek bound workload. ...

--
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."
1. Re:No block devices = no disk scheduling? by jd · 2007-01-31 19:28 · Score: 2, Interesting
  
  Block devices lend themselves nicely to offload engines, as you can RDMA the processed data into a thin driver that basically just offers the data to the userspace application in the expected format but does little or no actual work. You can even do direct data placement into the application and just use the kernel as a notification system. So, the smarter the hardware, the more you can get from being able to handle large chunks of data or large numbers of commands in a single shot. Arguably, you can still do some of this with a character device - you can RDMA into the kernel, but direct data placement would be a headache and I can't see you getting much from either offloading or kernel bypass.
  However, that is actually one of the benefits of character devices. They're lightweight on the hardware and the software, making "routine" activity extremely fast and efficient, and making it easier to be sure everything is correct and robust. For most "normal" activity, you're not wanting to do anything particularly complex. Wordprocessors, by and large, are not based on scatter/gather algorithms, and it is rare to find non-sequential MP3s. Also bear in mind that most CPUs outpace memory tens, if not hundreds, of time over - they are certainly going to outpace any peripherals a person might have. Why accelerate the kernel, if the kernel isn't the bottleneck? That just risks introducing bugs with no obvious gain.
  Myself, I believe that it's stupid to design limitations into one component because of limitations in another. The limitations in the other component will be subject to change, but the designed limitations will hang around for much longer. I also think it's stupid to look at current typical use. Current typical use is dictated by what is currently practical. If you change what is practical, you will change what is typical use. The OS and the users are not independent of one another. What people wanted is unimportant, it's what people want to want that should dictate what OS writers should want to offer. And, yes, I believe that direct data placement has the potential to eliminate the need for both binary-only drivers and heavy-weight kernels.
  (Linux contains a huge number of very low-level drivers, and is limited in what it can absorb in the way of new high-level functionality because of the risk of breakage and the difficulty of maintaining such a gigantic tree. If those had all been intelligent peripherals, the same amount of effort and coding would have produced a kernel with staggering capabilities and electronic superpowers. The drivers can't go away, even if intelligent devices replace the dumb ones of today, because people will use legacy stuff. Actually, it's worse. As Microsoft showed with Winmodems and Winprinters, it's possible to sell people dumber-than-dumb devices and even heavier-weight software that does a worse job, slower.)
  
  --
  It's a small world and it smells funny; I'd buy another if it wasn't for the money; Take back what I paid (SoM)
Missing Question: How do you pronounce your name? by chuck · 2007-01-31 07:48 · Score: 2, Interesting

As a native English speaker, comfortable with Spanish and aware of the basics of French (so I'm not entirely uneducated), I am entirely unequipped to reason the pronunciation of "Jens Axboe." Can someone help me out?

--
My Freakin Blog
Re:Missing Question: How do you pronounce your nam by LaminatorX · 2007-01-31 08:05 · Score: 3, Interesting

Back in school we pronounced it with a "y" sound for the "j": "Yens" rhymed with "mens." Now, as to weather that was actually the correct pronunciation or merely something close enough that he didn't bother correcting us; I couldn't say.
This is what Slashdot is about by bcmm · 2007-01-31 08:39 · Score: 3, Interesting

Thank you very much. Much of this article is informative, technical and really, really nerdy. I for one sit through dupes and rubbish like today's meaningless benchmarking of differing minor kernel versions in the hope of reading articles like this.

BTW, does anyone have a good set of benchmarks of the performance of different IO schedulers when running one or two or three IO intensive tasks, when running one intensive and many small tasks, etc.? That would actually help me decide whether to rebuild my kernel with CFQ.

Also, ionice would have made my old machine much more usable when doing backups... Oh well.

--
# cat /dev/mem | strings | grep -i llama
Damn, my RAM is full of llamas.
Scheduling better than no scheduling? by Kadin2048 · 2007-01-31 08:58 · Score: 4, Interesting

Are there any hard metrics on what the performance advantages are of various schedulers, under typical load conditions?

Reading TFA piqued my interest into I/O scheduling and I've been doing some reading on it, and it seems like there are several competing schools of thought, of which Axboe (and potentially the Linux kernel developers generally) are only one.

An alternative view, such as this from Justin Walker (a Darwin developer) on the darwin-kernel mailing list, holds that it's not worthwhile for the OS kernel to do much disk scheduling, since "the OS does not have a good idea of the actual disk geometry and other performance characteristics, and so we [kernel developers] leave that level of scheduling up to the controllers in the disk drive itself. I think, for example, that recent IBM drives have some variant of OS/2 running in the controller. Since the OS knows nothing about heads, tracks, cylinders for modern commodity disks, it's futile to try to schedule I/O for them." (written Mar 2003)

Axboe seems to acknowledge that this may sometimes be the case, because they do have the 'non-scheduling scheduler,' which he recommends only for use with very intelligent hardware. However, it seems like some people think that commodity drives are already 'smart enough' to do their own scheduling.

It seems like determining which approach was superior would be relatively straightforward, and yet I've never seen it done (although maybe I'm just not looking in the right places). Anecdotally, I'm tempted to agree with Axboe, since it seems like, when doing things where several processes are all thrashing the disk simultaneously, my Linux machine feels faster than my OS X one, but this is by no means scientific (they don't have the same drives in them, not working with the same datasets, etc.).

On what drives, and under what conditions, is it advantageous to have the OS kernel perform scheduling, and on which ones is it best just to pass stuff to the drive and let the controller do all the thinking?

--
"Ladies and gentlemen, my killbot features Lotus Notes and a machine gun. It is the finest available."
1. Re:Scheduling better than no scheduling? by axboe · 2007-01-31 20:07 · Score: 3, Informative
  
  It depends on what you need to schedule. If your drive does queuing and only one process IO is active, then the OS can do very little to help. The OS usually has a larger depth of ios to work with, so it's still often beneficial to do some sorting at that level as well.
  
  IO scheduling is a lot more than that, however. If you have several active processes issuing IO, the IO scheduler can make a large difference to throughput. I actually just did a talk at LCA 2007 with some results on this, you can download the slides here:
  
  LCA2007 CFQ talk
Re:Missing Question: How do you pronounce your nam by Ysangkok · 2007-01-31 09:32 · Score: 2, Informative
Well, he's a Dane. I'm a Dane too so I'll tell you how I would pronounce it:
Jens is NOT pronounced "Djens". "J" is pronounced as a Palatal approximant in Danish - just like "y" in English. Yens is somewhat more correct, but the "e" has to be pronounced like the IPA [æ]. Danish is not logic at all. If it was, "Jens" would be spelled with a "æ". Take a look at Jens.
IPA: [jæns]
Axboe is more complicated:
- A is pronounced flat. (like when you say "aah" at the dentist. Just like "spa". Take a look at Open back unrounded vowel)
- X is pronounced "ks". When the word is pronounced quickly it may sound like "gs".
- B is b.
- "Oe" is usually pronounced like the Danish "ø". See Close-mid front rounded vowel
IPA transcription of Axboe would be something like: Open back unrounded vowel + [ksbø]
(I can't get the IPA sign for "Open back unrounded vowel" to display in Slash)
Err, no by Fweeky · 2007-01-31 09:43 · Score: 2, Informative

"It sounds like, by eliminating block devices, that they basically remove the kernel from doing any re-ordering or caching of data, which makes things "safer""

No; FreeBSD's shifted the buffer cache away from individual devices and into the filesystem/VM, where it caches vnodes rather than raw data blocks. The IO queue (below all this block/character/GEOM stuff) is scheduled using a standard elevator algorithm called C-LOOK. It's showing it's age in places, and there's been some effort towards replacing/improving it, making it pluggable etc (e.g. Hybrid); sadly it's a tricky problem to solve properly. See this recent thread.
Re:Missing Question: How do you pronounce your nam by axboe · 2007-01-31 20:01 · Score: 3, Informative

Hi John!

That is correct, like a "y", rhymes with "mens". I saw another question on the lastname, I typically tell foreigners that it is pronounced ax-bow. Europeans often think the 'oe' is like the Danish "ø", however that is not the case.
Re:Reiser4 by axboe · 2007-01-31 20:09 · Score: 2, Interesting

That's largely because they do more than traditional file systems. Some of the ZFS functionality Linux would put in other layers, for instance. Once the IO is issued to the block layer, there's no difference.