Domain: kerneltrap.com
Stories and comments across the archive that link to kerneltrap.com.
Comments · 15
-
World of Warcraft 3.3.5 fix made it into 2.6.35
The fix for World of Warcraft under WINE made it into 2.6.35, though it is not mentioned in the changelist above. WoW 3.3.5 crashed under recent Linux kernels because it apparently made use of the "icebp" instruction, whatever that is; the kernel stopped sending SIGTRAP for icebp instructions in an earlier 2.6 build for whatever reason.
Diff of fix
Source code of file, showing the icebp fix merged in (search for "icebp")
WINE compat page -
Re:2.5 impressionsI've been running the 2.5 kernel on my laptop for a couple of weeks now to get the new cpufreq support.
This reminds me- one problem I've always had is that new stuff that gets thrown into the kernel isn't clearly explained- in the most basic ways. Ie, what the heck is it? I remember lots of versions of 2.4 had features and options with no help to explain what they did. Google searches don't always turn up anything handy- often they turn up lots of hits on patches or posts talking about the feature, but not describing what it actually is.
Anyway, For those wondering what the heck cpufreq is...From a kerneltrap interview:
JA: You also mentioned working on the x86 side of Russell King's cpufreq code. We spoke with Russell King in an earlier interview, but we didn't talk about cpufreq. What is it?
Dave Jones: Quite a few CPUs these days allow changing of the voltage/multiplier/bus speed through software. Russell and Erik Mouw did a bunch of work on the ARM CPUs that support this feature, and started writing a generic framework for this type of technology so that he wouldn't have to duplicate code that for eg, recalculates loops_per_sec in every speed scaling.
etc.
-
Re:Links to actual documentation
braz's article cleaned up, and with links for easy clicking (and AC to avoid karma-whoring)...
========
Congratulations to Mel Gorman for actually managing to get down to this low a level and still explain it sensibly.
Here are the links for interested readers:
The kerneltrap article
Actual documentation:
The documentation comes in two parts. The first is "Understanding the Linux Virtual Memory Manager" and it does pretty much as described. It is available in three formats, PDF, HTML and plain text.
"Understanding the Linux Virtual Memory Manager" as PDF, HTML, and Text.
The second part is a code commentary which is literally a guided tour through the code. It is intended to help decipher the more cryptic sections as well as identify the code patterns that are prevalent through the code. I decided to have the code separate from the first document as maintaining the code in the document would be too painful Code Commentary on the Linux Virtual Memory Manager
"VM Code Commentary" as PDF, HTML, and Text.
-
Drupal's Throttle
There's a PHP script called drupal that has a "throttle" module. Jeremy, the owner of Kernel Trap, developed it after many
/. stories with links to his pages.
It generates static files (similar to caches) when access is too high. You can check drupal's cvs (drupal -> modules -> throttle) or go straight to it.
Ps: Some links may contain whitespaces, cut, paste and edit... -
Re:No! No! OpenBeos! OpenBeos!
why the hell is everybody so intent on making some sort of BE/Linux hybrid?
Not to "save" the BeOS legacy/religion/apps obiously, but to save the linux kernel with all its drivers/features/fans/developers/sponsors/bouty from becoming a platform used for running nothing but posix webservers on headless pc hardware while it can be better (in design) then OSX for (even old) pc hardware.
This BefrankensteinAtOS is just a step toward what is my dreamworld:
- a cheap Nforce like mainbord with onboard graphics(nvidia, nuff said),audio(dolby 5.1 encoder),network(100mbit is 100mbit) and firewire (usb is now a "legacy connector" ;-))
- A dvb-c card
- two or four Clawhammer cpu`s
- Cooling that makes sense, not noise
- a linux-based kernel that loads directly from eeprom instead of an ugly old bios that doesn`t even understand todays harddrives. but still load ms-dos 3.00
- no more X, just every bit of experiance nvidia has with performace drivers
- A really fast gui, just try going back from Be`s Beos to windows
- a simple gui and cli shell that doesn`t eat more reasorces then it offers functinality but has a noice look and feel
- configurable translators
A filesystem that is fast, doen`t need complex journaling couse the oswrites metadata in a recoverable order and the hardware is fast enough to offer reasonable fast recovery anyway and has optional metadata (like the BeFS mime filetype)
I think this is really close to what others on slashdot want, note the lack of "evil" technology (except for perhaps nvidia).
After reading it back I found it also lacks girls and a social life but then again you can`t have it all ;-)
I guess for now I will have to do with the dano leak.... -
Re:Asserts
Yes. Asserts are easy to use and have a huge payoff. They are especially good at catching nasty subtle bugs.
They are easy; they catch bugs that might not otherwise ever be noticed (or noticed only as some pervasive flakiness); they save you a lot of time that would otherwise be spent debuging; they are good documentation; they don't cost anything in non-debug builds; they save you from a lot of pain. Despite all this, I haven't yet successfully convinced a single person who is unfamiliar with them of their value. Christ I get dismayed sometimes.
This interview with FreeBSD kernel hacker Matt Dillon has some interesting things to say about assertions ("it greatly contributed to our famed stability in 4.0 and later releases [of the FreeBSD kernel]").
Java finally has them? Cool. What year is it again? 2002? Jeez ... -
An earlier interview
With Robert Love at KernelTrap
And another one with John Levon (OProfile) where he discusses the patch. (at the end of the interview)
For me, these 2 interviews offer more insight than the interview in the article. Especially the comment from Levon that this patch is really just a "band-aid"...and that the real problem lies in fixing the kernel. -
An earlier interview
With Robert Love at KernelTrap
And another one with John Levon (OProfile) where he discusses the patch. (at the end of the interview)
For me, these 2 interviews offer more insight than the interview in the article. Especially the comment from Levon that this patch is really just a "band-aid"...and that the real problem lies in fixing the kernel. -
Re:They Need to
http://urbanlegends.about.com/library/blbush-quot
e .htm
http://www.msnbc.com/news/629589.asp
I've actually become rather tired of it, and plan on changing it soon. You are welcome to vote on your favorite of the two...
the 'green blob' school of navi-gation hasn't really caught on. i can't imagine why
--
Suck.com
OR
OpenBSD development has a long tradition of stealing free code from other projects, and then improving it ;-)
--
Theo De Raadt -
Re:.20 has been -pre for months
2.2.x is a very stable kernel series. Alan Cox is in charge, and intentionally being very cautious about making changes. (If it's not broken, don't fix it) That's why it took so long to go from 2.2.19 to 2.2.20.
Sadly, Alan's not planning to take over the 2.4 series. This is sad, as he's done such a good job with 2.2... And 2.4 could use his help. -
Below is the article copied from Byte...Byte.com is pretty non-responsive from my part of the world. Its a good read if you have time...
Linux Kernel Pillow Talk
(Linux Kernel Pillow Talk: Page1of1)
By Moshe Bar
October 29, 2001
And you thought the netherworlds of dry kernel engineering were free of politics, egos, and prima-donnas? Guess again. The events of the last four to six weeks and the e-mails flying to and from the Linux kernel mailing list show how Byzantine and complex the dynamics of decision finding, features design, and implementations can be. Go to http://www.tux.org/lkml/ to subscribe to the kernel mailing list, but be careful: This is a very high-traffic list. Subscribe only if you really want to follow every single detail of the Linux kernel, or instead read the weekly digest at Linux Kernel Cousin at http://kt.zork.net/kernel-tra ffic.
Sure, the lively debates have always existed. In the past there have been disputes about the Linux firewalling code, networking code, scheduler, installer, driver model, and many more. One recurrent theme has always been the Virtual Memory (VM) manager. Nothing determines the peculiar behavior, the feel -- even the ultimate success or failure of an operating system -- like its virtual memory design. Sometime during the development cycle leading up to the Linux 2.4.0 kernel, in other words in 2.3.xx times, Rik Van Riel (http://www.surriel.com), a Dutch kernel hacker working for Brazil-based Conectiva (one of the smaller Linux distributions), introduced a radically new VM code. It was based on what seemed to be new and advanced algorithms for efficient finding, allocation, and disposal of virtual memory pages requested by programs. Rik later introduced an interesting new kernel feature called the "OOM killer." OOM stands for Out Of Memory. The OOM killer attempts to locate a killable process when memory runs out in the system. Without such a feature the whole machine can go nuts or enter a vicious cycle of swapping out a few pages, realizing immediately after that those pages are needed, and searching again for swappable page candidates, keeping the kernel busy doing only this instead of letting user processes run.
Rik is a gifted hacker, and among other things he has been trying to improve the efficiency and speed of maintenance of those lists in the kernel responsible for managing all the virtual memory pages in the system. One of the main questions to address in every operating system VM code is: "How do you choose which page to steal next when there is a RAM shortage?"
In the 2.4.0 release, the Linux kernel scans the process page and decides which page to remove. The problem with this approach is that sometimes a lot of process tables have to be scanned to free just one page, or very few pages. Also, this approach does not guarantee that the pages stolen are only those that will not be needed again very soon. Some UNIXes introduced the notion of the working set; that is, the minimum amount required by a process to function efficiently. This solution is, however, limited to per-process pages only and does not consider other kinds of pages, such as filesystem caching. Stealing from these pages might in some cases even prove counter-productive. Very often in VM theory, a solution to one problem can worsen another; that's why kernel programming is difficult.
Rik van Riel and I have variously discussed another approach, called "reverse mapping," which implements a reverse-lookup between the page and process table. Once you have reverse-mapped pages, the VM can simply scan the pages for the ones to be freed. Naturally, some extra fields need to be added to the appropriate control tables to allow this reverse mapping. My own implementation has an overhead of 14 bytes and is therefore certainly a lesser solution than Rik's -- his overhead is just 8 bytes.
Other extremely talented kernel hackers such as Marcelo Tosati and Ben LaHaise have made other important contributions to the Linux VM.
However, even though all these intelligent people tried hard to make the Linux VM fast, efficient, and powerful, user reports since the 2.4.0 release indicated poor Linux kernel performance and erratic and unstable behaviors. Up to kernel 2.4.7, for instance, on machines with small memory footprints (less than 40-MB RAM), sudden swap storms could erupt which would virtually freeze the system while it inexplicably started swapping pages in out and like crazy. In some cases, the aforementioned OOM Killer would choose the wrong process to kill; I have seen the all-important init process killed erroneously. Many fringe kernel projects, like my own Mosix project or others such as Win4Lin, suffered because users accused these projects of unstable operations, assuming that a released kernel like 2.4.0 must be free of such nasty bugs. Even though the kernel had gradually evolved from 2.4.0 to 2.4.9, it was evident that the VM design was more of a liability than an advantage.
Linus himself said in a recent kernel list mailing that he wasn't happy yet with the VM. These problems were enough for many Linux shops to resist the migration to the 2.4 kernels and instead continue using the 2.2.19 kind of kernels. Obviously, compared to 2.4., the 2.2. series has many shortcomings -- like no zero-copy networking, the division of page cache and buffer-cache in filesystem operations, big spinlocks (serializations of kernel execution paths for computers with more than one CPU) for many parts of the kernel, and so on.
A simple C program like the one below shows how kernels up to 2.4.9 had problems dealing with stress workloads on the VM system. If, after running this program, you turned the swap partition off with swapoff, your server or workstation would become totally unresponsive for up to 15 minutes.
/* based on a code originally proposed by Andrew Tanenbaum, later by Derek Glidden and many others since */ #include void main(void) { /* in the next line we allocate 200MB, but since the virtual memory page is not actually allocated by the kernel until we use it, we also have to create an access to. The amount of allocated pages should reflect the total RAM on your computer. This test runs well with machines of, say, 256MB */ void *p = (void *)calloc(50000000, sizeof(int)) ; /* In the next line we let the system calm down a bit after allocating pages*/ sleep(12); /* and now re release it all again */ free(p); }Back in February 2001, I ran an informal and unscientific benchmark comparing FreeBSD 4.1.1 to kernel 2.4.0 (visit http://ww w.byte.com/documents/s=558/byt20010130s0010/) on exactly the same hardware and with exactly the same subsystems versions (MySQL, Sendmail, Apache, and others). The results clearly showed that, indeed, there were major problems with the efficiency and speed of the early 2.4 kernels. A New VM
Then, on September 24, with the kernel standing at version 2.4.9, everything suddenly changed. Andrea Arcangeli, an Italian kernel hacker (read my interview with him two years ago at http://ww w.byte.com/documents/s=287/byt20000229s0008/) and a very prolific contributor, decided that enough was enough. He sat down and in one of those marathon hacking bouts completely rebuilt the VM from scratch. In short succession he sent to Linus Torvalds over 150 patches to the 2.4.9 kernel, to implement a new VM engine. This is an extremely remarkable feat. A VM is a major piece of software and by nature very complex. One needs to satisfy many opposed objectives: Simultaneously efficient handling for server-type loads and interactive-type loads; ease of implementation and at the same time, optimized use of every last and small feature of the CPU. The VM must also be able to run well on Intel CPUs spanning 4 or 5 generations, as well as on AMD chips, Alphas, MIPSes, Sparcs, ARMs, and what have you. Andrea, by the way, does all his development on a Compaq AlphaServer with 2 500-MHz CPUs and 3-GB RAM.
Out of the blue, Linus accepted the new VM and incorporated it into the official Linux kernel tree.
Recently, I spent two days with Andrea giving speeches. During the two days, over many bottles of beer, we had plenty of time to discuss his new VM. I was mainly interested in how the new VM affects Mosix. Because Mosix must migrate virtual memory pages belonging to the program's address spaces between cluster nodes, it is important to correctly understand the VM and interface efficiently to it.
Specifically, Andrea took exception to the following problems in the 2.4 VM:
- kswapd looping forever on DMA or NORMAL class-zones.
- swap+ram will be almost all available address space (modulo when the swap cache serves to avoid swapin of shared anonymous memory after a fork).
- swapout storms.
- benchmarks, when run repeatedly, gradually slow down.
The new VM is much simpler and faster. Let me explain how it works.
The old 2.4 VM had a major design problem that manifested itself mainly when freeing physically dirty pages (remember dirty pages are the frames of 4-KB memory in the RAM whose contents have been modified by one of the virtual memory pages residing in it). The last owner of the page (usually the VM, except in swapoff) has to clear the dirty flag before freeing the page. When being swapped off in swapoff it may be a little more complicated -- we may need to grab the pagecache_lock to ensure nobody starts using the page while we clear it.
So, Andrea went and did the following: All physical pages are now divided into active and inactive pages. These two are further divided into dirty and clean for both active and inactive. When the active dirty pages become about 66 percent of the total number of pages, the VM starts to scan them for the oldest ones to be put into inactive dirty and then, later still, from there to the swap when memory becomes tight. This part is very central to the new VM and its simplicity is...well, simply stunning.
This elegant mechanism totally changes the behavior of the 2.4.10 kernel under heavy load and also makes for much better predictability of the system. Another very important change is that the swap is now additional to the RAM, just like in 2.2 times. All earlier 2.4 kernels (since 2.3.12) needed at least the same amount of RAM in swap and then more to give you additional virtual memory. This meant that on an 8-GB server, you needed to put aside almost a full 9-GB disk just to be able to swap, similar to some versions of Solaris or other UNIXes.
Finally, the page scanner doesn't page scan if there are theoretically no freeable pages, whereas before it did. Oh, and the OOM killer never really worked, so Andrea disabled it, as I did for all my kernels. In 2.4.12 it is enabled again; this time, however, it works much better. Try it with the above program to see it in action.
Arcangeli's VM is stable, acts predictably -- something that the old VM never really achieved -- and it makes the swap space look like it did in 2.2 days. Additionally, the design is much simpler and easier to understand. People will catch up fast with it.
However, many kernel hackers disagree. Upon the release of kernel 2.4.10, a virulent and sometimes aggressive debate flamed up, with many people trying to show why one of the two was a good VM and the other not. Some comments got a bit out of control, and only in the last two weeks or so has some calm been restored.
However, one nasty side effect stays. Alan Cox, the number two man after Linus Torvalds, does not yet like the new VM and in his own kernel tree (called the "ac tree") he still continues to use and patch the old VM. As a consequence, users and system administrators now find themselves facing two very different kernel trees to choose from: the official Linux tree and the Alan Cox tree. Quite often, latest patches to drivers and new features are only in Alan Cox's tree. Those who want to go with the official Linux source code may find themselves unable to apply the patches due to the different VM code all over. It is acceptable for the two trees to be different for a few days on such important subsystems like VM, but it is not acceptable to have them different for months and across many kernel versions.
Nobody has yet dared to speak of a Linux source fork, but this is dangerously close to one.
It became obvious that the VM up to 2.4.10 was a design liability. You can try to fix something that was designed badly, but it will never become a beauty. I think Linus' decision to scrap the old VM and go with the Arcangeli VM was courageous and right. Having a functioning and stable Linux box should not be deferred to 2.5 when we can do it already with 2.4. Kernel Preemption
But apart from the VM issues, there are other lively debates in the kernel community. There was an interesting interview at h ttp://kerneltrap.com/article.php?sid=328&mode=thr
e ad&order=0 with Robert Love, who is leading one of two projects trying to make the Linux kernel fully preemptible. Making the kernel preemptible means making it possible to interrupt whatever the kernel is doing (say, executing a system call) to process some other outstanding task and then return to its original task. Linux, as a multiprocessing OS, obviously always did that for user-land processes. However, many, just like Robert Love, feel that the fact that Linux up to now would not let itself be interrupted contributed to poor latency. Latency describes how quickly you can expect a response from your kernel when you actually need something from it. Note that Linux is not designed as a real-time OS (though there is at least one Linux real-time implementation somewhere), and therefore does not explicitly guarantee latency. User-land programs must be aware of this as, especially with kernel preemption, latencies can be very unpredictable.Theoretically, an OS will answer faster if it can be interrupted. What does suffer from kernel preemption is the global throughput. If you have a task that gets n seconds within the kernel to complete (let's say executing a given system call takes 0.005 seconds), then all the interruptions add some overhead to switch from one kernel task to another. So, finishing the execution of that system call (in our example) will finally require n+op where p is the frequency of switching and o the static overhead for one switching operation. Notice that kernel context switching does not invalidate the CPU cache, and is therefore not as expensive as process switching. However, kernel preemption will surely lead to a higher rate of switching from kernel space to user space, because upon preemption the scheduler might decide to give higher priority to a user process.
In other words, kernel preemption does decrease latency but slows down overall throughput. It's the math: nothing to be done against it.
Furthermore, in his interview, Robert Love heavily criticized Linus Torvalds for adopting Andrea Arcangeli's new VM in 2.4.10 and dropping the old van Riel VM.
Well, I did try the patch with kernel 2.4.12 and with pre13. While accurate measurement (which Robert Love provides with the preemption kernel patches) does indeed report an improvement in latency, for the life of me I have not noticed it on an empirical basis.
I really do appreciate Love's work, but I do not fully agree with some of his comments in the interview. First, as Linus himself said, if latency sucks in the kernel then we should check why it sucks, with or without preemptive scheduling. If the latency is bad in the stock kernel, then it should be fixed anyway.
The preemptive kernel 2.4.12 worked fine on my laptops and on my SGI 550 workstation where I do interactive work. The MP3 player very rarely skipped beats when doing heavy background work such as kernel compiling or opening large files in the editor. But for my servers and clusters, the decrease in performance and the unpredictability of latency is a problem. Also, some important patches will not apply to a Love-patched kernel. Mosix, the clustering kernel extension, does not patch correctly, and neither do some versions of the LIDS intrusion detection system.
It is up to each individual user to decide whether or not to use the patch, but is important to understand the implications of using it. Linux and FreeBSD Revisited
Upon returning home the other week after meeting with Andrea, I went to my lab and searched for the disk images of the server comparison I ran back in January of this year (of FreeBSD 4.1.1 versus Linux 2.4.0). I took the Compaq ML500 server I have been reviewing (2x 1-GHz CPUs, 2-GB RAM) and upgraded both the FreeBSD disk image to 4.4-Stable and the Linux version to 2.4.12. Then, I changed the memory down to 192-MB RAM so as to stress the VM system more. I also upgraded to the latest stable versions of Sendmail (8.12.1) and MySQL (version 3.23.42). Finally, I compiled everything with the latest version of gcc, 3.0.2, and tuned the two instances to the best of my knowledge (softupdates and increased maxusers for FreeBSD, and untouched default values for Linux).
The results were very interesting indeed. Since this benchmark is too much to be handled in this article, Byte.com will post it here soon for you to read.
The story of this article is that the 2.4 kernel has finally grown up with the 2.4.10 release. Not many users outside the relatively small kernel community realize that. Now you know about it, too. Spread the good news and immediately install 2.4.12 on your busy server. The server will thank you for it.
Moshe Bar is a systems administrator and OS researcher who started learning UNIX on a PDP-11 with AT&T UNIX Release 6, back in 1981. Moshe has a M.Sc and a Ph.D. in computer science and writes UNIX-related books.
For more of Moshe's columns, visit the Serving With LinuxIndex Page . Page1of1
-
More here...
More info linked from here...
Includes links to more DMCA info, and some of Alan's thoughts on the matter
Alan Cox being a major figure in the Linux world. He maintains the 2.2 stable series, as well as a 2.4.x-ac stable series. When Linus Torvalds moves on to the 2.5 Linux development series (soon), Alan will be fully in charge of the current stable 2.4 series.
-
Keep systems programming alive ;)
You low-level folks are hard to find and are exactly the type of people who should be reading -- and contributing to -- Kerneltrap.com.
-
Results of OpenBSD's code audit
I was just wondering if the results of OpenBSD's code audit is then shared with the other BSD projects (FreeBSD, NetBSD)?
(Aside to other Slashdot readers: By the way, www.kerneltrap.com is preparing to interview some developers at BSDi, so if you're interested in systems programming or kernel architecture, why not suggest some good questions there as well?)
-
Results of OpenBSD's code audit
I was just wondering if the results of OpenBSD's code audit is then shared with the other BSD projects (FreeBSD, NetBSD)?
(Aside to other Slashdot readers: By the way, www.kerneltrap.com is preparing to interview some developers at BSDi, so if you're interested in systems programming or kernel architecture, why not suggest some good questions there as well?)