Sangui5 · Slashdot Mirror

Re:Back and forth on Linux Hardware Looks at Core 2 · 2006-08-23 07:38 · Score: 1

OTOH, Intel consistently manages to make smaller SRAM cells at a particular process size than anybody else--which was a godsend to their cache hungry long pipelines. So yeah, their logic may be bigger than you'd expect, but they can really pack in the cache to make up for it a bit.

Re:Offline rootkit scanner? on Windows Rootkit Wars Escalate · 2006-07-13 06:51 · Score: 1

You mean like Tripwire? http://www.tripwire.com/products/enterprise/server s_desktops.cfm

Tripwire, btw, says that it "can monitor Windows alternative data streams to detect and log changes, additions, and deletions. FS and DT components help prevent negative impact from hidden, unknown, or malicious configuration data to improve baseline control." So it would be able to detect this.

Re:Actually followed this... on Supreme Court to Rule on 'Obvious' Patents · 2006-06-27 03:53 · Score: 4, Interesting

some of which would probably lose patent protection if this gets rid of the suggestion requirement.

Or rather, virtually all. There really aren't that many truly new drugs--mostly just applying a few standard tricks to old drugs to extend the patent protection. The worst (IMHO) are:

1) Obvious compounding. A good example is pain medication. Acetaminophen (Tylenol) has an unusual method of action which is synergistic with nearly every other analgesic, and rarely interacts with other drugs. So, the drug company will file a patent on their new painkiller, and then (just before the patent is made public/the drug is approved), they'll patent mixing it with acetaminophen. Doctors prefer prescribing the mixture because it has a percieved lower risk of abuse (due to the liver toxicity of acetaminophen), so the generic unmixed version isn't used so much.

2) Racemic mixtures. Many drugs have left handed and right handed versions. Often, one version or the other is more effective/safer. Especially since the thalidomide incident (anti-nausia drug where one versoin (left?) caused birth defects) testing both versions is standard. Yet the drug companies can get separate patents on the left, right, and mixture versions. Sometimes, the patent on the left or right can be used to control the mixture, especially if it is difficult to make just one version or the other. Regardless, it gives the company a "new" drug to market and to compete with the generics. Prilosec and Nexium are an example of this.

3) Particle size patents. Hmm, it just so happens that a certian size granule is "better" than others, and the standard manufacturing technique (whose patent is expiring) makes that particle size (or at least contains it)...

4) Time release/enteric versions. Coating something (with a standard, commonly used coating) to make it time released or gentle on the stomach isn't obvious, for some silly reason.

Sometimes I wonder if the problems with the high cost of healthcare aren't really caused at all by the healthcare providers or insurance companies, but are almost entirely a regulatory problem--stupid patents on drugs & medical devices driving costs up.

Re:Folks always forget the VAT on Nintendo Announces Japanese Wii Price · 2006-05-25 06:16 · Score: 1

Prices don't include tax mostly because the tax structure is complex. First, there is no national sales tax. Never has been, and likely never will be. There are national taxes on some specific items (e.g. gasoline, tobacco, and until just recently telephone service), but no general sales tax.

There are still, however, up to three different govenment bodies that can impose tax: the states, the counties, and individual cities. Which of these bodies does (or is allowed to) varies from place to place. The problem is that geographically, these entities can be quite small, and so you can't run an advertising campaign and report the post-tax price accurately. Two stores 10 feet away from each other can have different tax rates. But those two stores are serviced by the same media--radio, television, newspapers, billboards, etc. I've been to the same national chain restaraunt in the same metropolitan area and paid three different after-tax prices for the exact same meal. Reporting the post-tax price in infeasible.

Another compliation is that different areas treat different items differently for tax purposes. You can get different tax rates on food, prepared food, and general goods. Some places tax takeout food differently than eating in the restaraunt. Also, services aren't generally taxed (even though that would count as a "value add" under a VAT system), so when I go to a mechanic, he splits the charge between labor and parts. I get taxed only on the parts, but putting them in is untaxed.

It is important to remember that the US is geographically twice as large as the EU. Many of our states are larger than many countries (esp. Texas, or California). People in different areas have different opinions about how things ought be taxed--it would be like Italy deciding about how to tax things in the UK.

Re:Wrong Side of Bed? on Torvalds Has Harsh Words For FreeBSD Devs · 2006-04-21 07:39 · Score: 1

From the article:
make the vmsplice call block or just return -EAGAIN if we are not ready yet

Returning -EAGAIN would be not-blocking.

Re:Wrong Side of Bed? on Torvalds Has Harsh Words For FreeBSD Devs · 2006-04-21 07:36 · Score: 1

Unfortunately, CPU cycles is not the end-all-be all of performance.

For details, see my old comment on the topic. The short version is that the "fast" JVM allocaters return cold memory (the coldest, in fact) which in all likelyhood is NOT in cache. So you pay about 300 instructions worth of time waiting for a cache fill. If you're under memory pressure, you may miss main memory entirely with the JVM allocator--that'll cost you, oh, 24 million clock cycles waiting for the disk to bring in your page.

malloc() has to think harder about what memory to give you back, but it gives you hot memory, which is probably even still in L1, and nearly guaranteed to be in main memory still.

This sort of thing hits the heart of why Linus is ranting--it is important to consider all aspects of what is going on in order to tune performance. In this case, the BSD implementation does zero-copy for a buffer, but in many cases will end up copying whole pages and incurring two page faults--worse than just copying the buffer in the first place. That is, unless you jump through hoops (tiny ones which are on fire) to avoid the COW. If you're willing to do that, you may as well use the explicit interface the Linus proposes--it is simpler that avoiding the COW, even though it is more complex than using BSD & not worrying about COW overhead.

Now, GC is great and all for saving programmer effort, but it isn't that great for performance. I can put a medium amount of effort into using explicit heap management, and get fast results. I can put minimal effort into GC and get poor performance. Or I can put a lot more effort into playing games with the "smart" GC, and get slightly slower results. If you need the performance, you'd just better go with manual management in the first place.

Re:Wrong Side of Bed? on Torvalds Has Harsh Words For FreeBSD Devs · 2006-04-21 07:12 · Score: 3, Informative

Then it would operate quickly on FreeBSD. The problem then becomes exactly when do you free all those malloc()s?

No, it'd be slower than just copying on FreeBSD too.

while(read < totalSize){
buffer = malloc(1024); //1024 is < pagesize!
length = fread(buffer, 1, 1024, &file);
read += length; //Do some stuff, but don't free the buffer!
}

This is where VM games really bite you in the ass, because you get false sharing. Even if you never reuse the buffer, this can cause 3 copies--each group of 4 (3.99ish) buffers will be on the same page, and therefore each call will cause a fault from the previous one.

In theory the OS could be allow itself write & check for overlapping calls (& avoid the COW fault), but note that the read() example really isn't interesting for zero-copy unless you're using hardware TCP offloading. Zero copy is more interesting for write(). The usual case is then:

while(){
b = malloc();
fill_in_buffer(b);
write(b);
}

and that fill_in_buffer step *must* cause a fault if sets of buffers are on the same page. To avoid COW faults you have to be really careful that you don't accidentally write to the same page as the buffer--even indirectly by malloc updating it's inline data structures. That's pretty nasty to do--the easiest way is to allocate 8K at a time, and use a page-aligned chunk from the middle of it. Talk about a waste of memory.

Re:Best solution? on Contact Lenses for Computer Professionals? · 2006-04-13 09:29 · Score: 2, Informative

WaveFront is where they analyze for "more complex" distortion. Most glasses have two portions--the spherical (simple near-sightedness), and cylindrical (for basic astigmatism). Most contact lenses only do spherical--Toric lenses do cylindrical too.

However, there are more ways your vision can be distorted than spherical and cylindrical. The idea behind WaveFront is to analyse these other forms of astigmatism and include a correction for them, too.

You can also get expensive contacts with these corrective terms, & I'm told they work fairly well. Glasses are available too, but they don't work so well. The laser-eye-surgery versions work extremely well. An added bonus is that wavefront lasik tends to have much lower rates of night vision problems such as haloing.

Intralase replaces the first step of lasik. Usually first you cut a flap with a hand-held knife. Scary. Intralase uses a laser instead. Benefits are faster healing, less infection, less side-effects.

Re:Emulation would have worked too? on Skype 5-way Calling Limit Cracked · 2006-03-04 07:02 · Score: 1

x86 is a pain to virtualize--many "privledged instructions" have two (nearly identical) versions--one for privledged mode which does what it ought, and another for unprivledged modes, which does something else (e.g. silently fail). It would be much easier to virtualize if they either worked or generated a fault for the virtual machine monitor to catch, but such is not the case.

Therefore, VMWare does dynamic binary translation to catch instructions which don't generate a fault when executed in an unprivledged context. Calling CPUID does touch one memory page--the code page. VMWare will decompile your code before running it. In the case of self-modifying code, it will remove your execution permission after you start writing, and then check again before letting you run it. Any calls to instructions that ought be virtualized are replaced with code which will fault, and the virtual machine monitor will emulate them. That is, they dynamically change the binary. Yes, it is horribly complex. Yes, it can occasionally be dog slow (although VMWare has done a good job of making it as fast as possible). Yes, it would be much better to be able to avoid DBT (and you can if you have a Pacifica or VT chip). Unfortunately, it is the only way to virtualize an unmodified* OS without even more massive amounts of emulation. Which is why VMWare is "fast" and QEMU is "slow".

* Xen uses paravirtualization, where the guest OS has been slightly modified to make virtualization easy. So it sidesteps the whole problem, but can't run an unmodified Windows** (that is, without VT or Pacifica support).

** It can run a modified Windows, and does. However, they can't distribute such modified Windows to you, so you're SOL.

Re:Not a load of dung, just expensive on Internet Immunization · 2005-12-02 13:59 · Score: 1

Read this. Tested against Blaster, Slammer, and Code Red. It gives specifics of what signitures it generated, how fast they were generated, how fast they were distributed, and how well the infection was contained in their simulated outbreaks. Also includes an analysis of how it would work if the worm tried to DOS the detection system itself.

If you insist on a dead tree copy, well, the printed proceedings are generally only distributed at the conference itself, which is already past, and you can't have mine. Maybe you can buy them from the ACM somehow, since they organize SOSP. I'd assume a link to the PDF posted by the authors would be the most useful way to point you at work which does what you want.

As for moving the boundaries, the story talks about stopping a fast-spreading internet worm. I'm wasn't away that we were discussing something else.

You can detect most other worms when they try to write to the disk to a file they oughtn't (again, the honeypot server is yours so you KNOW where it should be writing, e.g. the log files and nowhere else), or if it executes code that it shouldn't. Since it is on a VM, it can't slip writes past you, because your disk is emulated. If you intercept every instruction fetch (which is what makes this slow), you can verify that the executable bits you are fetching match the ones that you started with. Or use the NX bits. Since I totally "own" the VM, you can only hide by not doing anything. And if you don't do anything, then how do you expect to spread?

Re:Not a load of dung, just expensive on Internet Immunization · 2005-12-02 13:21 · Score: 2, Insightful

Lord Kelvin, president, Royal Society, 1895 "Heavier-than-air flying machines are impossible."

The difference here is that Lord Kelvin said it before it had been done.

The problem, boiled down to its smallest, is to find inputs to the computer which cause it to emit bad outputs (e.g. cause it to try to spread the worm). We control the honeypot, so we can strictly classify what good outputs are (generally nothing, or some small set of fixed responses)--everything else is therefore bad. Any message to the honeypot, therefore, can be easily classified into causing a bad output or not.

If our signatures are composed of the inputs (in their entirety) which cause bad outputs, there can be no false positives--if that input is fed into the same system, it will spread the worm. Hosts recieving the signature can verify this by testing the signature in a virtual machine. "Gee, I fed it into my machine, and it started spewing traffic all over the place! Guess that really is a worm."

This is less than ideal for polymorphic worms (because you only get one signature), but polymorphic worms are slower than non-polymorphic ones, so they aren't as much of a threat (there are techniques for detecting polymorphic worms but they have non-zero (but quite small) false-positive rates). Also, worms which don't cause the honeypot to output anything for a long time can also slip by with false negatives. But if the worm takes a long time to spread itself, then it is, by definition, NOT a fast-spreading worm, and NOT the target of an automatic immune system.

Most work makes a trade off between a small false positive rate to faster/more powerful detection--here, false positives are measured in the 1 to a billion, or even lower. They also shortcut the detection some--you just need to be running code that wasn't on the machine to start with. Unless your web server is in the habit of accepting code from strangers to run, this is a surefire indication of a bad input.

Of course, these improvements aren't necessary to show that it is possible to have zero-false-positive detection; the scheme I describe above will work. Everything else is tradeoffs to make it faster, more sensitive, etc.

If you don't want to "wade" through lots of work, try just one: Vigilante, Unlike the paper from the story, Vigilante is actually implemented, and has been tested on simulated worm outbreaks using real worms. It also covers the current art of the field.

Re:Vigilante on Internet Immunization · 2005-12-02 12:03 · Score: 1

Yep, the story article is of rather low quality. As I state in my earlier post, they neglect quite a bit of good CS work, and instead cite such CS heavyweights as "Physical Review E".

Except for Yuval Shavitt the authors barely even register in DBLP (a database of CS bibliographies). Not big players in the CS community, and obviously not fully aware of the existing work.

Re:Autoimmunity on Internet Immunization · 2005-12-02 11:53 · Score: 1

The general way to implement such monitoring honeypots is to run them in a virtual machine. So, the virus writers cannot cause damage to anything important.

On the other hand, the virus writers have wised up to the VM trick--there are already exploits which try to detect that they are running in VMWare and refuse to behave "normally" if they are. The solution is, of course, better VMs (and better VM detectors, and better yet VMs...).

Not a load of dung, just expensive on Internet Immunization · 2005-12-02 11:12 · Score: 4, Informative

There are a lot of techniques to do automatic identification of viruses, the problem is that they are too expensive for everyday use--your programs run 40x slower or worse. Below is a selection (small and randomly generated) of related work.

Mostly, you need to do extensive monitoring of what your program is doing, and look for out-of-bound writes (e.g. buffer overflows/stack smashing), or do taint analysis (that is, don't execute or make "important" decisions based on data "tainted" from an untrusted source). But this requires performing many anaysis operations for every "real" operation, so it isn't feasible to do everywhere.

Just google the titles for electronic copies.

Kreibich, C., and Crowcroft, J. Honeycomb - creating intrusion detection signatures using honeypots. In HotNets (Nov. 2003).

Kim, H., and Karp, B. Autograph: Toward automated, distributed worm signature detection. In USENIX Security Symposium (Aug. 2004).

Zou, C. C., Gao, L., Gong, W., and Towsley, D. Monitoring and early warning for internet worms. In ACM CCS (Oct. 2003).

Wilander, J., and Kamkar, M. A comparison of publicly available tools for dynamic buffer overflow prevention. In NDSS (Feb. 2003).

Newsome, J., and Song, D. Dynamic taint analysis: Automatic detection and generation of software exploit attacks. In NDSS (Feb. 2005).

This isn't a very good paper. on Internet Immunization · 2005-12-02 10:59 · Score: 2, Interesting

I didn't know that Nature was such a high end CS publication. At SOSP this year Vigilante (http://research.microsoft.com/~manuelc/MS/Vigilan teSOSP.pdf) was presented--a much more complete paper in a more salient venue.

The citations list at the end of the Nature paper also is missing a large body of relevant work. Check the citations list of the Vigilante paper for details--50 references most of which are missing from the Nature pub. Also, the publications the Nature paper cites are mixed--some are good (like http://www.icsi.berkeley.edu/~nweaver/containment/ ), but I don't think the editors of "Physical Review Letters" (a physics journal) are really up to speed on the latest in computer security research. Indeed, most of the works they cite are either from physics journals, Nature, or Science.

The analysis is quite math heavy, and makes some unrealistic assumptions (i.e. worms only spread to their neighbors). In the end, they "show" that it is theoretically possible to stop worms with a side-channel network. Vigilante, on the other hand, has an implementation of a vaccination system, and simulation results run against Blaster, Slammer, and Code Red. Now, which is more convincing to you?

Re:From a Coder in Rural America on Outsourcing to Rural America · 2005-11-22 07:50 · Score: 1

For the parent: Champaign, Urbana, or Bloomington?

Also, (for those who belive that "rural" areas have nothing to do): check out the Krannert Center (http://www.krannertcenter.com/) and Assembly Hall (http://www.uofiassemblyhall.com/) web sites. Big name acts come out here, and the tickets are (gasp!) affordable without huge lines to buy them. Even the shows with people lining up to get tickets will still have some not-bad seats available a day or two before the show.

Just because there aren't skyscrapers doesn't mean there's nothing to do.

Re:Was it obvious? on JPEG Patent Challenged · 2005-11-18 04:46 · Score: 1

Compress was an implementation of LZW, based on Welch's 1984 paper in Computer. Only later was I informed that it was patented. After it had been incorporated into Berkeley Unix releases and into the GIF format. I was happy when that patent finally expired, but I had absolutely no doubt of its legitimacy.

Although both LZ77 and LZ78 are clearly legitimate, LZW is questionable. The "new idea" in LZW boils down to limiting the size of your LZ78 dictionary. Given that using an LZ78 dictionary of unlimited size is clearly a non-starter, limiting the dictionary size seems somewhat obvious to me. The other "ideas" in LZW all follow simply from doing LZ78 with a fixed dictionary.

Not that this would have changed much legally--compress would have still fallen under the scope of the LZ78 patent

Re:And while they are at it... on SCO Demands Linux 2.7 Information · 2005-11-08 06:41 · Score: 1

That'll do them no good until they get a Phantom Game Console to run it on.

Re:Except SBC won't offer naked DSL. on SBC CEO: Pay up if you want to use our pipes · 2005-10-31 06:27 · Score: 1

But what you can do on an SBC voice line is buy metered voice service, and then not use it in leu of VoIP. And in some areas even SBC is (allegedly) unbundling the line and the service, depending on the state. It's called local loop unbundling, and there is litigation and public service commision hearings about it pending/ongoing in all of the states and in DC.

Re:What a great 20th Century business model !!! on SBC CEO: Pay up if you want to use our pipes · 2005-10-31 06:23 · Score: 1

This is right up there with banks and bank tellers: I used to interact with a human teller and get my funds, or deposit the til. Now, I 'interact' with a machine - a machine that doesn't have health care, disability insurance or a retirement plan - and the bank charges me - ME! - for the privilege of said machine use.

I don't know about you, but every bank I have an account with still has human tellers, and I still visit them every time I need to make a deposit or withdrawl. I've been to an ATM about 3 times ever--I've been to a human teller 3 times this month. Perhaps you need to shop for a bank that doesn't suck.

Re:What a bunch of FUD on Java Urban Performance Legends · 2005-10-10 03:49 · Score: 1

The miss penalty times I googled for review site where they'd benchmarked it. If you look, you can probably get more accurate numbers from Intel. For the disk latency, the manufacturer of a disk will generally report some sort of access latency number, but overall, a "fast" 7200 RPM disk has a latency of about 8 ms.

So far as how malloc() works, well, it is open source. But some discussion on malloc design can be found at http://gee.cs.oswego.edu/dl/html/malloc.html. More simple information can be found linked at http://www.cs.utk.edu/~plank/plank/classes/cs360/l ecture_notes.html (again, just a quick google).

Mostly, it comes down to that my area is systems and architecture. My job is to know all those little details about how things interact. This was just a quickie. No guarantees it is right--I'm using other people's numbers, and just my recollection of how malloc() behaves. If this were more serious than /. then I'd dig through Intel's processor manuals for more exact numbers, figure out which (if any) of those latencies can happen in parallel, measure things myself to verify, etc. I'd also write a small program to do a lot of mallocs and frees and see what addresses are returned to verify the locality properties.

If I were really really serious, I'd come up with a benchmark program that allocates and frees memory, of various sizes with various working sets and object lifetimes, and measure the execution time for both Java and C. I'd use something like VTune (http://www.intel.com/cd/software/products/asmo-na /eng/vtune/index.htm) or SimICS (http://www.virtutech.com/) (with a timing model, say GEMS (http://www.cs.wisc.edu/gems/tutorial.html)) to analyze exactly where the differences are coming from.

And then I'd write it up and submit the results to SIGMETRICS, PLDI, OOPSLA, or some other conference, because I just spent a metric butload of time tracing down shortcomings in the GC implementation of various JVMs as compared to programmer managed memory allocation. If somebody is willing to pay me, I still might, but as it is, back of the envelope is all you'll get.

Re:What a bunch of FUD on Java Urban Performance Legends · 2005-10-09 11:21 · Score: 3, Informative

The allocators in the current JVMs are smart enough to optimize for cache misses
Allocation in any modern JVM is a pointer increment, nothing more

So which is it? Smart enough to optimize for cache misses or just a pointer increment? You can't have both.

The author of the article claims that, for 'new' objects, a modern VM uses a "stop-and-copy" style garbage collector with a few twists to cheapen the liveness detection (i.e. escape analysis). A disadvantage of such an allocator is that any new allocation to the heap will happen in some of the least-recently used memory. Also, every time you GC, you have to relocate all of your memory, forcing you to horribly pollute your cache, and turning all of your previously hot pages cold.

unlike malloc() which returns whatever the OS deems ok at the moment.
malloc() is a library call, not a system call. The OS is only involved if you need to extend the heap or if you've just allocated a very large object. So far as 'deems ok at the moment', the most recently used memory is what is generally returned (plus or minus size fitting to keep fragmentation under control). This maximizes locality, and minimizes your physical footprint. With a smaller resident set, more physical memory is available for things like other programs, disk buffers/caches, etc. which leads to even better performance.

A simple allocator like described in the article may be "cheap" in terms of cycles, but it is very expensive in its side-effects.

Re:What a bunch of FUD on Java Urban Performance Legends · 2005-10-09 11:07 · Score: 1

malloc() does not context switch into the kernel for most allocation events. The general case is that it looks at your request size, and then it goes to its list of that size and gives you the top chunk. malloc() is a library call, not a system call.

malloc() will call into the kernel for two rare cases. The first is if it needs to extend the heap, which happens as you've consumed a lot of heap. By default most of your address space is not mapped by the OS to VM pages--accesses to unmapped pages is a segmentation fault. malloc() will call brk() to ask the OS to extend the heap--in multiples of a page at a time, and only occassionally. This is only near the "normal" case if you never free memory.

The other time malloc traps into the kernel is if you've allocated a very large chunk in one call (I believe the glibc implementation considers 1MB to be large). In this case, malloc will request a separate mmap from the OS, separate from the normal heap space. Since chunks this large are rare, this too is not a signifigent source of context switch overhead.

Re:What a bunch of FUD on Java Urban Performance Legends · 2005-10-09 06:32 · Score: 5, Informative

Additionally, even if it takes more instructions per call, glibc malloc() will return freshly used "hot" memory to you, while the JVM will return the coldest memory to you--the author of the article admits that this is a problem, but it is worse than they say.

First, you definately have an L1 cache miss. On a P4, that is a 28 cycle penalty. You also likely have an L2 cache miss. Fast DDR2 memory has an access latency of about 76 ns--call it 220 processor clock cycles. If you miss the D-TLB, then add another 57 cycles. Total cost of your "fast" allocator--305 cycles of memory access latency. 305 is somewhat higher than 100.

Even worse, if you are on a system with a lesser amount of memory, you may miss main memory entirely, causing a page fault. That'll cost you several milliseconds waiting for the disk. That really really hurts, since 8 milliseconds latency from disk is twenty-four million cycles at 3 GHz. 24 million, 24,000,000, 2.4 * 10^6, no matter how you write it, that's a lot of cycles to allocate memory.

All of a sudden, 100 instructions to hit hot memory seems cheap.

Re:from MPI to multithreaded ? on SGI to Scale Linux Across 1024 CPUs · 2004-07-19 05:00 · Score: 1

and can scale to tens of thousands of nodes

The largest current machines (in terms of # of processors) I can find reference to are ASCI Q, BlueGene/L DD1, and ASCI White (Los Alamos, IBM, and Lawrence Livermore respectively). Each has 8192 processors. ASCI Red at Sandia is 9632, but aging (from 1999). Even if those processors are in clusters of 1 processor per node (they aren't--BlueGene isn't even a cluster), that is not tens of thousands. Unless you're talking about some classified machines, tens of thousands is out. The communication interconnects for this scale of machine stop scaling so well past 1000 nodes or so. Many of the largest of the largest clusters are, in production, split into smaller subclusters most of the time.

Thus it is economically stupid to run MPI apps on a shared memory computer.

Hence why I pointed out that NCSA still has 3 large cluster machines, including the 3rd fastest in the world (the Earth Simulator and BlueGene aren't traditional clusters in the strictest sense). Also, economy isn't everything in HPC. If it is faster to run an MPI job on a shared-memory machine, then there are times that it will make sense to do so, even though running it on a cluster would be cheaper. Remember, it would be cheapest and most efficient of all to run everything on one processor, except that everything would take forever to finish.

Gang schedulers exist and are in use on clusters too.

Gang scheduling has absolutely nothing to do with the lack of premption. Gang scheduling means that a job will get all of the processors it needs at once. Which is one of the causes of the inefficiency.

Suppose you have a 100 node cluster. Your job queue has 60 jobs that each want 1 node, 1 that wants 50 nodes, and 1 that wants the whole cluster (arriving in that order). Suppose also that the 50 node job is currently running, and another 50 node job just finished (so you have 50 unused nodes). Finally, assume that more small jobs are being entered into the queue relatively frequently.

Since the scheduling is non-preemptive, in order to run the 100 node job, the entire cluster must first be idle. Suppose you start allocating the 1 node jobs in the 50 free nodes, since they fit. They each take an unpredictable amount of time to run, so when any one finishes, the others are likely to be still going. Even when the 50 node job finishes, you'll have some number of small jobs running (remember, more are entering the ready queue). You *cannot* preempt them, so the 100 node job cannot run. If you keep scheduling jobs that fit, it is likely that it will never run.

Suppose you run jobs in strict order of arrival. Then the most likely outcome will be that a few of the 1 node jobs will take much longer than the others. You'll have most (99?) of the nodes free, but one or two straggler jobs still going. That 100 node job can't run, and although the jobs after it would fit, they can't run either. You have to sit with most of your cluster idle.

The schedulers in use don't do first come first served, and make an effort to keep efficiencies high, but it is still the case that large numbers of nodes lie idle when there are jobs that could fit in the idle nodes, in an effort to get the larger jobs through. If it was possble to suspend jobs, aka preempt them, this wouldn't be a problem, but on current clusters, that isn't possible (there are some research-grade cluster systems that can, but they have other very bad weak points (i.e. proprietary MPI extensions), and aren't in production use). It also isn't feasible/allowable to share a node amoung more than one job--for the same reasons that gang scheduling is efficient.

On large shared memory machines, however, preemptive scheduling is much easier. Whether this particular machine will allow it or not, I don't know. But it is much more likely than on the clusters.

Slashdot Mirror

User: Sangui5

Comments · 455