Slashdot Mirror


Linux Needs Resource Management For Complex Workloads

storagedude writes: Resource management and allocation for complex workloads has been a need for some time in open systems, but no one has ever followed through on making open systems look and behave like an IBM mainframe, writes Henry Newman at Enterprise Storage Forum. Throwing more hardware at the problem is a costly solution that won't work forever, he notes.

Newman writes: "With next-generation technology like non-volatile memories and PCIe SSDs, there are going to be more resources in addition to the CPU that need to be scheduled to make sure everything fits in memory and does not overflow. I think the time has come for Linux – and likely other operating systems – to develop a more robust framework that can address the needs of future hardware and meet the requirements for scheduling resources. This framework is not going to be easy to develop, but it is needed by everything from databases and MapReduce to simple web queries."

25 of 161 comments (clear)

  1. Re:This obsession with everything in RAM needs to by Lisias · · Score: 5, Insightful

    I know you're afraid of the garbage collector, but it won't bite. I promise.

    Yes, it will. It's not common, but it happens - and when it happens, it's nasty. Pretty nasty.

    But not so nasty as micromanaging the memory by myself, so I keep licking my wounds and moving on with it.

    (but sometimes would be nice to have fine control on it)

    --
    Lisias@Earth.SolarSystem.OrionArm.MilkyWay.Local.Virgo.Universe.org
  2. From the "is it 2005? department" by dbIII · · Score: 2

    "next-generation technology like non-volatile memories and PCIe SSDs"

    That generation has been going on for a while storagedude. People have been scaling according to load to deal with it.

    1. Re:From the "is it 2005? department" by K.+S.+Kyosuke · · Score: 2

      That's the former, not the latter, but OK. (I also said "in many places", one would have thought it obvious that these things sort of trickle down from the top over time, especially given the initial limitations on the technology.)

      --
      Ezekiel 23:20
  3. Re: This obsession with everything in RAM needs to by JMJimmy · · Score: 5, Funny

    Boobs.

  4. This belongs in the cluster manager by Animats · · Score: 4, Informative

    That level of control probably belongs at the cluster management level. We need to do less in the OS, not more. For big data centers, images are loaded into virtual machines, network switches are configured to create a software defined network, connections are made between storage servers and compute nodes, and then the job runs. None of this is managed at the single-machine OS level.

    With some VM system like Xen managing the hardware on each machine, the client OS can be minimal. It doesn't need drivers, users, accounts, file systems, etc. If you're running in an Amazon AWS instance, at least 90% of Linux is just dead weight. Job management runs on some other machine that's managing the server farm.

    1. Re:This belongs in the cluster manager by K.+S.+Kyosuke · · Score: 2

      Honestly, in MVS (z/OS), it probably makes perfect sense to have this in an OS, especially if you're paying through the nose for the hardware already. But solving it on the VM level surely makes it a huge win for everyone.

      --
      Ezekiel 23:20
    2. Re:This belongs in the cluster manager by Tough+Love · · Score: 2

      If you're running in an Amazon AWS instance, at least 90% of Linux is just dead weight

      Which 90% would that be, and in what way would it be dead weight? If you don't mind my asking.

      --
      When all you have is a hammer, every problem starts to look like a thumb.
    3. Re:This belongs in the cluster manager by Lennie · · Score: 4, Interesting

      Yes and no.

      No, large (Linux using) companies like Google, Facebook, Twitter have always used some kind of Linux container solution, not virtualization.

      Yes, policy is controlled by the cluster manager.

      But for example Google uses nested CGroups for implemeting those policies for controlling resources/priorities on their hosts.

      Virtualization is very ineffcient and Docker/Linux containers are a perfect example of how peole are starting to see that again:
      https://www.youtube.com/watch?... / https://www.youtube.com/watch?...

      Suppposedly, CPU utilization on AWS is very low, maybe even only 7%:
      http://huanliu.wordpress.com/2...

      The reason for that is, is that VMs get allocated resources they never end up using. Because the host kernel/hypervisor doesn't know what the VM (kernel) is going to do/need.

      For their own services Google doesn't use VMs, but Google does offer VMs to customers and to control the resources used by VM they run the VM inside a container.

      Here are some talks Google did at DockerCon that mentions some of the details of how they work:
      https://www.youtube.com/watch?...
      https://www.youtube.com/watch?...

      --
      New things are always on the horizon
  5. Linux Cgroups by corychristison · · Score: 3, Informative

    Is this not what Linux Cgroups is for?

    From wikipedia (http://en.m.wikipedia.org/wiki/Cgroups):
    cgroups (abbreviated from control groups) is a Linux kernel feature to limit, account, and isolate resource usage (CPU, memory, disk I/O, etc.) of process groups.

    From what I understand, LXC is built on top of Cgroups.

    I understand the article is talking about "mainframe" or "cloud" like build-outs but for the most part, what he is talking about is already coming together with Cgroups.

    1. Re:Linux Cgroups by Anonymous Coward · · Score: 2, Informative

      the article is not about "mainframe" or "cloud"... it is "advertising" for IBM... a company in the middle of multi-billion dollar deals with apple, all the while fighting to remain even slightly relevant.

      IBM has the magic solution to finally allow the world to run simple web queries.

      FUCK OFF

  6. Is this real or fantasy? by m00sh · · Score: 3, Interesting

    I read the article and I can't tell if this is a real problem that is really affecting thousands of users and companies, or a fantasy that the author wrote up in 30 minutes after having a discussion with an old IBM engineer.

    Sure, IBM has all these resource prioritization in mainframes because mainframes cost a lot of money. Nowadays, hardware is so cheap you don't have to do all that stuff.

    If some young programmer undertook the challenge and created the framework, would anyone use it and test it? Will there be an actual need for something like this?

    My point is that an insider information to what is really going on in the cutting edge usage of linux or just some smoke being blown around to an obligated write up.

  7. complex application example by lkcl · · Score: 4, Insightful

    i am running into exactly this problem on my current contract. here is the scenario:

    * UDP traffic (an external requirement that cannot be influenced) comes in
    * the UDP traffic contains multiple data packets (call them "jobs") each of which requires minimal decoding and processing
    * each "job" must be farmed out to *multiple* scripts (for example, 15 is not unreasonable)
    * the responses from each job running on each script must be collated then post-processed.

    so there is a huge fan-out where jobs (approximately 60 bytes) are coming in at a rate of 1,000 to 2,000 per second; those are being multiplied up by a factor of 15 (to 15,000 to 30,000 per second, each taking very little time in and of themselves), and the responses - all 15 to 30 thousand - must be in-order before being post-processed.

    so, the first implementation is in a single process, and we just about achieve the target of 1,000 jobs but only about 10 scripts per job.

    anything _above_ that rate and the UDP buffers overflow and there is no way to know if the data has been dropped. the data is *not* repeated, and there is no back-communication channel.

    the second implementation uses a parallel dispatcher. i went through half a dozen different implementations.

    the first ones used threads, semaphores through python's multiprocessing.Pipe implementation. the performance was beyond dreadful, it was deeply alarming. after a few seconds performance would drop to zero. strace investigations showed that at heavy load the OS call futex was maxed out near 100%.

    next came replacement of multiprocessing.Pipe with unix socket pairs and threads with processes, so as to regain proper control over signals, sending of data and so on. early variants of that would run absolutely fine up to some arbitrarry limit then performance would plummet to around 1% or less, sometimes remaining there and sometimes recovering.

    next came replacement of select with epoll, and the addition of edge-triggered events. after considerable bug-fixing a reliable implementation was created. testing began, and the CPU load slowly cranked up towards the maximum possible across all 4 cores.

    the performance metrics came out *WORSE* than the single-process variant. investigations began and showed a number of things:

    1) even though it is 60 bytes per job the pre-processing required to make the decision about which process to send the job were so great that the dispatcher process was becoming severely overloaded

    2) each process was spending approximately 5 to 10% of its time doing actual work and NINETY PERCENT of its time waiting in epoll for incoming work.

    this is unlike any other "normal" client-server architecture i've ever seen before. it is much more like the mainframe "job processing" that the article describes, and the linux OS simply cannot cope.

    i would have used POSIX shared memory Queues but the implementation sucks: it is not possible to identify the shared memory blocks after they have been created so that they may be deleted. i checked the linux kernel source: there is no "directory listing" function supplied and i have no idea how you would even mount the IPC subsystem in order to list what's been created, anyway.

    i gave serious consideration to using the python LMDB bindings because they provide an easy API on top of memory-mapped shared memory with copy-on-write semantics. early attempts at that gave dreadful performance: i have not investigated fully why that is: it _should_ work extremely well because of the copy-on-write semantics.

    we also gave serious consideration to just taking a file, memory-mapping it and then appending job data to it, then using the mmap'd file for spin-locking to indicate when the job is being processed.

    all of these crazy implementations i basically have absolutely no confidence in the linux kernel nor the GNU/Linux POSIX-compliant implementation of the OS on top - i have no confidence that it can handle the load.

    so i would be very interested to hear from anyone who has had to design similar architectures, and how they dealt with it.

    1. Re:complex application example by Mr+Thinly+Sliced · · Score: 5, Insightful

      > the first ones used threads, semaphores through python's multiprocessing.Pipe implementation.

      I stopped reading when I came across this.

      Honestly - why are people trying to do things that need guarantees with python?

      The fact you have strict timing guarantees means you should be using a realtime kernel and realtime threads with a dedicated network card and dedicated processes on IRQs for that card.

      Take the incoming messages from UDP and post them on a message bus should be step one so that you don't lose them.

    2. Re:complex application example by lkcl · · Score: 4, Informative

      > the first ones used threads, semaphores through python's multiprocessing.Pipe implementation.

      I stopped reading when I came across this.

      Honestly - why are people trying to do things that need guarantees with python?

      because we have an extremely limited amount of time as an additional requirement, and we can always rewrite critical portions or later the entire application in c once we have delivered a working system that means that the client can get some money in and can therefore stay in business.

      also i worked with david and we benchmarked python-lmdb after adding in support for looped sequential "append" mode and got a staggering performance metric of 900,000 100-byte key/value pairs, and a sequential read performance of 2.5 MILLION records. the equivalent c benchmark is only around double those numbers. we don't *need* the dramatic performance increase that c would bring if right now, at this exact phase of the project, we are targetting something that is 1/10th to 1/5th the performance of c.

      so if we want to provide the client with a product *at all*, we go with python.

      but one thing that i haven't pointed out is that i am an experienced linux python and c programmer, having been the lead developer of samba tng back from 1997 to 2000. i simpy transferred all of the tricks that i know involving while-loops around non-blocking sockets and so on over to python. ... and none of them helped. if you get 0.5% of the required performance in python, it's so far off the mark that you know something is drastically wrong. converting the exact same program to c is not going to help.

      The fact you have strict timing guarantees means you should be using a realtime kernel and realtime threads with a dedicated network card and dedicated processes on IRQs for that card.

      we don't have anything like that [strict timing guarantees] - not for the data itself. the data comes in on a 15 second delay (from the external source that we do not have control over) so a few extra seconds delay is not going to hurt.

      so although we need the real-time response to handle the incoming data, we _don't_ need the real-time capability beyond that point.

      Take the incoming messages from UDP and post them on a message bus should be step one so that you don't lose them.

      .... you know, i think this is extremely sensible advice (which i have heard from other sources) so it is good to have that confirmed... my concerns are as follows:

      questions:

      * how do you then ensure that the process receiving the incoming UDP messages is high enough priority to make sure that the packets are definitely, definitely received?

      * what support from the linux kernel is there to ensure that this happens?

      * is there a system call which makes sure that data received on a UDP socket *guarantees* that the process receiving it is woken up as an absolute priority over and above all else?

      * the message queue destination has to have locking otherwise it will be corrupted. what happens if the message queue that you wish to send the UDP packet to is locked by a *lower* priority process?

      * what support in the linux kernel is there to get the lower priority process to have its priority temporarily increased until it lets go of the message queue on which the higher-priority task is critically dependent?

      this is exactly the kind of thing that is entirely missing from the linux kernel. temporary automatic re-prioritisation was something that was added to solaris by sun microsystems quite some time ago.

      to the best of my knowledge the linux kernel has absolutely no support for these kinds of very important re-prioritisation requirements.

    3. Re:complex application example by Mr+Thinly+Sliced · · Score: 4, Informative

      First - the problem with python is that because it's a VM you've got a whole lot of baggage in that process out of your control (mutexes, mallocs, stalls for housekeeping).

      Basically you've got a strict timing guarantee dictated by the fact that you have incoming UDP packets you can't afford to drop.

      As such, you need a process sat on that incoming socket that doesn't block and can't be interrupted.

      The way you do that is to use a realtime kernel and dedicate a CPU using process affinity to a realtime receiver thread. Make sure that the only IRQ interrupt mapped to that CPU is the dedicated network card. (Note: I say realtime receiver thread, but in fact it's just a high priority callback down stack from the IRQ interrupt).

      This realtime receiver thread should be a "complete" realtime thread - no malloc, no mutexes. Passing messages out of these realtime threads should be done via non-blocking ring buffers to high (regular) priority threads who are in charge of posting to something like zeromq.

      Depending on your deadlines, you can make it fully non-blocking but you'll need to dedicate a CPU to spin lock checking that ring buffer for new messages. Second option is that you calculate your upper bound on ring buffer fill and poll it every now and then. You can use semaphores to signal between the threads but you'll need to make that other thread realtime too to avoid a possible priority inversion situation.

      > how do you then ensure that the process receiving the incoming UDP messages is high enough priority to make sure that the packets are definitely, definitely received

      As mentioned, dedicate a CPU mask everything else off from it and make the IRQ point to it.

      > what support from the linux kernel is there to ensure that this happens

      With a realtime thread the only other thing that could interrupt it would be another realtime priority thread - but you should make sure that situation doesn't occur.

      > is there a system call which makes sure that data received on a UDP socket *guarantees* that the process receiving it is woken up as an absolute priority over and above all else

      Yes, IRQ mapping to the dedicated CPU with a realtime receiver thread.

      > the message queue destination has to have locking otherwise it will be corrupted. what happens if the message queue that you wish to send the UDP packet to is locked by a *lower* priority process

      You might get away with having the realtime receiver thread do the zeromq message push (for example) but the "real" way to do this would be lock-free ring buffers and another thread being the consumer of that.

      > what support in the linux kernel is there to get the lower priority process to have its priority temporarily increased until it lets go of the message queue on which the higher-priority task is critically dependent

      You want to avoid this. Use lockfree structures for correctness - or you may discover that having the realtime receiver thread do the post is "good enough" for your message volumes.

      > to the best of my knowledge the linux kernel has absolutely no support for these kinds of very important re-prioritisation requirements

      No offense, but Linux has support for this kind of scenario, you're just a little confused about how you go about it. Priority inversion means you don't want to do it this way on _any_ operating system, not just Linux.

  8. Re:This obsession with everything in RAM needs to by Tough+Love · · Score: 3, Insightful

    Garbage collector with no overhead, hmm? Easy peasy with no satanic complexity I suppose. And of course no obnoxious corner cases. Equivalently in engineering, when your bridge won't stay up you just add a sky hook. Easy.

    --
    When all you have is a hammer, every problem starts to look like a thumb.
  9. Re:mainframe is old crap for geezers by Anonymous Coward · · Score: 2, Informative

    Yeah - the sky is the limit!!!
    Use your Microsoft cloud capabilities without hesitation....

    This message was brought by you by your friendly NSA..

  10. Re:mainframe is old crap for geezers by viperidaenz · · Score: 3, Informative

    On the contrary, if you can increase the performance of each node by 2x with 100,000 nodes, you've just saved 50,000 of them.

    That's a pretty big cost saving.

    The larger the installation, the more important resource management is. If you need to add more node, not only do you need to buy them, increase network capacity and power them, you also need to increase your cooling capacity, and floor space. Your failure rate goes up too. The higher the failure rate, the more staff you need to replace things.

  11. Re:Lotta work for an OS nobody uses by Z00L00K · · Score: 2

    2% may be the desktop share for Linux, but when it comes to servers and handheld devices like Android it's a different story.

    --
    If builders built buildings the way programmers wrote programs, then the first woodpecker would destroy civilization.
  12. Re:mainframe is old crap for geezers by K.+S.+Kyosuke · · Score: 2

    I don't dispute the possible savings and their value on large scale, but in general, it seemed to me that these proposals (what TFA describes) covered inter-application interactions, and not intra-application performance management. That's what I had in mind. With application-dedicated nodes (in cloud systems), improving performance is still of paramount importance but you do that with better data structures, careful application design, basically using internal domain knowledge etc., not with some some sort of app/OS generic resource allocation protocols. Or did I miss something?

    --
    Ezekiel 23:20
  13. Linux ALREADY has it! by Cyberax · · Score: 2

    Really. Author is an idiot. He should actually read something that is not a documentation volume for his beloved IBM mainframe.

    Linux has cgroups support which allows to partition a machine into multiple hierarchic containers. Memory and CPU partitioning works well, so it's easy to give only a certain percentage of CPU, RAM and/or swap to a specific set of tasks. Direct disk IO is getting in shape.

    Lots of people are cgroups in production on very large scales. There are still some gaps and inconsistencies around the edges (for example, buffered IO bandwidth can't be metered) but kernel developers are working on fixing them.

  14. Re:This obsession with everything in RAM needs to by Lisias · · Score: 2

    And yes, a garbage collector with zero overhead. Who would have thought? Well, pretty much anyone in the know, I guess.

    MARK / RELEASE from the Pascal days used to work pretty well - this is the less overhead "garbage collector" possible.

    It's impossible to have a Garbage Collector without some kind of overhead - all you can do is try to move the overhead to a place where it's not noticed.

    There's no such thing as Free Lunch.

    --
    Lisias@Earth.SolarSystem.OrionArm.MilkyWay.Local.Virgo.Universe.org
  15. Re:This obsession with everything in RAM needs to by IamTheRealMike · · Score: 2

    Not sure what you're getting at, but the Azul collector is well known for pulling off apparently magical GC performance. They do it with a lot of very clever computer science that involves, amongst other things, modifications to the kernel. I believe they also used to use custom chips with extended instruction sets designed to interop well with their custom JVM. Not sure if they still do that. The result is that they can do things like GC a 20 gigabyte heap in a handful of milliseconds. GC doesn't have to suck.

  16. Re:Linux Cgroups are a good subset of this by davecb · · Score: 3, Informative

    The only thing mainframes have that Unix/Linux Resource Managers lack is "goal mode". I can't set a TPS target and have resources automatically allocated to stay at or above the target. I *can* create minimum guarantees for CPU, memory and I/O bandwidth on Linux, BSD and the Unixes. I just have to manage the performance myself, by changing the minimums.

    --
    davecb@spamcop.net
  17. Re:This obsession with everything in RAM needs to by Wootery · · Score: 2

    I believe they also used to use custom chips with extended instruction sets designed to interop well with their custom JVM. Not sure if they still do that.

    I could've sworn I'd read that they'd stopped with their hardware work, but I think I was wrong: Appendix A of this page gives the impression (though I can't see it explicitly stated) that they're still doing custom hardware, but their software will work on ordinary Intel/AMD chips as well.

    GC doesn't have to suck.

    Indeed. It's Sturgeon's Law, but I think the '90%' part might be too low in this case. Major interpreters/'VMs' - even the ones with optimised native-code compilation - have awful GCs. Up until quite recently, Mono was using the Boehm GC. The GCs in OCaml and D show no signs of improving any time soon.