Linux Needs Resource Management For Complex Workloads
storagedude writes: Resource management and allocation for complex workloads has been a need for some time in open systems, but no one has ever followed through on making open systems look and behave like an IBM mainframe, writes Henry Newman at Enterprise Storage Forum. Throwing more hardware at the problem is a costly solution that won't work forever, he notes.
Newman writes: "With next-generation technology like non-volatile memories and PCIe SSDs, there are going to be more resources in addition to the CPU that need to be scheduled to make sure everything fits in memory and does not overflow. I think the time has come for Linux – and likely other operating systems – to develop a more robust framework that can address the needs of future hardware and meet the requirements for scheduling resources. This framework is not going to be easy to develop, but it is needed by everything from databases and MapReduce to simple web queries."
Newman writes: "With next-generation technology like non-volatile memories and PCIe SSDs, there are going to be more resources in addition to the CPU that need to be scheduled to make sure everything fits in memory and does not overflow. I think the time has come for Linux – and likely other operating systems – to develop a more robust framework that can address the needs of future hardware and meet the requirements for scheduling resources. This framework is not going to be easy to develop, but it is needed by everything from databases and MapReduce to simple web queries."
I know you're afraid of the garbage collector, but it won't bite. I promise.
So then what should we be obsessed with? Light weight, shiny screens, and rounded corners?
12345678910+Ã--Ã=%_@$!#/\&*()
http://xkcd.com/619/
next-generation is a word that schould be forbidden
I know you're afraid of the garbage collector, but it won't bite. I promise.
Yes, it will. It's not common, but it happens - and when it happens, it's nasty. Pretty nasty.
But not so nasty as micromanaging the memory by myself, so I keep licking my wounds and moving on with it.
(but sometimes would be nice to have fine control on it)
Lisias@Earth.SolarSystem.OrionArm.MilkyWay.Local.Virgo.Universe.org
That generation has been going on for a while storagedude. People have been scaling according to load to deal with it.
Why not map everything in RAM? These days even Windows gives every process 128 terabytes of address space. TERA BYTES.
Boobs.
That level of control probably belongs at the cluster management level. We need to do less in the OS, not more. For big data centers, images are loaded into virtual machines, network switches are configured to create a software defined network, connections are made between storage servers and compute nodes, and then the job runs. None of this is managed at the single-machine OS level.
With some VM system like Xen managing the hardware on each machine, the client OS can be minimal. It doesn't need drivers, users, accounts, file systems, etc. If you're running in an Amazon AWS instance, at least 90% of Linux is just dead weight. Job management runs on some other machine that's managing the server farm.
Is this not what Linux Cgroups is for?
From wikipedia (http://en.m.wikipedia.org/wiki/Cgroups):
cgroups (abbreviated from control groups) is a Linux kernel feature to limit, account, and isolate resource usage (CPU, memory, disk I/O, etc.) of process groups.
From what I understand, LXC is built on top of Cgroups.
I understand the article is talking about "mainframe" or "cloud" like build-outs but for the most part, what he is talking about is already coming together with Cgroups.
Load balancing clustering, JIT storage, cloud services, mainframe offloading, dedicated database servers, high avail redudant networking, etc....
The whole world is a nail to the man with a hammer....
So Who is paying his salary (or this trip)
Garbage collection necessarily wastes memory by factor of 1.5 to 2.
The collection itself also slows down the program, and in some languages cannot even happen asynchronously.
Finally, the most important aspect for program performance is locality and memory layout, something you cannot even optimize for in a language where every object is a pointer to some memory on a garbage-collected heap.
KVM, Xen and other hypervisors make Linux systems look like IBM mainframes. The whole "Virtual Machine" hype where we have guest operating systems running on hypervisors is just like IBMs Z series.
I was promised a flying car. Where is my flying car?
This feature was introduced in Windows Vista, and as we all know, this is the best OS ever because of that. Cant wait until Linux will becomes more like Vista.
I read the article and I can't tell if this is a real problem that is really affecting thousands of users and companies, or a fantasy that the author wrote up in 30 minutes after having a discussion with an old IBM engineer.
Sure, IBM has all these resource prioritization in mainframes because mainframes cost a lot of money. Nowadays, hardware is so cheap you don't have to do all that stuff.
If some young programmer undertook the challenge and created the framework, would anyone use it and test it? Will there be an actual need for something like this?
My point is that an insider information to what is really going on in the cutting edge usage of linux or just some smoke being blown around to an obligated write up.
Mainframes have always looked massively expensive, so we made do with cheap commodity crap. And crappy it was. You can see it everywhere, from (lack of, or bolted on as an afterthrought) management features, to single points of failure everywhere, to being cheaply made and so prone to breakage and very hard to diagnose. Most of us have never worked with anything else so have no idea that things could be massively better. Resource management in the OS is but a small thing lacking in comparison.
What's most amazing is that this status quo is gospel, that nobody saw fit to sit back and really think about the whole thing and perhaps start a project or two to try and do something about it. Instead we see marginal fiddling that really isn't innovating at all. From the poetteringware that's deliberately but unnecessarily breaking compatability in the name of progress but hardly progressing at all, to a bright new "standard" in rack sizes, right smack dab between the previous two(!) existing standards in size while still managing to fail to seize the chance to go metric, with a lot of cheap more-of-the-same software and hardware inbetween. The larger theme in computing is that it's not progressing much at all. It's not even baby steps, it's fiddling, doodling, not going anywhere at all.
This really can be a user-visible problem.
For example, the scheduling of things like SSD trims really needs to be stepped up.
Right now you can get unexpected blocking behaviour, for up to a whole second.
And there's no way for user-land to see it's going to happen, or even really to know what level of storage it is going to be using.
Maybe this stuff wants to be done as cluster management, rather than as part of the core kernel; but from a user's point of view - it just needs to be done.
Cloud???
Isn't that a mainframe connected over the internet with dumbed down terminals which require little complexity because the real complexity is located at a central point.
To clarify, cloud services act as the modern equivalent of the classic mainframe and the communication channels between the core system and the terminals has changed.
What is it, like 2% share? I mean, it was cool when 1% used it but now it's just an old, desperate OS looking for something, ANYTHING, to keepit from dying completely.
Linux grew because when people wanted/needed something, they wrote it themselves. Companies helped with money/manpower because they got some benefits.
So, if there's something missing, then it's probably not needed, or the other solutions cover it well enough.
I guess that's why Azul hired all those smart people, to make that go away for good.
Ezekiel 23:20
Garbage collection necessarily wastes memory by factor of 1.5 to 2.
And manual memory management on a similar scale wastes CPU time. And the techniques that alleviate one also tend to help the other, or not?
Finally, the most important aspect for program performance is locality and memory layout, something you cannot even optimize for in a language where every object is a pointer to some memory on a garbage-collected heap.
There's not a dichotomy here. Oberon and Go are garbage collected without everything being a heap pointer.
Ezekiel 23:20
I thought the title wanted to talk about something revolutionary, so I read through the details.
What I discovered was that the title was bullshit, so were the concerns surrounding Linux's capabilities. Some of them make sense for general all-purpose computation, some of them don't. I don't see why anybody should take these proposals too seriously for kernel inclusions.
The portion on primary memory management is perfect. Hadoop does suffer from lack of cache aware code; So far, only modified kernels have been in use with systems such as Azul's C4 based Virtual Machine.
The portion on user driven resource management (CPU/disk) is a very thorny issue. Most people don't use big monolithic computers, but provisioned, distributed systems. This leads to better separation of concerns and better diagnostics. This may be a non issue for most people than create complicated, entangled, scheduler code.
The portion on User Accounting generally does not make sense for most Linux machines in production today. Most people who favor lower latencies do not want context switches.
Linux is not a product, but a meta-product. It is up to the implementor to take a variety of components, put them together in a logical way, and configure the modules/userland to work with them correctly. If IBM feels that they want to bring in that prickly complexity to run Linux on a boilerplate expensive mainframe computer, it's their headache.
When you go cloudy, you can do the same things on a somewhat higher level. As in, when you go Google-sized, the allocation and management of resources with a granularity of a computing node doesn't probably bother you much, because you have tens or hundreds of thousands of them. Trying to solve these problems on the single system level might be a waste of time for many applications. This is more of a problem for on-site big iron. It's an interesting problem, and if solved, could be of use to many people, but it would be much less useful for cloud providers.
Ezekiel 23:20
..ought be enough for everyone. I mean, 2^64 could address all atoms in the solar system. How much porn do you expect to be able to store anyway?
i am running into exactly this problem on my current contract. here is the scenario:
* UDP traffic (an external requirement that cannot be influenced) comes in
* the UDP traffic contains multiple data packets (call them "jobs") each of which requires minimal decoding and processing
* each "job" must be farmed out to *multiple* scripts (for example, 15 is not unreasonable)
* the responses from each job running on each script must be collated then post-processed.
so there is a huge fan-out where jobs (approximately 60 bytes) are coming in at a rate of 1,000 to 2,000 per second; those are being multiplied up by a factor of 15 (to 15,000 to 30,000 per second, each taking very little time in and of themselves), and the responses - all 15 to 30 thousand - must be in-order before being post-processed.
so, the first implementation is in a single process, and we just about achieve the target of 1,000 jobs but only about 10 scripts per job.
anything _above_ that rate and the UDP buffers overflow and there is no way to know if the data has been dropped. the data is *not* repeated, and there is no back-communication channel.
the second implementation uses a parallel dispatcher. i went through half a dozen different implementations.
the first ones used threads, semaphores through python's multiprocessing.Pipe implementation. the performance was beyond dreadful, it was deeply alarming. after a few seconds performance would drop to zero. strace investigations showed that at heavy load the OS call futex was maxed out near 100%.
next came replacement of multiprocessing.Pipe with unix socket pairs and threads with processes, so as to regain proper control over signals, sending of data and so on. early variants of that would run absolutely fine up to some arbitrarry limit then performance would plummet to around 1% or less, sometimes remaining there and sometimes recovering.
next came replacement of select with epoll, and the addition of edge-triggered events. after considerable bug-fixing a reliable implementation was created. testing began, and the CPU load slowly cranked up towards the maximum possible across all 4 cores.
the performance metrics came out *WORSE* than the single-process variant. investigations began and showed a number of things:
1) even though it is 60 bytes per job the pre-processing required to make the decision about which process to send the job were so great that the dispatcher process was becoming severely overloaded
2) each process was spending approximately 5 to 10% of its time doing actual work and NINETY PERCENT of its time waiting in epoll for incoming work.
this is unlike any other "normal" client-server architecture i've ever seen before. it is much more like the mainframe "job processing" that the article describes, and the linux OS simply cannot cope.
i would have used POSIX shared memory Queues but the implementation sucks: it is not possible to identify the shared memory blocks after they have been created so that they may be deleted. i checked the linux kernel source: there is no "directory listing" function supplied and i have no idea how you would even mount the IPC subsystem in order to list what's been created, anyway.
i gave serious consideration to using the python LMDB bindings because they provide an easy API on top of memory-mapped shared memory with copy-on-write semantics. early attempts at that gave dreadful performance: i have not investigated fully why that is: it _should_ work extremely well because of the copy-on-write semantics.
we also gave serious consideration to just taking a file, memory-mapping it and then appending job data to it, then using the mmap'd file for spin-locking to indicate when the job is being processed.
all of these crazy implementations i basically have absolutely no confidence in the linux kernel nor the GNU/Linux POSIX-compliant implementation of the OS on top - i have no confidence that it can handle the load.
so i would be very interested to hear from anyone who has had to design similar architectures, and how they dealt with it.
"Resource management and allocation for complex workloads has been a need for some time in open systems"
Not that mainstream closed systems like microsoft corporation's so-called "windows" product and apple's "macos" system had anything like this.
And in microsoft corporation's "windows" operating system it is even impossible to implement this, due to buggy system design and already existing tons of issues related to resource management that would require rewriting windows from scratch. Since microsoft corporation's "windows" is just a toy for (rather stupid) children, this will never happen.
Just a note for people misunderstanding open and closed systems.
Garbage collector with no overhead, hmm? Easy peasy with no satanic complexity I suppose. And of course no obnoxious corner cases. Equivalently in engineering, when your bridge won't stay up you just add a sky hook. Easy.
When all you have is a hammer, every problem starts to look like a thumb.
No need to be facetious. If computer system engineering had no obnoxious corner cases, half of the problems not associated with GC would disappear as well. It's not like you can magically solve everything anyway.
And yes, a garbage collector with zero overhead. Who would have thought? Well, pretty much anyone in the know, I guess.
Yeah - the sky is the limit!!!
Use your Microsoft cloud capabilities without hesitation....
This message was brought by you by your friendly NSA..
Switch to Go. 15 scripts to re-write is not the end of the world. Use goroutines. profit ?
Why yes, yes it is nothing more than a rehash of the old days where dumb terminals connected to a mainframe. Sometimes those dumb terminals were connected via terrestrial microwaves or phone lines. Now where did I put my 3270 and where did I put my modified termcap file for a vt220.
My karma is not a Chameleon.
On the contrary, if you can increase the performance of each node by 2x with 100,000 nodes, you've just saved 50,000 of them.
That's a pretty big cost saving.
The larger the installation, the more important resource management is. If you need to add more node, not only do you need to buy them, increase network capacity and power them, you also need to increase your cooling capacity, and floor space. Your failure rate goes up too. The higher the failure rate, the more staff you need to replace things.
Weren't they added in Linux 0.01 around 1991?
I don't dispute the possible savings and their value on large scale, but in general, it seemed to me that these proposals (what TFA describes) covered inter-application interactions, and not intra-application performance management. That's what I had in mind. With application-dedicated nodes (in cloud systems), improving performance is still of paramount importance but you do that with better data structures, careful application design, basically using internal domain knowledge etc., not with some some sort of app/OS generic resource allocation protocols. Or did I miss something?
Ezekiel 23:20
So, rounded corners then.
Parallel processing super computers are a cost effective way of managing complex resources. The new technologies mentioned in Mr. Newman's article will make these super computers all the more efficient.
There is a solution that does this, it called a mainframe, they're hideously expensive, cooked a motherboard recently 1.2 million, want a 10G network card $20000. Now you can buy an awful lot of commodity hardware for much cheaper so that you have excess resources, need a dedicated system for a database buy one, run the other applications on a shared resource, you'll still end up with spare change if you dump a mainframe contract. You can replace a mainframe with commodity items you just need to plan for it. The cost of this scheduling is more expensive than deploying a couple of dedicated components.
The last time that I looked the number of cycles being performance on mainframes had been decreasing for over 25 years. ie there's not a great deal of market demand in this area and most of this market is with legacy systems.
The other litmus test is to look at how many successful IT companies that have developed in the last 20 years use a mainframe. I suspect that it is zero. Do google, facebook amazon etc use mainframes?
Scheduling and resource control on systems, is a bit like QoS, if you can buy fat pipes just buy fat pipes, it's a better solution and it makes all of the problems go away. Introduce scheduling and you'll be employing goons for now to enternity trying to sort out which application is king and performance still sucks.
Really. Author is an idiot. He should actually read something that is not a documentation volume for his beloved IBM mainframe.
Linux has cgroups support which allows to partition a machine into multiple hierarchic containers. Memory and CPU partitioning works well, so it's easy to give only a certain percentage of CPU, RAM and/or swap to a specific set of tasks. Direct disk IO is getting in shape.
Lots of people are cgroups in production on very large scales. There are still some gaps and inconsistencies around the edges (for example, buffered IO bandwidth can't be metered) but kernel developers are working on fixing them.
Microsoft's Virtual Address Space (Windows) page claims that it is 8 terabytes (with a special feature to allocate just a full 2 gigabyte chunk).
Are you being intentionally obtuse? It would seem so, but sometimes it's hard to tell on /.
Salut,
Jacques
> And manual memory management on a similar scale wastes CPU time.
No, it doesn't. Manual memory management could include no heap allocation at all, but let's suppose you mean RAII style. GC necessarily waste more than this, and stores up that more for some undefined point in the future. Why? Because if you have perfectly matched creation/deletion, it uses what it needs to. If you have a pooled RAII allocator for use in a specific call tree, it can benefit from the huge economy of scale of discarding an entire pool. Unlike GC though RAII memory management does not need to throw away the information it has at the point of discarding a reference, just to do it later.
Moore's Law speaks to computational horsepower per unit per cost. But even if the computational abilities do not continue to increase, the costs will keep coming down.
Hardware is cheap. It's not an elegant solution, but it's cheap. And getting cheaper.
Focus on the UX, because without that, who cares what your kernel can do? Machines are plenty powerful enough, what you want to do is get your OS in to the hands of the most users possible .... right?
"Consensus" in science is _always_ a political construct.
And yes, a garbage collector with zero overhead. Who would have thought? Well, pretty much anyone in the know, I guess.
MARK / RELEASE from the Pascal days used to work pretty well - this is the less overhead "garbage collector" possible.
It's impossible to have a Garbage Collector without some kind of overhead - all you can do is try to move the overhead to a place where it's not noticed.
There's no such thing as Free Lunch.
Lisias@Earth.SolarSystem.OrionArm.MilkyWay.Local.Virgo.Universe.org
That page has a comment on the bottom indicating that the information is out-of-date, with a pointer to more recent information: "Windows 8.1 and Windows Server 2012 R2: 128 TB"
dom
I don't have hard data yet, but I'm finding that EL7 is much much faster than EL6 on the same hardware for the workloads I've tried so far.
I don't know that tuned is most responsible, but I can see that it's running and that's what it's supposed to do.
I realize that the kernel is better and perhaps XFS helps, but those alone seem insufficient to realize the difference.
Anyway, it's somewhat along the direction people are talking about, even if only minimally.
My God, it's Full of Source!
OUTSIDE_IP=$(dig +short my.ip @outsideip.net)
Ah yes, I missed that. Apparently my microsecond-long attention span wasn't good enough to read the full page properly.
And I promise to pull out in time, honey.
Just lay back and think of England.....
And when those clouds swell and rise to the heights too fast, there will be a corporate hail storm and thunder. Possibly even a business tornado.
Not sure what you're getting at, but the Azul collector is well known for pulling off apparently magical GC performance. They do it with a lot of very clever computer science that involves, amongst other things, modifications to the kernel. I believe they also used to use custom chips with extended instruction sets designed to interop well with their custom JVM. Not sure if they still do that. The result is that they can do things like GC a 20 gigabyte heap in a handful of milliseconds. GC doesn't have to suck.
... but no one has ever followed through on making open systems look and behave like an IBM mainframe, ...
But I'll need a punch-card station and reader, build out my server room with a glass service window, hire a disinterested, snarky guy to retrieve printouts ... Or have IBM mainframes changed since my college days back in the late '80s?
It must have been something you assimilated. . . .
back to the future.
The only thing mainframes have that Unix/Linux Resource Managers lack is "goal mode". I can't set a TPS target and have resources automatically allocated to stay at or above the target. I *can* create minimum guarantees for CPU, memory and I/O bandwidth on Linux, BSD and the Unixes. I just have to manage the performance myself, by changing the minimums.
davecb@spamcop.net
Consider trying QNX, the message-passing real time OS, for this. This is a message passing problem, and Linux doesn't do message passing well. QNX has a scheduler optimized for message passing. You should be able to handle the UDP front end and fan-out without any problems. You can give the front-end process a higher priority than the other processes, which should let you get all the UDP packets into the fan-out program without losing any. That's what real-time OSs are for.
Trying to do anything high-performance with CPython's threads is hopeless. Watch this presentation on performance issues with Python's Global Interpreter Lock, Python has an internal scheduler, and it behaves very badly under load.
So each Python process should be single-thread. Have as many as you need, set up to get work via MsgReceive and reply by MsgReply. Don't set them up as "resource managers".
Python under QNX is being used by the robotics community, where real-time matters for some things, but not others.
QNX - great technology, marketing operation from hell.
or you do both.
All had hugely complex, sophisticated and mind bendingly expensive hardware with complex built in diagnostics ... 4/5 of the machine could die and the sytem continue.
There are now much cheaper ways of doing this, no 1965 360/30 are not calling
I believe they also used to use custom chips with extended instruction sets designed to interop well with their custom JVM. Not sure if they still do that.
I could've sworn I'd read that they'd stopped with their hardware work, but I think I was wrong: Appendix A of this page gives the impression (though I can't see it explicitly stated) that they're still doing custom hardware, but their software will work on ordinary Intel/AMD chips as well.
GC doesn't have to suck.
Indeed. It's Sturgeon's Law, but I think the '90%' part might be too low in this case. Major interpreters/'VMs' - even the ones with optimised native-code compilation - have awful GCs. Up until quite recently, Mono was using the Boehm GC. The GCs in OCaml and D show no signs of improving any time soon.
Earlier 64-bit AMD CPUs did not have a 64bit atomic compare-and-swap instruction, so Microsoft limited their OSes at those times to 8TB. If only Microsoft supported compiling for your arch. Stupid closed source OS.
If you got a 2x increase in single threaded performance on a 100k node cluster, you could probably get rid of quite a bit more than 50k nodes because of scaling issues.
If you're going to MARK/RELEASE why not malloc/free? Same goes for languages like Java - if you have to null a reference for it to get collected, how is that different from free() or delete? It's still a line of code you have to remember to put in your program at the right place.
I think this person is still mad that linux doesn't feed out accurate memory usage ever since COW pages were introduced, let alone multiple efficiency steps since then.
Not going to say that task management over a greater picture's a bad idea, but have to make it more coarse (per server, approximations) rather than fine if one is to still be able to effectively use many of Linux' performance improvements above IBM mainframe approaches. Mind, I've built a couple of systems like that for proprietary infrastructure.
Don't include JCL, for heaven's sake.
Table-ized A.I.
The problem is that the author is advocating adding a significant amount of logic (and cache footprint) to hot paths. This might have made a lot of sense if we were seeing a trend towards bigger single-image boxes (ie, mainframes). But the fact is that everything points to proliferation of microservices, sharded-distributed implementations, even just bigger boxes divided into VMs. Yes, big boxes still exist, but they're relatively special-purpose, and certainly shouldn't be dictating the direction of kernel/OS development.
If only Microsoft supported compiling for your arch.
...then what? You're using old Microsoft systems in areas where you need more than 8TB per process?
a) when was the last time you saw a single threaded node?
b) it was obviously an illustrative example. don't be a dick.
wow, who knew boobs could be so controversial
Re: This obsession with everything in RAM needs to, posted to Linux Needs Resource Management For Complex Workloads, has been moderated Insightful (+1).
It is currently scored Normal (2).
Re: This obsession with everything in RAM needs to, posted to Linux Needs Resource Management For Complex Workloads, has been moderated Informative (+1).
It is currently scored Insightful (3).
Re: This obsession with everything in RAM needs to, posted to Linux Needs Resource Management For Complex Workloads, has been moderated Interesting (+1).
It is currently scored Insightful (4).
Re: This obsession with everything in RAM needs to, posted to Linux Needs Resource Management For Complex Workloads, has been moderated Overrated (-1).
It is currently scored Insightful (3).
Re: This obsession with everything in RAM needs to, posted to Linux Needs Resource Management For Complex Workloads, has been moderated Funny (+1).
It is currently scored Insightful (4).
Re: This obsession with everything in RAM needs to, posted to Linux Needs Resource Management For Complex Workloads, has been moderated Overrated (-1).
It is currently scored Insightful (3).
Re: This obsession with everything in RAM needs to, posted to Linux Needs Resource Management For Complex Workloads, has been moderated Offtopic (-1).
It is currently scored Insightful (2).
Re: This obsession with everything in RAM needs to, posted to Linux Needs Resource Management For Complex Workloads, has been moderated Funny (+1).
It is currently scored Funny (3).
in VMS (you know, that semi-mainframe OS invented by DEC and now owned by HP).
While Linux has many ways to manage resources like cgroups, understanding how to use resource management lags in the Developers and Operations people. Understanding how an application or VM impacts performance of the host and how one application or VM interacts with the performance of another application or VM is a very complex subject.
A lot of good work is being done in this area, but general understanding in the industry is lagging. Most of the time the solution is to throw more hardware or upgraded hardware at the problem. More memory, more cores, upgrade to SSDs, more NICs.
To add my 2 cents, I think resource management has three layers in a virtualized cluster environment:
Cluster management: Deciding which hosts have the resources to execute the job.
Host management: Managing the resources on the actual hardware host
Instance management: Instrumenting VM or container based applications to accurately forecast and report their resource requirements.
The last one is the hard one. Nobody wants to run short of resources so they always ask for what they think are their maximum needs. Understanding the actual resource requirements of an application is very difficult. Writing an application to work within those resource allocations is also very difficult. Coordinating all of this on a cluster wide basis is even harder.
rlh100
Sorry for posting this as "Anonymous Coward" but the new Slashdot always drops my login information when I go to a specific article.
If you're going to MARK/RELEASE why not malloc/free? Same goes for languages like Java - if you have to null a reference for it to get collected, how is that different from free() or delete? It's still a line of code you have to remember to put in your program at the right place.
For two reasons:
1) It's easier to MARK the heap on the beginning of the task, using it as there's no tomorrow and then just RELEASE everything at once on the end. (nothing prevents you from deleting some pointers in the job to save memory).
2) You avoid HEAP fragmentation, easing the memory management's life.
Anyway, it appears to me that you missed the point. I was criticizing the pretense "no overhead garbage collector" from Azul.
Lisias@Earth.SolarSystem.OrionArm.MilkyWay.Local.Virgo.Universe.org