stripes · Slashdot Mirror

Re:yes, so far.. on "For Use on Free Operating Systems, Only!" · 2001-05-24 03:18 · Score: 3

[...] but might be in a future version of GPL which would allow people to release GPL software which did not provide the exemption for linking with the standard C library (GPL generally does not allow linking with non-free libraries). This would effectively allow people to write GPL software that wasn't portable to proprietary OS's since on these OS's it is always necessary to link with non-free libraries.

I don't think so. One would have to go to the trouble of getting a free libc like glibc or the BSD libc to run on the non-free OS, but once done they could use the more-viral-then-normal-GPL on a otherwise non-free OS.

Or on a few where the libc is all ready free, like Mac OS X where the libc is (as far as I know) part of the BSD licensed part, but the OS as a whole is clearly non free (in either sense of the word) there wouldn't even be trouble involved.

Re:Where is the bottleneck, really? on Benchmark Madness · 2001-05-24 00:23 · Score: 2

Um... b-trees are always balanced, it's one of their most basic properties.

Yes, but because they force balancing work to be done on (some) inserts and deletes. Sometimes a lot of work. And that can require locking and blocking on those operations. They are still good data structures in many cases, but one must remember they get fast search times at the expense of possibly slow insert/delete times (not as slow as a insert on an unbalanced binary tree could be though as that hits linear).

Re:But Softupdates has the same benefit on Benchmark Madness · 2001-05-24 00:03 · Score: 2

As for the "lower cost(complexity)", i doubt that severely. Journalling isn't all that complex, either.

Journaling may not be that complex, but XFS sure is. The code size the Usenix paper quotes was over half the size of the BSD kernel I was using at the time.

XFS does give a number of other things. Hashed b-trees for file name lookups, multi volume file systems (with draining), GRIO (not in the GPLed version though).

I don't know about the size of RiserFS, but it too offers more then just journaling, it has some size and speed performance bonuses on small files.

FFS's soft updates changes were fairly small, but not simple. The second round of them adding file system check pointing are also pretty small, but again makes things harder to compare.

I like FFS + soft updates a whole lot. I think someone ought to port it to Linux. However if I had to run a Usenet News system XFS would look very interesting since it will do name lookups in huge directories much faster then FFS.

Re:This is the right thing to do... on Apple Releases - Doing Less, Faster, Is Better? · 2001-05-12 12:23 · Score: 2

I really have trouble coming up with a task that requires Photoshop 6 over 5.5

Other people's canned "actions" (aka macros). There is a pretty good one for reducing the Canon EOS-D30 high ISO noise that needs PhotoShop six (or at least it claims it does - I don't have PS6)

As far as monolithic upgrades go, the companies need them in order to make their business model work.

That's why I actually kind of like the software subscription model, pay for support, bug fixes, and continuous updates. No big push to add cool sounding features to get people to upgrade, or making it look different, or make it have a non backwards compatible file format. No big push to get the product out on time, even if that means more bugs.

I'm not thrilled with the idea of the software no longer working if you fail to make a payment though.

Re:Faster is better, but... on Apple Releases - Doing Less, Faster, Is Better? · 2001-05-12 12:00 · Score: 2

seriously, I don't considering the spell checker not beeping as something bring "broken".

It doesn't beep after every word, it beeps when you have corrected or ignored (but not "Find Next"ed over) all the words. It is useful because if you are fixing up a slashdot post (the spell checker is available from all Coco text editing widgets, including the web browser I use) with somebody else's misspelled text, you want to leave those words alone. The beep reminds you that you have gotten a full pass even if there is still underlined misspelled words left.

(at least I want to leave the quoted text uncorrected -- I don't like to muck with someone else's words, plus it may make me look subtly smarter by having the only correctly spelled half of the argument....)

Re:does it update itself or do you get them? on Apple Releases - Doing Less, Faster, Is Better? · 2001-05-12 05:49 · Score: 2

I guess the question I have is is OS X forcingf people to get the updates? Probably not. This means that you can choose weather you want to update every month or once every 3 months or whatever.

You can choose manually, and never get updates if you like. Or automatically, with sub choices of daily, weekly, and monthly. No three month choice though. You could try that with cron, but I think the updater might get pissy if it fires up with no GUI user around (when it finds an update it requires an admin password to install it).

I'll note that I have mine set to daily and have never seen it do an update by itself. It may only check on login (I tend to log in and stay that way, it's my laptop after all!). I'm pretty sure it had a whole 72 hours between 10.0.3's release and when I ran the updater manually.

I think that they are probably focusing on the issue of fixing one bug at a time and doing it right rather than trying to fix all the bugs at once and then doing it half assed like M$ would.

They seem to bundle a bunch of fixes together, and they have really vague descriptions. 10.0.3 only fixes "making sure you see all files in the Finder in really really big directories", but people have said it has sound fixes and SCSI stuff. It was less then 5M to download.

I don't use OS X but would you rather have to download one huge patch every 6 months to a year (4-15Meg) or would you rather have 6 to 12 patches that were maybe smaller (1 - 2Megs). So how big are the patches?

The first one was pretty big, and included bug fixes, and "new" features like ssh and sshd (which were in the Public Beta, but Uncle Sam didn't get paperwork in time to go on the CD). It also had a lot of printer issues (like support for a lot of modern low cost ink jets). It may have been as large as 14M, it took a while to load over the 144Kbit DSL. 10.0.2 was smaller, and had some fixes plus partial CD-writer support.

Sorry I am not a big fan of M$ as they keep making there software 'easier' to use they make it more bloated and 'all in one' and then stuff it at you. Why is it that when you install Windows now a days you get IE and cannot get ride of it?

Apple may not be that bad, but they aren't releasing the tiny little fine grained patches you seem to want.

Faster is better, but... on Apple Releases - Doing Less, Faster, Is Better? · 2001-05-12 05:32 · Score: 3

I haven't seen anything 'broken' by the updates, and each time the system runs faster and smoother.

I have. My sound seems a little flaky, like the spell checker doesn't beep when it has corrected the last word. Other people have reported worse. However there are few of them, so I'm guessing it is mostly good.

I think Apple really needs an "undo update".

However I'm use to the fast small release cycle, and I like it.

Re:Thankyou on Dual Athlon Motherboards Creep Closer · 2001-05-11 01:40 · Score: 2

Er, "unclean"?? Just set make options and compiler flags in /etc/make.conf

What do you set in make.conf? Can you really set CFLAGS there without it getting stomped later?

I think it is unclean because you are setting CFLAGS, not C_OPT_FLAGS so something that doesn't want to be optimized gets your settings, as do things that have their own carefully worked out settings.

It is partly a limit of make, and partly a limit of how it is used ('tho I admit that the modern BSDs use make far far better then 4.3 ever did).

Re:Thankyou on Dual Athlon Motherboards Creep Closer · 2001-05-10 22:53 · Score: 3

You have restored my faith. My comments about compiler based performance increases are drawn from benchmakring done prior to the P4 release. I've utterly forgotten the details, except that P4 was slammed utterly in the initial benchmark, but after severe hand made adjustments from the intel techs, it blew the competition away. Evidently the compiler wasn't quite up to scratch at that point.

I don't think that was true, I know it got slammed in initial benchmarks, and then a few months later it was doing OK (but not significantly faster then the much slower AMDs). I think that has more to do with the motherboard chipset and memory systems improving then any compiler changes. Then again I haven't been watching closely, so I could be wrong.

Obviously designing a compiler like gcc requires making some trade-offs, given the way it is used. Redhat can't afford to compile rpms exclusively for x86, and they can't start seperate distros for AMD and intel, realistically. As a result, compiler optimisations are a compromise.

The compiler itself has a fair number of CPU specific bits, but they are mostly enabled by flags. On set for "what exact instruction set should I use", and a second set for "what CPU should I tune for". So you can make sure your code will run on a 486 or better (but maybe not the 386), but have the instruction ordering and cost weightings for a AMD K7, or a Intel P4 (I think that is -mtune=CPU).

Redhat may not want to compile up different RPMs for each arch, but if it is important to you, you can do it yourself. I don't know how, on FreeBSD you can do some unclean things to the master Makefiles and pretty much everything will use the new settings. If RPMs are mostly source you should be able to do the same kind of thing.

I can tell you why the windows compiler doesn't optimize well: Apparently they haven't written code to optimise for P2 yet. Hopefully, they'll skip a generation or two and jump straight to P4.

That's still not a why :-) I have been told you can use the Intel compiler as a backend for MS Visual C++ if you own it, so I guess either that compiler is too slow for most Windows software authors to want to use, or too expensive for their managers to pay for. I wonder if they can use gcc as a backend...

Drag about alpha. I'd still recommend sparc for high-end over most other things, due to excellent bus speeds and huge MP support. Bus speeds count for a hell of a let in some fields.

I expect the Alpha would do well there also. I don't think the Alpha is dead, there are still good people working on it. Just not as many. The next Alpha (which is late already, tape out being a significant fraction of a year late) looks nice. The one after it (which might come on time!) looks like a real killer.

Of corse the other "next gen" CPUs look pretty killer too. The IBM POWER4 actually looks more impressive then the next-next-Alpha. I don't recall that being the case before (the POWER has held the lead, but mostly by having their new CPU coming out a few months ahead of the new Alpha, not a year or more!). Of corse IBM seems to be playing it close to the vest, so we don't know if the POWER4 is on time.

Finally, it wasn't meant to be a troll. Seriously. I don't know what happened there.

It happens. I've had posts mistaken for personal attacks that weren't.

Re:Athlon is still a risk on Dual Athlon Motherboards Creep Closer · 2001-05-10 21:25 · Score: 5

Due to various changes to the super-scalar and caching features of pentium in the P4, AMD processors are a dangerous risk if you plan to use your machine for a particularly long time. As compilers are reworked to take advantage of the changes, AMD processors will perform considerably worse in comparison.

I doubt it. Most of the compiler related benefit is pretty small. The prefetch instruction for example gave about a 20% boost to the stream benchmark when shoved into an experimental version of gcc (this was the AMD prefetch). At that time it wasn't taken because it made some other benchmarks worse (the compiler wasn't smart enough to know when not to use them). That was about 18 months ago, so I'm not sure if they were improved and put in, or shelved for a post-3.0 release. Most other tweaks are smaller. SSE/3DNow would show a bigger improvement, but so far no compiler has done much with them, that is all hand coding (or on the PowerPC minor compiler assist because Apple modified gcc to have AltiVec datatypes), but you still have to change the C/C++/ObjectaveC yourself).

Just as importantly gcc sees optmisations for both CPUs (and many others), not just the Intel version.

The Intel compiler (as far as I know) doesn't get AMD optmisations, but it also isn't all that wildly used, despite being a very nice compiler. Most windows code isn't all that optmised, I'm not really sure why.

(note the superscaler changes seem like they would require a lot of compiler help, but ever since the PPro the x86 CPUs have mostly been out-of-order machines and don't need much compiler help in instruction ordering to get pretty close to top speed so unlike the 21064 or Pentimum1 or SuperSPARC rather then getting a 2x to 4x speed up for getting just the right ordering the speed up is tipicaly more like 10% and that is when there are lots of cache hits!)

In short, if you want intel, buy intel. If you want performance, buy a different architecture, like alpha or sparc.

Actually if you want integer performance Intel pretty much has the SPECInt crown (at least last month comp.arch was abuzz because the Alpha had finally lost it, and was in danger of losing the SPECFP as well, but that's what happens when half your design team is lost and your new CPU gets to be 20% as late as Intel's Merced).

The SPARC isn't a performance leader, and hasn't been for a very long time. It does give you access to some great rackmountable hardware, a ROM monitor that is great for lights out management, and a lot of other things, but raw CPU speed isn't it.

I think the Alpha still wins in SPECFP, but if you can do with reduced accuracy non-IEEE FP the PowerPC or Intel or AMD may beat it. For I/O the S/390s seem like a better bet :-)

Don't waste your time on the low end if you are doing high end stuff, and don't blow your cash on dual proc boards that you aren't going to be able to take advantage of.

Pretty good advice.

Hell, there aren't any really good SMP OSs for intel anyway, except BeOS, and that suffers from other problems that keep it from achieving much popularity.

If you are CPU bound BSD/OS will do fine. If you have some I/O in there Linux and FreeBSD aren't too bad, but they could be a lot better. They are certainly better then Solaris was after the same number of years of effort. I'm not sure the BeOS kernel is actually any better then those OSes. The userland is better positioned because of the way they designed it, and they promote use of threads quite a bit.

P.S. yeah, I know it was probably a troll, but I had to reply :-)

Re:What about making it a little less bloated? on Next Generation C++ In The Works · 2001-04-30 22:13 · Score: 2

We would have gotten here even sooner if you hadn't been so uncivil as to sleaze all around the subject (and several others) instead of simply accepting that maybe the point about hidden costs was a valid one.

Maybe it wasn't sleaze, maybe I didn't realize it was true.

And I still don't think it is the least bit common.

Re:What about making it a little less bloated? on Next Generation C++ In The Works · 2001-04-30 10:18 · Score: 2

If you think about it for a while, you'll realize that there are situations that the linker can't handle, and therefore there must be run-time patch-ups for at least those cases. You really should be more careful about using words like "never".

So when do they change the vtbl? I could believe the debugger might do it, but I don't know if it does. The debugger also writes other normally non-writable areas, so I don't think that makes a big difference.

I don't know any other times a statically linked C++ program would change the vtbl. I've been asserting that dynamically linked C++ programs aren't an interesting debate area because (a) I really haven't seen any, and (b) the C code ends up doing the same fixups, and (c) I don't really know when and how the fixups are done (they could be as each vtbl entry is used, or en mass), (d) many platforms that can dynamically link C can't do the same for C++, (e) I figured it was irrelevant because in a apples to apples the C code is just as bad.

And I suppose those are the only OSes that matter, eh?

No, but those are the only OSes in widespread use that I know how the dynamic linker works on. I also know how Multics did it, how Sprite did it, how SunOS 4.x did it, and a few research systems. I also know how the static linked shared objects on SCO and BSD/OS work.

I didn't want to say "this is how dynamic linking always works", so I went with "it works this way on the OSes I know about".

And do you suppose that mprotect is free? Or might this be one of those hidden costs whose existence you've been denying?

No, that is a lot of overhead. However I was explaining exactly what ld.so does for C code, not C++ vtbls. For example on many systems a libc.so call to malloc will do a JMP JMP (or indirect jump), even though malloc is in libc.so, the call through the dynamic link table just in case a "more important" .so defines malloc.

I don't know if these (C code!) JMP JMP sequences are better then an indirect jump. By extension I don't know if the same C++ shared object JMP JMPs are a good idea.

You got that backwards. I view it as a kludge or not based on the effect it has, instead of assuming it's not a kludge and then trying to deny effects to back up my opinion. IMO if it slows down the system as a whole *or* if it makes code elsewhere significantly more complex to support it, it's a kludge.

Well, I do believe the static linked C++ JMP JMPs don't slow the system down as a whole, and I don't think they make anything more complex with the possible exception of the linker.

Yeah, and nobody ever got in any trouble by forgetting the difference between "really really common" and "universal for all time" right?

Sure they did, I even hesitated to bring it up the first time. I just think there is a lot of code that assumes 16 byte lines, and because the lines of CPUs are currently multiples of 8 bytes (almost always multiples of 16) they tend not to get in too much trouble.

I can think of cases where it would cause trouble. I can think of more cases where it would at the very least fail to be faster then code that ignores the cache line size. I assume it also pisses off CPU designers because there might be some win in designing 30 byte cache lines (or some other odd size), except all the code that assumes 16 byte lines screw it up (but code "just written" would run better).

The right thing would be to run with a #define, or a const int, I was however making a cynical comment that 16 would get chosen, not a pronouncement that 16 ought to be the One True Answer there.

When my debating partner is obstinately straying from the rules of debate, I actually do feel they deserve a little slap on the wrist. The crux of this whole debate is your statement (in cid#635):
The hidden costs the other poster was talking about [me, in cid#595] don't exist

What annoys me is not that the statement was made, but that it wasn't retracted the first time it was refuted. Instead, I've had to put up with your topic changes, buzzword storms, squishy definitions, and all manner of other evasions. Frankly, I don't appreciate the extra work. I wouldn't treat you like an errant debate pupil if you'd stop acting like one.

Let's see, what were those costs I was denying existed?

Manual cache invalidation isn't cheap. It requires a lot more interlocking within the MMU than a typical instruction, so you pay a penalty every time you create an object.
Invalidating i-cache may blow away unrelated (but needed) instructions because of false sharing.
The object-creation code is now messier and more system-dependent.
Mixing instruction and data spaces precludes a whole class of VM-system optimization.

Those I still say don't exist, because I still say the vtbl is not changed at runtime.

Now I do admit that I hadn't thought about dynamically linked code the first time I made that statement. In fact I didn't think about it until about two posts ago. I still don't think dynamically linked code is relevant for reasons I stated at the top of the post.

If we do include the dynamically linked code then of the costs you listed originally, the same ones I dismissed out of hand, the "Invalidating i-cache may blow away unrelated (but needed) instructions" is the only relevant cost. It is not payed on each object creation. Depending on how exactly ld.so works it may be payed only when the .so is mapped the first time, or it may be payed when the first call on that vtbl is made, or the first call through a specific entry of the vtbl is called. Even so the vtbl will be shared for all objects of the same class.

To be pissy, the costs you asserted, and I denied don't exist, even for a dynamically linked object.

Being less pissy, one of the four costs you asserted exists in a radically reduced form. If you include dynamically linked code (and I hadn't argued at the time that one shouldn't, in large part because I hadn't thought about dynamically linked code at all) then you have enough of a point that I should have said "3 of the four never ever happen, the last doesn't happen in practice, and even if it did it isn't per object create, but about as frequent as per class in a .so, or per virtual function per class in a .so, or maybe per .so".

But I didn't realize at the time that there was a small part of the original statement that was true. You were (or seemed) fixated on the vtbl being modified all the damn time, and I was fixated on denying it.

Maybe we would have gotten here sooner if you were civil, but I assure you we would not have gotten here later.

Re:What about making it a little less bloated? on Next Generation C++ In The Works · 2001-04-30 04:05 · Score: 2

Ahhh, but it does. On an architecture designed around the "i-space modification is rare" assumption, writing to a vtbl *even* at object-creation or class-loading time incurs a substantial overhead in exception handling

The vtbl is not ever written to. On a unix system it is part of the ELF or a.out code area. The linker figures out how it looks, and it is not changed at runtime.

It might be different for a dynamically linked library, but the normal C linked libs frequently (but not always) work that way. BSD/OS does the double JMP trick (at least on the x86), using a ld.so that it got from FreeBSD, which uses a ld.so borrowed from, or inspired by the Linux version. So there is a good chance all three systems do it the same way.

Of corse dynamically linked C++ code is quite rare. Or at least using C++ libs dynamically, it is common for C++ code to use dynamic C libs. This has a lot to do with defcicencies in the ABI, and in template generation code.

So for a static linked program the vtbl is never ever written. I'm not sure about dynamically linked ones, maybe I'll go check.

This is different from the modifications that must occur at image-load time (including DSO-load time) because those have distinct boundaries and the OS can treat pages differently during that period than afterward. Maybe if parts of the C++ runtime were integrated into the OS loader this could be handled more efficiently, but that's a heinous idea for other reasons.

For statically linked code it is the same as being done at image load time, because the vtbl is filled out at image load time (before that actually).

For dynamically linked code it is also not all that different (I assume) because ld.so does similar things for the C code, and it also has minimal OS support (the mprotect(2) call, make the stub tables non-executable+writable, change them, make them executable+read-only).

Similarly, the whole point of the double-jump seems to be to abuse the BTB for performance. I call it abuse because every method pointer that's stuffed into the BTB is one less BTB entry that can be used for *real* branches.

That's an opinion based on whether you view it as a kludge or not. I view it as a real use of the BTB because I view the BTB's job a keeping the pipeline filled in the presence of branches.

Some CPUs allocate BTB entries for normal JMP and JSR instructions. The AMD 29k did that, the BTB held an address and the actual instruction. Modern CPUs don't tend to do that because the pipeline is too long to be happy with just one instruction. Some modern CPUs still keep entries for all branch/JMP/JSRs, but the BTB has a "internal pointer" to the i-cache line. This is rare because it is only useful on CPUs where the i-cache lookup takes more then one cycle (otherwise the internal pointer is no win).

CPUs like the HAL SPARC64 actually unfold i-space around control flow instructions, so the JMP-JMP would be replaced with the straight line code, no BTB would be allocated. I'm not sure about the P-IV's trace cache as I don't know if it is a true trace cache, or just shares the name.

In any event I think using the BTB entry to make the JMP JMP faster is a great thing, it avoids a pipeline stall. If that means some other branches can't fit, well at least it got used in some places.

However here I'm only arguing my opinion vs. yours. The right thing to try would be to convert from JMP JMP to a indirect JMP. If the indirect JMP is faster then you are right for that workload. If the JMP JMP is faster then I'm right for that workload. It's probably not all that hard to get gcc to produce either kind of code, so the big issue would be finding the right workload.

Any ideas?

very fast special-purpose cache; there's another cache - the L1 - right nearby that could also contain that same information

The BTB is actually included in the L1 cache on some CPUs. At least one SPARC, and I think the AMD K7. The down side is they limit the number of jumps that can have a BTB entry per cache line. I guess the other down side is they make the L1 cache take more transistors per line, and possibly reduce it's size.

So you save yourself a cycle on the method dispatch (if repeated) by using the BTB instead of the L1, in return for which you create a nice fat pipeline bubble for someone else when they hit a branch that would have fit in the BTB if not for your shenanigans. That's not a win, it's just shifting the load.

Oh, it's more then a cycle in many CPUs. In fact of all the modern CPUs except the PowerAS it is a fair bit more then a cycle, if you read the PowerAS papers they make a big deal about it. I think the PowerAS is also known as the IBM North Star, it is the CPU in their more recent AS/400ish systems.

I do agree that if the BTB entry for the JMP JMP pushes out an entry that would see at least as much use it is just shifting the load. Or worse if it pushes out an entry that is used more. A good BTB replacement algorithm can reduce the chances of that happening. A very large BTB (like the ones tied directly to the L1 cache) will also reduce the chances.

On the other hand the BTB entry for the JMP JMP may push out a less often used entry, or no other entry at all. In those cases it is a win.

The question is, does it win more then it loses? The true answer will depend on the CPU and the benchmark. I expect it to be a win though.

I sincerely hope you're asking how false sharing applies to this particular situation, not what false sharing is, because if you meant the latter then you should be reading H&P instead of posting here.

Yes, I'm asking how it applys here. I read H and P in '92. I know there is a new edition, but getting it is pretty far down on my reading list. It did actually make it onto my bookshelf and there have been 4 moves between then and now though.

False sharing is an issue because a single cache line on a modern processor is likely to span multiple vtbl entries. Naive vtbl-patching code that does manual icache invalidation would therefore be likely to go through all that overhead multiple times.

For a statically linked C++ program there is no vtbl patching. None. For a dynamically linked one I don't expect the costs to be different from the C version which has a similar table.

ck. The only alternative would be to have the vtbl-patching code be *deeply* aware of the local machine's cache line size (i.e. not just hidden in some memory-munging library routines). Also ick. That kind of machine-specificity needs a reason, and there just doesn't seem to be much of one so far.

Yeah but the assumption would probably be 16 bytes because that is a really really common number, and even if it happens to be only 50% or 25% of the cache line that is better then doing only a single address.

Of corse that would be if there were any vtbl patching code, because there isn't any.

The same as it has always been, Sparky: whether double jumps as an alternative to indirect junks are a reasonable or sucky idea.

Independent of context?

If you know that the target won't be altered you can get a very different answer then if you assume the target will be mutable. Or even a different answer on mutable but almost never changed vs mutable and changed frequently.

The answer can also depend a lot on the CPU and other things, like is the main memory system high latency (like maybe remote on a NUMA) or low latency (like on a CPU with an integrated SDRAM or RDRAM controller).

I expect it is the correct thing with a immutable target on most but not all CPUs.

If you're having trouble making the connections between the issues we're discussing and that basic point, let me know and I'll dumb it down a little more for you.

If you don't want to debate, don't debate. There is no need to stoop to insulting your debating partner.

Re:What about making it a little less bloated? on Next Generation C++ In The Works · 2001-04-29 21:58 · Score: 2

I'm sure we could have a very interesting discussion about the relative merits of double jumps vs. indirect jumps if you'd cooperate, because you seem to know more than most /.ers about how CPUs work. However, as long as you're going to deny that these systemwide costs exist at all - things like false sharing, extra interprocessor communication in an SMP system to do TLB shootdowns, pollution of the BTB when the regular L1 cache is damn near as good - then that's not going to happen.

Those things do exist (for the most part). C++ doesn't cause them though. C++ using the double jump doesn't make these problems any worse (except, arguably "pollution of the BTB..."). If you are not intrested in C++'s use of this feature and want to disscuss modifyable i-space in genneral, I'm fine with that.

But could you let me know what the hell the topic is?

Now of the costs that you listed, what do define "false sharing" as, and what does "pollution of the BTB when the regular L1 cache is damn near as good" mean?

then that's not going to happen

Well it won't happen unless we are both on the same topic. Care to let me know what the topic is?

How disappointing.

Pretty much, but you can change that.

Re:What about making it a little less bloated? on Next Generation C++ In The Works · 2001-04-29 06:36 · Score: 2

No, it's the same topic because it impacts the same solution.

I thought the topic was whether the C++ double jump through a vtbl was better then a single indirect jump through the vtbl. C++ does not allow overloading of functions of a single object, it requires a new class for that. So C++ will not benefit from a writable vtbl.

If the topic is something else, please let me know so I can either argue my point, or agree with yours.

Please stop trying to redefine the topic to suit yourself.

Well it was a reply to a C++ article, so I assumed we were discussing C++, and CPUs slipped in in relationship only to that. I admit I may be talking about a different topic then you though, but not to frustrate you, but because I was unaware you were discussing something else.

"Seldom" is not equal to "never", and we were talking about the assumption that i-space would *never* change because that's the only assumption that would make the proposed solution seem reasonable.

No CPU I know of has that assumption. All of them allow a i-space change, because all of them need to allow programs to be loaded in. Given the rarity of the changes many require something fairly expensive to be done after the change (like an i-cache flush, or on things like the MIPS a controlled flush of some of the lines).

We're not talking about self-modifying code, as much as you seem to be hoping that the taint associated with that phrase will stick to anyone who disagrees with you. We're talking about mutable data in i-space, and about the nasty hack of using double jumps with the intermediate target in i-space to "trick" CPUs and make method dispatch a cycle or two faster without considering the effect of such a hack on the rest of the system.

Ok, does that mean you think the vtbl is mutable? The vtbl is immutable. It is never changed. It is a constant. I don't even know any non-portable ways to change the vtbl.

If C++ itself was changed in a way that needed mutable vtbls I expect the double jump through a vtbl would be changed back to an indirect jump and the vtbl moved to a writable area. But I'm not sure. It might be cheeper on some CPUs do do selective i-cache flushes (like on the CPUs that allow single cache lines to be invalidated).

But you're almost right. Making this particular hack work faster makes the rest of the system slower. That's exactly the point. Congratulations on finally getting it.

Are we talking about the way C++ uses the CPU? If so it doesn't make anything slower as it uses immutable i-space to hold immutable data (technically it is code I guess, but immutable none the less).

If we are talking about the way CPUs make modifying i-space expensive, then I expect you are wrong, but it does depend on exactly what code you need to run. It is a big argument, and unless it is one you are interested in I'll leave it dormant.

If we are talking about something else please do let me know.

Re:What about making it a little less bloated? on Next Generation C++ In The Works · 2001-04-29 01:05 · Score: 2

You're so full of crap. Yes, those hidden costs do exist and must be paid *even though* the contents of the table are in fact effectively constant. That's the problem. You still have to invalidate the i-cache, you still have to forego VM-level optimizations, etc. because the values *could* change.

Eh? The vtbl is stored in immutable code space (in most implementations), after all the vtbl is immutable. The only i-cache invalidate is done when the OS maps the page in the first time. I'm unaware of what VM optimizations are being forgone.

You can bitch about the vtbl being immutable is a bad requirement, and prevents C++ from being as flexible as Ruby. But that is a different topic.

So I say there are no hidden costs in how C++ implements this trick. You can argue that there are hidden costs in how the CPU thinks about i-space, but that is a very different argument.

It does no such thing. Any CPU that makes such an assumption about the immutability of i-space could be considered broken.

With the sole exception of the x86 all modern CPUs are broken? They all assume that i-space is very seldom altered, and that that altering can be quite costly. In fact even the modern x86 assumes that. The other CPUs require an i-cache invalidate, the x86 makes stores slower by snooping i-cache address, or they make the i-cache smaller, or both.

That seems a little silly as i-space modification is rare, it wasn't even all that common when it was easy. The only place I can think of where it is all that useful is a low level graphics system, but that kind of code can be frequently be pre-expanded, or even more commonly hardware assist is used now anyway.

I have written self modifying code (at least) three time in the last 20 years. I don't mind it being slightly harder now. Have you ever written any?

That breakage can be worked around by having the VM system play nasty tricks with making i-space pages read-only etc., but the cost of having to cover for the CPU's deficiencies like that is much greater than the benefit.

Or frequently at the "cost" of finding stray pointer usage much sooner.

Try looking at the problem from a *system* standpoint for a change, instead of a myopic "how can a CPU designer avoid work" standpoint.

From a system standpoint making self modifying code faster makes everything else slower. Is there really enough self modifying code to make that a good deal?

Re:What about making it a little less bloated? on Next Generation C++ In The Works · 2001-04-24 04:34 · Score: 2

I think the point is that the tables of jumps do not change, and thus the icache is not invalidated. The table being talked about is the vtab for a given class and it is a constant.

Exactly. The hidden costs the other poster was talking about don't exist.

This jmp-to-jmp stuff is a way to fool the CPU cache and predictor circuitry into assumming the location is constant, because it figures that jmp instructions are constant. It does seem kind of annoying that it is worth doubling the table size (and thus halving how much fits in the cache) in order to get around a mistaken programming assumpition by the CPU designers.

Well, it doesn't so much fool the CPU into thinking the location is constant as to actually inform it that the location is constant.

I think it would be better for a CPU jump predictor to assumme *everything*, whether in instruction, read only, or read/write data, is constant. Modern C++ code typically accesses a given location many orders of magnitude more times than it modifies it!

C++ isn't the only thing that runs on a CPU. It may well be a good idea to have a indirect jump that does allocate a BTB entry, or to have BTB entries allocated if the indirect address is on a read-only page (this may be hard to tell). Of corse one could only do this on a CPU that deals with the BTB entry being incorrect (most modern CPUs do, OOO machines can do it fairly trivially), otherwise you can get some odd problems.

Re:is there actually *that* much money in it? on Displaced Techies Find Sex Sells, And Pays · 2001-04-23 23:38 · Score: 2

That's because amatuer porn is pretty popular right now. Who wants to see yet another airbrushed nude of Pamela Anderson when we can see Polaroids of that girl we ran into at the bar last nite?

I thought am. porn was "normal looking" women, plain or no sets, no airbrush (or photoshop), single flash, flash shadow even. But not poor photography that makes the poor woman's skin look uniformly white. Ugh.

Re:What about making it a little less bloated? on Next Generation C++ In The Works · 2001-04-23 23:07 · Score: 2

Pick ANY task.

Mmap a large binary file, treat it as an array of int, sort it. Treat it as an array of char, sort it. Use the language provided library to do both sorts.

Write a C and C++ program to accomplish the task.

Done. The C++ STL version is a few lines shorter then the C qsort version.

Pick any C and C++ compiler. You make the choice of which.

The gcc provided with BSD/OS 4.2 x86, or the SPARC version. For both languages.

The C code will finish first. Always.

Odd, the C++ program seems to have been eight times faster. Oh, look now that the file is in the cache it is 14 times faster.

You simply cannot argue with facts.

What's the alternative, arguing without facts, as you have just done?

Re:What about making it a little less bloated? on Next Generation C++ In The Works · 2001-04-23 22:43 · Score: 2

I truly don't understand this. C always can tell the pointertype, because it is static. Or are you thinking about equivalent code in C?

I was talking about the equiv code. Yes, C always knows the pointer type, or more accurately C assumes it knows the pointer type, and if it is wrong, the programmer (or user) will pay.

The discussion is idiotic. Algorithms and design are what's crucial, not syntactic sugar. High-level languages just improves your efficiency by many orders of magnitudes (whatever that is).

It is your right to think that. Personally I find it easier to bring people into the fold by convincing them to use the more powerful language, and it's cheeper features.

Of corse the first time you look at code that finds the 95% percentile by doing sort, and replace it with nth_element converting a O(NlogN) algo into a O(N) algo and save days of runtime in a billing application, yes, they will buy the algo argument. One savings like that can make up for a lot of runtime ineffecency.

However if you never get them to use C++, you will never get them to "just call nth_element". So I like to start by saying "Yo, if you micromanage C's runtime speed, you can do that in C++ too, and while your in there try templates, they'll make that micromanagment way simpler, go check out the STL too....". That tends to convert more people.

Re:What about making it a little less bloated? on Next Generation C++ In The Works · 2001-04-23 22:27 · Score: 2

don't understand this. Is this because the processor assummes that any location you jump to is involitile? It seems that an indirect jump could be "predicted" just as well by assumming the contents of the pointed-to memory location is the same as last time and this would require no more circuitry than the jump-to-jump predictor.

I'm not a CPU designer, but I hang out in comp.arch a lot. So take this with a grain of salt.

Some CPUs sniff writes to areas covered by the i-cache, and will do a lot of work when they are detected. Assuming BTB targets must be in the i-cache (true on some CPUs) that provides a good way to catch changing tables of jumps, but not tables of jump addresses.

No CPU I know of sniffs BTB source addresses. I think that is in part because BTBs became popular in micros well after self-modifying code became "evil".

Many CPU's (pretty much everything other then the x86 -- and other really old things like the 390) require a (i-)cache invalidation between modifying code and executing that code. The cache invalidation will also invalidate the BTB. So the CPU can feel free to use the BTB to optimize a JMP-JMP sequence, but not to optimize a indirect jump.

While I'm on the topic of CPUs, I think the POWER (incl PowerPC) has dedicated branch registers, and doing an indirect branch through them is quite fast, at least if the CPU has time to prefetch the targets, or they are in the i-cache. Except on the PowerAS where the branch registers aren't special, but the pipeline is so short (an amazing six cycles at 500+Mhz) pipeline stalls are cheep enough that they don't do target prediction, and only static branch prediction.

Could a CPU make indirect branches as fast as JMP JMP branches? Sure, but I think it would slow down all data stores, or the all uses of the BTB, or both. It doesn't seem worth it with current language usage. Could C implement function pointers as JMP JMP? Sure, but that would make function pointers wider then normal pointers (or waste space in normal pointers).

Does this make C++'s virtual function calls faster then C's? That is going to depend a lot on the CPU, and the usage patterns. I doubt they will be faster on a PowerPC, but they could be faster on a AMD K7, or the Intel P3 and P4, and they were for sure in one usage on the SPARC for me. Which is why I went down this rathole in the first place, to figure out how that could be.

Re:What about making it a little less bloated? on Next Generation C++ In The Works · 2001-04-23 22:05 · Score: 2

That double indirection is a loader artifact, not really part of the language.

It is independent of ld.so. Well actually it might be done for similar reasons. It use to always be an indirect jump (like in C, but with the vtbl indirection), but has been changed to a jump through a jump because that runs faster on many platforms.

See the gcc C++ archives for details, or maybe you can find a comp.arch archive.

Re:Rationale Garbage Collection Explained -- sort on Next Generation C++ In The Works · 2001-04-23 21:57 · Score: 2

Properly implemented, GC has no impact on the classes interface, or even the lifetime of the objects.

Only reference counting can destroy objects as quickly as explicit memory management. Reference counting is also the slowest GC, and frequently not even accepted as GC because they break in the presence of even trivially circular objects. So what most people would think of as good GC will extend the lifetime of objects somewhat (except when they fix a memory leak and radically reduce the lifetime of the object).

If your destructor is important, use of auto_ptr or other "smart" pointers will be needed even in a GCed language. (don't get me wrong, I like GC, I just don't like overselling it)

Re:What about making it a little less bloated? on Next Generation C++ In The Works · 2001-04-23 11:00 · Score: 2

That's simply impossible. A C++ virtual function is an array index (into vtbl) followed by a call through a pointer

Nope. It is an array index followed by a jump to that location which contains a jump to the new location. Or at least that is a common way to do it (it does double the size of the vtbl). That tends to make branch predictors happier...

Now other then that one little edge, I did kind of forget the extra indirection, which does make the C++ call a bit slower, but compared to the cost of the pipeline bubble from doing a jump through a pointer, or for having the BTB miss (10 to 20 cycles) the extra two for the indirect (assuming a cache hit) is pretty minor.

Oops. My bad.

Re:No1 Wish: try/finally on Next Generation C++ In The Works · 2001-04-23 10:35 · Score: 2

auto_ptr is less then 100 lines of code. You can take the one from and template it up a bit more to use any allocate/free functions you like. You can even get some counted versions from boost.

finally may still be nice for some things, but following the "all resource allocation is object creation/all deallocation is destruction" design will eliminate 99% of those.

Slashdot Mirror

User: stripes

Comments · 1,586