Mr+Z · Slashdot Mirror

Re:Rotating Registers... on Intel's Itanium Processor Explained · 2000-12-03 16:24 · Score: 2

Write-After-Read hazards are particularly interesting in the case of software pipelined loops. First, some terminology: a value is live from its earliest definition to its last use. In the example above, x is live from the first statement until the second within the body of the loop. In a given loop, a value may be live for quite a long time. However, the initiation interval for the loop might be quite short. This can lead to problems, such as violated Write-After-Read hazards.

Ack, it's late and I'm tired, and I forgot to link this back to my introduction of loop-carried dependences. In this case, the way you avoid the violated W-A-R hazard is to introduce a new dependence known as an anti-dependence. An anti-dependence is a dependence on the use of data relative to its destruction; in contrast, a flow dependence is a dependence on the creation of data relative to its use.

In this example, an anti-dependence exists from g[i] = e + b to b = a[i] on the next iteration. This forms a cycle in the dependence graph, and gives us a much larger recurrence bound. This leads to an artifically high iteration interval and low performance.

We break this recurrence by inserting the moves I mentioned in the remainder of my post, or by using rotating registers. Sorry for my lameness there.

--Joe
--
Program Intellivision!

Yummy Intel Documentation Goodness on Intel's Itanium Processor Explained · 2000-12-03 16:02 · Score: 3

With all the wild speculation going on around here, I thought it might be worth throwing some actual links in here to real information.

Itanium Processor Family Home -- has links to all sorts of IA-64 material.
The IA-64 Architecture Specifications and Guides -- lots of good documentation links.
And don't forget Itanium[TM] Processor Microarchitecture Reference.

I haven't read all of these myself, but I have poured over the details that are most relevant to my work. :-)

Have fun.

--Joe
--
Program Intellivision!

It does do bitwise rotate on Intel's Itanium Processor Explained · 2000-12-03 15:46 · Score: 2

Actually, I just looked it up, and you're wrong. On Page 4-6 of IA-64 Application Developer's Architecture Guide, Rev 1.0, it says specifically, and I quote: (emphasis mine)

The shift right pair (shrp) instruction performs a 128-bit-input funnel shift. It extracts an arbitrary 64-bit field from a 128-bit field formed by concatenating two source general registers. The starting position is specified by an immediate. This can be used to accelerate the adjustment of unaligned data. A bit rotate operation can be performed by using shrp and specifying the same register for both operands.

So there.

--Joe
--
Program Intellivision!

Re:Rotating Registers... on Intel's Itanium Processor Explained · 2000-12-03 15:33 · Score: 1

Tee hee! Actually, it's kinda interesting, speaking of 4GB RAM, the next workstation I'm scheduled to get at work will have ~4GB RAM in it, and two happy UltraSPARC III CPUs in the ~800MHz range. Whee! (And to think those US IIIs came out of our fab just up the street! Whoo hoo!) The design jobs that run on my workstation really use that much RAM too. They're not my jobs though -- I'm a software guy. All our workstations are in a load-sharing queue, offering gobs of MIPS for crunching all of design's jobs day in and day out. I can kick the jobs off my node during the day as needed, though, which is nice, especially on my current workstation which only has a half-gig of RAM.

As for light, content-free entertainment, you can get that anytime by setting your threshold to -1. I find it rather amusing. :-)

--Joe
--
Program Intellivision!

Re:You havn't been paying much attention: on Intel's Itanium Processor Explained · 2000-12-03 14:51 · Score: 2

1. Many modern CPUs perform 'Predication', often called something speculitive execution insted. Processors such as P6, K7, and EV6 all perform this optimization.

This is somewhat true but not completely. Predication is a form of speculative execution, but is qualitatively different from the speculative execution that most CPUs do when they branch-predict. The problem is that these architectures don't really have a way to execute down both sides of a branch. To equal what predication provides, you'd actually need to be able to fetch down several code paths in parallel and know which instructions to discard. Icky. Predication allows that to happen in a single code path, because you can put both "if (cond)" and "if (!cond)" paths directly in parallel, or even better, "if (cond1)", "if (cond2)" ... "if (condN)".

Predication is very useful for eliminating short branches and flattening small switch-case statements into effectively straight-line code. It's a much, much, much more effective method for speculative execution than trying to fetch and execute down multiple code-paths.

--Joe
--
Program Intellivision!

Re:Some highlights... on Intel's Itanium Processor Explained · 2000-12-03 14:44 · Score: 1

Not that you could do rotates from C or anything...

--
Program Intellivision!

Re:I can't stand Java, but maybe that's just me... on Why Linux Lovers Jilt Java · 2000-12-03 14:33 · Score: 1

The preprocessor is supposed to strip comments before processing preprocessor directives, so that commented-out directives don't get processed. In ancient compilers, the comment was completely stripped in some compilers, leading to the following idiom for symbol concatenation:

#define CONCAT(x,y) x/**/y

As you can imagine, it may have worked, but it wasn't hugely portable. ANSI put its foot down and stated that comments are replaced by a single character of whitespace, and so the CONCAT(x,y) macro above won't work. ANSI, recognizing that something like CONCAT(x,y) might actually be useful, though, specified the ## operator:

#define CONCAT(x,y) x##y

Now what does this have to do with C++ comments?

Well, nothing in C (or I believe early C++) says that the preprocessor need be aware of C++ style comments. I beleive early Cfront-based C++ compilers just reused the existing C preprocessor. The result is that the C++ comment gets included as part of the macro. Wackiness ensues when the macro is used in the middle of a line:

#define MYCONST (42) // The answer! x = MYCONST; y = 69;

In that snippet, on older compilers whose C preprocessor is not C++ comment aware, you'll get x = y = 69; (spread over two lines, of course). Whee.

--Joe
--
Program Intellivision!

Rotating Registers... on Intel's Itanium Processor Explained · 2000-12-03 14:15 · Score: 5

Well, it seems Sharky glossed right over this one. They don't seem to get what rotating registers are for. They just make some vague statement about them working well for streaming things or something. *sigh*

One of the chief techniques that VLIW (and EPIC) processors will use to extract parallelism from looping code is Software Pipelining. This technique extracts parallelism across multiple loop iterations by scheduling them in parallel. The most popular form of software pipelining, Modulo Scheduling, offsets the loop iterations by a fixed interval known as the initiation interval.

The minimum possible initiation interval for a software pipelined loop is limited by two factors: The resource bound for the loop, and the recurrence bound for the loop. The resource bound is determined by counting up all the resources the loop uses and finding the minimum # of cycles (ignoring dependences) that you could pack everything into. The recurrence bound is a little trickier.

The recurrence bound is the bound imposed by loop-carried dependences in the loop. That is -- dependences that feed from one iteration of the loop into future iterations. For instance, in the following loop, there's a dependence from the result written to "z" on one iteration to the calculation of "x" on the next:

for (i = 0; i &lt N; i++) {
- x = z ^ 3; y = x + 42; z = y * 69;
}

On an architecture with infinite resources, this loop is still recurrence bound by the path from x to y to z, back to x. So, what does this have to do with rotating registers?

Well, so far, I've just described flow dependences. If you pick up a copy ofHennessy and Patterson's Computer Architecture: A Quantitative Approach , you'll see that this corresponds to "Read after Write" hazards -- meaning a later instruction reads a result written by an earlier instruction. There are two other sorts of hazards to watch out for: Write-After-Write (two instructions writing to the same place have to write in order), and Write-After-Read (a later instruction might clobber a value read by the current instruction).

Write-After-Read hazards are particularly interesting in the case of software pipelined loops. First, some terminology: a value is live from its earliest definition to its last use. In the example above, x is live from the first statement until the second within the body of the loop. In a given loop, a value may be live for quite a long time. However, the initiation interval for the loop might be quite short. This can lead to problems, such as violated Write-After-Read hazards.

Suppose we have the following code:

for (i = 0; i < N; i++) {
- b = a[i]; c = b + t; d = c + u; e = d + v; g[i] = e + b;
}

Suppose we can fit all of this into a single cycle loop on our hardware because we can do four ADDs in parallel, plus the load and the store. Notice that the instructions in the middle are just dependent on each other, and on constants that are initialized outside the loop. Notice that the final instruction uses the second-to-last ADD's result as well as the value we loaded initially.

If we try to put this into a single-cycle loop, we'll have a problem, because we'll load multiple values into b before we even get to the calculation which finds g[i]. Oops. This is because the b = a[i] from a future iteration has moved up above an instruction from the current iteration which reads b--that is, we've violated a Write-After-Read hazard. In software-pipelining parlance, this is a "live-too-long" problem. The value of b is live across multiple iterations.

In a device without rotating registers, you solve this problem by manually copying b to temporary registers. In C code, this might look like so:

for (i = 0; i < N; i++) {
- b = a[i]; b1 = b; b2 = b1; b3 = b2; c = b + t; d = c + u; e = d + v; g[i] = e + b3;
}

Fine, except that can increase codesize, and in some cases impact performance. (It is, however, the technique of choice on processors that implement a minimum of hardware, so as to save power and cost.) Rotating registers alieviate this by performing these copies implicitly whenever the loop branch is taken.

So there you have it. That's the scoop behind rotating register files.

--Joe
--
Program Intellivision!

Re:6.4 GFLOPS on Intel's Itanium Processor Explained · 2000-12-03 13:45 · Score: 1

Yeah, except the PSX2 doesn't quite put out enough heat to melt the DVDs you put in it...

--Joe
--
Program Intellivision!

Re:The most important thing on Intel's Itanium Processor Explained · 2000-12-03 13:43 · Score: 1

Ok, Pheon, are you sure of this? What if your L1 cache is on chip and super fast? Isn't this just as good as having a zillion registers?

Not really. Cache memory is never truly as fast as registers. The primary reason (in a parallel architecture) is porting. A register file has ports which connect it to all of the functional units. (A port is a connection from a memory cell to a device which reads or writes that memory.) Multiple functional units can all access the register file in parallel. In contrast, most memory is single ported. Multi-porting a memory either slows it down, or drastically limits its size. When you throw in the cache tag RAMs as well for an L1 cache, you further limit its size and speed, and add a layer of indirection that simply does not exist with registers. And that's just some of the hardware reasons why L1 will always be slower.

In the compiler, things get messy as well with memory operands, as now the compiler must disambiguate references to these operands to know which operations may be safely moved past each other. In languages such as C, you have the unfortunate problem that pointers can point just about anywhere. Many compilers are unwilling to consider pointer arguments as pointing to storage that's independant of even the function's local stack frame, and so you get artificial scheduling hazards which limit the parallelism that the compiler can expose in the code.

Other fun reasons: You can't do register renaming on memory locations. You actually have to do memory allocation for memory (pushing values on a stack IS register allocation). You take memory faults and cache misses at least occasionally for memory -- you NEVER do for registers.

So no, memory can never be as fast as registers. It can get close, but never quite 100%.

--Joe
--
Program Intellivision!

Re:"New" Architecture on Intel's Itanium Processor Explained · 2000-12-03 13:31 · Score: 1

It was supposed to be the first mass-market VLIW chip, though Transmeta beat them to it.

Erm, and how exactly do I program Transmeta's VLIW assembly language? What's that? I can't? Although Transmeta's TMxxxx family of CPUs uses a VLIW architecture presently, nothing constrains them to a particular VLIW instruction set or even to be VLIW on future parts, since the only interface they expose is their emulation of the x86 instruction set. They may as well have a little hamster running in his wheel in there -- as long as it cranks out x86, it won't change the instructions you program it with. (The hamster might not be as fast as the VLIW, though. ;-)

So, with that in mind, I'd say that Itanium will probably be the first mass market VLIW-like programming platform. Of course, TI's TMS320C6000 DSP was probably the first volume-shipping VLIW-on-a-chip, even if it wasn't targeted at the desktop market. The old Multiflow and Cydra machines of yore never quite got that small.

--Joe
--
Program Intellivision!

Re:Quote from JavaPro magazine on Why Linux Lovers Jilt Java · 2000-12-02 12:29 · Score: 1

I suppose you've never heard of Sun's MAJC, eh?

(FYI, MAJC == Microprocessor Architecture for Java Computing.)

--Joe
--
Program Intellivision!

Re:That is unfair on Why Linux Lovers Jilt Java · 2000-12-02 12:00 · Score: 1

Actually, troll, it's pretty common to include the name of the required runtime environment in the name, particularly if said runtime environment is likely to not be the default. Java may be a language, but it also implies the requirement of a JVM. (At least, if you don't have one of the (nonexistant) Java CPUs Sun tried to make.) As you may recall (or not -- you sound like you were born yesterday), when Windows was still fairly new and DOS was still the norm, Windows apps included "for Windows" as part of their name. Such as "MyLeetProgram for Windows 1.25".

Go away troll.

--Joe
--
Program Intellivision!

Re:coincidence? on IBM's OSS Code Morphing Code/or OSS vs. Transmeta · 2000-11-29 05:45 · Score: 1

Unlikely. IBM's been working on DAISY for a loooong time.

--Joe
--
Program Intellivision!

Re:Another way to do emulation on IBM's OSS Code Morphing Code/or OSS vs. Transmeta · 2000-11-29 05:39 · Score: 1

That works as long as you can identify all of the code cleanly. Particularly outside the UNIX world (think DOS / Win9x), it's commonplace to treat data as code and code as data. (Most UNIX programs just rely on the ELF/COFF file format and don't muck with code vs. data attributes, unless they're doing something icky like GNU C's trampolines which put executable code on the stack.... *blech*)

It's hard enough to write a reliable disassembler that doesn't fall over when it hits a jump table, callback, overlay, dynamic library or other indirect method for loading / invoking code. (At least, one that's reliable in the absence of a symbol table that highlights all of the valid entry points.) What makes you think you can reliably re-assemble a binary for a different target in such a setting?

--Joe
--
Program Intellivision!

Re:How will Amiga compete? on IBM's OSS Code Morphing Code/or OSS vs. Transmeta · 2000-11-28 16:43 · Score: 2

Not really. Amiga's just implementing a thin virtual machine layer, providing an "ideal assembly language" that provides more control than, say, C, but still provides sufficent abstraction that the code can be targeted to a wide range of CPUs easily. (This is in stark contrast to, say, the Java VM, which is comparitively quite heavy.) You can think of Amiga's virtual assembly as a "medium level language", if such a term exists.

DAISY translates other ISA into its own native Tree-VLIW ISA. Rather than providing an abstract assembly language that gets targeted to a wide variety of CPUs, DAISY is doing the reverse: Take a wide variety of ISAs, and target them to this specialized CPU. Transmeta is similar, although they've chosen to focus primarily on x86 to get the biggest bang for their limited bucks.

--Joe
--
Program Intellivision!

Re:Rearranging Compiled Code for Optimization on IBM's OSS Code Morphing Code/or OSS vs. Transmeta · 2000-11-28 16:36 · Score: 2

Newer GCCs have something like this. Look up -fprofile-arcs -ftest-coverage in the GCC/gprof documentation. I haven't looked super closely at how well it works, but the documentation seems to hint that it's doing similar types of optimizations. Basically, it takes the profile information for each arc in the control flow tree, and uses that to decide how to lay out the basic blocks when it generates the code.

In my limited experimentation, I didn't see much of a difference (too small to measure) using these tricks, so either my code wasn't helped by it (too small?), or GCC was just going through the motions. YMMV.

--Joe
--
Program Intellivision!

Re:Seems like very cool tech on IBM's OSS Code Morphing Code/or OSS vs. Transmeta · 2000-11-28 15:14 · Score: 2

I seem to recall someone (perhaps IBM's old DAISY website) mentioning that the name DAISY was picked as an obscure reference back to that. Goes back to the old HAL == IBM << 1 thing...

--Joe
--
Program Intellivision!

Re:Interesting spin-off's... on IBM's OSS Code Morphing Code/or OSS vs. Transmeta · 2000-11-28 14:52 · Score: 5

BTW, Transmeta has been working on their stuff since 1995, so the technology mentioned in the 1997 paper doesn't strictly predate it.

I read about Daisy a few years back when I was studying VLIW scheduling techniques and whatnot. The DAISY VLIW is quite different than most VLIWs around. Their instruction word is built upon the ability to execute large numbers of "branches" in parallel every cycle. (As best as I can tell, these "branches" are actually closer to being composite predication conditions in many cases, which is why I put "branches" in quotes.) Their experimental physical implementation could execute something like 8 branches every cycle. Downright weird.

A more traditional VLIW uses predication to convert short branches into a simple "if (cond)" prefix on individual instructions. (This technique is known as if conversion.) Also, traditional VLIW instruction words are flat -- all N instructions in a VLIW bundle execute together in parallel, with no tree structure implicit in the encoding.

All that aside, the DAISY scheduling techniques sound pretty similar to trace scheduling , which was used on the old Multiflow VLIW machines. The actual process of converting PowerPC instructions to individual DAISY operations is mostly search and replace, and preserving program order is a matter of constructing proper dependences between the instructions.

Feel free to ask me questions if you're curious about this kind of stuff. It's my day job.

--Joe
--
Program Intellivision!

Re:Yes it is the exact term you would use. on Mutant Tetrachromat Females Found · 2000-11-28 07:55 · Score: 1

The poster a couple levels up talked only of letting this mutation spread into the populace, which implies simple reproduction, not genetic engineering. Animal husbandry and genetic engineering are two different things. If I pick a mate because I like some aspect of that mate and subsequently have offspring, is that wrong? Does it matter whether that critereon is "explicitly vital to the existance of the species"?

I find the concept of genetically engineered children to be repugnant. The thread I was on wasn't talking about that though. I didn't miss your point, but it was covered in a different thread that I just chose not to participate in.

--Joe
--
Program Intellivision!

Re:Argh! Read article first, then comment! on Mutant Tetrachromat Females Found · 2000-11-28 06:53 · Score: 1

If the Males can't see the difference, though, how does this improve desireability (unless it becomes one of those unconsciously perceived bits...)?

Hey, maybe she's into other women, and they decide to have kids by visiting the local sperm bank. I guess that'd be a case of two tetrachromats hooking up and passing their unique genes on, eh? You never know in today's world.

(And if that is the case, I say more power to them! I'm all for a little diversity.)

--Joe
--
Program Intellivision!

Re:Yes it is the exact term you would use. on Mutant Tetrachromat Females Found · 2000-11-28 06:35 · Score: 1

Actually, depending on how the four colors are distributed, the male children may not be completely color blind, just differently sensitive to color. For instance, suppose the mother has "Green" and "Red1" on one X chromosome, and "Green" and "Red2" on the other. Suppose "Red1" and "Red2" are sufficiently far apart in their response curves that the mother is a tetrachromat. Any sons that she has will still be trichromats.

The reason the reasearchers looked specifically for mothers of colorblind men is that it narrows the search. It means that one X chromosome has both "Red1" and "Red2" on it (or perhaps "Green1" and "Green2"), rather than spread across the two X chromosomes. It makes the search easier, but I see no reason why a tetrachromat must have the two nearly identical colors on the same X chromosome.

With this in mind, I wonder if women just generally have better color perception then men, since they'll tend to actually have up to five chroma channel, since the reds and greens in both X chromasomes aren't going to be identical to each other. The differences might not be large enough to really affect their vision deeply, but it might add a subtle touch. Thoughts? Maybe that's why my fiancee and I always argue about what color something is...

--Joe
--
Program Intellivision!

Re:Just wondering on Mutant Tetrachromat Females Found · 2000-11-28 05:57 · Score: 1

As you point out, it does no good in the framebuffer.

Actually, if your video card does real-time compositing with an external video source, that alpha channel is pretty darn handy.

--Joe
--
Program Intellivision!

Re:hm..a kid learning website development on EFF Makes Call For DMCA Help · 2000-11-27 01:38 · Score: 1

No, but possibly their trademark.
--
Program Intellivision!

Re:some more licensing ideas on EFF Makes Call For DMCA Help · 2000-11-27 01:31 · Score: 1

That's ok, we can sue them for infringement, since we didn't give them a license to do so. ;-)

--Joe
--
Program Intellivision!

Slashdot Mirror

User: Mr+Z

Comments · 3,254