Christopher+Thomas · Slashdot Mirror

Re:Registers and scheduling. on P4 - The Art Of Compromise · 2000-12-15 10:48 · Score: 2

However, it turns out that register renaming prevents a lot of the stalling that you'd expect with so few registers (write-after-write hazards vanish). Thus, while the small number of registers does degrade performance, it doesn't degrade it catastrophically.

[Emphasis added.]

There are more things to fetch, decode, schedule, execute and retire. What's good about that?

As clearly stated in my original post - nothing at all. However, I question how much of an _impact_ the bad side effects have in practice. It's non-negligeable, but that leaves a lot of territory open.

It hurts. A lot. Check out some of our papers on the subject, especially this one, which contains many references to other work.

Done. I compliment you on your fascinating approaches to register use optimization. However, most of your works focus on how the program use and physical performance of a register file of a given size may be improved. The dependence of performance on register file size is only studied in one document ("The Need for Large Register Files in Integer Codes"), and the advantage of a relatively large register file (at least for the 64-vs-32 case) is found to be relatively modest (5%-20%).

A factor of two speed difference makes a processor unmarketable. A 20% speed difference doesn't (witness the holy war still going on between Intel and AMD proponents).

The effect of a small register file is undoubtedly more severe as size decreases, but I have yet to see evidence of truly earth-shattering performance impacts. Circumstantial evidence suggests that the effect is not earth-shattering (SPECmarks for high-end workstation chips fail to thoroughly trounce SPECmarks for x86 chips for comparable configurations, and the PowerPC architecture fails to blow x86 out of the water).

Most certainly, a larger register file is nice, and causes a speed improvement - but the effect of a small register file does not seem to be as devastating in practice as you appear to be suggesting above.

Neutron activation 101. on Chernobyl (Finally) Shuts Down · 2000-12-15 08:21 · Score: 2

Fusion produces helium. That's it. Helium. The same helium that they put in helium balloons.

*sigh*.

Once again:

Both fission *AND* fusion produce very intense neutron radiation. This transmutes atoms in anything nearby - i.e. the reactor itself - into radioactive isotopes.

So, your fusion reactor produces helium - which is well and good - but your thousand-tonne reactor vessel is low-grade radioactive waste by the time it wears out. This is a *BIG PROBLEM* - a *bigger* problem than primary waste would be for either type of reactor.

Yeah, the radiation in the core itself is hazardous to life (it is in fission plants, too) but you surround the core with water, or concrete or something and the radiation doesn't escape.

This same "neutron activation" of materials makes it extremely difficult to ensure that reactors - fission or fusion - are safe. Ever wonder *why* a fission reactor has three levels of coolant loops, instead of just running coolant directly through the core? It's because neutron radiation from the reactor breeds tritium and a few nice, unstable isotopes of oxygen and fluorine in the water. You don't want this leaking, and if it does leak between stages, you don't want the *next* stage to leak.

You _can_ build (reasonably) safe reactors, but you are vastly underestimating the difficulty, and completely overlooking a *BIG PROBLEM* with fusion. There's plenty of good research material on the web and in your local university's book store if you're interested in learning more about the subject.

Clearing up several misconceptions. on Chernobyl (Finally) Shuts Down · 2000-12-15 03:09 · Score: 2

True fission power is certainly being phased out in favor or more cleaner types of power production. Your telling me that the radioactive waste of a fission reactor is safe. It is far from safe, its highly toxic. Thats the main reason why fission is being phased out, and its going to be decades before anything happens. Wind power is never going to cut it, nor will solar power. The way of the future is off course fission power. Once we have stable fission reactors we won't need anything else, and the by product, helium.

Your sources seem to have missed a few important pieces of information, here:

Nuclear waste is nasty, but there's only a *tiny* amount of it.

There's a nuclear plant sitting next to my home city that supplies all of the city's power and exports power to the US. It's been operating for many years. Its waste fits in a swimming pool inside the plant (water makes a nifty radiation shield).

By comparison, a coal plant with comparable capacity would have dumped several million tonnes of CO2 and SO2 into the atmosphere, with attendant acid rain and other problems (not to mention the environmental and health hazards from mining the coal in the first place and transporting it to the plant).
Fusion isn't much cleaner than fission.

You don't produce primary waste with fusion, but you do produce one hell of a lot of radiation. This makes your entire reactor vessel radioactive. For both fission and fusion reactors, this is a greater _volume_ of waste than that produced directly from spent fuel rods, and it has to be swapped out and maintained every decade or two.

If fission reactors are properly maintained, they're a wonderful power source. The problem is that a) the public is afraid of the word "fission", and b) nuclear plant maintenance tends to be under-funded.

Registers and scheduling. on P4 - The Art Of Compromise · 2000-12-15 00:41 · Score: 2

You raise several good points; however, it turns out that there are a few mitigating factors.

First, the lack of registers in the x86 architecture. Having a fast cache is great, but it's not as fast as a register, and it takes extra instructions to load and store

This is true, and greatly hampers things like loop unrolling on the x86. However, it turns out that register renaming prevents a lot of the stalling that you'd expect with so few registers (write-after-write hazards vanish). Thus, while the small number of registers does degrade performance, it doesn't degrade it catastrophically.

Second is the relatively finer granularity of the instructions available on a RISC architecture. Although there is some merit to making decisions based on information only available at runtime, that isn't a big factor with today's technology. What a modern x86 looks like is a microcode architecture with somewhat intelligent scheduling of the instructions. In most cases a compiler could do a better job.

Actually, since the Pentium Pro, x86 processors have been fundamentally RISC-ian. x86 instructions are decoded into "micro-ops" (Intel's term), which are essentially RISC instructions. These can be scheduled by the processor as effectively as RISC instructions.

The decoding adds latency, but that's what the P4's "trace cache" is for. Arguably, a compiler with access to the underlying RISC instruction set could do better scheduling, but in practice the gain is marginal (especially since most people don't seem to use really-good compilers). I also have a sneaking suspicion that basic blocks in most code are small enough to fit inside the processor's scheduler window, which means that the compiler probably _wouldn't_ do a better job in most cases than the hardware scheduler. Higher-level transformations like loop unrolling have benefit even if done at a CISC level.

In summary, I'm not sure there's a very big performance hit from the instruction granularity (just a silicon hit).

I am impressed with your knowledge of the subject, though.

The real limits, IMO. on P4 - The Art Of Compromise · 2000-12-14 11:54 · Score: 5

The real reason for the chip's inherent "performance losses" is the running-string that's slowly being pulled to its breaking point -- that is, the x86 architecture.

Actually, while inconvenient, the x86 architecture isn't as horrible a limit to performance as a lot of people seem to be assuming. The main problem is the extra latency in the decode stage, which lengthens the pipeline somewhat, but the P4's trace cache takes care of that.

The real problem with the P4 is that it has very wierd optimization requirements (the whole "bundle" thing) and so needs a very smart compiler if code is to run quickly. Generally, even if compilers like this exist, they aren't used (remember the original Pentium?).

The other problem with the P4 is the long pipeline, which exacerbates stall problems.

As for architecure in general, heat issues are what's limiting clock speeds (for x86 and non-x86 processors alike). However, the main limit people are noticing is the limit to the number of instructions you can run in parallel. As long as you're executing only one thread, you're not going to be able to sqeeze more operations per clock beyond a certain point. The "performance problem" isn't with clock speed - it's with people expecting new chips to do more, clock for clock, than old chips while running serial programs. This parallelism problem affects all chips - x86 and non-x86.

This is why the major manufacturers are starting to look at SMT chips (Symmetric Multi-Threading) seriously. Running multiple threads in parallel on one chip doesn't take much extra hardware, and makes it *much* easier to schedule concurrent instructions and to keep on running when one instruction stalls (your "Instruction-Level Parallelism" goes up in proportion to the number of threads).

My more accurate article, posted last week. on A Well-Chilled 750GHz Feasible Within 5 Years · 2000-12-14 05:25 · Score: 2

I submitted a more accurate review that, among other things, didn't confuse data rate and clock frequency, to Slashdot and Kuro5hin last week. Kuro5hin accepted it. You can read it and reader responses at:

http://www.kuro5hin.org/?op=displaystory&sid=2000/ 12/10/0925/1544

Re:What makes a processor virtualization-friendly? on Ask Kevin Lawton About Plex86 · 2000-12-12 02:38 · Score: 2

Moderate the original posting down! This is a huge FAQ at the Plex86 website.

Not that I can find. One sentence in the "information" page mentions that it's hard, but gives no further details. If there's a giant auxiliary FAQ anywhere, it's well-hidden.

What makes a processor virtualization-friendly? on Ask Kevin Lawton About Plex86 · 2000-12-12 00:36 · Score: 4

What characteristics make a processor difficult to virtualize? What characteristics make it easy? I have more than a passing interest in this, as I'm currently a graduate student studying IC design.

The more detailed the answer, the better.

Re:Missing the Important Bits (Again) on Ogg Vorbis Update: Thomson Trouble · 2000-12-11 12:29 · Score: 2

(And please, let's not delude ourselves that the mythical Open Source Community will magically step in and finish the project: enthusiasm and spirit are no replacement for in-depth knowledge of signal processing.)

Actually, some of us do have in-depth knowledge of signal processing. I've been playing with my own CODEC projects in my (near-mythical) free time.

Most monitors are far too bright. on Coping With Computer Related Eye Strain? · 2000-12-10 00:58 · Score: 3

I find that most people (including myself) tend to set their monitors to be far too bright. As a general rule of thumb, try to keep the monitor's intensity no greater than the intensity of frame surrounding the monitor. Too many people seem to like staring into a monitor that's as bright as a fluorescent light bank.

What's actually happening here. on The Reactionless Space Drive? · 2000-12-07 01:38 · Score: 2

What's actaully happening here is just Lenz's Law - the conducting object is repelled by a changing magnetic field, and vice versa.

Nothing magical, and this _does_ require reaction mass - the conducting object.

Among other things, this is how coilguns work (not railguns; different animal).

Pipeline flush question. on Intel's Itanium Processor Explained · 2000-12-04 01:56 · Score: 2

This means no more pipeline flushes for missed branch prediction. None.

Ok, perhaps I need to re-read my textbooks, but I seem to have missed the part about branch mispredictions requiring a full pipeline flush. As far as I can tell, all that would actually happen is the speculated instructions being invalidated in-flight, with other instructions proceeding as normal. You still get a delay - it's the equivalent of a stall of as many cycles as it took to figure out which way the branch really went - but certainly not a full flush.

Is there some mechanism that I don't know about at work here, or have Sharky et. al. just turned "stall" into "flush" because of miscommunication?

Re:Critique of the Itanium. on Intel's Itanium Processor Explained · 2000-12-04 00:33 · Score: 2

No, the schedulers needs to stay in because it is not possible to do all things in software. Things like register renaming and micro-op scheduling. The instruction set doesn't support it so you do it in hardware.

Actually, register renaming doesn't require out of order execution. All you're doing is renaming the second register in a write-after-write or write-after-read situation to a different internal register name.

You're right about micro-op scheduling, though. I was thinking about RISC processors, which already have more or less atomic instructions.

Fusion does produce radioactive waste. on Ozone Hole Will Heal, Say British Scientists · 2000-12-04 00:13 · Score: 2

The solution, as far as I'm concerned, is in nuclear fusion. It's the only power source which has litle to no environmental impact

Actually, fusion does produce radioactive waste. The fusion reactor vessel itself becomes dangerously radioactive due to activation by neutrons produced in fusion. It doesn't produce any _primary_ waste from its fuel, but you still have a few thousand tonnes of reactor to swap out every couple of decades.

There have been fuel combinations propsed that aren't supposed to produce neutron radiation, but these are much more difficult to ignite and produce much less energy. Thus, I suspect it'll be good old D-T or (when practical) D-D in any fusion plants that are actually built.

You would also have a heat pollution problem from any power plant that produces more energy than the Earth receives from the sun. There are ways of piping this heat back out of the ecosystem, but it's picky and costly enough that we probably won't bother until the earth starts warming again. Right now things like CO2-induced global warming mask the effect (and we haven't industrialized the planet yet).

Critique of the Itanium. on Intel's Itanium Processor Explained · 2000-12-03 13:51 · Score: 2

A few aspects of this design strike me as either shady or overly optimistic:

It assumes a very good compmiler, tuned specifically to its architecture.

Part of the reason that schedulers on modern chips are so complex is that good compilers are rare. If the compiler produced optimally-ordered code, you could dispense with out-of-order execution and save a huge amount of silicon and effort. In practice, however, this kind of code is rare, so the scheduler stays in.

Remember the P4 vs. Athlon saga; it turned out that _both_ chips were running far below optimum performance due to sub-optimal compilation. Even without SSE2 enabled, Intel's compiler was able to produce a very large increase. Intel has a history of writing very good compilers, so it's possible that they'll be able to handle optimization for the Itanium consistently, but the vast majority of software developers don't shell out for the Intel compiler. Thus, most Itanium code will be sub-optimal.
Limits to the amount of parallelism present.

This is the big problem with building really wide superscalar processors - it gets exponentially harder to extract parallelizable instructions from the serial program stream. The predication system helps Intel a lot here - by allowing them to pretend that they've predicted branches with certainty, thus optimizing them out and producing longer basic blocks - but it won't be magical. Beyond a certain point, which we're already starting to reach, it just stops being practical to try to issue more instructions in parallel from one instruction stream.

The caveat here is loops that repeat for a large number of iterations, known beforehand, without data dependencies between iterations. You can unroll these into reams of parallelizable instructions, and a large register set makes it much easier to do so. However, this turns out also to reach diminishing returns fairly quickly (play with -funroll-loops and -funroll-all-loops on a few test programs to see what I mean). Your processor bottlenecks on the (large) part of the program that isn't in an easily-scheduled tight loop.
128 integer and 128 fp registers.

Boy, will this increase context switch overhead. Part of the attraction to register renaming and a smaller visible register set is that you get much of the benefit of a larger register set without the context switching cost. Now, this can be taken too far (c.f. x86), but I suspect that 256 registers will be enough to substantially influence performance if you're doing something that involves switching a lot to perform relatively short tasks (like many kernel service calls, many driver calls, interrupts to transfer blocks of network data, etc.).

In short, I think that this processor tries to be too clever for its own good; my prediction is that it will burn lots of power executing both sides of branches, and run at far below peak issue rate due to poor compilers used by most of industry and the limited ILP that exists in the programs being run.

That having been said, there are a few things about this architecture that I _do_ like. Predication is one; speculating both sides of a branch requires a lot more silicon, but allows certain optimizations that just wouldn't be possible by any other means. The large visible register file is also nice for loop unrolling and software pipelining compiler optimizations, though it does cause overhead on systems with a lot of context switching.

My money's still on SMT processors (symmetrical multithreading; one core and one scheduler executing many instruction streams (threads or processes), which gives you more ILP for free, as well as free interleaving when needed to mask latencies).

Re:Molecular weight of water. on NASA Has Found Evidence Of Oceans On Mars · 2000-12-03 07:45 · Score: 2

Are you sure mars had large amounts of nitrogen to begin with?

Certain? No. However, there is very strong empyrical evidence - both Earth and Jupiter did. Both still have plenty. Fractioning effects would cause distribution to vary with distance from the sun, but they wouldn't just leave a gaping hole in Mars' orbit.

Another possibility is that it's all bound up in ammonium salts or bound up in nitrate rocks via mechanisms like the one you mentioned for water. I don't *remember* hearing about vast amounts of nitrates on Mars, but I'm not an expert on Martian geology, either.

Re:Molecular weight of water. on NASA Has Found Evidence Of Oceans On Mars · 2000-12-03 02:22 · Score: 2

Who is talking about molecular oxygen or carbon dioxide boiling off? I mentioned hydrogen nuclei (protons) blowing off with the solar wind.

You also mentioned that molecular water was too heavy to boil off readily, which turns out not to be the case. This is what I was responding to (by noting that nitrogen in some form _did_ boil off, indicating that molecular water could too).

Molecular weight of water. on NASA Has Found Evidence Of Oceans On Mars · 2000-12-03 01:00 · Score: 2

The water was probably split into loose hydrogen atoms (or protons) and oxygen by solar radiation in the upper atmosphere. The protons drifted off wit hthe solar wind and the oxygen bound to metals in the planet's crust. Water vapour is very heavy, and probably wouldn't escape so easily.

Actually, water vapour is much lighter than molecular oxygen, molecular nitrogen, or carbon dioxide. The nitrogen, at least, wouldn't have bound that readily to metals, and so would have had to boil off. The lightest simple nitrogen compound is ammonia, which has about the same molecular weight as water; if ammonia could boil off, then it's likely that water vapour could too, if I understand correctly.

Not that I'm disagreeing with your mechanism; I'm just pointing out that direct escape probably happened too. Your mechanism nicely explains why Mars doesn't have an atmosphere rich in hydrogen compounds (water, methane, ammonia).

Possible idea - diary hardcopies. on Linus Torvalds Announces Autobiography · 2000-11-28 00:06 · Score: 2

Here's an interesting thought - publish hardcopy editions of the web diaries of various notables.

While an autobiography might look a bit shaky, something like, oh, The Compiled Diary of Alan Cox might be taken more seriously by the geek crowd.

OTOH, Linus's autobiography will probably sell like hotcakes to the business crowd that's just heard about this "Linux" thing.

The benefits of interleaving. on From Rambus to DDR:Memory Explained · 2000-11-27 23:58 · Score: 2

I'm wondering if we could improve bandwidth and latency by going back to banked memory, perhaps interleaved.

You would definitely be able to increase bandwidth, as long as you kept signal paths clean (difficult, but not impossible). Latency gets iffy. In practice, the main benefit would be in SMP systems or SMT systems (simultaneous multithreading; two or more threads running concurrently on a chip, while sharing some or all of the pipeline hardware).

The reason is that interleaved memory lets you have several outstanding cache row loads in progress from different parts of the memory system. A cache miss will still most likely stall the thread that missed, but as long as other threads or other processors are accessing memory, processing can continue. A non-banked/interleaved system would have to wait for the cache row to be transferred before servicing other requests.

Porting nVidia's driver. on Nvidia's NV20 · 2000-11-26 23:24 · Score: 2

Other benefits come for other OSes (NetBSD, FreeBSd, etc.) for which nVidia will never write drivers.

Actually, I gather from other posts in this thread that the abstraction layer between the driver core and the OS's driver interface is open (or at least published). This should make porting fairly straightforward, even with most of the driver being a black box.

There's also the option of wrapping Linux drivers in their entirity to run under *BSD, though I don't know if *BSD's Linux support has been extended *that* far.

You'd need *full* card specs to fix a driver. on Nvidia's NV20 · 2000-11-26 08:49 · Score: 2

Stick with the more open 3dfx, or Matrox. With them, if it crashes, you can track the bug down and fix it!

You'd have a lot of trouble doing that, unless it was a silly problem like a memory leak (admittedly worth fixing).

I've worked for a couple of years with a well-known software company that does third-party driver development (well-known cards, well-known platforms). Debugging a driver even *with* the standard reference texts for the card is a royal pain. Doing it blind - say, for hardware bugs or restrictions that aren't documented - is so much trouble it's not funny. This eats a vast amount of time even for us. Trying to debug a driver while having to guess at restrictions/errata in a register spec without support documentation - or worse, having to reverse-engineer the spec from code - would be at best a vast undertaking and at worst impractical.

It can be done, but not nearly as easily as you seem to think, by several orders of magnitude.

You only need to recompile a couple of things. on Pentium 4 Re-evaluated, Again (Again) · 2000-11-25 12:59 · Score: 2

Like you can recomplile all your windows applications, not.

You don't need to.

For 90% of users out there who need the processing power at all, the only thing that matters is the graphics driver, because it's games that are sucking up the CPU time. Graphics driver upgrades are released fairly regularly.

For the rest, it's MPEG CODECs. I'm sure if your favourite CODEC's site posted an update that ran 50% faster, you'd download it; thus, I don't think upgrading it will be a problem.

The (relative) handful of people doing heavy-duty image processing or rendering will likewise be upgrading to the next version of their software package at some point, which will contain SSE2 code.

The OS itself doesn't need a recompile. Neither do your office applications. Where is the vast pile of software that needs to be recompiled?

Particle beam? on New 8-Node PPC Cluster From Terra Soft · 2000-11-24 12:09 · Score: 1

Cool, a multiprocessor on wheels. Now all it needs is a few servo motors, a grasping arm, and a video camera.

Oh, and a high-intensity particle beam. And *missles*!

Use an industrial CO2 laser instead of the particle beam. You can buy them off the shelf, and they don't have the atmospheric dispersion problem :).

Power Dissipation 101 on Tom's Hardware Retracts P4 Endorsement · 2000-11-24 01:43 · Score: 2

I don't think you understand you chips achieve low power. It has very little to do with the linewidth, and everything to do with power management.

Um, I've spent the last 5 years learning how to build chips. While power management is important, total area - absolute area, not number of transistors - is also directly tied to power dissipation.

The power dissipated is simply the clock speed times the square of the core voltage times capacitance that is charged or discharged per clock.

If you optimize your chip so that only areas that are being used are clocked, you save power. This is what you were referring to.

If you lower your core voltage, you save power.

If you reduce the total area of the chip - by applying a linewidth shrink, for instance - you reduce the total capacitance (by a factor of 2, usually), and save power.

Thus, a linewidth shrink would most certainly allow an Athlon to run faster for the same power dissipation.

Also, saying that "the P4 should dissipate more power because it's bigger" isn't strictly true - all of this applies to the size of the CORE, not the size of the CACHE. The cache can be optimized to dissipate pretty much the same amount of power no matter what its size, as in any given clock, you're only accessing one or two rows of it. It's the core that's changing state all the time.

The relative sizes of the Athlon _core_ vs. the P4 _core_ are what would be important for your argument.

Slashdot Mirror

User: Christopher+Thomas

Comments · 2,147