akuma(x86) · Slashdot Mirror

Re:Tom's Hardware reviewed a similar Xeon... on Intel Demos New P4 'Extreme Edition' · 2003-09-17 16:52 · Score: 1

I was trying to show a worst case analysis to show that even cache flushes don't matter as much as he/she would claim.

Re:Tom's Hardware reviewed a similar Xeon... on Intel Demos New P4 'Extreme Edition' · 2003-09-17 12:43 · Score: 2, Informative

Lets look at some numbers...

What is the standard time slice quantum for windows and linux typically? That is to say, what is the typical rate of context switches? If I recall correctly, it's on the order of 100 per second.

That's 1 context switch every 0.01 seconds. Lets suppose now that I have a typical P4 system with 6.4GB/sec of memory bandwidth. I can fill the entire 2M cache in roughly

0.002/6.4 seconds = 0.0003125 seconds

That's only 3% of the entire time slice quantum! That's assuming the thread will want all of the cache (which is unlikely for many apps).

So, yes you may be running 25 programs, but they only switch between each other 100 times a second. The cache re-fill only takes 3% of a minimum quantum, leaving 97% of the time left to read the filled data out of your caches (which is highly likely due to the principle of locality and LRU replacement policies).

Now, if you're running something server-like TPCC with a bazillion threads, then you'll probably context switch a lot more, but the P4 isn't designed for transactional servers. This is why servers systems have larger caches and multiple processors and multiple SMT threads within each processor -- among many other reasons :)

Putting the DRAM on die is a nice idea, but it's not cost effective at this time. You only have a finite die size. The cost of the die increases exponentially with area, so there's a hard cap on how big you can make the chip. The area that you would have spent on the DRAM could be used more profitably on other chip features. Perhaps in the future when we get many more transistors, it may be feasible.

Re:Tom's Hardware reviewed a similar Xeon... on Intel Demos New P4 'Extreme Edition' · 2003-09-17 11:11 · Score: 5, Informative

Computer architecture 101.

Average memory access latency per memory access =
(L1_hit_rate * L1_hit_cycle_time) +
(L1_miss_L2_hit_rate * L2_hit_cycle_time) +
(L2_miss_L3_hit_rate * L3_hit_cycle_time) +
(L3_miss_rate * DRAM_latency)

80-95% of your accesses will hit the 8k L1 in typical applications. This is the vast majority of the accesses. The latency of this cache is TINY on a P4. Do the math for a 3.2GHz 3 cycle cache.

Given a curve of cache-size vs. latency and hit rates for all the cache sizes, the optimal hierarchy is a simple optimization problem. I can assure you that this equation has been solved and the optimal heirarchy has been chosen (given the other constraints of obviously die-size and power).

Quadrupling the L1 will double the latency and kill your average access time, making your chip almost certainly slower.

Bigger caches mean longer latencies. It's limited by the basic laws of physics. There's only so much distance you can traverse in a ceratin amount of time and larger caches have longer distances (meaning higher RC delays).

The reason we want larger outer level caches is because the DRAM_latency is enourmous and has an impact on average access time. Hardware prefetching can also help to alleviate this problem - This solution is available on both Athlon and P4 chips and will only get better in the future because it is absolutely critical to hide this DRAM latency.

Ok, now to address the notion that more registers will improve performance...
You won't get as much performance out of more registers as you might think. First of all, when the compiler runs out of registers it spills the excess to the stack -- pushing it out with a store (spill) and reading it back in with a load (fill).

In modern processors (just about every chip out on the market), there is the concept of store buffers. Each store writes it's data to a store buffer. Subsequent loads that require data from stores, get their data by forwarding out of the store buffer. So -- the spilled store writes the buffer and the fill load reads the buffer -- all of this happening much faster than a memory access because it's just reading out a local on-chip buffer, so the load looks more like a fast register read. This architectural trick emulates the effect of having more registers, subject to the size of your store buffer. There are even more advanced architectural tricks you can play to completely eliminate the spill-fill pair from the critical path (look up memory-renaming in the literature).

If you're worried about chip-real estate, you should be very concerned that a 64-bit application's pointers will take up twice as much space effectively making your caches and memory bandwidth appear smaller.

Re:64bit performance gains... on AMD64 Preview · 2003-09-06 19:52 · Score: 1

I'm not saying that the 64 bitness of the CPU is making it faster, I was pointing out the benefits of switching from 32 bit mode on AMD64 to 64 bit mode on AMD64. In this comparison, the 64 bit mode will be faster most of the time.

If you are assuming an IA-32 compiled binary, then running in 64 bit mode gets you no advantage since you don't use the 64 bitness with an IA-32 binary.

Ok, so you are assuming that the app that you will run has been compiled for both IA-32 and x86-64 and you are making the comparison of 2 different binaries run in 2 differents modes on the same Athlon64...

If this is the case then it is not completely true that performance will always increase due to the pointer doubling issue. The caches will look smaller and the demand on memory bandwidth will be greater.

Now to the issue of increasing the number of architectural registers...

The advantage of more registers is significantly diminished in the presence of store-to-load forwarding that occurs in the processor hardware (in virtually all state of the art processors).

If you run out of architectural registers, the compiler will spill the excess to the stack. All of the registers that need to spill to memory are spilled to the stack via stores. In modern microprocessors, these stores write to a store forwarding buffer. Subsequent loads that need to read the stack will simply get their data from the forwarding buffer and not the memory system. These loads execute much faster because the data is in a local forwarding buffer and not a larger/slower cache, so they look more like register reads.

This is just one example of the many tricks that we computer architects play to get around ISA inconveniences.

There are even more elaborate tricks to get around this problem. You can completely eliminate both the spill-store and fill-load out of the critical path. If you are interested, do some research on "memory-renaming".

Re:64bit performance gains... on AMD64 Preview · 2003-09-06 11:10 · Score: 1

Those are good points. Let me now enumerate the disadvatages of 64 bits over 32 bits.

1) Your data caches will appear smaller (relative to 32 bit code) because your pointers are now twice as big in order to address that nice 64 bit space.

2) Since your ALUs needs to be twice as wide, they can't go as fast. The ALU path in a microprocessor is very speed critical and is often what dictates the maximum frequency you can run at.

3) Since your datapath is twice as large, you will inevitably burn more power.

4) The way x86-64 is defined, it complicates the x86 decoder thus reducing it's speed/power-efficiency.

It is not clear that 64 bits by itself will universally boost performance. As with most engineering problems, there are advantages and disadvantages and all factors must be considered.

The Athlon64 will be a good, high performance product, but don't credit the performance to the 64-bitness of it. It will be good because of it's microarchitecture (as opossed to it's ISA).

ISAs are way way way overrated. The provide at best a 2nd order effect on performance.

Re:Great.... on Four Core Processor to Bring Tera Ops · 2003-08-29 06:21 · Score: 1

That's what prefetchers are for. If the memory pattern is predictable (many are), then you don't have to wait, the hardware will prefetch the data for you and have it ready to go when the core needs it.

Re:Cool solution, but fixed the wrong problem on Silent Pump for Water-Cooled PCs · 2003-08-25 18:54 · Score: 1

You do realize that you're getting performance with that power as well don't you? Performance isn't free. It costs energy. Sure, you could run a 486 at 500mW, but would you want to? It would stutter on a task as simple as mp3 playback.

Now, it may be argued that today's processors have more performance than you need. But that's not what the market is telling us. The market still pays a premium for higher performance parts (otherwise, why would Intel/AMD bother to make them?)

Another argument you could make is that today's CPUs aren't very power efficient (power/performance ratio is too high). If you compare 2 competitors like the top end AMD vs. the top end Intel CPU, you'll find that they dissipate about the same amount of power, so it could be that the both of them have work to do in improving efficiency.

But, if you want to do multiple HDTV MPEG-4 encode streams in real-time, you're going to have to spend the watts.

Believe me, if the market says that it wants lower power at the expense of performance, you bet your ass Intel and AMD will deliver it. There's far too much money at stake in the CPU business for something like this to go unnoticed.

Re:Country -vs- country rankings? on Top University Rankings for 2004 Released · 2003-08-23 08:20 · Score: 1

I am a Canadian. I went to Waterloo and graduated in 1996, which is arguably one of the top 5 engineering schools in North American.

Several of my classmates and I had no problems getting jobs in the US.

Re:Perhaps it's time for more innovation? on Linux will have 20% desktop market share by 2008? · 2003-08-17 10:52 · Score: 1

The command line apps can be built for any OS. I use cygwin extensively on Windows.

So, what else differentiates Linux? Sure the kernel's memory management and process control may be superior in certain scenarios to Windows, but desktop users are not likely to care.

I think it's obvious to everybody that people buy computers for the apps. Linux needs more professional developers to create apps for it.

Professional developers won't create apps for an OS with such a small market.

But in order to grow the market, Linux needs the developers...

It's the classic catch-22. Linux needs enough 'critical-mass' applications to start the conversion. It's nowhere near there yet.

Re:whats the big deal on New Transmeta Chip: "Efficeon" · 2003-08-12 10:13 · Score: 1

That's the theory anyways. If dynamic compilation is so great, why aren't there any decent dynamic recompilers? Hint: try doing precise exceptions. What about self-modifying code (oh yes, it's out there - esp. in java and other managed run time apps).

Also, why doesn't Transmeta get performance that is comparable to a similarly powered out of order machine like Banias? Why does IA64 suck? Couldn't a dynamic recompiler help it out?

There's too much overhead with a software based dynamic recompiler. The latency of getting the dynamic information and then making a change is way to long compare to hardware methods.

Things like instruction scheduling change far too quickly for dynamic software methods to be effective.

Using a trace cache, the hardware can "dynamically recompile" code for you at much reduced latency. Also, the hardware is privy to much more dynamic information, such as the state of branch predictors and other internal, non software visible structures. All of this can be done quickly, and at a granularity much finer than a software method.

Re:whats the big deal on New Transmeta Chip: "Efficeon" · 2003-08-12 09:57 · Score: 1

I have a minor nitpick :)

IBM did Daisy, not Intel.

Statically scheduled VLIW will almost never outperform a dynamically scheduled out-of-order machine. But, you can save tons of power :)

Re:I live in a SOCIALIST country... on Part Two: Technical Self-Employment For All · 2003-08-05 20:17 · Score: 1

You live in a social democratic country. There's a big diffrence :)

I went the other way...I moved from Canada to the US.

Re:Surely? on AMD, Transmeta Edge Up In Market Share · 2003-08-04 06:29 · Score: 1

Nowhere in the article do they mention if the market share numbers are for "unit volume" or "revenue share". There's a huge difference.

You could have 50% market share on a unit volume basis, but if all you're selling are money losing Durons, then that really doesn't help you much.

AMD sells their product for substantially less than Intel. You heard right, they are hurting these days. They're bleeding money and have no product that is currently able to command the higher prices of a Pentium4. They've pretty much bet the company on Athlon64. If they can execute and deliver that product, then they can survive.

Innovation is key on The IT Market: Cyclical Downturn or New World Order? · 2003-07-15 18:50 · Score: 1

Once upon a time when computers were not so cheap and plentiful, programmers got paid more because their skills were in demand and the supply of programmers was scarce. Computers were expensive. Not everyone had them. You probably had to go to school to learn how to use one properly because documentation and programming tools were primitive by today's standards. Businesses could cut costs by having their information processed quickly and efficiently with computers - thus they created a large demand for these skilled workers.

What has happened with the proliferation of cheap and plentiful computers is that the supply of programmers has vastly increased - keeping up with demand and exceeding it in these recessionary times. The demand remain the same or greater - businesses still need to manage information with computers. However, it has become much easier for a person to gain access to a computer and learn how to program it. You can go to any bookstore in the country and pick out a few books and be well on your way to learning something useful. The computer is just a tool for getting something done.

Thus, the price one is willing to pay for the skill of operating this tool has dropped. Let me ask you honestly...do you really need to go to college for 4 years or more to get a CS degree to become a good programmer? High school kids with enough aptitude can do it.

In 1993, programmers were scare and demand was high. In 2003, programming is an easily found skill.

So what is a geek to do? Well...do what we've always done. INNOVATE! Innovation is the primary driver of economic growth. This is the only true way to create wealth. We need to get back to the innovation that made silicon valley great. If you do something that is valuable that nobody else has done before, then you can be sure you won't be outsourced to India.

Re:Exactly on SGI Releases New Workstations · 2003-07-15 18:06 · Score: 1

I think your analogy is flawed.

Different bands have differentiating features that matter. Microprocessors are quickly becoming a homogeneous resource (through virtualization technology it doesn't matter what the underlying ISA is). They are a commodity. Easily replaced. Akin to DRAM. Computer vendors don't design custom DRAM anymore. I say this as someone that gets paid to design microporcesors.

Bands are not commodities. You cannot substitute one band for another easily.

In 5-10 years, it will become too expensive to support a microprocessor design division. Only Intel or some other large semiconductor player will be able to sustain this kind of investment. IBM certainly could do it. Sun has no chance.

Re:Exactly on SGI Releases New Workstations · 2003-07-14 18:50 · Score: 1

Then why do Sun and IBM continue to make microprocessors? Shouldn't they outsource to Intel and AMD? It would save them a ton of dollars in R&D. The top performing processors are already Intel or AMD silicon.

Re:G5 is really a full-blown workstation on NASA Benchmarks the New G5 Powermac · 2003-07-04 13:38 · Score: 1

The FP performance of just about any CPU crushes Sun. Sun CPUs suck.

As for the the POWER4 performanc, yeah you can get that kind of performance with 1.5MB on ON-CHIP cache and 32MB of off-chip cache, which is what this particular POWER4 system has. The PPC970 only has 512K of on-chip cache and 0k off chip cache.

Also, performance does not scale 1:1 with frequency, especially for SpecFP - because as you scale your CPU core speed, the speed of memory remains constant and becomes a greater proportion of the run-time. So it is very unlikely that a 2GHz system would get a score of 1500.

Re:Space should be left to corperations on Leave Outer Space to the Millionaires · 2003-07-02 13:51 · Score: 1

In many cases, the free market does the optimal thing for humanity. It allocates resources to their most productive ends. It delivers what people want. If the people don't want it, they won't pay for it (the item or service will not be profitable to produce). This is a "vote"-by-dollars system.

The point I was trying to make is that there are 2 parties in a market. The buyer and the seller, and if the both of them trade with each other to produce an undesirable result (pollution for example), then it's both of their faults.

In some cases, the market does not do the right thing for humanity. This is where we need governments to step in and intervene. This is a good thing. I would recommend reading "Everything for Sale: The Virtues and Limits of Markets" by Robet Kuttner for more examples of markets not working.

I agree that in some cases the market does not work, but in many cases it does.

I don't consider the space program to be one of those cases where the government needs to intervene. Yes, there were innovations and science that came out of the space program, but this was not the most efficient way to get those benefits. The problem with government programs is that they are inherently inefficient because there exist very little corrective measures that you would get in a free market. Clearly, the space program did nothing for the Soviets.

Re:Space should be left to corperations on Leave Outer Space to the Millionaires · 2003-07-02 08:46 · Score: 1

So the cost to the consumer of checking on the producer of the goods and the increased cost incurred by the producer following environmentally responsible practices will not be tolerated by the consumer. If the consumer really were concerned about environmental safety as such, then it would be worth both his or her time and money to pay more and do these checks.

As it turns out, this increased time and money is not worth it to the consumer.

I'm not saying that we should move towards this total free market. I even suggested in my comment that the government steps in to enact laws and regulations.

Perhaps I also should have mentioned that this is a good thing.

Re:Space should be left to corperations on Leave Outer Space to the Millionaires · 2003-07-02 08:40 · Score: 1

I'm not suggesting that we move towards a totally free market. My comment even suggested that the government had to step in to intervene with laws and regulations.

My point is that the market (producer AND consumer) does not value environmental safety. If it did, we would have no need for the current government regulations.

Re:Space should be left to corperations on Leave Outer Space to the Millionaires · 2003-07-01 21:29 · Score: 1

If industrial waste really were a concern of the consumer, then the consumer would willingly pay more for a product that was environmentally safer.

As it turns out, the free market just doesn't value things like environmental safety, so the government intervenes with laws and regulations.

It's just as much the consumer's fault as it is the corporation's

Re:How I'd improve bookmarks on Netscape Founder Says Web Browsing Innovation Dead · 2003-07-01 18:34 · Score: 1

Google will be the future innovator of the web browsing experience.

I think Google is in a unique position to take advantage of all of their usage data and their pool of badass Phd's to come up with some innovative content summarization and organization algorithms.

I predict that newer versions of the Google toolbar will bring some seriously good innovations to the web browser. Google can take over the world by releasing their own browser. I'll be the first in line to buy google stock when it goes public. Nobody can copy them because nobody has created the kind of database that they have.

Re:Some features I would like to see on Netscape Founder Says Web Browsing Innovation Dead · 2003-07-01 18:20 · Score: 1

This is a great idea and I don't know why someone has not yet implemented it. It seems easy enough.

I too would like to see a page "prefetch" algorithm. It takes my human eyes a long time to read the web page compared to the download speed of the computer. Let it speculatively fetch pages ahead into a cache while I'm reading. If I demand a new page by clicking a link, immediately stop all prefetches and service the demand (First look in the cache to see if it's already there).

We already have a 'user-controlled' prefetch with tabbed browsing. While I read something in Phoenix, if I see a link I'd like to visit at a later point, I'll launch a new tab so I can read it later.

I'm sure there are lots of innovative auto-prefetch algorithms out there.

Re:Reasonable claims - IBM's Power4 vs Intel on Apple Hardware VP Defends Benchmarks · 2003-06-24 19:12 · Score: 1

All 128 MB are enabled for the POWER4 spec score. It's not so much that the latency is lower (it is), but it's that the bandwidth to this cache is tremendous. SPECfp is a bandwidth hog.

It would be interesting to benchmark the POWER4 with the L3 cache disabled. I think you'd find that the PPC970 would perform equal to or beter at the same frequency.

Re:Reasonable claims - IBM's Power4 vs Intel on Apple Hardware VP Defends Benchmarks · 2003-06-24 19:09 · Score: 1

The G5/970 should do similarly or better than the G5/970 (since the G5/970 is running at 2.0Ghz vs Power4+ 1.7Ghz). One caveat is that the G5/970 has a smaller on-chip second-level cache (512kB vs 1.5MB), which will hurt its performance on some codes.

Don't forget about the Power4's 128MB!!! L3 cache. This is the only reason it performs better. Take out the cache, and you'll get something like the PPC970 performance.

Slashdot Mirror

User: akuma(x86)

Comments · 407