AMD Previews New Processor Extensions
An anonymous reader writes "It has been all over the news today: AMD announced the first of its Extensions for Software Parallelism, a series of x86 extensions to make parallel programming easier. The first are the so-called 'lightweight profiling extensions.' They would give software access to information about cache misses and retired instructions so data structures can be optimized for better performance. The specification is here (PDF). These extensions have a much wider applicability than just parallel programming — they could be used to accelerate Java, .Net, and dynamic optimizers." AMD gave no timeframe for when these proposed extensions would show up in silicon.
Looks like there isn't a whole lot there that you couldn't get using existing performance counters and a tool like oprofile....
-- Erich
Slashdot reader since 1997
and did away with the aging x86 instruction set and came up with something new.
Yeah, I know, Intel tried with Itanium.
These extensions could be useful, but speaking as someone from the target audience... I just don't care right now. No amount of minor improvement difference (as might be gained through these) is as important to me as seeing a viable alternative to Intel. Not because I'm an AMD fanboy, but because competition brings the prices down, and accelerates the release of faster chips. From what I hear now, we'll finally see Barcelona chips out on September 10th at -maybe- up to 2.3 Ghz if you're one of the cherised few, but most retail ones will be 1.9 Ghz. I haven't seen the (valid) numbers, so I can't say for sure, but I'm worried about how competitive this will be.
/Grumble
I realize that the software people and hardware people both have their projects to work on, and they work largely independently in terms of a time-frame, but I figure this news might be timed to say, "Hey! Look at us! We're doing stuff!", but it only serves to frustrate me that their still aren't any real numbers on Barcelona, and, on the whole, that AMD seems to have dropped the ball.
There's very little difference between the instructions in the different modes. The memory management unit is where most of the differences are. Properly written 16 bit real mode code will still run in 16 bit protected mode. The only difference is how the segment portion of the pointer in interpreted.
As for 16 bit vs 32 bit modes. The instructions are mostly the same. A code segment is specified as being either 16 or 32 bit. That size is the default data sized used by instructions within that segment. There is a "size override" prefix, which if found immediately before an instruction, tells the CPU that the following instruction should use the opposite of default size.
I don't remember the specifics, but 64 bit mode just continues along with the same ideas. There aren't many changes from 32 bit code to 64 bit.
It was at least 200 last time I read - and the source was an 80486 programming book. I think there's at least that many more in the different versions of SSE.
Has there in the past been an example of AMD adding new instructions and then Intel following along and adopting them? I know it works in the converse, but somehow I doubt Intel wants AMD taking the lead in extending its own ISA.
Part of the hardcore faithful who believed in Apple long before it was cool again to do so
I never quite understood why chip manufacturers had added cores long after memory bandwidth had became a problem. Why not add specialized execution units and make instruction set a bit fatter? It's not like arithmetic and logic operations are all that you can do with an int or a few ints. Same for floats (but even more operations).
its a good start, but it isint much. parallel programming will still be a bitch
I for one
think this
is good
news.
Please sign petition to restore sanity to our banking system!!!
http://financialpetition.org/
Probably enough to start dropping a few. The 16 bit instructions could be disposed of without anyone noticing for a start.
"Welcome to our world. We are the wasted youth. And we are the future too." Yes, I know these are stupid lyrics.
Also, I know from asm on SPARC that many op codes are really just variations of other ops (and/or pseudo ops). For instance, (I'm not sure of the x86 equivalent)
If I mod you up, it doesn't necessarily mean I agree with what you've said, sorry.
They can't get the chips to clock up nicely as a whole; an individual chip or a few dozen individuals can, but most of them are binning in the sub-2GHz category
Do you have a source for that, or is it just internet speculation?
they could be used to accelerate Java, .Net, and dynamic optimizers
So for CPU manufactures C++ is dead. Thanks for to be so clear.
94.3% of all statistics are made up on the spot.
You can poo-poo Java all you want, but the reality is that it's made programming a lot easier for the "rest of us", especially in a world where cross platform compatibility is key.
If only it were so. Unfortunately, it's not. There's a distressing amount of 16-bit real-mode code being executed in between power-on and your OS kernel switching into 32 or 64 bit mode even on the most modern PC.
Performance counters could be used by JITs to generate more optimized code. I wonder which programming languages use JITs...
Yet another waste of silicon to 'accellerate' badly written software.
AND well-written software. What, you think you could write code that's just as fast without all the "hardware acceleration" being done for you, without using any instruction set extensions that have been added over the years? You are on crack.
Instead of devoting transistors to speed up the latest toy programming languages ('managed' code), why can't we just train programmers better?
And better profiling tools are contrary to this goal how exactly? And at what point do you tell your better-trained programmers that using those hardware acceleration features will make their code go faster?
Ahh..of course, because of java..don't bother learning HOW to optimized, let java do it FOR you...
Or let your C compiler do it for you. Whichever. There's a matter of degree, to be sure, but even still you're most likely wasting your time "optimizing" individual lines of C code since the compiler can probably do a better job and that's been the case for quite a while. The thing that will get you the most bang for buck is the same in C as it is in Java -- optimize your algorithms. Java can't do that for you, and neither can your C compiler.
The enemies of Democracy are
Java never made anything easier for anyone and you know it.
This is a joke. I am joking. Joke joke joke.
"They would give software access to information about cache misses..." Yeah that ought to help significantly with side-channel attacks against crypto software.
It isn't Intel's job to train programmers to do things right. That is the responsiblity of the education system. Nothing stops the education system from still teaching proper programming and design skills.
this nation, under God, shall have a new birth of freedom. -- Lincoln, Gettysburg Address
Instead of devoting transistors to speed up the latest toy programming languages ('managed' code), why can't we just train programmers better?
Ahh..of course, because of java..don't bother learning HOW to optimized, let java do it FOR you...
I'm tempted to slam this as an uneducated rant, but since there's a little teeny kernel of truth in it, I'll let it slide.
The issue is not "badly written code". It's being able to run the same compiled code on a wide variety of hardware without recompiling it for every chip variant.
The huge drawback with all the RISC architectures (at least initially) was that each version of each chip had different numbers of functional units, different latencies for the functional units, different latencies to cache and memory, etc.
If you ever dealt with the MIPS or Sun compilers, they have a huge number of flags for hyper-optimizations on a variety of implementations of those architectures. The problem is that when you optimize it for one variant, it often makes it worse on other variants (because instructions that didn't collide in the instruction pipeline now do, as just one example..)
Now all of the modern architectures play the same games. Power/PowerPC, SPARC, Itanium, all of them. They all have multiple pipelines and execution units, massively parallel instruction issue, etc. Just like the X86.
And it's not because the programmers are idiots, but because that's the only way you could ever ship one binary that would run "optimally" on every implementation of that architecture.
PS. Java and C++ only make this worse because they are so dependent on such out-of-order massively-parallel execution (since they are so darn difficult to statically optimize).
The supreme irony of this is that for a while there, Java on X86 (Sun's implementation, no less!) ran rings around Java on SPARC (great strategy for pulling in customers for SPARC !). It's only with recent SPARC implentations (Niagara/Niagara 2) that play the same way as the X86's, that SPARC has finally caught up with and passed X86 again..
You can make up statistics to prove anything. 16% of all people know that.
Linux Zealots: Smarter than Mac Zealots, but still zealots.
The number depends on how you look at it. I made a table that lists every x86 instruction excluding prefixes a while ago and it came out to 57,839 instruction/parameter combinations. That doesn't factor in the specific values passed to the opcode, or in the registers, or the differences in behavior of the chip depending on mode, how memory protection is setup, out of order execution, or instruction prefixes.
The large number of combinations certainly makes validation a tremendous challenge.
There already are systems which do exactly that (optimise dynamically C programs), see http://arstechnica.com/reviews/1q00/dynamo/dynamo- 1.html.
Of course HotSpot-like JIT'd languages are "easiest" target and most likely gives the biggest performance improvement. After all, HotSpot does partially (in SW) what the proposal does (in HW).
That must have been speculation or a SWAG from the poster to suggest it could be used to accelerate Java and/or .NET. There is nothing special about java or net that would allow this optimization.
.NET and JAVA come specifically from AMD themselves.
Ok, sorry, wrong, and yes, wrong again...
The notes about
The reason it would benefits these environments is because they are processed on the fly and the environment could make the 'adjustments' to the code at runtime instead of it be 'locked' as natively compiled code is.
This is level 101 understanding and logic here, not sure how you are missing this.
Only 3% of Slashdot users haven't heard that joke, and only 2% of those who have still think it's funny for the (on average) 36.4th time.
In a sense, the 16 bit instructions have been dropped, if only when running in 64-bit mode. Which is actually kind of annoying, because it means some of those old Windows 3 and DOS programs won't run without emulation.
There already are systems which do exactly that (optimise dynamically C programs)
I don't disagree with the notion that any natively compiled language could be scaled to take advantage of this, a good solution would be an OS level scheduling mechanism for natively compiled applications that could make the decisions based on the information the AMD instructions would be offering.
However, the reference you cite is more about basic instruction changing and not the dynamics of testing to see what threads are busy, which ones need more time, and where they can shifted to run at runtime based on these needs. The Transmeta solution is a lot like a real-time in chip concept of many of the old translation tricks used in various products at a software level like the FX!32 used on WindowsNT Alpha version.
You're missing the point here. They aren't talking about accelerating the frameworks (the JRE, the Common Language Runtime, any other program that was compiled to native code at or before install time), they are about accelerating the applications that are run using those frameworks. The reason is that both frameworks use JIT-compiled code (I believe old JVMs translated instructions individually, but these days I'm pretty sure the whole .class gets compiled to native code just before execution).
The advantage of these extensions is that they can make it possible to optimize the hell out of the JIT, producing code at run-time that takes full advantage of the available capabilities of the processor(s). Theoretically, this could make these JIT-compiled programs faster to run (though they will always incur a startup penalty) than non-JIT native code because the JIT compiler knows things about the environment in which it will be executing that would be specific to that machine at that time.
There's no place I could be, since I've found Serenity...
I see all fuss about programming. easy. don't what the is parallel It's I hereby propose that execution is in order for out of order speech.
tasks(723) drafts(105) languages(484) examples(29106)
Running code that is directed at one architecture or another was an issue for RISC. If you look at the x86 CISC machines, you'll have a lot less variance. When it comes to RISC vs. CISC, it's not so important to omptimize for a specific architeture on CISC simply because the CPU handles a lot of things instead of the programmer/compiler code. The variances between running a program on CISC architectures is much smallar then doing the same for RISC architectures.
Yeah, but I couldn't find a way to get AMD to mail me a hard copy of their documentation (at least, not for free). If they do so, please correct me, as I haven't looked in quite a few months.
If I mod you up, it doesn't necessarily mean I agree with what you've said, sorry.
Profiling is useful for code produced by any language, and being able to profile without adding code, eg, at the beginning of functions, means you get to see how the actual software runs, without doing things that affects caching etc (for example, profiling code might push certain instructions onto a different cache line, skewing the results)
The revolution will not be televised... but it will have a page on Wikipedia
Funny. I've seen a $59 Brisbane core (1.9 out of the box) overclocked to 2.9 GHz with just air cooling, so I'm not sure why everyone insists AMD can't hit the 3GHz barrier, especially when AMD keeps displaying 3GHz Barecelonas.
There are three reasons to buy AMD right now.
1. Price, price and price. AMD knows Intel has the better fab, but AMD is selling super cheap. You can get a dual-core processor for half what Intel charges, and for the average user, it is more than enough. I'm running Oblivion at 30 FPS with a $59 processor, and I've barely overclocked it. The cheapest Intel dual-core proc was $120 when I bought my $59 proc. Most people have no idea that their proc these days often underclocks itself, and you rarely touch the full potential of your proc. Intel is faster, and no one doubts that today, but if you never see the speed benefit, why spend the extra dollars? On a performance per dollar basis, AMD wins hands down.
2. There is a mountain of evidence against Intel for anti-trust violations, and I try not to financially support evil. The EU is also coming down on Intel for anti-trust violations.
3. Even if the anti-trust suits both come through, AMD is near bankruptcy, and I prefer choice in the marketplace. I am terrified of the day when Intel has no competition pushing them and they can just sell what they want and whatever price they want.
http://blindscribblings.com - Tasty pop-culture in conceptual fashion.
"There is nothing special about java or net that would allow this optimization"
Sure there are. A profiler could quickly pick up on a function that's getting called many times from within a loop, and decide it could speed it up more by inlining it. Or, a bit of inline code that isn't being used often could be moved out of line, so the rest of the loop fits into a single cache line.
The revolution will not be televised... but it will have a page on Wikipedia
Java is a great concept with piss-poor execution.
Oddly enough, the same code can often be compiled cross-architecture and cross-platform quite easily on GCC that provides a nice, fast executable native to each platform and architecture and it uses a fraction of the start-up speed and resources of Java.
I'm a crappy programmer, and even that is transparent to me.
http://blindscribblings.com - Tasty pop-culture in conceptual fashion.
Java and .Net are JIT compiled. C++ is a normal compiled language. I assume the extensions are helpful to JIT compilers because they would allow the compilers to recompile the code with different optimizations based on the data they get.
Centralization breaks the internet.
Entertainingly, PPC code is much larger than AMD64 code, prefix bytes or no.
A deep unwavering belief is a sure sign you're missing something...
It looks like I have a fan.
good times. I guess I'll have to start wearing pants now though.
Linux Zealots: Smarter than Mac Zealots, but still zealots.
I was reading the Great Microprocessors list and it says AMD already did that back in the K5 days. It had a mode where it can natively execute the RISC-like instructions. Nobody used it, so I don't know whether current gen AMD chips support it.
-- "This world is a comedy to those who think, a tragedy to those who feel."
Sony had a $10k PS2 called the PA that recorded exactly what happened to every cycle on the cpu, gpu etc. without changing the way the game ran. It was the most incredible thing, like you had been sitting in the dark for years and then suddenly someone turned on the lights.
Is it cache misses, dma contention, background threads, branch stalls or actual work? Optimizing on the PC just feels like groping around in the dark again.
--
thegirlorthecar.com - a dating game for guys
-- http://thegirlorthecar.com funny dating game for guys
You are right, the reference I gave is old.
I was not trying to disagree completely with the GP as clearly JIT'd languages are biggest winners. Just noted that "even" C can be improved dynamically, without compiler help. I would not call it a scheduling mechanism, rather a code morphing mechanism.
Besides the proposed extensions (the PDF) did not so much help in thread scheduling but rather on cache coherency and finding hot spots. Garbage collection can, if it would use the cache information, improve performance perhaps a lot just by moving the objects around. HotSpot VM can use the, eh, hot spot information without counting in SW.
Whole families have one or two computers but every member has their own phone. ARM has triumphed numerically. It doesn't try to compete with x86 but a future could exist in which many people have an extremely powerful ARM-based phone and rely on the internet a lot instead of having a PC.
This is all just my personal opinion.
There's a matter of degree, to be sure, but even still you're most likely wasting your time "optimizing" individual lines of C code since the compiler can probably do a better job and that's been the case for quite a while.
Terrible, if people start to give up to optimize the code (and understanding why it works), the net result will always be a noticeable decrease in programming quality (a very usual situation).
I know that you are aiming at premature optimization, and you are really right on this one, but the notion that 'optimize code' == 'wasting time' is a perfect excuse to not to learn how and why things work.
You have to know how to optimize the code to decide when is detrimental to use that optimization, and then, when you know how to do it, you realize it's a matter of degree, as you rightly said.
What's in a sig?
releases ANOTHER newer faster processor two weeks later ... effectively kicking AMD in the groin AGAIN.
You're crazy if you think the education system teaches programmers how to write good code. They can't even teach math and english well. Good programmers are mentored by other programmers.
Yes but then the 8051 then is probably out numbers the X86 and the Arm. The Mips, Arm, Power, and even the 68k still exists in the embedded market. For example the Power is in all three of the new game consoles. Arms are in a lot of the WAPs. I keep wondering if we will see the a CPU the size of the latest AMD but containing 16 or more ARM cores. Sort of a T1 competitor.
See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
I think I'm missing some op-codes.
"Glory is fleeting, but obscurity is forever." - Napoleon Bonaparte
i don't mean that you'd be an idiot for being a mac person, but that x86 cpu particulars would slip your mind. :D
Please stop stalking me, bro.
Isn't this just exposing/documenting the CPU's internal debug features so that developers can use them?
If you look at the die shots of recent CPUs, you will see a big chunk of transistors marked DEBUG.
like most /.ers,you have these wierd catagorys of evil and non evil companies.
ALL large companies are the same - the more successfull, the more evil
why is this so ?
while everyone professes to like the free market, businesmen hate the free market and love monopoly - in a free market you have to work harder for less, who in their rigth mind would actually like that ?
So, the 1st thing a company does when it becomes big and succesffull is to use its power to dampen market forces in any way it can.
Now sometimes, when a company is really, really rich and successful, like google or the old AT&T they are so succesfull that they cna hide their evilness behind total monopoly power. but as sooon as their market posistion slips, they beocme evil.
mark my words, you heard it hear 1st: as soon as googles profit starts to fall, andit is no longer a wall street darling, they willl be right in their with MS and GM and whoever.
There is a multi-core ARM CPU under development. The idea is that multiple cores are the best way to keep increasing performance without increasing power consumption.
I don't think that it's anything astoundingly interesting by desktop standards but it will allow embedded devices to keep advancing. As usual, before your phone can handle it properly, there is probably going to be some software that needs a redesign if it's going to show a speed improvement.
This is all just my personal opinion.
If you ever dealt with the MIPS or Sun compilers, they have a huge number of flags for hyper-optimizations on a variety of implementations of those architectures Sun's compiler has a huge number of flags for hyper-optimizations on a variety of implementations of X86 too. Near as I can tell, though, their impact on the vast majority of code is minimal. AMD and Intel can throw in all the new instructions they want, but they won't be meaningful for years -- if ever -- because code has to run on existing processors that don't implement those instructions.
Do you mind if I call Microsoft into that comitee? They are the ones holding x86 alive.
Rethinking email
The ARM core isn't slow by any stretch I would bet that a good dual or quad core ARM would run all the software the average desktop needs. It would probably work just fine for most business systems. Since the ARM core is so small compared to say an Core2Duo or AthlonX2 I would bet that you could put 16 or more on a single die and then use Hyper transport for memory IO. You would need to add something like SSE and maybe an FPU but the end result could be very interesting for servers.
See my blog http://ilovecookes.blogspot.com/ for light hearted technical information.
You must be crazy if that's what you got out of my message. I didn't say the education system currently teaches programmers how to write good code. I said nothing stops them from doing so, whether they know how to is a different issue.
this nation, under God, shall have a new birth of freedom. -- Lincoln, Gettysburg Address
I know python, c, c++ vbscript and javascript. I did try and learn java once, that didn't last long though. Relax man, its a joke.
This is a joke. I am joking. Joke joke joke.
I guess the education system failed me on reading comprehension....LOL