64bit also vastly speeds up long and double math. It doesn't really apply to a browser, but if you were using 64bit integers to store currency amounts, you'd notice a huge speedup. Adding/subtracting from longs is one thing that SSE probably won't help.;)
No speedup for these reasons, at all:
1) In the case of using 64-bit 2's complement integer registers, you're able to speed-up your 64-bit interger code because operating with 64-bit integers without chaining 32-bit results on the 32-bit CPU case. However, you're missing the point that most heavily computing, such as RSA's big numbers, DES, AES, Blowfish, etc. doesn't use general purpose register but vector SIMD opcodes (e.g. SSE*), already available in the 32-bit mode (with 8 instead of 16 registers, yes), which is faster than 64-bit integer operations.
2) Floating point ("double math") remains almost the same, but with also 8 additional SSE registers.
3) Related to "adding/substracting from longs": In 32-bit mode, a SSE3 -or later- functional unit can execute *four* 32-bit instructions per clock (fetching 128-bit data at once), while already being able to execute from 2 to 4 integer + load/store instructions (e.g. Core2Duo or K8), so it would be faster still while chaining 32-bit results.
It is not just a hardware revision, but implies also cuts in software: Remember that Sony has cut the possibility of running Linux in the new PS3 "Slim" model, disabling the "Other OS" boot option, because of the costs of programming new drivers for virtualizing the new I/O devices through the hipervisor.
BY: sarahe
DATE: 2009-Aug-21 22:23
SUBJECT: RE: Why no Linux in PS3 Slim?
Hi aragon,
I'm sorry that you are frustrated by the lack of comment specifically regarding the withdrawal of support for OtherOS on the new PS3 slim.
The reasons are simple: The PS3 Slim is a major cost reduction involving many changes to hardware components in the PS3 design. In order to offer the OtherOS install, SCE would need to continue to maintain the OtherOS hypervisor drivers for any significant hardware changes - this costs SCE. One of our key objectives with the new model is to pass on cost savings to the consumer with a lower retail price. Unfortunately in this case the cost of OtherOS install did not fit with the wider objective to offer a lower cost PS3.
We'll see if we can get the offical OtherOS page updated with something to this effect so that an official explanation is provided. Thank you for your comments.
- The benefit from passing from 8 to 16 general purpose registers is very little, and often, counterproductive, as total "true registers", the ones used for register renaming in OoOE remain the same, so with twice the general purpose registers, you halve the renaming register pool. That was specially noticeable in firsts AMD64 CPUs, and *very* noticeable on Intel Pentium D CPUs (Pentium 4 with x64 support and other minor changes), acusing of insufficient register pool volume for the OoOE operation in x64 mode. Newer CPUs, having a higher pool of registers, have less impact when executing x64 code.
- Memory and data cache wasting: Pointers take 64 bits, so unles you're doing your own memory management, with 32-bit offsets instead of using the bulk 64-bit space for adresses, you're wasting more memory, and what is worst: higher data cache usage for the same purpose, with unnecessary CPU-RAM bus overload (remember that OoOE implies data fetching! -imagine a contiguous 32 64-bit pointer vector, taking 2048 bits instead of the 1024 bits that it would take with 32-bit pointers-).
Pros:
However, for some things there is true benefit, and is that the number of registers for SSE operations have been also doubled, from 8 to 16. And because of the nature of the SSE code, which is usually less prone to jump misprediction and with less register aliasing, because of the nature of vector processing code.
Corollarius:
In my opinion a 64-bit operating system makes sense, but an application that doesn't need more than 2GB of RAM, and doesn't need to gain an extra 10% of speed up when running optimized SSE vector code, should be compiled in 32-bit mode.
Yes, they do OoOE, but not with the insane amount of register renaming of the OoOE-x86/OoOE-PowerPC ones, nor with the same alternate execution depth. The ARM Cortex OoOE is a very power-wised balanced OoOE, however, and is just my opinion, completely unnecesary (you could put 3 in-order-execution cores instead of the 2 out-of-order-execution ones).
It is not the "x86 emulator", as it takes a tiny percent of the die, and 90-95% of instructions are decoded to one underlying RISC equivalent. Most power consumption is because of OoOE, huge pipelines, and huge caches. In my opinion OoOE processors are an aberration inteded to maximize serial code, by wasting 4 to 8x resources, as it is like having many processors executing future code paths "just in case" (misusage of instruction cache just to feed the OoOE jump prediction execution paths) while making a misuse of the system bus by loading data for instructions that will be discarded 1 of every 10 times (data cache misusage by fetching data for instructions that will be discarded in a major part). So in "advanced OoOE CPU" you're saturating the bus for computing worthless instructions. As example, in the area of a P4 CPU, you may had 8 to 16 MIPS or ARM in-order CPU cores, making much better usage of the shared cache, and with 4 to 8x more executed instructions/transistor, with efficient system bus usage.
Well, it could be also said that 9 out 10 "branded" laptops with VT-capable Core2Duo CPUs have that feature disabled because of their BIOS. The point it is not about CPUs lacking a feature, but CPUs with the feature, being cripped at BIOS-level. Example: Acer Aspire 2930 (I own one, with Intel Core2Duo P7350, which supports VT, but it is disabled at BIOS level, without the possibility of enabling in the BIOS menu). It seems that there are hacks for enabling it (1), but involving BIOS reflash, which is, in my opinion, a too much risk.
My Acer Aspire 2930 laptop (Intel Core2Duo CPU) has the VT extensions disabled at BIOS level. Don't buy this model, and be aware of buying other models from Acer.
For sure I will not buy anything from Acer. In addition to the VT %$%$$%-ing, the laptop VGA output it is not properly shielded because of poor design, and produces a signal with a bit of flickering (to get a digital DVI output you have in addition to spend over 125 € for a "Easyport IV" dock station).
Magically no, but embedded device manufacturers would move quickly for: 1) provide "youtube-resolution enough" Theora decoding for software based ARM-SIMD, 2) Hardware accelerated Theora on GPU.
New features are adopted slowly on embedded devices, as example, take the Flash player for browsers. The change will come after demand, and Google could flip the situation at their option, no matter the way they choose, they have the Ace of Spades.
Most newer ARM CPUs inside system-on-chip include SIMD extensions, so while being less efficient than GPU-h264, it should be enough for decoding YouTube-sized Theora video. It is a matter of time of Theora-accelerated on GPU, but demand should be first.
The problem is encoding, not decoding, as the decoding is done in third party hardware (final user). Also in the transcoding process, i.e., decode from whatever to h264/Theora, decoding is much faster than encoding (because of pattern matching and movement analysis). Anyway, bandwidth is the main problem, as uploaded video is reencoded *once*, and played *many* times.
Why it should? Fortran is compiled, so it is C, and both are very simple and easily optimizable languages (GCC). Lisp can be compiled too, but by its flexibility still compiled implies higher overhead in parameter processing and more data cache trashing because of additional control structures requiring extra pointer usage.
In the C vs Fortran you comment, the most time consuming is the complex domain square root (loop can be unrolled, integer multiply cand and FP integer load can be pipelined, along with FP multiply). Loop optimization, constant propagation, and strength reduction, can be done by both Fortran and C compilers, so there is no much left to be done (needless to say that the complex 'sqrt' implementation it is probably written in C).
... is to surrender in order to accept buggy as hell plug-ins or memory leaks as "acceptable".
Current multithreaded Firefox is able to use multiple CPUs, being the reason of splitting the tabs into independent processes is to surrender to mediocrity. How about increasing Q&A, do proper synchronization between components, and don't allow untested components to be used without showing a big warning at installation?
To manufacture with Marvell license, but the IP is Marvell property, except for IXP (Network Processors) and IOP (I/O Processors): that means that Intel should no longer able to build "application" ARM CPUs (PXA*) without license.
64bit also vastly speeds up long and double math. It doesn't really apply to a browser, but if you were using 64bit integers to store currency amounts, you'd notice a huge speedup. Adding/subtracting from longs is one thing that SSE probably won't help. ;)
No speedup for these reasons, at all:
1) In the case of using 64-bit 2's complement integer registers, you're able to speed-up your 64-bit interger code because operating with 64-bit integers without chaining 32-bit results on the 32-bit CPU case. However, you're missing the point that most heavily computing, such as RSA's big numbers, DES, AES, Blowfish, etc. doesn't use general purpose register but vector SIMD opcodes (e.g. SSE*), already available in the 32-bit mode (with 8 instead of 16 registers, yes), which is faster than 64-bit integer operations.2) Floating point ("double math") remains almost the same, but with also 8 additional SSE registers.
3) Related to "adding/substracting from longs": In 32-bit mode, a SSE3 -or later- functional unit can execute *four* 32-bit instructions per clock (fetching 128-bit data at once), while already being able to execute from 2 to 4 integer + load/store instructions (e.g. Core2Duo or K8), so it would be faster still while chaining 32-bit results.
Extra-official reply from Sarah Ewen, a Sony employee:
BY: sarahe
DATE: 2009-Aug-21 22:23
SUBJECT: RE: Why no Linux in PS3 Slim?
Hi aragon,
I'm sorry that you are frustrated by the lack of comment specifically regarding the withdrawal of support for OtherOS on the new PS3 slim.
The reasons are simple: The PS3 Slim is a major cost reduction involving many changes to hardware components in the PS3 design. In order to offer the OtherOS install, SCE would need to continue to maintain the OtherOS hypervisor drivers for any significant hardware changes - this costs SCE. One of our key objectives with the new model is to pass on cost savings to the consumer with a lower retail price. Unfortunately in this case the cost of OtherOS install did not fit with the wider objective to offer a lower cost PS3.
We'll see if we can get the offical OtherOS page updated with something to this effect so that an official explanation is provided. Thank you for your comments.
Sarah.
Cons:
- The benefit from passing from 8 to 16 general purpose registers is very little, and often, counterproductive, as total "true registers", the ones used for register renaming in OoOE remain the same, so with twice the general purpose registers, you halve the renaming register pool. That was specially noticeable in firsts AMD64 CPUs, and *very* noticeable on Intel Pentium D CPUs (Pentium 4 with x64 support and other minor changes), acusing of insufficient register pool volume for the OoOE operation in x64 mode. Newer CPUs, having a higher pool of registers, have less impact when executing x64 code.
- Memory and data cache wasting: Pointers take 64 bits, so unles you're doing your own memory management, with 32-bit offsets instead of using the bulk 64-bit space for adresses, you're wasting more memory, and what is worst: higher data cache usage for the same purpose, with unnecessary CPU-RAM bus overload (remember that OoOE implies data fetching! -imagine a contiguous 32 64-bit pointer vector, taking 2048 bits instead of the 1024 bits that it would take with 32-bit pointers-).
Pros:
However, for some things there is true benefit, and is that the number of registers for SSE operations have been also doubled, from 8 to 16. And because of the nature of the SSE code, which is usually less prone to jump misprediction and with less register aliasing, because of the nature of vector processing code.
Corollarius:
In my opinion a 64-bit operating system makes sense, but an application that doesn't need more than 2GB of RAM, and doesn't need to gain an extra 10% of speed up when running optimized SSE vector code, should be compiled in 32-bit mode.
Now, how do you want he, dead or alive?
Yes, they do OoOE, but not with the insane amount of register renaming of the OoOE-x86/OoOE-PowerPC ones, nor with the same alternate execution depth. The ARM Cortex OoOE is a very power-wised balanced OoOE, however, and is just my opinion, completely unnecesary (you could put 3 in-order-execution cores instead of the 2 out-of-order-execution ones).
It is not the "x86 emulator", as it takes a tiny percent of the die, and 90-95% of instructions are decoded to one underlying RISC equivalent. Most power consumption is because of OoOE, huge pipelines, and huge caches. In my opinion OoOE processors are an aberration inteded to maximize serial code, by wasting 4 to 8x resources, as it is like having many processors executing future code paths "just in case" (misusage of instruction cache just to feed the OoOE jump prediction execution paths) while making a misuse of the system bus by loading data for instructions that will be discarded 1 of every 10 times (data cache misusage by fetching data for instructions that will be discarded in a major part). So in "advanced OoOE CPU" you're saturating the bus for computing worthless instructions. As example, in the area of a P4 CPU, you may had 8 to 16 MIPS or ARM in-order CPU cores, making much better usage of the shared cache, and with 4 to 8x more executed instructions/transistor, with efficient system bus usage.
You're right. To keep fixes prices in a free market is non sense.
To some extent, long distance (interstellar) comerce problems were addresed by Paul Krugman in the following essay: The Theory of Interstellar Trade (Paul Krugman, 1978) [PDF] (related Slashdot article here).
... inflation.
Well, it could be also said that 9 out 10 "branded" laptops with VT-capable Core2Duo CPUs have that feature disabled because of their BIOS. The point it is not about CPUs lacking a feature, but CPUs with the feature, being cripped at BIOS-level. Example: Acer Aspire 2930 (I own one, with Intel Core2Duo P7350, which supports VT, but it is disabled at BIOS level, without the possibility of enabling in the BIOS menu). It seems that there are hacks for enabling it (1), but involving BIOS reflash, which is, in my opinion, a too much risk.
My Acer Aspire 2930 laptop (Intel Core2Duo CPU) has the VT extensions disabled at BIOS level. Don't buy this model, and be aware of buying other models from Acer.
For sure I will not buy anything from Acer. In addition to the VT %$%$$%-ing, the laptop VGA output it is not properly shielded because of poor design, and produces a signal with a bit of flickering (to get a digital DVI output you have in addition to spend over 125 € for a "Easyport IV" dock station).
And what about the sales lost because of annoying the *customer*? Greedy idiots.
Magically no, but embedded device manufacturers would move quickly for: 1) provide "youtube-resolution enough" Theora decoding for software based ARM-SIMD, 2) Hardware accelerated Theora on GPU.
New features are adopted slowly on embedded devices, as example, take the Flash player for browsers. The change will come after demand, and Google could flip the situation at their option, no matter the way they choose, they have the Ace of Spades.
Most newer ARM CPUs inside system-on-chip include SIMD extensions, so while being less efficient than GPU-h264, it should be enough for decoding YouTube-sized Theora video. It is a matter of time of Theora-accelerated on GPU, but demand should be first.
It is not a problem for Google.
The problem is encoding, not decoding, as the decoding is done in third party hardware (final user). Also in the transcoding process, i.e., decode from whatever to h264/Theora, decoding is much faster than encoding (because of pattern matching and movement analysis). Anyway, bandwidth is the main problem, as uploaded video is reencoded *once*, and played *many* times.
From 3 to 4 blades? Come on!
Why it should? Fortran is compiled, so it is C, and both are very simple and easily optimizable languages (GCC). Lisp can be compiled too, but by its flexibility still compiled implies higher overhead in parameter processing and more data cache trashing because of additional control structures requiring extra pointer usage.
In the C vs Fortran you comment, the most time consuming is the complex domain square root (loop can be unrolled, integer multiply cand and FP integer load can be pipelined, along with FP multiply). Loop optimization, constant propagation, and strength reduction, can be done by both Fortran and C compilers, so there is no much left to be done (needless to say that the complex 'sqrt' implementation it is probably written in C).
Of month, you insensitive clod XD
P.S. thank you for the link, very cool demo.
... for XP64 and Vista64.
Here is my last tribute:
C:\Users\faragon>copy con hifolks.com
zzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzz^Z
1 files copied.
C:\Users\faragon>debug hifolks.com
-a 100
187E:0100 jmp 112
187E:0102 db "Hello Slashdot!$"
187E:0112 mov ah,9
187E:0114 push cs
187E:0115 pop ds
187E:0116 mov dx,102
187E:0119 int 21
187E:011B int 20
187E:011D
-w
Writing 00048 bytes
-q
C:\Users\faragon>hifolks.com
Hello Slashdot!
C:\Users\faragon>
... is to surrender in order to accept buggy as hell plug-ins or memory leaks as "acceptable".
Current multithreaded Firefox is able to use multiple CPUs, being the reason of splitting the tabs into independent processes is to surrender to mediocrity. How about increasing Q&A, do proper synchronization between components, and don't allow untested components to be used without showing a big warning at installation?
Fuck everything, give me the 5MP!
Yes, with license there is no problem for Intel for producing IBM, ARM, NVidia or AMD chips.
To manufacture with Marvell license, but the IP is Marvell property, except for IXP (Network Processors) and IOP (I/O Processors): that means that Intel should no longer able to build "application" ARM CPUs (PXA*) without license.
You missed that Intel sold its ARM division to Marvell in 2006 (1).
ARM is not breaking any "law".
Are you sure that it is not breaking the law?