Also, it seems to take a lot of select, shift, and shuffle instructions to make efficient use of the quadword (SIMD) instructions. With Xeon and Opteron, use of the quadword instructions seems to require far fewer other additional cycles. These instructions are only required for scalar code. Vectorizable code does not generally require additional selects, shifts and shuffles, unless the compiler can't guarantee that the data is aligned. However, since all instructions are SIMD, for scalar code, the compiler has to emit those additional instructions in order to mask the effect of loads and stores and the rest of the instructions being SIMD. This makes it slow and bloated for scalar code, but not for SIMD code.
Well, my point was that you were not clearly separating what happens inside a chip and what happens between chips. For example, the EIB bandwidth figures you were mentioning are for the aggregate for the whole chip. Chip to chip communication has much less bandwidth.
About unified memory, PPU to PPU is coherent (plus load/store queues), but inside the chip the instructions are not running on the same memory space unless you do some tricks and then it still not a unified memory architecture. This is one of the main benefits of this architecture, SPUs operate on local data, without the effects of false sharing, ping pongs, etc.
I think that you are mixing chips and cores together.
And the Cell is designed for scalable multicore/chip parallelism.
The Cell is an heterogeneous multicore design with very good bandwidth between its cores, but that does not mean that it has been designed for scalable multichip parallelism. In fact there is a paper that shows that the bandwith between chips is not that great.
Its main magic is its coherent, superfast "elements" bus, which retains coherency even at 1.6Tbps across multiple cores and chips.
Again, the Element Interconnect Bus has quite a lot of bandwith, but it is only available between the cores of a single chip. Interchip communication must be performed through the IO port, which has much less bandwidth.
IBM has 4-core chips in pairs already deployed in public, and 128-core chips in the lab, where a massive new top-predator supercomputer is being built on the new architecture.
That's interesting. Could you provide a link to that information, please?
The Cell has builtin allocation facilities, so app code doesn't have to schedule or otherwise closely manage the fast SPEs, just send tasks to a generic pool.
The Cell does not have those facilities in hardware. All that is implemented in software.
Which SPEs just DMA into a unified memory model.
That is a bit confusing. The PPE operates on main memory and it is accessible to the SPEs, but only through DMA operations. They operate on their own memory (Local Store in the literature). I consider that a non unified model.
Nevertheless, this model can be altered by memory mapping the SPE Local Stores onto the memory of the PPE. But that still does not allow the SPEs to operate directly on main memory.
That kind of simplicity makes Cell programming harder than, say, PowerPC programming, but much easier than other parallel programming, without losing its speed. Once there are some basic libraries for programming "common" new parallel tasks on the Cell, it won't be considered any harder than it was to program x86 "Protected Mode", Extended vs Expanded Memory, word alignment, etc.
I think that in general, programming for the Cell is much more complicated than programming for an SMP, and even in some cases MPI.
There is very few storage on the SPE side, which must be shared by the code, the data and the stack.
The SPEs do not have memory protection on their Local Store, which means that smashing your data or code with the stack is not detected and handled automatically.
The SPEs have a pure vector ISA, which forces the programmer to vectorize the SPE code in order to obtain good performance. In fact having a pure vector ISA forces the compiler to emmit lots of additional instructions (rotating and masking) for non vectorized code (compared to scalar ISAs), making the LS space limitations toughter.
The PPU, although multithreaded, is not as powerful as a traditional PPC (e.i. no OoO execution), which in practice means that you cannot spend too many cycles on scheduling work for the SPEs, otherwise your SPEs will be starved.
Without the help of tools and libraries that hide those low level details from the programmer, programming the Cell can be quite hard.
I think that programming any non embedded processor should be simple and for that reason libraries, compilers and other tools are going to be as important for the Cell processor as the compiler is for Itanium.
This is an all in one reply to several posts. First of all, the comments in this post are my personal comments and not the comments of the parts involved (IBM, BSC and the Spanish Government). The final destination has always been Barcelona, but they put the machine together in Madrid because the final building was not ready in time for running the Top500 benchmark. Even then, they didn't have enough time to set up all the nodes and then the result in Top500 had to be done with less nodes than the fully assembled machine. I believe that SGI had also submitted data the machine before it was fully assembled, but sent the results of their full machine after the deadline and got the result accepted. When the final building was ready in Barcelona, they moved the machine to its final destination. Refering to the limits of scalability, I think that having such a configuration presents new challenges for the computer science researchers that work for the center. Having such a machine at our dispossal will provide us with a very interesting oportunity to improve the scalability of our parallelization techniques. Regarding the memory configuration, the login nodes have 4GB of RAM, and I believe the rest of the nodes have the same configuration. An finally, the file systems are currently mounted using NFS, but it is expected that soon they will change to GPFS.
I also know bisexual people that do that, and homosexual people that laugth at the "jokes". But they also pretend that they are not gay, talk about their "girlfriends", and so on. I call them hypocrites. Being myself a gay man, they make me sick. I think that tons of gay bashing come from gay people who are ashamed of themselves. That's tragic. The world hasn't changed that much as people want you to believe.
My mouse, on the other hand, makes a very audible *click* each time I use it, and while providing a pleasant tactile feedback, it keeps my girlfriend awake during my late-night work sessions.
OK, so it's a protocol, like say ftp or http, but different. So it seems, as per the bugzilla discussion, that the problem should be solved by creating a mozilla plugin to handle URL's written torrent://domain.name/localpath/file.torrent.
The real problem is that it doesn't use just one protocol, it uses two protocols. The first protocol is the traditional protocol (http, ftp, email, whatever) you use to download the dot torrent file that contains the description of where to get the proper file. The second protocol is the real torrent protocol by itself. Using a single URL for two different protocols is not very clean IMHO.
I can think of three solutions: a) eliminating the first protocol by putting the necessary data in the URL, e.x. torrent://server/enough_data_to_begin_the_transfer , b) always assuming that a torrent://server/path/file.torrent URL will allways be downloaded using http or whatever fixed protocol, and c) letting a plugin or other application handle the dot torrent file.
Please, dear moderators, moderate the parent post up. I'm sick of seeing that all posts that are +5 are "Funny". Please, give a chance to interesting and insightful posts.
The moderation ability of a person is based on how his posts have been moderated. When most +5 posts are "funny", we get moderators that just know what funny is, not what insightful or interesting is.
I realise that treating the contents of files using simple comands and virtual directories is very interesting, but it has very important consequences.
What happens to old applications? They should work as before with no changes. Imagine the result of creating a tar archive if it didn't behave as before.
For which file types will reiserfs4 will be capable to create virtual directories to access their content? Or put it another way, when do we stop?
How is it implemented, as kernel modules or a a userspace library? If implemented as kernel modules, then they will bloat the kernel or it will be wasting much time and fragmenting memory loading and unloading modules. If implemented as a library, why not implement it fully in userspace, independently of the choosen FS?
Is there a userspace alternative? For many cases the answer is yes. For example, the password file does not have to be a plain text file. It can be a db file if the system is configured accordingly.
IMHO, accessing the contents of files using virtual directories has no advantadge over userspace solutions. If a file requires database like functionality, it should be in a db or equivalent format. Implementing that functionality at the kernel level is just overkill and doesn't provide any benefit, just added complexity. What's missing (or I am missing;) is the proper tools to deal with those files with the ease of the exampple given before.
Re:Why aren't Oopses dumped to swap?
on
Linux 3.0
·
· Score: 1
Oopses are not normally dumped to swap because it may not be available or reliable in that state. In fact you could produce more corruption. I think that the best solution is to leave things as they are, put a mark in some place in memory, reboot and let the boot code deal with it (assuming it is in a sane state after the reboot).
You can do that. I think that the requirements are higher that what you suspect. Look at this page:
The server software runs on UltraSPARC servers supported by the Solaris 2.6, Solaris 7, or Solaris 8 Operating Environments. The suggested server configuration for most installations includes at least two processors, about 25 active sessions per CPU, 20-40 Mbyte random access memory or more for each active session, and about 50-100 Mbyte of swap space per session.
IMHO it is not cost effective for most cases.
I don't think that the load balancing is possible yet.
Of course the x86 won't get a design award. The x86 wasn't created with extensibility in mind. It has been handicaped from the very beginning. It wasn't designed thinking that they could use more registers in the future, or that it could end up using any register for any purpose. In contrast, the X system has been designed for extensibility, network transparency, multiuser systems and isolation from the kernel.
Extensibility allows adding functionality to the system. The common example is the Renderer extension, but that is just a small example. They could have been created a widget set as an extension to reduce network traffic (not that it would be a good idea). The problem with extensions is not the proper extensions but standarisation. A non standarised extension is useless.
Network transparency allows to use any machine (that uses X) from a single location. You can have a desktop with several apps from different machines. You can move to another location and use the same machine you used before.
Multiuser systems allow various users to be logged into the same machine at the same time without interfering one to the other except for some level of resource competition. This allows to reduce the number of systems to be configured and mantained.
Isolation from the kernel allows to execute the X server in a separate process. The implications of this are that if for any reason there is any operation that will cause a crash, it will only crash the X server and not the entire operating system.
I understand that these features are of no use to the average user of a computer. In the other hand they are completely transparent to the user. I like it's design very much. What problems do people find in X?
Sun Ray terminals are similar in concept to VNC terminals (if they existed). There is a server to which the terminals connect using a propietary protocol. The X server runs on the server and the terminal is just a framebuffer with keyboard, mouse, USB sockets and audio. When you start a session you start an X server. When you switch terminals you disconnect from your X server and reconnect with another terminal. This method requires a dedicated *monster* server that has enough memory for each frame buffer, has enough cpu power to draw into the frame buffers and has dedicated networks with enough bandwith to the terminals.
What would be really interesting is migrating aplications from one X server to the other transparently just using the X protocol. Without dedicating a server to the task, or having to migrate the entire session.
Well, my point was that you were not clearly separating what happens inside a chip and what happens between chips. For example, the EIB bandwidth figures you were mentioning are for the aggregate for the whole chip. Chip to chip communication has much less bandwidth.
About unified memory, PPU to PPU is coherent (plus load/store queues), but inside the chip the instructions are not running on the same memory space unless you do some tricks and then it still not a unified memory architecture. This is one of the main benefits of this architecture, SPUs operate on local data, without the effects of false sharing, ping pongs, etc.
The Cell is an heterogeneous multicore design with very good bandwidth between its cores, but that does not mean that it has been designed for scalable multichip parallelism. In fact there is a paper that shows that the bandwith between chips is not that great.
Again, the Element Interconnect Bus has quite a lot of bandwith, but it is only available between the cores of a single chip. Interchip communication must be performed through the IO port, which has much less bandwidth.
That's interesting. Could you provide a link to that information, please?
The Cell does not have those facilities in hardware. All that is implemented in software.
That is a bit confusing. The PPE operates on main memory and it is accessible to the SPEs, but only through DMA operations. They operate on their own memory (Local Store in the literature). I consider that a non unified model.
Nevertheless, this model can be altered by memory mapping the SPE Local Stores onto the memory of the PPE. But that still does not allow the SPEs to operate directly on main memory.
I think that in general, programming for the Cell is much more complicated than programming for an SMP, and even in some cases MPI.
- There is very few storage on the SPE side, which must be shared by the code, the data and the stack.
- The SPEs do not have memory protection on their Local Store, which means that smashing your data or code with the stack is not detected and handled automatically.
- The SPEs have a pure vector ISA, which forces the programmer to vectorize the SPE code in order to obtain good performance. In fact having a pure vector ISA forces the compiler to emmit lots of additional instructions (rotating and masking) for non vectorized code (compared to scalar ISAs), making the LS space limitations toughter.
- The PPU, although multithreaded, is not as powerful as a traditional PPC (e.i. no OoO execution), which in practice means that you cannot spend too many cycles on scheduling work for the SPEs, otherwise your SPEs will be starved.
Without the help of tools and libraries that hide those low level details from the programmer, programming the Cell can be quite hard.I think that programming any non embedded processor should be simple and for that reason libraries, compilers and other tools are going to be as important for the Cell processor as the compiler is for Itanium.
Wouldn't it make more sense if he asked if it was Celsius or Kelvin?
But then, would you trust the summary or would you read the patent application to be sure?
This is an all in one reply to several posts.
First of all, the comments in this post are my personal comments and not the comments of the parts involved (IBM, BSC and the Spanish Government).
The final destination has always been Barcelona, but they put the machine together in Madrid because the final building was not ready in time for running the Top500 benchmark. Even then, they didn't have enough time to set up all the nodes and then the result in Top500 had to be done with less nodes than the fully assembled machine. I believe that SGI had also submitted data the machine before it was fully assembled, but sent the results of their full machine after the deadline and got the result accepted. When the final building was ready in Barcelona, they moved the machine to its final destination.
Refering to the limits of scalability, I think that having such a configuration presents new challenges for the computer science researchers that work for the center. Having such a machine at our dispossal will provide us with a very interesting oportunity to improve the scalability of our parallelization techniques.
Regarding the memory configuration, the login nodes have 4GB of RAM, and I believe the rest of the nodes have the same configuration.
An finally, the file systems are currently mounted using NFS, but it is expected that soon they will change to GPFS.
I can already see which ISP will host their servers giving their clients optimum performance.
I think the full plan has been layed out by now.
DLL Dynamic Link Library
SO Shared Object
LIB Library
A Archive
I also know bisexual people that do that, and homosexual people that laugth at the "jokes". But they also pretend that they are not gay, talk about their "girlfriends", and so on.
I call them hypocrites. Being myself a gay man, they make me sick. I think that tons of gay bashing come from gay people who are ashamed of themselves. That's tragic. The world hasn't changed that much as people want you to believe.
My mouse, on the other hand, makes a very audible *click* each time I use it, and while providing a pleasant tactile feedback, it keeps my girlfriend awake during my late-night work sessions.
You dont't snore, do you?
BTW, I vote "between 0m4.000s and 0m3.001s".
I propose a new slashdot poll:
time globus-job-run machine
Bad idea. Do you want to give them a reason to sue slashdot for trademark infringement?
Am I the only one who understood that title as Details on a crossing between an animal and a japanese?
Evei if it is a bug in he document, the browser should never crash.
OK, so it's a protocol, like say ftp or http, but different. So it seems, as per the bugzilla discussion, that the problem should be solved by creating a mozilla plugin to handle URL's written torrent://domain.name/localpath/file.torrent .
The real problem is that it doesn't use just one protocol, it uses two protocols. The first protocol is the traditional protocol (http, ftp, email, whatever) you use to download the dot torrent file that contains the description of where to get the proper file. The second protocol is the real torrent protocol by itself. Using a single URL for two different protocols is not very clean IMHO.
I can think of three solutions: a) eliminating the first protocol by putting the necessary data in the URL, e.x. torrent://server/enough_data_to_begin_the_transfer , b) always assuming that a torrent://server/path/file.torrent URL will allways be downloaded using http or whatever fixed protocol, and c) letting a plugin or other application handle the dot torrent file.
Don't use the password as encrypting key, just have the encrypting key in a file encrypted using your password.
Or how about the key between Fn and Alt? Yes, that's the infamous "diamond key".
Do you mean the meta key?
Please, dear moderators, moderate the parent post up. I'm sick of seeing that all posts that are +5 are "Funny". Please, give a chance to interesting and insightful posts.
The moderation ability of a person is based on how his posts have been moderated. When most +5 posts are "funny", we get moderators that just know what funny is, not what insightful or interesting is.
Please, remeber that this is not segfault.org.
I realise that treating the contents of files using simple comands and virtual directories is very interesting, but it has very important consequences.
What happens to old applications? They should work as before with no changes. Imagine the result of creating a tar archive if it didn't behave as before.
For which file types will reiserfs4 will be capable to create virtual directories to access their content? Or put it another way, when do we stop?
How is it implemented, as kernel modules or a a userspace library? If implemented as kernel modules, then they will bloat the kernel or it will be wasting much time and fragmenting memory loading and unloading modules. If implemented as a library, why not implement it fully in userspace, independently of the choosen FS?
Is there a userspace alternative? For many cases the answer is yes. For example, the password file does not have to be a plain text file. It can be a db file if the system is configured accordingly.
IMHO, accessing the contents of files using virtual directories has no advantadge over userspace solutions. If a file requires database like functionality, it should be in a db or equivalent format. Implementing that functionality at the kernel level is just overkill and doesn't provide any benefit, just added complexity. What's missing (or I am missing ;) is the proper tools to deal with those files with the ease of the exampple given before.
Oopses are not normally dumped to swap because it may not be available or reliable in that state. In fact you could produce more corruption. I think that the best solution is to leave things as they are, put a mark in some place in memory, reboot and let the boot code deal with it (assuming it is in a sane state after the reboot).
IMHO it is not cost effective for most cases.
I don't think that the load balancing is possible yet.
But neither one will get any design awards
Of course the x86 won't get a design award. The x86 wasn't created with extensibility in mind. It has been handicaped from the very beginning. It wasn't designed thinking that they could use more registers in the future, or that it could end up using any register for any purpose. In contrast, the X system has been designed for extensibility, network transparency, multiuser systems and isolation from the kernel.
Extensibility allows adding functionality to the system. The common example is the Renderer extension, but that is just a small example. They could have been created a widget set as an extension to reduce network traffic (not that it would be a good idea). The problem with extensions is not the proper extensions but standarisation. A non standarised extension is useless.
Network transparency allows to use any machine (that uses X) from a single location. You can have a desktop with several apps from different machines. You can move to another location and use the same machine you used before.
Multiuser systems allow various users to be logged into the same machine at the same time without interfering one to the other except for some level of resource competition. This allows to reduce the number of systems to be configured and mantained.
Isolation from the kernel allows to execute the X server in a separate process. The implications of this are that if for any reason there is any operation that will cause a crash, it will only crash the X server and not the entire operating system.
I understand that these features are of no use to the average user of a computer. In the other hand they are completely transparent to the user. I like it's design very much. What problems do people find in X?
most people do need the ability to change resolution and color depth on their desktops easily
Why? I configured my screen resolution and depth when I installed the OS. Why should I need to change it again?
This is not intended to be a troll. I just don't get it.
Sun Ray terminals are similar in concept to VNC terminals (if they existed). There is a server to which the terminals connect using a propietary protocol. The X server runs on the server and the terminal is just a framebuffer with keyboard, mouse, USB sockets and audio. When you start a session you start an X server. When you switch terminals you disconnect from your X server and reconnect with another terminal. This method requires a dedicated *monster* server that has enough memory for each frame buffer, has enough cpu power to draw into the frame buffers and has dedicated networks with enough bandwith to the terminals.
What would be really interesting is migrating aplications from one X server to the other transparently just using the X protocol. Without dedicating a server to the task, or having to migrate the entire session.