Accessing global memory on GPUs is extremely slow and there is a strict memory heirarchy that you have to adhere to in order to get any kind of performance.
It could be seen as being the same as the CPU, except will automatically cache it to fast memory for you.
Any problem where you need random access over a large amount of data is just not feasible on GPUs.
What makes you think it would be faster with the MIC?
You will be parallelizing, and each thread will only ever be able to use max_mem/N for its own processing. When you parallelize, you avoid sharing memory between threads. Your data set is split over the threads and synchronization is minimized. In a SMP/NUMA model, this is done transparently by simply avoiding to access memory that other threads are working on. In other models, you have to explicitly send the chunk of memory that each thread will be working on (through DMA, the network, an in-memory FIFO or whatever), but it doesn't change anything from a conceptual point of view.
If your parallel decomposition is much more efficient if your data per thread is larger than 1GB, then you cannot possibly run 64 threads set up like this on the MIC platform. There is often a minimum size required for a parallel primitive to be efficient, and if that minimum size is greater than max_mem/N then you have a problem. This is the limiting factor I'm talking about. 128 MB, however, is IMO quite large enough.
In fact this is a major advantage of MIC versus GPUs.
The advantage of MIC lies in ease of programming thanks to compatibility with existing tools and the more flexible programming model. Memory on GPUs is global as well, so I have no idea what you're talking about. There is also so-called "shared" memory (CUDA terminology, OpenCL is different) which is per block, but that's just some local scratch memory shared by a group of threads.
There is nothing nighmarish of the above
Please stop deforming what I'm saying. What is nightmarish is finding the optimal work distribution and scheduling of a heterogeneous or irregular system. Platforms like GPUs are only fit for regular problems. Most HPC applications written using OpenMP or MPI are regular as well. Whether the MIC will be able to enable good scalability of irregular problems remains to be seen, but the first applications will definitely be regular ones.
Yes, a core can access the memory of other cores. Your point being?
If you run all cores at the same time each with their own dataset, which is what you want to do in order to actually use the architecture properly, you'll have that limit for each thread.
No, I mean linux server VMs running with either Xen or ESX. The fact that neither the host or the guest run in graphics mode has nothing to do with it. The host will still emulate a screen and graphics adapter for each guest.
How is that esoteric? A thread shouldn't require more than this even on a PC. That's also much more than the Cell allowed, which is a similar architecture.
Flops had always been a useless metric. If you want good metrics, look at the instruction reference with the speed in cycles of each instruction, its latency, its pipelining capabilities, the processor frequency, and cross it all with the number of cores and memory and cache interconnect specifications.
Flops are just a number that give a value for a single dumb computation in the ideal case ; real computations can be up to 100 times slower than that.
It's just that it's got everything integrated. You can get the same amount of debugging help with gdb and an appropriate front-end, and the fact that those tools are command-line is actually an advantage, since it's easier to integrate them in other environments.
Integrated solutions do not work well because they're inflexible and force you to use their way to do things instead of special dedicated tools that might be better at what they do, can be run on their own, or simply might be more powerful since more specialized. In particular, projects in visual studio do not scale at all, so it's impossible to do any sort of large scale software development with it unless you split everything in mini-projects. The file manager is not practical, nor are the search functions, nor is the text editor itself. Its refactoring is severely limited to simplistic things, and its integrated support for team work (vcs) is a joke. About the compiler itself, its C++ language support is horrible, the compiler is riddled with bugs, is slow, and isn't even that good at optimizing as soon as you get out of the patterns for which it auto-detects things and cheats.
Nothing beats a good build system, a good text editor, a good standalone compiler, a good terminal and a good set of tools with adequate scripting.
It's much easier to find a good developer job than it is to create your own business and make it successful. (and I'm saying that as a CEO) I also wouldn't say being a software developer is a dead end when you compare the average salary of software developers to that of the whole population.
Development is only like lego when everything is trivial. When you're doing development on things that are technically complex, it's more like rocket science.
Those are old versions of software. Also, unlike what you're saying, those libraries do have systems to prevent you from linking against an incompatible version.
I hired someone who did both Windows and Linux some time ago. Fired him after one month when I realized he was always looking for a GUI to do anything.
Seems like you simply want to make sure you parallelize on the L3 cache line boundary to avoid false sharing (same as with regular CPUs)
It could be seen as being the same as the CPU, except will automatically cache it to fast memory for you.
What makes you think it would be faster with the MIC?
You will be parallelizing, and each thread will only ever be able to use max_mem/N for its own processing.
When you parallelize, you avoid sharing memory between threads. Your data set is split over the threads and synchronization is minimized. In a SMP/NUMA model, this is done transparently by simply avoiding to access memory that other threads are working on. In other models, you have to explicitly send the chunk of memory that each thread will be working on (through DMA, the network, an in-memory FIFO or whatever), but it doesn't change anything from a conceptual point of view.
If your parallel decomposition is much more efficient if your data per thread is larger than 1GB, then you cannot possibly run 64 threads set up like this on the MIC platform. There is often a minimum size required for a parallel primitive to be efficient, and if that minimum size is greater than max_mem/N then you have a problem. This is the limiting factor I'm talking about.
128 MB, however, is IMO quite large enough.
The advantage of MIC lies in ease of programming thanks to compatibility with existing tools and the more flexible programming model.
Memory on GPUs is global as well, so I have no idea what you're talking about. There is also so-called "shared" memory (CUDA terminology, OpenCL is different) which is per block, but that's just some local scratch memory shared by a group of threads.
Please stop deforming what I'm saying. What is nightmarish is finding the optimal work distribution and scheduling of a heterogeneous or irregular system.
Platforms like GPUs are only fit for regular problems. Most HPC applications written using OpenMP or MPI are regular as well. Whether the MIC will be able to enable good scalability of irregular problems remains to be seen, but the first applications will definitely be regular ones.
You can do that, but that will only reduce the amount of memory available to each core, not increase it...
You could also have something even less homogeneous, but that would be a nightmare to schedule.
Yes, a core can access the memory of other cores.
Your point being?
If you run all cores at the same time each with their own dataset, which is what you want to do in order to actually use the architecture properly, you'll have that limit for each thread.
No, I mean linux server VMs running with either Xen or ESX. The fact that neither the host or the guest run in graphics mode has nothing to do with it. The host will still emulate a screen and graphics adapter for each guest.
How is that esoteric? A thread shouldn't require more than this even on a PC. That's also much more than the Cell allowed, which is a similar architecture.
Flops had always been a useless metric. If you want good metrics, look at the instruction reference with the speed in cycles of each instruction, its latency, its pipelining capabilities, the processor frequency, and cross it all with the number of cores and memory and cache interconnect specifications.
Flops are just a number that give a value for a single dumb computation in the ideal case ; real computations can be up to 100 times slower than that.
To program the MIC you need to design your program so that each thread only requires 128 MB of RAM anyway...
Sorry, but no. VMs typically run with an emulated screen, typically redirected to a VNC server.
You'll have to explain to me how a web browser can disable basic OS functionality like print screen.
Maddo Scientisto!
It's just that it's got everything integrated. You can get the same amount of debugging help with gdb and an appropriate front-end, and the fact that those tools are command-line is actually an advantage, since it's easier to integrate them in other environments.
Integrated solutions do not work well because they're inflexible and force you to use their way to do things instead of special dedicated tools that might be better at what they do, can be run on their own, or simply might be more powerful since more specialized.
In particular, projects in visual studio do not scale at all, so it's impossible to do any sort of large scale software development with it unless you split everything in mini-projects. The file manager is not practical, nor are the search functions, nor is the text editor itself. Its refactoring is severely limited to simplistic things, and its integrated support for team work (vcs) is a joke.
About the compiler itself, its C++ language support is horrible, the compiler is riddled with bugs, is slow, and isn't even that good at optimizing as soon as you get out of the patterns for which it auto-detects things and cheats.
Nothing beats a good build system, a good text editor, a good standalone compiler, a good terminal and a good set of tools with adequate scripting.
It's much easier to find a good developer job than it is to create your own business and make it successful. (and I'm saying that as a CEO)
I also wouldn't say being a software developer is a dead end when you compare the average salary of software developers to that of the whole population.
Development is only like lego when everything is trivial.
When you're doing development on things that are technically complex, it's more like rocket science.
How about an actually good, extensible, stable and scalable toolchain instead?
This is clearly apparent from all the videos of leaked models.
Time to counter-sue?
Visual Studio? Fantastic?
Newbie developers are so funny.
Why can't we see the exoskeleton in question?
Are you crazy? 25ms is very fast for a tcp handshake.
Those are old versions of software.
Also, unlike what you're saying, those libraries do have systems to prevent you from linking against an incompatible version.
Actually, Red Hat is one of the more GUI-oriented distributions....
You realize most software developers run Linux, and that software developers can easily be paid in the 100k range?
I hired someone who did both Windows and Linux some time ago.
Fired him after one month when I realized he was always looking for a GUI to do anything.
Why not try to do it in Cambridge? It's already a major technology cluster, better invest there than to try to recreate something from scratch...