First of all, there are very few general purpose applications that special purpose NVIDIA hardware running CUDA can do significantly better than a real general purpose CPU, and Intel intends to cut even that small gap down within a few product cycles. That's not strictly true. Off the top of my head: Sorting, FFTs (or any other dense Linear Algebra) and Crypto (both public key and symmetric) covers quite a lot of range. The only real issue for these application is the large batch sizes necessary to overcome the latency. Some of this is inherent in warming up that many pipes, but most of it is shit drivers and slow buses.
The real question is what benefits will CUDA offer when the vector array moves closer to the processor? Most of the papers with the above applications used pre-CUDA hardware with all of the horrors of general-purpose coding running under OpenGL. A couple of the applications would already receive a significant boost from running in CUDA on modern hardware (primarily from latency reducton).
It doesn't suprise anyone that we are watching the second generation of FPU being folded into the processor. It wouldn't suprise me personally if ten years from now the individual floating EUs inside most chips had disappeared completely leaving small Integer / Control pipes as a front-end to a massive vector array of FP units. There is more at stake than who can trace rays the quickest.
There is no real detail in the article so dredging my memory for how CUDA works... It probably is because they are stream processors - i.e a pool of vector processors that are optimised for SIMD. The innovation was that the pool could be split into several chunks working on separate SIMD programs. Rather than threads there are programmable barriers to control the different groups and explicit memory locking to ensure the cache is partitioned between the different groups.
So to put it another way, the big threading "innovation" in CUDA is to not use threading, but instead to partition the memory and use low-level synchronisation primitives. Something that the supercomputing guys are well aware of, although they prefer to stick a MPI layer on top of it.
You must be new here. *Cough*... pot... kettle... black? You actually said this about America with a straight face:
The law runs the country. This is a nation of law, not a country of lawyers who are best paid by large content owners.
Actually there is a definition of how long a sample has to be to infringe copyright because the RIAA took various hip-hop stars to court claiming infringement on beats they'd lifted. I'm too lazy to google but I think it was about 5 seconds - which would make it about 150K @ 256Kb/s encoding. If the chunk size on the protocol was set to be lower than this then I'd love to see someone argue in court that it was fair use:)
Much like witty retorts in bars that you come up with the next day, I now realise that the final part of my comment should have read "kill him now, kill kill, kill -9"... ah, life's too short, at least on slashdot I can post that next time the discussion comes around:)
I would say so, although it is more efficient to use binary than any other base so I don't think it will come up. When I mentioned self-referential data I meant things that contained addresses of other pieces of information. Whenever that is the case, the fact that the address is being stored in binary means that it makes sense to use units based on powers of two. But what I'm arguing is just a variant of the old memory addressing argument.
The basic argument is that when storing anything it makes sense for the units to be "round" numbers (in the everyday sense), and where those addresses are being manipulated in binary, base-2 units make sense. Of course, not everyone stores and manipulates the same kind of data so YMMV.
These are not the processes you are looking for...
He is referring to processes as a language mechanism for encapsulating state and communication. Not as in O/S level encapsulation of state.
Language-based processes are very light-weight. Look at Erlang, any of the Occam spin-offs or any of the other theoretical...
hmm, 1960s? That sounds very early for CCS or CSP. Oh dear, I think you're right, he is talking about forking processes for each unit of concurrency. Shoot him, shoot him now
Cue posts on block sizes, sector sizes, which are still not relevant to the number of bytes in a file. Not quite:)
You're confusing two things in your argument there: the amount of information, and the shape of the information. Ah, but then I don't really see any difference between the two, so it makes it easier to confuse:) But then I'm used to dealing with information that is self-referential
Keep your opinion if you want, but your line of argument is entirely redundant. This is not a usability issue and I think that you know that. The end user does not really care if sizes are displayed in base 2 prefixs or base 10. It doesn't matter.
A file size is some number. If this other file is about 10x larger then the number will be about 10x larger. That's about as concrete as it gets from a user interface point of view.
For a more technical user it does matter, because they may have issues related to how many addressing units (blocks) the file spans. So the actual matter of which is used is completely irrelevant for the vast majority of users from a usability perspective; but vital to the usability of a certain segment of users.
Of course the choice of binary is not arbitrary and there is the basic choice of whether most convenient sizes should be integer (with base-2 prefixes) or non-integer (with base-10). I'm sure that making most storage sizes integer makes the system simpler, buy hey maybe you want to argue differently.
Most quantities that we measure are base-neutral so we default to base-10 because it is the standard counting system. But when we measure storage we are talking about a volume of information. And information in digital form is inherently binary, both when stored, and when manipulated.
So the only base that it makes sense to talk about amounts-of-information in is binary. Hence decades of engineers using the correct, i.e most logical measurements.
Now on a tangent, but if I think (way back) to my school days I seem to remember being taught kB, mB and gB. The idea being that the lower case prefix would prevent confusion with SI prefixs. But I'm way too lazy to look for some sort of citation for that, and yes, only engineers would think that reduces confusion.
I know, I've followed their research. But unfortunately the fact that the network is static does not (in their case) give static routing costs. It can be as low as one cycle - but you cannot rely upon it.
Very lucid description. The other problem with the design is that you don't get what you expect; using a simple 4-way grid should give predictable latency costs between nodes. Unfortunately their routing algorithm is non-predictable so you can't statically schedule threads at compile-time to feed each other, it all has to use dynamic control-flow. Shame really.
If you liked the transputer then you should look at its other descendant that is in the process of coming to market. There isn't a wealth of public information yet although they are in the process of releasing dev-tools and simulators. The largest chip only has 4 tiles, instead of the Tilera's 64, but it is aimed at the low power market. They should scale it up to similar levels without the silly amount of power the Tilera draws.
The simple answer is you use a memory hierarchy same as people do now. The L2 cache on a CPU is large enough to contain the working set for most problems. The working set for GPU-type problems tends to be accessed differently. You need some sort of caching for data but for lots of the memory you access it will be a really large pretty sequential stream. The memory locking in CUDA reflects this.
So going back to your comment about memory mismatch. Some of your cores in a hybrid would have large L2 caches like a conventional CPU. Some of your cores would have almost no L2 cache but would share a really large pool of L3 (probably the same 1/2 GB of DDR3) and the rest would be system memory. If the large L3 pool is in use then the cpu type cores wouldn't see any benefit from this layer,.... but when the gpu parts are idle this would be a large speed boost for the cpu-type parts.
It can do vast amounts of linear algebra really quickly. That makes it useful for a lot of applications if you decrease the latency between the processor and the vector pipelines.
Sharing one bus would hamper bandwidth per core (or parallelism as you've phrased it) - but look at the memory interface designs in mini-computers/mainframes over the past ten years for some guesses on how that will end up. Probably splitting the single bus into may point-to-point links, or at least that is where AMD's money was.
Maybe he believed in some sort of managed automatic blood-collection, in which case we certainly don't want his code in the kernel. Could lead to some sort of Singularity
I know of a certain transatlantic link that would fail once a day (turned out to be a missing free that caused the heap to become exhausted). The customer screamed that every 30s reboot cost them $50,000. The bug went unfixed for nine months because it couldn't be replicated in a test environment, only on their live link and for some reason they wouldn't let us debug it there.
Once a day their CEO called ours and shouted for five minutes about the 50 grand that they'd just lost.
True, but I've been standing in switch rooms watching operators manually kill those circuits because they wanted to reboot a box. 5x 9s doesn't mean perfect service, and if anyone complained about it they were told that a ms interruption once every few months was in their SLA. By the time they reconnected they went through another box so how were they to know it was any longer than that.
Well if you want to know what they look like... I can't vouch for how accurate these images are. I can see that they are either the largest clerical fuckup of all time, or a great hoax.
Travelling through Madrid airport in the summer of 2003 there was a series of display cases with every Lockhead Martin aircraft every made. Gorgeous little wooden carvings. When I saw this beauty I nearly dropped from shock. Then I walked backwards on the travelator to snap the pic - hence the horrible blur. There is also a closeup.
Either somebody in the marketing department made a career ending mistake, or someone in the modelling department had some fun with the spanish public. There should be enough plane nuts on these here threads to decide...
The real question is what benefits will CUDA offer when the vector array moves closer to the processor? Most of the papers with the above applications used pre-CUDA hardware with all of the horrors of general-purpose coding running under OpenGL. A couple of the applications would already receive a significant boost from running in CUDA on modern hardware (primarily from latency reducton).
It doesn't suprise anyone that we are watching the second generation of FPU being folded into the processor. It wouldn't suprise me personally if ten years from now the individual floating EUs inside most chips had disappeared completely leaving small Integer / Control pipes as a front-end to a massive vector array of FP units. There is more at stake than who can trace rays the quickest.
There is no real detail in the article so dredging my memory for how CUDA works... It probably is because they are stream processors - i.e a pool of vector processors that are optimised for SIMD. The innovation was that the pool could be split into several chunks working on separate SIMD programs. Rather than threads there are programmable barriers to control the different groups and explicit memory locking to ensure the cache is partitioned between the different groups.
So to put it another way, the big threading "innovation" in CUDA is to not use threading, but instead to partition the memory and use low-level synchronisation primitives. Something that the supercomputing guys are well aware of, although they prefer to stick a MPI layer on top of it.
Yes. They're trying to claim damages from you because other people are distributing the file. I can't see anything wrong with that argument at all...
Actually there is a definition of how long a sample has to be to infringe copyright because the RIAA took various hip-hop stars to court claiming infringement on beats they'd lifted. I'm too lazy to google but I think it was about 5 seconds - which would make it about 150K @ 256Kb/s encoding. If the chunk size on the protocol was set to be lower than this then I'd love to see someone argue in court that it was fair use :)
One of the most terrible attempts to deliberately Goodwin a thread ever. You have to at least try and make it sound relevant
Much like witty retorts in bars that you come up with the next day, I now realise that the final part of my comment should have read "kill him now, kill kill, kill -9" ... ah, life's too short, at least on slashdot I can post that next time the discussion comes around :)
Pour quoi? It's said to be like wiping your ass with silk
I would say so, although it is more efficient to use binary than any other base so I don't think it will come up. When I mentioned self-referential data I meant things that contained addresses of other pieces of information. Whenever that is the case, the fact that the address is being stored in binary means that it makes sense to use units based on powers of two. But what I'm arguing is just a variant of the old memory addressing argument.
The basic argument is that when storing anything it makes sense for the units to be "round" numbers (in the everyday sense), and where those addresses are being manipulated in binary, base-2 units make sense. Of course, not everyone stores and manipulates the same kind of data so YMMV.
These are not the processes you are looking for...
He is referring to processes as a language mechanism for encapsulating state and communication. Not as in O/S level encapsulation of state.
Language-based processes are very light-weight. Look at Erlang, any of the Occam spin-offs or any of the other theoretical...
hmm, 1960s? That sounds very early for CCS or CSP. Oh dear, I think you're right, he is talking about forking processes for each unit of concurrency. Shoot him, shoot him now
Keep your opinion if you want, but your line of argument is entirely redundant. This is not a usability issue and I think that you know that. The end user does not really care if sizes are displayed in base 2 prefixs or base 10. It doesn't matter.
A file size is some number. If this other file is about 10x larger then the number will be about 10x larger. That's about as concrete as it gets from a user interface point of view.
For a more technical user it does matter, because they may have issues related to how many addressing units (blocks) the file spans. So the actual matter of which is used is completely irrelevant for the vast majority of users from a usability perspective; but vital to the usability of a certain segment of users.
Of course the choice of binary is not arbitrary and there is the basic choice of whether most convenient sizes should be integer (with base-2 prefixes) or non-integer (with base-10). I'm sure that making most storage sizes integer makes the system simpler, buy hey maybe you want to argue differently.
OK, I'll take that challenge.
Most quantities that we measure are base-neutral so we default to base-10 because it is the standard counting system. But when we measure storage we are talking about a volume of information. And information in digital form is inherently binary, both when stored, and when manipulated.
So the only base that it makes sense to talk about amounts-of-information in is binary. Hence decades of engineers using the correct, i.e most logical measurements.
Now on a tangent, but if I think (way back) to my school days I seem to remember being taught kB, mB and gB. The idea being that the lower case prefix would prevent confusion with SI prefixs. But I'm way too lazy to look for some sort of citation for that, and yes, only engineers would think that reduces confusion.
Not that I disagree with your point - but can you name a single company that puts their customers interests over those of its shareholders?
I know, I've followed their research. But unfortunately the fact that the network is static does not (in their case) give static routing costs. It can be as low as one cycle - but you cannot rely upon it.
So... you understood his point?
Very lucid description. The other problem with the design is that you don't get what you expect; using a simple 4-way grid should give predictable latency costs between nodes. Unfortunately their routing algorithm is non-predictable so you can't statically schedule threads at compile-time to feed each other, it all has to use dynamic control-flow. Shame really.
If you liked the transputer then you should look at its other descendant that is in the process of coming to market. There isn't a wealth of public information yet although they are in the process of releasing dev-tools and simulators. The largest chip only has 4 tiles, instead of the Tilera's 64, but it is aimed at the low power market. They should scale it up to similar levels without the silly amount of power the Tilera draws.
The simple answer is you use a memory hierarchy same as people do now. The L2 cache on a CPU is large enough to contain the working set for most problems. The working set for GPU-type problems tends to be accessed differently. You need some sort of caching for data but for lots of the memory you access it will be a really large pretty sequential stream. The memory locking in CUDA reflects this.
.... but when the gpu parts are idle this would be a large speed boost for the cpu-type parts.
So going back to your comment about memory mismatch. Some of your cores in a hybrid would have large L2 caches like a conventional CPU. Some of your cores would have almost no L2 cache but would share a really large pool of L3 (probably the same 1/2 GB of DDR3) and the rest would be system memory. If the large L3 pool is in use then the cpu type cores wouldn't see any benefit from this layer,
It can do vast amounts of linear algebra really quickly. That makes it useful for a lot of applications if you decrease the latency between the processor and the vector pipelines.
Sharing one bus would hamper bandwidth per core (or parallelism as you've phrased it) - but look at the memory interface designs in mini-computers/mainframes over the past ten years for some guesses on how that will end up. Probably splitting the single bus into may point-to-point links, or at least that is where AMD's money was.
Maybe he believed in some sort of managed automatic blood-collection, in which case we certainly don't want his code in the kernel. Could lead to some sort of Singularity
Are you sure you wouldn't prefer a nice game of chess?
I know of a certain transatlantic link that would fail once a day (turned out to be a missing free that caused the heap to become exhausted). The customer screamed that every 30s reboot cost them $50,000. The bug went unfixed for nine months because it couldn't be replicated in a test environment, only on their live link and for some reason they wouldn't let us debug it there.
Once a day their CEO called ours and shouted for five minutes about the 50 grand that they'd just lost.
True, but I've been standing in switch rooms watching operators manually kill those circuits because they wanted to reboot a box. 5x 9s doesn't mean perfect service, and if anyone complained about it they were told that a ms interruption once every few months was in their SLA. By the time they reconnected they went through another box so how were they to know it was any longer than that.
Well if you want to know what they look like... I can't vouch for how accurate these images are. I can see that they are either the largest clerical fuckup of all time, or a great hoax.
Travelling through Madrid airport in the summer of 2003 there was a series of display cases with every Lockhead Martin aircraft every made. Gorgeous little wooden carvings. When I saw this beauty I nearly dropped from shock. Then I walked backwards on the travelator to snap the pic - hence the horrible blur. There is also a closeup.
Either somebody in the marketing department made a career ending mistake, or someone in the modelling department had some fun with the spanish public. There should be enough plane nuts on these here threads to decide...
It's good to hear that people are still actively trying to hasten Judgment Day