Eventual consistency means that the computer eventually computes the right answer if its quiescent long enough. Intermediate values, though, are an approximation, which is often enough.
Each CPU will see update events from other CPUs in different orders, each saying how much to update the count by. All CPUs will eventually see all updates. So, the total seen by any given CPU might differ from the true total in the short run (and may not even be a technically valid total given the original source of events, since events get reordered), but eventually all of the counters will converge on the same total if updates stop pouring in. Also, the totals are still locally monotonic.
If you required all CPUs to see the same sequence of updates to the count, then you have to take locks and serialize memory accesses, which on a manycore system is an expensive operation that simply doesn't scale well. But, if you relax the constraint to "eventual consistency" and "monotonic updates", then each core can have its local approximation that isn't too far from the real value, knowing that each core is no further from the true value than the backlog of events yet to arrive.
That's an extremely reasonable model for many types of data.
Go buy some cheap hosting and set up a WordPress blog then. You can have that nice, shiny island on the Internet all to yourself just like you like it. Tell all your friends to subscribe to its RSS feed.
I did a stint with 132 columns awhile back (back when I was running Linux on a machine that wasn't quite powerful enough to run X11), and I found the extra horizontal space to be mostly wasted. If I did start putting comments or code over there, it'd often get "lost".
Now, I do use a wide format for certain debugging applications. It also works well for spreadsheets. But for source code? I've definitely got an 80 column mind, despite ever having used punch cards or paper tape. I went from 28 columns to 40 to 80 as I learned programming in the 1980s. 80 columns is a very comfortable width. 132 pushed it too far.
80 columns actually corresponds pretty well to the amount of text you'd have on a type-written page on standard letter size paper (or A4, if you prefer). You get 60 to 72 characters across (assuming 1" margins) depending on whether you're 10CPI or 12CPI, and that's roughly how wide programs are when written within 80 column boundaries.
A few years back, I was implementing Leslie's Bakery Algorithm. (Which, to be sure, you should look up his original paper, not the bastardizations you sometimes find in textbooks. That paper and more are available here.)
In my implementation, I wanted to SIMD-ize one of the steps to make it more efficient. I thought the transformation was valid, but wasn't certain, so I emailed Dr. Lamport. I was pleasantly surprised when Leslie actually replied to my email.
And yes, the transformation was valid. *whew* Our multiprocessor DSP software got a little faster that day.
I think part of the problem is that the axes aren't linear. If you know the problem you're trying to tackle a priori you can tackle it with multiple magnitudes of greater efficiency. For a fully specified, unchanging problem, I'd expect 3 orders of magnitude or better in most spaces, because you'd build exactly the hardware you need, and strip away all the hardware that supports unneeded programmability—you build a hardwired ASIC. Even in the programmable space, spending a bit of effort matching your problem to your processor can bring huge gains in efficiency, at least 5x. Also, consider that efficiency isn't just run time, but rather a function of power, performance, and cost.
The algorithms that run on a hearing aid would sop the hearing aid's battery before they were even fully loaded if you tried to run them on a typical desktop processor. But, they're baked down to a hyperefficient DSP or ASIC that's tuned specifically for the problem.
You cite a SPEC benchmark that runs faster on an A7 than an A15. Is that in clocks or wall-clock time? I suspect it's dominated by pointer dereferences, such as a linked list traversal. Load-to-use latency (which isn't a function of cache organization, but rather pipeline depth) becomes a dominant term for those workloads.
Backing up a bit: My problem with your thesis is that you assume there's a "best GPP" and then seek to prove there's no one processor that could possibly be that on the basis that across random applications, the winner varies. Your argument seems to be, at the limit: "if you don't tell me your application ahead of time, I can't pick a best processor, so therefore there's no general purpose processors."
It's the other way around. There's a cluster of processors that are OK at a range of random tasks. They're distinguished from special-purpose processors by the fact that the special purpose processor performs at least 5x or more (and likely orders of magnitude in some cases) better than the average for the cluster. That's true even if some of the processors in that cluster are 2x more efficient than the others. A processor is a GPP if there's few or no problems for which it's orders of magnitude more efficient than its cohort. 2x is nothing to sneeze at, but a specialized processor should reach much higher. 5x at a minimum.
And please note I'm mentioning efficiency. It's not raw cycles or even wall clock time. Maybe a better measure is "energy per function", or "energy per function per dollar." (Although the latter is a bit dubious, as you buy the hardware once, but you use it many, many times. Lifetime costs are best approximated by energy costs over the lifetime of the device, if you're doing significant compute.)
You mention GPUs. Sure, GPUs provide cheap FLOPs, and they can even start to run arbitrary C programs. But, what %age of those FLOPs get utilized when running random programs? You might get a 4x speedup offloading some algorithm to your video card, but is that a win when your video card's raw compute power is 100x your host CPUs? Would you buy a Windows machine powered only by a GPU, running everything from your statistical regression to your web browser?
(I may exaggerate, but only slightly.)
To me, "general purpose" means, "I run the compiler, and for the most part, I get what I get. If there's some hotspots, maybe I can tune for this specific architecture. Most of the time, I don't worry." Specialized means "by selecting this processor for this task, I know up front I need to spend time optimizing the implementation of the task to this processor."
Perhaps the qualm is that really that's more a function of the application than the processor. OK. I can buy that. But, when you look across the space of processors that get deployed in that way, you'll see that most processors tend to end up one one side or the other of that line fairly often, and few are on the fence. You find very few DSPs and GPUs asked to run Linux or Windows kernels and applications (the core code, not the stuff they compile to be offloaded, say, in a shader language). You find some number of x86s asked to run signal processing applications, but only where they can afford the cooling.
Due to self-interference, the light bumps into itself on the way out, and subsequently can't get out. At least at my limited level of understanding, it's the wave-light nature of light at play here.
I imagine at some point, the trapped photons all get absorbed and the original energy dissipates as heat.
FWIW, I technically had some C++ before Perl, but not enough that I count it.
Re: Perl: I'm much the same way with Perl. I use Perl for lots of quickie projects. Great for anything I'm only going to spend at most a couple days developing.
Perl's also great for much larger projects too, where runtime performance isn't absolutely critical but flexibility and development ease is paramount. We actually have some fairly significant projects at work that are written in Perl.
One of them, which embeds Perl in another language as a metaprogramming language. I couldn't imagine trying to write that in C++ without a dedicated team. But, a set of hardware designers are effectively maintaining the tool in the background thanks to it being written in Perl.
As for which language is your gateway language, it probably depends on what era you started programming in, too. My path was Microsoft BASIC => Assembly => Turbo Pascal => C => Perl => C++11, with some shell scripting and other goodies around the fringes. I've probably written more C than anything, but C++11 rules my future. Turbo Pascal was my short-lived gateway to C, where I then spent most of my career. I wrote some truly regrettable neophyte-programmer code in C there at the beginning, so really C was where I grew from a college-aged hacker to someone who can actually program. Now guess how old I am.;-)
I guess for an analysis like this, you really need to limit yourself to people who consider themselves competent programmers. Those VB macro whizzes in accounting likely consider themselves accountants, not programmers. Likewise for the physicist with a pile of creaky MATLAB models.
BTW, I have to agree with you 100% on make and bash. I consider myself above average on make as compared to my coworkers, but that's an extremely low bar. And while I've done some crazy stuff in bash in the past, these days I'll hop over to perl for anything more than 10 - 20 lines, especially if I find too much 'sed' showing up or find myself wanting an actual data structure.
I finally brought up the PDF. It appears the authors consider C++ weakly typed because it allows type-casting between, say, pointers and integers.
While this is strictly true, I find myself avoiding such things whenever possible. Main exception: When talking directly to hardware, it's often quite necessary to treat pointers as integers and vice versa.
I guess to fairly evaluate a language like C++, you need to categorize programs based on how the language was used in the program. If you stick to standard containers and standard algorithms, eschewing casting magic except as needed (and using runtime-checked casts the few places they are), your program is very different than one that, say, uses union-magic and type punning and so on every chance it gets. (I've written both types of programs... again, FORTRAN in any language.)
One of my more recent projects was written ground-up in C++11. It relies on type safety, standard containers, standard algorithms, smart pointers (shared_ptr, unique_ptr) fairly heavily. It's been quite a different experience to program vs. my years of C programming. Way fewer dangling pointers, use-after-free errors, off-by-one looping errors, etc. But, the paper lumps both languages into the same bucket. That hardly seems fair.
BTW, this ACM Queue article was linked from the blog post I linked above. It's another good, somewhat relevant read, IMHO. It makes largely the same point, though: It's more the programmer than the language.
I wonder if you can do an analysis of code bases across languages for the same team? I regularly write significant amounts of C++ (these days, C++11), Perl and assembly language. Those are three rather different languages, with strong, weak and largely non-existent type systems, respectively.
Of course, all three languages also open themselves to a wide range of programming styles, and I imagine if you picked any other set of languages you could make a similar statement. But if you measure the same programmers programming in across them (assuming a reasonably high level of proficiency in all of them), then perhaps you can determine what portion of the effect is due to the programmer vs. due to the language.
And that helps this how? The counterfeits aren't mask copies, but rather microcontrollers programmed to emulate FTDI chips. This isn't a "ghost shift" problem like many counterfeit consumer goods.
As a matter of fact, I'm releasing a product soon that falls exactly into this category. I specifically chose to keep the VID/PID unmodified, as the product has a mode where it needs to look like an ordinary serial port.
Sure, I could get modified OSX, Windows and Linux drivers to teach it about a new PID so that ordinary comms software works, but it's far simpler and less risky for a small guy like me (we're talking a few hundred units total) to just go with the flow and make it work with the generic drivers everyone likely already has.
You're technically correct that the chip hasn't been physically damaged. However, it's effectively dead, and FTDI's EULA revision makes it clear that they intend to render non-functional any clones they detect:
Use of the Software as a driver for, or installation of the Software onto, a component that is not a Genuine FTDI Component, including without limitation counterfeit components, MAY IRRETRIEVABLY DAMAGE THAT COMPONENT.
I recently built a big pile of boards with an FTDI USB chip on them, as part of my retrogaming hobby. I bought from a reputable source, I think. But if it turns out that I got an illegitimate batch of FTDI chips, I now own a pile of bricks until I pay to get them reworked. I don't know yet, since I haven't tested them in Windows with the latest driver.
Counterfeiting harms the original producer of the chips, and this extends the harm to OEMs that use the chips (who may or may not be innocent), and their customers (who most certainly are innocent).
I can't see how this is a good thing for anyone.
As someone said recently at work: "Deposits to the 'trust bank' are always small. Withdrawals are always large." In other words, it takes years to build trust, but you can obliterate it in seconds. FTDI may have done just that.
Carl Mueller, Jr. discovered these when he reverse engineered Donkey Kong a few years ago. He implemented them in his clone for the Intellivision, also. I believe he had had blog posts about these, but I can't find them.
My work laptop has 4GB of RAM on it and Windows 7 and it runs just fine. The only thing that slows it down is when the corporate-mandated management scripts run and start pegging the hard drive with virus scans, audits and the like. More RAM wouldn't help that. Switching to an SSD did.
According to Resource Monitor, I'm using about 3GB, with 850MB of that used as cache. A bit over 1GB of that is Firefox.
So, yeah, I could see 1GB really sucking when used with a modern web browser and many tabs open (like I do). 4GB, though, hasn't really held me back much.
Apparently, at least part of Vista's memory woes stemmed from the poorly tuned "SuperCache" feature, that would aggressively try to pre-cache data in RAM. Its appetite was apparently too large. It apparently also didn't manage its disk buffers very well. (This is all third or fourth hand knowledge and so could be shaky. I've never run Vista myself. If someone has more details, pipe up!)
Eventual consistency means that the computer eventually computes the right answer if its quiescent long enough. Intermediate values, though, are an approximation, which is often enough.
One example that Paul McKinney gives is of a distributed counters built out of per-CPU counters, and CPU-to-CPU events saying how much to update the total by. (Let's assume positive counts only.)
Each CPU will see update events from other CPUs in different orders, each saying how much to update the count by. All CPUs will eventually see all updates. So, the total seen by any given CPU might differ from the true total in the short run (and may not even be a technically valid total given the original source of events, since events get reordered), but eventually all of the counters will converge on the same total if updates stop pouring in. Also, the totals are still locally monotonic.
If you required all CPUs to see the same sequence of updates to the count, then you have to take locks and serialize memory accesses, which on a manycore system is an expensive operation that simply doesn't scale well. But, if you relax the constraint to "eventual consistency" and "monotonic updates", then each core can have its local approximation that isn't too far from the real value, knowing that each core is no further from the true value than the backlog of events yet to arrive.
That's an extremely reasonable model for many types of data.
Go buy some cheap hosting and set up a WordPress blog then. You can have that nice, shiny island on the Internet all to yourself just like you like it. Tell all your friends to subscribe to its RSS feed.
ROFL. Not all of us have low enough UIDs to remember Katz. But you and I do.
Yep, I'm with you here.
I did a stint with 132 columns awhile back (back when I was running Linux on a machine that wasn't quite powerful enough to run X11), and I found the extra horizontal space to be mostly wasted. If I did start putting comments or code over there, it'd often get "lost".
Now, I do use a wide format for certain debugging applications. It also works well for spreadsheets. But for source code? I've definitely got an 80 column mind, despite ever having used punch cards or paper tape. I went from 28 columns to 40 to 80 as I learned programming in the 1980s. 80 columns is a very comfortable width. 132 pushed it too far.
80 columns actually corresponds pretty well to the amount of text you'd have on a type-written page on standard letter size paper (or A4, if you prefer). You get 60 to 72 characters across (assuming 1" margins) depending on whether you're 10CPI or 12CPI, and that's roughly how wide programs are when written within 80 column boundaries.
Indeed!
A few years back, I was implementing Leslie's Bakery Algorithm. (Which, to be sure, you should look up his original paper, not the bastardizations you sometimes find in textbooks. That paper and more are available here.)
In my implementation, I wanted to SIMD-ize one of the steps to make it more efficient. I thought the transformation was valid, but wasn't certain, so I emailed Dr. Lamport. I was pleasantly surprised when Leslie actually replied to my email.
And yes, the transformation was valid. *whew* Our multiprocessor DSP software got a little faster that day.
Anyway, there's some fascinating stuff on his page full of papers. The link again: http://research.microsoft.com/en-us/um/people/lamport/pubs/pubs.html
I think part of the problem is that the axes aren't linear. If you know the problem you're trying to tackle a priori you can tackle it with multiple magnitudes of greater efficiency. For a fully specified, unchanging problem, I'd expect 3 orders of magnitude or better in most spaces, because you'd build exactly the hardware you need, and strip away all the hardware that supports unneeded programmability—you build a hardwired ASIC. Even in the programmable space, spending a bit of effort matching your problem to your processor can bring huge gains in efficiency, at least 5x. Also, consider that efficiency isn't just run time, but rather a function of power, performance, and cost.
The algorithms that run on a hearing aid would sop the hearing aid's battery before they were even fully loaded if you tried to run them on a typical desktop processor. But, they're baked down to a hyperefficient DSP or ASIC that's tuned specifically for the problem.
You cite a SPEC benchmark that runs faster on an A7 than an A15. Is that in clocks or wall-clock time? I suspect it's dominated by pointer dereferences, such as a linked list traversal. Load-to-use latency (which isn't a function of cache organization, but rather pipeline depth) becomes a dominant term for those workloads.
Backing up a bit: My problem with your thesis is that you assume there's a "best GPP" and then seek to prove there's no one processor that could possibly be that on the basis that across random applications, the winner varies. Your argument seems to be, at the limit: "if you don't tell me your application ahead of time, I can't pick a best processor, so therefore there's no general purpose processors."
It's the other way around. There's a cluster of processors that are OK at a range of random tasks. They're distinguished from special-purpose processors by the fact that the special purpose processor performs at least 5x or more (and likely orders of magnitude in some cases) better than the average for the cluster. That's true even if some of the processors in that cluster are 2x more efficient than the others. A processor is a GPP if there's few or no problems for which it's orders of magnitude more efficient than its cohort. 2x is nothing to sneeze at, but a specialized processor should reach much higher. 5x at a minimum.
And please note I'm mentioning efficiency. It's not raw cycles or even wall clock time. Maybe a better measure is "energy per function", or "energy per function per dollar." (Although the latter is a bit dubious, as you buy the hardware once, but you use it many, many times. Lifetime costs are best approximated by energy costs over the lifetime of the device, if you're doing significant compute.)
You mention GPUs. Sure, GPUs provide cheap FLOPs, and they can even start to run arbitrary C programs. But, what %age of those FLOPs get utilized when running random programs? You might get a 4x speedup offloading some algorithm to your video card, but is that a win when your video card's raw compute power is 100x your host CPUs? Would you buy a Windows machine powered only by a GPU, running everything from your statistical regression to your web browser?
(I may exaggerate, but only slightly.)
To me, "general purpose" means, "I run the compiler, and for the most part, I get what I get. If there's some hotspots, maybe I can tune for this specific architecture. Most of the time, I don't worry." Specialized means "by selecting this processor for this task, I know up front I need to spend time optimizing the implementation of the task to this processor."
Perhaps the qualm is that really that's more a function of the application than the processor. OK. I can buy that. But, when you look across the space of processors that get deployed in that way, you'll see that most processors tend to end up one one side or the other of that line fairly often, and few are on the fence. You find very few DSPs and GPUs asked to run Linux or Windows kernels and applications (the core code, not the stuff they compile to be offloaded, say, in a shader language). You find some number of x86s asked to run signal processing applications, but only where they can afford the cooling.
Due to self-interference, the light bumps into itself on the way out, and subsequently can't get out. At least at my limited level of understanding, it's the wave-light nature of light at play here.
I imagine at some point, the trapped photons all get absorbed and the original energy dissipates as heat.
In the future, instead of complaining about letting the blue smoke out, we'll complain about letting the blue light out.
FWIW, I technically had some C++ before Perl, but not enough that I count it.
Re: Perl: I'm much the same way with Perl. I use Perl for lots of quickie projects. Great for anything I'm only going to spend at most a couple days developing.
Perl's also great for much larger projects too, where runtime performance isn't absolutely critical but flexibility and development ease is paramount. We actually have some fairly significant projects at work that are written in Perl.
One of them, which embeds Perl in another language as a metaprogramming language. I couldn't imagine trying to write that in C++ without a dedicated team. But, a set of hardware designers are effectively maintaining the tool in the background thanks to it being written in Perl.
I imagine the result resembles INTERCAL.
As for which language is your gateway language, it probably depends on what era you started programming in, too. My path was Microsoft BASIC => Assembly => Turbo Pascal => C => Perl => C++11, with some shell scripting and other goodies around the fringes. I've probably written more C than anything, but C++11 rules my future. Turbo Pascal was my short-lived gateway to C, where I then spent most of my career. I wrote some truly regrettable neophyte-programmer code in C there at the beginning, so really C was where I grew from a college-aged hacker to someone who can actually program. Now guess how old I am. ;-)
I guess for an analysis like this, you really need to limit yourself to people who consider themselves competent programmers. Those VB macro whizzes in accounting likely consider themselves accountants, not programmers. Likewise for the physicist with a pile of creaky MATLAB models.
BTW, I have to agree with you 100% on make and bash. I consider myself above average on make as compared to my coworkers, but that's an extremely low bar. And while I've done some crazy stuff in bash in the past, these days I'll hop over to perl for anything more than 10 - 20 lines, especially if I find too much 'sed' showing up or find myself wanting an actual data structure.
I finally brought up the PDF. It appears the authors consider C++ weakly typed because it allows type-casting between, say, pointers and integers.
While this is strictly true, I find myself avoiding such things whenever possible. Main exception: When talking directly to hardware, it's often quite necessary to treat pointers as integers and vice versa.
I guess to fairly evaluate a language like C++, you need to categorize programs based on how the language was used in the program. If you stick to standard containers and standard algorithms, eschewing casting magic except as needed (and using runtime-checked casts the few places they are), your program is very different than one that, say, uses union-magic and type punning and so on every chance it gets. (I've written both types of programs... again, FORTRAN in any language.)
One of my more recent projects was written ground-up in C++11. It relies on type safety, standard containers, standard algorithms, smart pointers (shared_ptr, unique_ptr) fairly heavily. It's been quite a different experience to program vs. my years of C programming. Way fewer dangling pointers, use-after-free errors, off-by-one looping errors, etc. But, the paper lumps both languages into the same bucket. That hardly seems fair.
BTW, this ACM Queue article was linked from the blog post I linked above. It's another good, somewhat relevant read, IMHO. It makes largely the same point, though: It's more the programmer than the language.
I wonder if you can do an analysis of code bases across languages for the same team? I regularly write significant amounts of C++ (these days, C++11), Perl and assembly language. Those are three rather different languages, with strong, weak and largely non-existent type systems, respectively.
Of course, all three languages also open themselves to a wide range of programming styles, and I imagine if you picked any other set of languages you could make a similar statement. But if you measure the same programmers programming in across them (assuming a reasonably high level of proficiency in all of them), then perhaps you can determine what portion of the effect is due to the programmer vs. due to the language.
After all, Real Programmers can write FORTRAN in any language.
Remove the / from the end of the link and it works. Annoying.
And that helps this how? The counterfeits aren't mask copies, but rather microcontrollers programmed to emulate FTDI chips. This isn't a "ghost shift" problem like many counterfeit consumer goods.
Vigilanteism, in other words. May fill a sense of personal justice, but that doesn't make it legal.
As a matter of fact, I'm releasing a product soon that falls exactly into this category. I specifically chose to keep the VID/PID unmodified, as the product has a mode where it needs to look like an ordinary serial port.
Sure, I could get modified OSX, Windows and Linux drivers to teach it about a new PID so that ordinary comms software works, but it's far simpler and less risky for a small guy like me (we're talking a few hundred units total) to just go with the flow and make it work with the generic drivers everyone likely already has.
You're technically correct that the chip hasn't been physically damaged. However, it's effectively dead, and FTDI's EULA revision makes it clear that they intend to render non-functional any clones they detect:
I recently built a big pile of boards with an FTDI USB chip on them, as part of my retrogaming hobby. I bought from a reputable source, I think. But if it turns out that I got an illegitimate batch of FTDI chips, I now own a pile of bricks until I pay to get them reworked. I don't know yet, since I haven't tested them in Windows with the latest driver.
Counterfeiting harms the original producer of the chips, and this extends the harm to OEMs that use the chips (who may or may not be innocent), and their customers (who most certainly are innocent).
I can't see how this is a good thing for anyone.
As someone said recently at work: "Deposits to the 'trust bank' are always small. Withdrawals are always large." In other words, it takes years to build trust, but you can obliterate it in seconds. FTDI may have done just that.
Ok, so I wasn't the only one.
Ah, apparently he auctioned off his fully commented disassembly
Carl Mueller, Jr. discovered these when he reverse engineered Donkey Kong a few years ago. He implemented them in his clone for the Intellivision, also. I believe he had had blog posts about these, but I can't find them.
Seems to be an exclusive or. "Ron Paul" or "somebody sane", which implies Ron Paul isn't in the sane set.
My work laptop has 4GB of RAM on it and Windows 7 and it runs just fine. The only thing that slows it down is when the corporate-mandated management scripts run and start pegging the hard drive with virus scans, audits and the like. More RAM wouldn't help that. Switching to an SSD did.
According to Resource Monitor, I'm using about 3GB, with 850MB of that used as cache. A bit over 1GB of that is Firefox.
So, yeah, I could see 1GB really sucking when used with a modern web browser and many tabs open (like I do). 4GB, though, hasn't really held me back much.
Apparently, at least part of Vista's memory woes stemmed from the poorly tuned "SuperCache" feature, that would aggressively try to pre-cache data in RAM. Its appetite was apparently too large. It apparently also didn't manage its disk buffers very well. (This is all third or fourth hand knowledge and so could be shaky. I've never run Vista myself. If someone has more details, pipe up!)