And you think printf() and strtol() are major bottlenecks worth dedicated silicon area why?
Modern CPUs already have many accelerators for high end functions, such as numerical computations, cryptography, and the all important memcpy. (Memory copies are a traditional bottleneck, and general enough that they can be easily offloaded.) They come in two forms—specialized SIMD/vector instruction sets, and dedicated blocks for high-level functions that take multiple microseconds. An example of the former are the SIMD-oriented AVX instructions found on modern x86 chips. As an example of the latter, chips aimed at high end signal processing often have discrete blocks such as FFT accelerators. Others aimed at network tasks (especially DPI) have regular expression engines.
The problem with accelerator blocks is that they do take up area. And if they're powered up, they leak. Leakage current is a significant factor in modern designs. To get faster transistors, you need to drive their threshold voltage down. As you lower the threshold voltage, their leakage current goes up exponentially. So, that circuit better be bringing a lot of bang for the buck if it's going to be sitting there taking up space and leaking.
Another issue with dedicating area to fixed functions is the impact it has on distance between functions on the die. In the Old Days, you could get anywhere on the die in a single clock cycle. With modern designs and modern clock rates, cross-die communication is slow, taking many many cycles. So, when you plop down your custom accelerator, you have to figure out where to put it. Do you put it right in the middle of the rest of the computational units, slowing down the communication between their functions (either lowering clock rate or increasing cycle counts), or do you put it on the other side of the cache, meaning it takes several cycles to send it a request and several cycles to see the result?
This is why many custom accelerator blocks out there today focus on meaty workloads. A large FFT still takes a good bit of time to execute, and there's usually other work the main CPU can do while it executes. Thus, the communication overhead doesn't tank your performance. printf(), on the other hand, generally shows up right in the middle of a bunch of other serial steps. You can't overlap that with anything. Hauling off to a printf() accelerator block generally would make zero sense. If you're really spending that much time in printf(), you're better off rewriting the code to use a less general facility.
A final issue with dedicated hardware is that you can't patch it. Someone finds a bug in your printf() and you're back to using a library version. I could go on, but I think I've made my point.
That's true for active power. (V^2/R). For leakage power, it's even worse. That looks closer to exponential. I've seen chip for which leakage accounted for close to half the power budget.
Supposedly FinFET/Tri-gate will help dramatically with leakage. We'll see.
The crappier of the two, amazingly, was not written in-house. We apparently bought that turd! (A site-customized/bastardized version of SumTotal Unified Workforce Interface (assuming I found a link to the correct turd), in case you're curious. It's non-modular!)
Could be. Most of the links I run into trouble with are from generic news aggregators. It isn't like I'm surfing the underworld or anything. I guess everybody's out for a buck these days, though.
Maybe it's AT&T's network, or maybe my phone (BB10), but the videos often don't load quickly enough for me to notice them until it's too late. My only hint is that the browser gets strangely unresponsive, and then 5 seconds later it pops over to full screen nonsense.
I've been known to just kill the browser app outright when that happens, as it's quicker than trying to get the video player to quit.
I freely admit that some of the trouble may be phone specific. Still, auto-play videos suck.
There are a couple popular news sites that seem to have moved to HTML 5 videos that don't need a flash plugin. I don't know how to block their videos on my phone. Turning off flash doesn't help, since it isn't involved.
The browser does have a switch between 'mobile' mode, which gives me a turn-of-the-century web browsing experience (not what I want), and 'desktop' mode, which usually (but not always) much better.
I'm not on Verizon, nor am I on an unlimited plan. Still, I seem to hit my bandwidth cap more regularly these days. What seems to kill my utilization these days are websites with auto-play videos that I can't kill simply by blocking Flash.
What's really annoying is that the videos load in the background, and on a few occasions, have started playing after I've already locked the display and set my phone down. I only notice them because my phone starts making noise (when I don't have it set to 'silent'). It kills my battery and eats the bits I paid for on the assumption I'd be using them for things I actually wanted.
I honestly don't have a problem with throttling actual abusers. But, modern website design seems to make "abusers" out of more of us than there otherwise would be.
For the unlimited crowd, perhaps there should be tiers there, also. How about two levels? The lower tier would be "no overage fees" unlimited, meaning you don't get random dings for going over arbitrary caps, but you might get throttled occasionally. Rather than a hard cap, there's a soft limit. The upper tier would be "no limits, no throttling," meaning you could stream all the video and download all the torrents you want, but you pay a significantly higher fee for it. I'd happily sign up for the former service just to avoid the fees associated with the occasional data-heavy month. Folks who want to treat their phone as a cable-less cable modem can pay a few bucks more to avoid the throttle.
I think the problem currently is that 95%+ fall into the first group, and the remaining 5% or fewer really need a different class of service. The current "unlimited" label doesn't really make a sufficient distinction between the two.
Of course, the cynical would point out that such a tiering system would open itself to a whole new brand of marketing abuse...
Also, there's a semantic looseness as well that bothers me. The proposed solution doesn't really require changing the speed of light in a vacuum. Rather, it points out that photons will undergo certain interactions which mean that light as a bulk phenomenon will appear to go slower than the maximum speed light can travel in a vacuum because of those other interactions.
When computing relativistic effects, such as Lorenz contractions, etc., the upper speed (not including all those interactions) still remains the limit, at least as I understand it.
I remember many years ago reading an article (probably in Wired; these days, it'd be a blog post) where someone described walking around EPCOT Center while listening to this exact album. Sounds like quite a trip, really.
And then there's this article from several years ago that's also fitting. Apparently Disney was working on their version of the Holy-Grams too..
Oy, is that how they're selling it? As if none of these features existed before Apple did it?
Swift code is transformed into optimized native code
As is any other language that passes through an optimizing compiler that outputs native code. They crowed about a 30% speedup above, which in my experience is sometimes achievable just by tweaking your compiler flags.
You realize, of course, that a 26" 1920x1080 monitor is only 85 DPI, so the same font size (in pixels) on a 26" 1920x1080 monitor would actually look about 40% larger. And, you'd get more text on the screen to boot.
1280x720 at 120 DPI makes for a small screen: 10.7" x 6", which is approx 12" diagonal. Do you do all your coding on a subnotebook or MacBook Air or something?
Well, they don't magically get cheaper to build just by building more. They get cheaper to build as the manufacturer refines the process, improves the technology, and scales the production lines to amortize the fixed costs of a production facility over a larger number of vehicles. That is, it takes work to make them cheaper, above and beyond just making more.
As long as there's sufficient demand, producers will have enough reason to scale up the production and work to bring the production cost down. Eventually, if all goes well, this begins a virtuous cycle where decreased price increases demand, and increased demand drives further cost reduction and innovation.
This works great if there's enough demand to kick-start the process. Unfortunately, the price of EVs today is too high to drive sufficient demand. Hence the carrot-and-stick incentives to try to jumpstart the virtuous cycle. On the carrot side are tax breaks and government subsidies / loan guarantees. On the stick side are fleet-wide fuel economy standards, price caps and quotas.
Right now, it seems as if most traditional auto manufacturers treat their electric cars either as halo cars, or as tasks they're required to do by law/regulation/whatever but would rather not. I doubt anyone at GM is staking the quarterly numbers on Chevy Volt sales, for example, but it doesn't stop them advertising it. The only competition at this point, though, is positioning, posturing and establishing a brand. That is, competition on the marketing front. The market's still too small to have meaningful competition driving the product development. At least, that's how it seems to me.
Eventually they'll figure out how to bring the costs down. Meanwhile, the early adopters hopefully help build interest and therefore demand in the future. When that happens, I'd expect the real competition to start. You'll see Toyota or GM or someone get into the mega-battery business, like Tesla is currently. Or some other major, bold move like that.
In the meantime, the carrot-and-stick will push both the supply and demand curves to the right, elevating the total units shipped to a modest number until the market can sustain itself.
i did give a proviso for run-time alias checks in my comment above. Our compiler will also generate a run-time check for that as well, with a small codesize and runtime cycle penalty.. The FORTRAN equivalent doesn't need the alias check.
I'd expect ICC to be very aggressive, given that Intel has one of (if not the) largest paid, full-time compiler team in the world.
So how about all of FORTRAN's other nifty features, such as array slices? To get the same functionality in C / C++, you have to put explicit strides and bounds everywhere, and sometimes checks to reverse loop directions. In FORTRAN, you can write things like "A(1:100,1:200:2) = B(101:300:2,51:150)", and the compiler is free to choose the best way to do it.
In C / C++, you leave it to the programmer to dictate the loop explicitly and hope the compiler can figure out what you're doing. In my experience, real world programmers get unusually creative with this task, creating awful code. If you write clearly enough and pay enough attention to compiler vectorization reports or other feedback, maybe the compiler + user eventually figures it out. Realistically, most programmers aren't that sophisticated. And even among the ones who are, not all have the time or inclination.
Looking back to my array slice example: Now take those slices across function call boundaries in both languages and see how much work the programmer has to do in each language...
My point is, the more work the programmer has to do to help the compiler succeed, the more evidence it's a poor fit for the problem domain. FORTRAN can make it easier for compiler writers because they start with a higher level specification of what the programmer is trying to achieve. FORTRAN also makes it easier for programmers because they stop at that higher level specification of what they're trying to achieve.
I'd argue that if you're programming in processor-specific intrinsics, you're not really programming in C++ any more. Standard C and C++ semantics for pointers and arrays really get in the way of autovectorization. So, you have to go to language extensions and kludges (like C99's restrict keyword) to throw the compiler a bone.
Sure, tricks such as whole-program analysis and run-time alias tests can help the compiler find the guarantees it needs to have in order to vectorize. The fact of the matter (and I heard this straight from the mouths of my employer's vectorizing compiler team members) is that stock FORTRAN is simply much friendlier than stock C/C++ for this due to those semantic differences.
Our compiler will autovectorize C code if you pass it enough hints such as minimum loop trip counts, pointer alignment, pointer aliasing guarantees (aka. restrict) and so forth. Even then, there are limits to what it can do. We offer processor specific intrinsics so you can vectorize the code yourself.
Once you start coding in vector intrinsics, you're taking the vectorization out of the compiler's hands and doing it yourself. Each of those intrinsics usually maps directly to an instruction or small sequence of instructions, so there's little left for the compiler to figure out. The compiler then just schedules and register-allocates the code, and handles the non-vector bits around the edges. Sure, you still compile with the C++ compiler, but the C++ compiler is no longer providing the vectorization: You are.
I wasn't actually thinking of DoS'ing, but I guess that's actually a valid concern. If a particular write pattern could crap a server, then you may have to worry about a user doing that to your server. I was just putting my "DV engineer" hat on, and trying to think of how I'd break an SSD in the minimum number of writes. It's the kind of analysis I'd hope the engineers that come up with lifetime specs use to give a bulletproof lifetime spec. For example, X years at YY MB/day even if you're writing like an a**hole.;-)
I don't have a formalized attack against any particular drive, manufacturer or filesystem.
For a multi-user system, just a thought: Could you address it with quotas? If a given user can't write to more than X% of the filesystem, you can bound the "badness" of their behavior.
I'm not challenging the 30 day number, to be sure.
It's not entirely true that write amplification won't appear to speed up the rate at which an SSD erases sectors. SSDs generally have multiple independent flash banks, and each can process an erasure independent of the others. To maximize your erasure rate, you need a pattern of writes that triggers erasures across all banks as often as possible. Each bank will split its time spent receiving data to write, committing write data to flash cells, and erasing flash cells. (My assumption is that a given bank can only be doing one of these operations at a time, which was certainly true for the flash devices I programmed.)
Consider a host sending a stream of writes as fast as it can send it. The writes will land on the drive as fast as the SSD controller can process them and direct them to flash cells. If there are any bottlenecks in that path, such as generating ECC codes and allocating physical blocks in the FTL, it will slow down the part of the duty cycle devoted receiving and committing write data.
A "friendly" write stream would minimize the number of GC cycles the SSD performs, and thus the amount of write amplification that occurs. Thus, the total number of writes to the SSD media is at most slightly larger than what the PC sends, and the "receive-write" portion of the "receive-write-erase" cycle gets lengthened by whatever bottlenecks might be in the PC-controller-flash path. A "hostile" write stream triggers a larger number of GC cycles to migrate sectors. It seems reasonable to me that an on-board chip-to-chip block migration might be quite a bit faster than receiving data from the PC. For one thing, you don't necessarily need to recompute ECC. The block transfer itself could be handled by a dedicated DMA-like controller transferring between independent banks in parallel with other activity. So, generating more write data locally to the SSD could reduce the time spent in the receive-write portion of the receive-write-erase cycle, so you can spend a greater percentage of your time erasing as opposed to receiving or writing.
It seems a little counter-intuitive, but it's in some ways similar to getting a super-linear speedup on an SMP system, which is indeed possible with the right workload. How? By keeping more of the traffic local.
The main effect of write amplification, though, is on the SSD wear specs themselves, as I said. They're stated in terms of days/months/years of writes at a particular average write rate. So really, when you multiply that out, they're specified in terms of total writes from the PC. There's at least one flash endurance experiment out there showing that drives often massively exceed their rated maximum total writes by very large factors. One reason for that, I suspect, is that they aren't sending challenging enough write patterns to the drive to trigger worst case (in terms of bytes written, not wall-clock time) failure rates.
OK, I see that SandForce has on-the-fly compression tech, which I imagine would help more reasonable workloads. (Although, if your workload involves a lot of compressed video or images, that compression tech won't buy you anything.)
The point of my thought experiment, though, was how I would construct a maximally bad workload, and it's pretty easy to nullify compression with uncompressable data.
And you think printf() and strtol() are major bottlenecks worth dedicated silicon area why?
Modern CPUs already have many accelerators for high end functions, such as numerical computations, cryptography, and the all important memcpy. (Memory copies are a traditional bottleneck, and general enough that they can be easily offloaded.) They come in two forms—specialized SIMD/vector instruction sets, and dedicated blocks for high-level functions that take multiple microseconds. An example of the former are the SIMD-oriented AVX instructions found on modern x86 chips. As an example of the latter, chips aimed at high end signal processing often have discrete blocks such as FFT accelerators. Others aimed at network tasks (especially DPI) have regular expression engines.
The problem with accelerator blocks is that they do take up area. And if they're powered up, they leak. Leakage current is a significant factor in modern designs. To get faster transistors, you need to drive their threshold voltage down. As you lower the threshold voltage, their leakage current goes up exponentially. So, that circuit better be bringing a lot of bang for the buck if it's going to be sitting there taking up space and leaking.
Another issue with dedicating area to fixed functions is the impact it has on distance between functions on the die. In the Old Days, you could get anywhere on the die in a single clock cycle. With modern designs and modern clock rates, cross-die communication is slow, taking many many cycles. So, when you plop down your custom accelerator, you have to figure out where to put it. Do you put it right in the middle of the rest of the computational units, slowing down the communication between their functions (either lowering clock rate or increasing cycle counts), or do you put it on the other side of the cache, meaning it takes several cycles to send it a request and several cycles to see the result?
This is why many custom accelerator blocks out there today focus on meaty workloads. A large FFT still takes a good bit of time to execute, and there's usually other work the main CPU can do while it executes. Thus, the communication overhead doesn't tank your performance. printf(), on the other hand, generally shows up right in the middle of a bunch of other serial steps. You can't overlap that with anything. Hauling off to a printf() accelerator block generally would make zero sense. If you're really spending that much time in printf(), you're better off rewriting the code to use a less general facility.
A final issue with dedicated hardware is that you can't patch it. Someone finds a bug in your printf() and you're back to using a library version. I could go on, but I think I've made my point.
That's true for active power. (V^2/R). For leakage power, it's even worse. That looks closer to exponential. I've seen chip for which leakage accounted for close to half the power budget.
Supposedly FinFET /Tri-gate will help dramatically with leakage. We'll see.
I only enter mine into two, but both are crap.
The crappier of the two, amazingly, was not written in-house. We apparently bought that turd! (A site-customized/bastardized version of SumTotal Unified Workforce Interface (assuming I found a link to the correct turd), in case you're curious. It's non-modular!)
Could be. Most of the links I run into trouble with are from generic news aggregators. It isn't like I'm surfing the underworld or anything. I guess everybody's out for a buck these days, though.
Acronyms Can Really be Obnoxious Names, You Mean?
Maybe it's AT&T's network, or maybe my phone (BB10), but the videos often don't load quickly enough for me to notice them until it's too late. My only hint is that the browser gets strangely unresponsive, and then 5 seconds later it pops over to full screen nonsense.
I've been known to just kill the browser app outright when that happens, as it's quicker than trying to get the video player to quit.
I freely admit that some of the trouble may be phone specific. Still, auto-play videos suck.
There are a couple popular news sites that seem to have moved to HTML 5 videos that don't need a flash plugin. I don't know how to block their videos on my phone. Turning off flash doesn't help, since it isn't involved.
The browser does have a switch between 'mobile' mode, which gives me a turn-of-the-century web browsing experience (not what I want), and 'desktop' mode, which usually (but not always) much better.
Unfortunately, there isn't a way to determine a site is sleazy prior to clicking on a link.
Oh, and give me a way to say "Never play a video under any circumstances, unless I explicitly say 'play this video.'" KTHXBYE.
I'm not on Verizon, nor am I on an unlimited plan. Still, I seem to hit my bandwidth cap more regularly these days. What seems to kill my utilization these days are websites with auto-play videos that I can't kill simply by blocking Flash.
What's really annoying is that the videos load in the background, and on a few occasions, have started playing after I've already locked the display and set my phone down. I only notice them because my phone starts making noise (when I don't have it set to 'silent'). It kills my battery and eats the bits I paid for on the assumption I'd be using them for things I actually wanted.
I honestly don't have a problem with throttling actual abusers. But, modern website design seems to make "abusers" out of more of us than there otherwise would be.
For the unlimited crowd, perhaps there should be tiers there, also. How about two levels? The lower tier would be "no overage fees" unlimited, meaning you don't get random dings for going over arbitrary caps, but you might get throttled occasionally. Rather than a hard cap, there's a soft limit. The upper tier would be "no limits, no throttling," meaning you could stream all the video and download all the torrents you want, but you pay a significantly higher fee for it. I'd happily sign up for the former service just to avoid the fees associated with the occasional data-heavy month. Folks who want to treat their phone as a cable-less cable modem can pay a few bucks more to avoid the throttle.
I think the problem currently is that 95%+ fall into the first group, and the remaining 5% or fewer really need a different class of service. The current "unlimited" label doesn't really make a sufficient distinction between the two.
Of course, the cynical would point out that such a tiering system would open itself to a whole new brand of marketing abuse...
Isn't that what A/B testing is all about?
Well, VIM and a bunch of XTerms.
Also, there's a semantic looseness as well that bothers me. The proposed solution doesn't really require changing the speed of light in a vacuum. Rather, it points out that photons will undergo certain interactions which mean that light as a bulk phenomenon will appear to go slower than the maximum speed light can travel in a vacuum because of those other interactions.
When computing relativistic effects, such as Lorenz contractions, etc., the upper speed (not including all those interactions) still remains the limit, at least as I understand it.
Rhino Records has reissued several on CDs. Also, Firesign Theatre has put out a few new albums in the last ~15 years, including Give Me Immortality or Give Me Death, Boom Dot Bust, and The Bride of Firesign.
"Hey Paolo! He broke the President!"
I remember many years ago reading an article (probably in Wired; these days, it'd be a blog post) where someone described walking around EPCOT Center while listening to this exact album. Sounds like quite a trip, really.
And then there's this article from several years ago that's also fitting. Apparently Disney was working on their version of the Holy-Grams too..
Firesign Theatre was definitely excellent stuff. "I'm Arty Choke, and we're just a joke. So it's back to the shadows again..."
Oy, is that how they're selling it? As if none of these features existed before Apple did it?
As is any other language that passes through an optimizing compiler that outputs native code. They crowed about a 30% speedup above, which in my experience is sometimes achievable just by tweaking your compiler flags.
Hasn't made mine hurt.
You realize, of course, that a 26" 1920x1080 monitor is only 85 DPI, so the same font size (in pixels) on a 26" 1920x1080 monitor would actually look about 40% larger. And, you'd get more text on the screen to boot.
1280x720 at 120 DPI makes for a small screen: 10.7" x 6", which is approx 12" diagonal. Do you do all your coding on a subnotebook or MacBook Air or something?
Well, they don't magically get cheaper to build just by building more. They get cheaper to build as the manufacturer refines the process, improves the technology, and scales the production lines to amortize the fixed costs of a production facility over a larger number of vehicles. That is, it takes work to make them cheaper, above and beyond just making more.
As long as there's sufficient demand, producers will have enough reason to scale up the production and work to bring the production cost down. Eventually, if all goes well, this begins a virtuous cycle where decreased price increases demand, and increased demand drives further cost reduction and innovation.
This works great if there's enough demand to kick-start the process. Unfortunately, the price of EVs today is too high to drive sufficient demand. Hence the carrot-and-stick incentives to try to jumpstart the virtuous cycle. On the carrot side are tax breaks and government subsidies / loan guarantees. On the stick side are fleet-wide fuel economy standards, price caps and quotas.
Right now, it seems as if most traditional auto manufacturers treat their electric cars either as halo cars, or as tasks they're required to do by law/regulation/whatever but would rather not. I doubt anyone at GM is staking the quarterly numbers on Chevy Volt sales, for example, but it doesn't stop them advertising it. The only competition at this point, though, is positioning, posturing and establishing a brand. That is, competition on the marketing front. The market's still too small to have meaningful competition driving the product development. At least, that's how it seems to me.
Eventually they'll figure out how to bring the costs down. Meanwhile, the early adopters hopefully help build interest and therefore demand in the future. When that happens, I'd expect the real competition to start. You'll see Toyota or GM or someone get into the mega-battery business, like Tesla is currently. Or some other major, bold move like that.
In the meantime, the carrot-and-stick will push both the supply and demand curves to the right, elevating the total units shipped to a modest number until the market can sustain itself.
i did give a proviso for run-time alias checks in my comment above. Our compiler will also generate a run-time check for that as well, with a small codesize and runtime cycle penalty.. The FORTRAN equivalent doesn't need the alias check.
I'd expect ICC to be very aggressive, given that Intel has one of (if not the) largest paid, full-time compiler team in the world.
So how about all of FORTRAN's other nifty features, such as array slices? To get the same functionality in C / C++, you have to put explicit strides and bounds everywhere, and sometimes checks to reverse loop directions. In FORTRAN, you can write things like "A(1:100,1:200:2) = B(101:300:2,51:150)", and the compiler is free to choose the best way to do it.
In C / C++, you leave it to the programmer to dictate the loop explicitly and hope the compiler can figure out what you're doing. In my experience, real world programmers get unusually creative with this task, creating awful code. If you write clearly enough and pay enough attention to compiler vectorization reports or other feedback, maybe the compiler + user eventually figures it out. Realistically, most programmers aren't that sophisticated. And even among the ones who are, not all have the time or inclination.
Looking back to my array slice example: Now take those slices across function call boundaries in both languages and see how much work the programmer has to do in each language...
My point is, the more work the programmer has to do to help the compiler succeed, the more evidence it's a poor fit for the problem domain. FORTRAN can make it easier for compiler writers because they start with a higher level specification of what the programmer is trying to achieve. FORTRAN also makes it easier for programmers because they stop at that higher level specification of what they're trying to achieve.
I'd argue that if you're programming in processor-specific intrinsics, you're not really programming in C++ any more. Standard C and C++ semantics for pointers and arrays really get in the way of autovectorization. So, you have to go to language extensions and kludges (like C99's restrict keyword) to throw the compiler a bone.
Sure, tricks such as whole-program analysis and run-time alias tests can help the compiler find the guarantees it needs to have in order to vectorize. The fact of the matter (and I heard this straight from the mouths of my employer's vectorizing compiler team members) is that stock FORTRAN is simply much friendlier than stock C/C++ for this due to those semantic differences.
Our compiler will autovectorize C code if you pass it enough hints such as minimum loop trip counts, pointer alignment, pointer aliasing guarantees (aka. restrict) and so forth. Even then, there are limits to what it can do. We offer processor specific intrinsics so you can vectorize the code yourself.
Once you start coding in vector intrinsics, you're taking the vectorization out of the compiler's hands and doing it yourself. Each of those intrinsics usually maps directly to an instruction or small sequence of instructions, so there's little left for the compiler to figure out. The compiler then just schedules and register-allocates the code, and handles the non-vector bits around the edges. Sure, you still compile with the C++ compiler, but the C++ compiler is no longer providing the vectorization: You are.
I wasn't actually thinking of DoS'ing, but I guess that's actually a valid concern. If a particular write pattern could crap a server, then you may have to worry about a user doing that to your server. I was just putting my "DV engineer" hat on, and trying to think of how I'd break an SSD in the minimum number of writes. It's the kind of analysis I'd hope the engineers that come up with lifetime specs use to give a bulletproof lifetime spec. For example, X years at YY MB/day even if you're writing like an a**hole. ;-)
I don't have a formalized attack against any particular drive, manufacturer or filesystem.
For a multi-user system, just a thought: Could you address it with quotas? If a given user can't write to more than X% of the filesystem, you can bound the "badness" of their behavior.
I'm not challenging the 30 day number, to be sure.
It's not entirely true that write amplification won't appear to speed up the rate at which an SSD erases sectors. SSDs generally have multiple independent flash banks, and each can process an erasure independent of the others. To maximize your erasure rate, you need a pattern of writes that triggers erasures across all banks as often as possible. Each bank will split its time spent receiving data to write, committing write data to flash cells, and erasing flash cells. (My assumption is that a given bank can only be doing one of these operations at a time, which was certainly true for the flash devices I programmed.)
Consider a host sending a stream of writes as fast as it can send it. The writes will land on the drive as fast as the SSD controller can process them and direct them to flash cells. If there are any bottlenecks in that path, such as generating ECC codes and allocating physical blocks in the FTL, it will slow down the part of the duty cycle devoted receiving and committing write data.
A "friendly" write stream would minimize the number of GC cycles the SSD performs, and thus the amount of write amplification that occurs. Thus, the total number of writes to the SSD media is at most slightly larger than what the PC sends, and the "receive-write" portion of the "receive-write-erase" cycle gets lengthened by whatever bottlenecks might be in the PC-controller-flash path. A "hostile" write stream triggers a larger number of GC cycles to migrate sectors. It seems reasonable to me that an on-board chip-to-chip block migration might be quite a bit faster than receiving data from the PC. For one thing, you don't necessarily need to recompute ECC. The block transfer itself could be handled by a dedicated DMA-like controller transferring between independent banks in parallel with other activity. So, generating more write data locally to the SSD could reduce the time spent in the receive-write portion of the receive-write-erase cycle, so you can spend a greater percentage of your time erasing as opposed to receiving or writing.
It seems a little counter-intuitive, but it's in some ways similar to getting a super-linear speedup on an SMP system, which is indeed possible with the right workload. How? By keeping more of the traffic local.
The main effect of write amplification, though, is on the SSD wear specs themselves, as I said. They're stated in terms of days/months/years of writes at a particular average write rate. So really, when you multiply that out, they're specified in terms of total writes from the PC. There's at least one flash endurance experiment out there showing that drives often massively exceed their rated maximum total writes by very large factors. One reason for that, I suspect, is that they aren't sending challenging enough write patterns to the drive to trigger worst case (in terms of bytes written, not wall-clock time) failure rates.
OK, I see that SandForce has on-the-fly compression tech, which I imagine would help more reasonable workloads. (Although, if your workload involves a lot of compressed video or images, that compression tech won't buy you anything.)
The point of my thought experiment, though, was how I would construct a maximally bad workload, and it's pretty easy to nullify compression with uncompressable data.