I can't see which prior posts you're pulling that from, but it is out by at least a couple of orders of magnitude.
Here is a very old analysis of the process creation overhead (I can't find anything more modern after a bit of googling). By old, I'll point out that they call windows NT, and do their experiments on a P200-MMX. Even then, process creation was about 100th of a second. Remember that processors are roughly 20-50x faster depending on how much architectural improvements effect a particular benchmark. The work in spawning a process is roughly constant, although windows may be doing it more efficiently now. I'd guess an upper-bound of 0.2ms on a modern machine.
It's strange that you appear to have counted to five without using the number two. Hint: read what the gp wrote under point two. Read what you wrote as point five. Understand the similarity. Consider the oxygen that you have stolen.
It is a lovely feature. When I was writing up my thesis a few years I found it to be completely invaluable. It meant that I could line up the rendered output of a thesis page on the screen so that the margins were cropped, and the view was zoomed in on the actual text. When I hit forward / backward the viewer would skip the exact same point on the adjacent page. It saved a lot of time, and it's feature that is missing from a lot of document viewers *cough* Adobe Acroread I'm looking at you.
Of course I'm talking about using ghostview which has supported this feature for decades. But I'm sure Microsoft don't believe it's a real patent either. They're just playing the numbers game.
Damn. This is going to undo my mods in this discussion.
Unfortunately your argument has a hole in it at this point. I was just about to mod your earlier posts insightful but I thought I'd correct you instead. If you write your JIT compiler in C then it takes the Javascript as input and outputs native code. This glosses over the interactive nature of the JIT compiler but is largely true. The compiler does not execute code in the language that it is written in. It executes code in the language that it is emitting. The language being emitted is the native assembly language. So you do not have a C program that does the same thing as the same speed. You have a C program that generates code, which when run does the same thing. For this reason the output of the JIT can be faster than C, even if the JIT is written in C.
C++ is not a multi-paradigm language. You are confusing the meaning of procedural and imperative. All OO languages are imperative (based on state mutations). The organisation of those state mutations is OO rather than procedural like C. This is quite a weak difference, they are both within the same imperative paradigm : the difference is merely how code is organised on the medium scale, and how namespaces are split.
It is often argued that the templates in C++ form a functional language. This is partially true. Template programming can only be performed at compile-time, not at run-time so there are many programs that cannot be written this way. So it is not possible to write C++ programs in a functional style - it is only true that the type system can be abused to write some programs in a style that looks a bit functional.
It seems strange that they've chosen a model who never blinks. It kind of ruins the effect somewhat. The other comments about the mouth being slightly off and missing the face seem more noticable on the video.
The talking head has a (very) slightly different shape of face to the actress. I wonder if their approach is having difficulty mapping from one face shape to another?
It's been quite a few years (err, back in 2001 if memory serves) since I've used microsoft products. Back then *only* two or three hours of downtime per quarter would have been a dream. How are they for reliability these days? I hear the OS side stays up a lot better than win2000 used to. What about the Office suite? Does it still crash every couple of hours and hose work?
This is not strictly true. Copyright controls the distribution of copies, nothing more, nothing less. If you buy a book, then you have the right to read it because you can do whatever you want with your own property. Copyright doesn't come into it.
If companies are assuming that loading software into RAM allows them to impose a license with terms on the user then a) they've never had legal advice, and b) they're stupid. Copyright only restricts your ability to distribute copies, not to make them. If I want, I can buy a copyrighted book and make hundreds of copies at home. If I ever try to sell them, give them away or otherwise distribute them then I breach copyright. But the making of copies is fine. (In real life convincing a judge / jury that the printing press was for personal use would be difficult).
While copyright doesn't enter into the resale of OS-X I would be very suprised if Apple win a case based on provisions in their EULA. All EULAs are a bluff, there is no legal basis for it to be binding. Another poster further up made the interesting comment that Apple are suing over redistribution of software patches, rather than the original OS. That would be an interesting angle, and one for which copyright law would be applicable.
The GPU implementation barely worked... and it took about 12 months to get it into that state. The original idea was simple enough - can we do multi-precision integer algorithms on a GPU. We went for the 7800GTX, theoretically it should have been 35x quicker than the CPU reference. In practice almost everything went against us: 1. Driver issues - to make the approach work we needed to generate hundreds of specialised shaders. There is a huge latency issue in the driver for doing the context switch that doesn't relate to the costs in hardware. 2. Unit sizes - the shader hardware hides its huge startup times pretty well when you are doing millions of iterations of a single small shader. We were shading hundreds of quads with each shader then switching. In combination with point (1) this kills performance. 3. Program length penalities. Nvidia have really optimised for tiny shaders. Longer programs have a super-linear slowdown. 4. The CG compiler was shit. You can't turn off unsafe floating point optimisations that break integer code...
The list goes on, but you get the idea. In the end we managed to squeak past the CPU implementation and go about 2x as quick. Only for our reviewers to complain that they would have written better CPU assembly... Perhaps things have changed with CUDA, but Nvidia certainly needed to improve their game on the 7 series. The final issue would have been the card memory latency but we spent all our time just trying to win on the compute cost. For exponentiation with large key sizes there is a lot more computing than memory access.
You noticed that my requirements don't line up with my budget:) Ideally I would take robust to mean mirroring, at different colo's, each on a different backbone. When I was costing up what I was prepared to spend that dropped to RAID-5 on a single box. Such are the constraints of my wallet:) The $50/250MB/yr estimate was based on similar costs to what you said, obviously a factor of two from avoiding replication. The profit margin only considered the running (bandwidth/colo) costs, and I think it took 18-24months to break even on hardware. That was when I decided it wasn't worth the effort for me, although I'm still interested in paying someone else for doing the work.
I'll keep an eye on your website, because if you go into that market with prices in that region you have no competition at all. The current wave of backup providers / media sharers give tiny chunks of storage for at least 10x the cost. You've picked a good market to get into. My personal guess is that 5 years from now everyone will have some sort of "cloud storage" to steal last years buzzword. The main driver is that people are switching to laptops from desktops, and nobody wants to carry around that kind of storage. Fast broadband + fast office connections make it practical to carry a virtual disk where ever you go. The privacy angle is key for a lot of people, but if that works I think it is a service that has monetary value.
I could guess what your plans are just by considering what I would do: dump the RAID hardware and move to JBOD. Implement the replication and redundancy in software by going down the same route as Google did with GFS. Slash the hardware costs as much as possible and invest in the service that you can run over the top of a bunch of cheap machines spread through different colo sites.
Although the theory is nice it assumes that you shut down the Nano after it finishes the task. Because the idle drain on the Nano is so close to the load drain on the Atom the theory breaks down. Assuming that the Nano machine sat idling for the remaining time it is still power consumption they did not include in the comparison.
Of course I'm still trying to get my head round the concept that I should buy Via for performance and Intel for power consumption...
I can see that we've been arguing at cross purposes somewhat. My original claim that the Padlock implementation was shit was not an indication that hardware acceleration for crypto is a bad idea. Perhaps I should mention that I work as a post-doc in a crypto group, I published a paper last year on accelerating RSA on graphics cards, and there are guys sitting a few desks away who design instruction set extensions.
I know they're a good idea, and I agree about your target application being an interesting one. In fact I woud say that it will become a vital one within five years. My point was only that Padlock could be much, much better than it is. And yes, your analysis is correct that the relative size of the design time at Via vs the one at Intel is responsible for this.
Your business looks interesting. I was looking into doing something similar last year: essentially I want to store 2-3TB of data somewhere. I want it to be robust and highly available, with fast download speeds and I don't want to pay the earth for it. I would prefer if the storage was an encrypted block device at the far end that I can mount and decrypt locally.
When I ran the numbers it looked like I could turn a 100% profit over my running costs, offering 250GB slices for (and my memory may be way off here) $50 a year. Your site looks interesting, but would you offer a service like the above for a reasonable price?
The figures that I found googling were about 45MB/sec for openssl. If it hits 511MB/sec then yes that is much more impressive, but that is ten times higher than the top few results in google suggested. The unpublished figures that I mentioned will be released in a few months: they're not mine but I can't really spoil another researchers thunder, as it were. Even ignoring those results there are published results for Crypto++ that show 20 clocks-cycles per byte for AES. That's 150MB/sec per core.
Maybe I'm missing something in your figures but how is 511MB/sec a 6-12x speedup over Core2? You can't just multiply out the clock-speed because the length of the longest path in the AES circuit won't allow the circuit to scale to 3Ghz.
The argument is quite simple: a "good" result for special purpose hardware is at least 10x faster than a general purpose circuit. This is roughly the order of speed-up in software specialisation, and for a lot of problems the hardware speedup is much higher.
The Via can produce 2x the throughput of a single core on the Core2. Applying a block cypher is inherently parallel so we can assume that using both cores will match the performance.
Sure your other arguments are nice, like the Via uses less power etc etc. But the point remains: for a processor that incorporates a special-purpose hardware circuit just to do AES the performance is lack lustre compared to the fastest general purpose architecture.
Yike, ranting, raving and selective quoting. You do go for the whole troll don't you. I'm not suprised you didn't understand my point as you quote all of my post but the part that explained it:
Special purpose hardware (like Padlock) is always more efficient than executing a program on general purpose hardware....overhead has been removed and the execution has been optimised for that specific case
So no, it is not pathetic that a 3Ghz general purpose processor can match the special purpose extensions on the C7. Given that the achievable speedup is much larger than the ratio in clock speeds (let alone the extra the Core2 is doing) is shows that the VIA performance is shit.
Ah, I was talking about the 2Gb/s claim on Via's page. I would expect that to be 250MB/s of throughput for AES. The overhead of openssl is another matter although I would expect it to be less of a factor on a Core2. The single core of a Core2 should exceed 250MB/s, although not by much.
A single core on a 3Ghz Core2 can match the performance of Padlock. I can't provide a link as the figures are unpublished but it's not particularly hard to work out how.
Offloading is a good idea for any heavily used operation. Special purpose hardware (like Padlock) is always more efficient than executing a program on general purpose hardware. There is nothing magical about this - overhead has been removed and the execution has been optimised for that specific case.
The fact that the Core2 can keep up says volumes about the poor implementation of the C7.
This has no impact on cryptography whatsoever. Symmetric encryption has never been shown to be a problem that quantum computing can help with. A *large* QC would affect the use of public key algorithms as both factoring and discrete logs can be sped up.
However: 1. 28 is not a large number. Current asymmetric key sizes would takes thousands of qubits. 2. This is not a "quantum computer". Shor's algorithm requires entangled qubits that stay coherent during the length of the algorithm. The 28 cubits in this system are not entangled so it is useless for the (almost) only proven quantum algorithm.
How is their study either unethical, or illegal as you have claimed? Ignoring your hypothetical marijuana study as completely irrelevant you seem to have missed the key points in what they did.
They did not run a "wiretap" as claimed. They monitored the traffic at a tor node that they controlled. People willingly sent them the information that was supposed to be private.
Their study is a scientific investigation into whether the privacy claims of Tor can be sustained. They cannot - the system is open to abuse. This is an entirely ethical study into the claims made by Tor, and furthermore this is exactly how good empirical science should work.
I can't see which prior posts you're pulling that from, but it is out by at least a couple of orders of magnitude.
Here is a very old analysis of the process creation overhead (I can't find anything more modern after a bit of googling). By old, I'll point out that they call windows NT, and do their experiments on a P200-MMX. Even then, process creation was about 100th of a second. Remember that processors are roughly 20-50x faster depending on how much architectural improvements effect a particular benchmark. The work in spawning a process is roughly constant, although windows may be doing it more efficiently now. I'd guess an upper-bound of 0.2ms on a modern machine.
Ah that makes sense now. Language can be quite subtle at times. Thanks for the example.
In ten years of parallel processing research that is the first time that I've someone draw that distinction between faster and quicker.
It must be a .... very localised distinction that you're aware of.
It's strange that you appear to have counted to five without using the number two. Hint: read what the gp wrote under point two. Read what you wrote as point five. Understand the similarity. Consider the oxygen that you have stolen.
Compound interest all the way baby!
Yeah but at least you get local rate up there...
What about VisiCalc. You're not saying it was really written by John Titor in a previous trip are you?
It is a lovely feature. When I was writing up my thesis a few years I found it to be completely invaluable. It meant that I could line up the rendered output of a thesis page on the screen so that the margins were cropped, and the view was zoomed in on the actual text. When I hit forward / backward the viewer would skip the exact same point on the adjacent page. It saved a lot of time, and it's feature that is missing from a lot of document viewers *cough* Adobe Acroread I'm looking at you.
Of course I'm talking about using ghostview which has supported this feature for decades. But I'm sure Microsoft don't believe it's a real patent either. They're just playing the numbers game.
Damn. This is going to undo my mods in this discussion.
Unfortunately your argument has a hole in it at this point. I was just about to mod your earlier posts insightful but I thought I'd correct you instead. If you write your JIT compiler in C then it takes the Javascript as input and outputs native code. This glosses over the interactive nature of the JIT compiler but is largely true. The compiler does not execute code in the language that it is written in. It executes code in the language that it is emitting. The language being emitted is the native assembly language. So you do not have a C program that does the same thing as the same speed. You have a C program that generates code, which when run does the same thing. For this reason the output of the JIT can be faster than C, even if the JIT is written in C.
C++ is not a multi-paradigm language. You are confusing the meaning of procedural and imperative. All OO languages are imperative (based on state mutations). The organisation of those state mutations is OO rather than procedural like C. This is quite a weak difference, they are both within the same imperative paradigm : the difference is merely how code is organised on the medium scale, and how namespaces are split.
It is often argued that the templates in C++ form a functional language. This is partially true. Template programming can only be performed at compile-time, not at run-time so there are many programs that cannot be written this way. So it is not possible to write C++ programs in a functional style - it is only true that the type system can be abused to write some programs in a style that looks a bit functional.
It's guaranteed that I'll only see a sublime work of genius like this on a day without mod-points. Keep up the good work.
It seems strange that they've chosen a model who never blinks. It kind of ruins the effect somewhat. The other comments about the mouth being slightly off and missing the face seem more noticable on the video.
The talking head has a (very) slightly different shape of face to the actress. I wonder if their approach is having difficulty mapping from one face shape to another?
My hobby:
I like to post an xkcd link into every story I come across...
It's been quite a few years (err, back in 2001 if memory serves) since I've used microsoft products. Back then *only* two or three hours of downtime per quarter would have been a dream. How are they for reliability these days? I hear the OS side stays up a lot better than win2000 used to. What about the Office suite? Does it still crash every couple of hours and hose work?
This is not strictly true. Copyright controls the distribution of copies, nothing more, nothing less. If you buy a book, then you have the right to read it because you can do whatever you want with your own property. Copyright doesn't come into it.
If companies are assuming that loading software into RAM allows them to impose a license with terms on the user then a) they've never had legal advice, and b) they're stupid. Copyright only restricts your ability to distribute copies, not to make them. If I want, I can buy a copyrighted book and make hundreds of copies at home. If I ever try to sell them, give them away or otherwise distribute them then I breach copyright. But the making of copies is fine. (In real life convincing a judge / jury that the printing press was for personal use would be difficult).
While copyright doesn't enter into the resale of OS-X I would be very suprised if Apple win a case based on provisions in their EULA. All EULAs are a bluff, there is no legal basis for it to be binding. Another poster further up made the interesting comment that Apple are suing over redistribution of software patches, rather than the original OS. That would be an interesting angle, and one for which copyright law would be applicable.
Yet it does explain the entire "raw" food movement
The GPU implementation barely worked ... and it took about 12 months to get it into that state. The original idea was simple enough - can we do multi-precision integer algorithms on a GPU. We went for the 7800GTX, theoretically it should have been 35x quicker than the CPU reference. In practice almost everything went against us:
1. Driver issues - to make the approach work we needed to generate hundreds of specialised shaders. There is a huge latency issue in the driver for doing the context switch that doesn't relate to the costs in hardware.
2. Unit sizes - the shader hardware hides its huge startup times pretty well when you are doing millions of iterations of a single small shader. We were shading hundreds of quads with each shader then switching. In combination with point (1) this kills performance.
3. Program length penalities. Nvidia have really optimised for tiny shaders. Longer programs have a super-linear slowdown.
4. The CG compiler was shit. You can't turn off unsafe floating point optimisations that break integer code...
The list goes on, but you get the idea. In the end we managed to squeak past the CPU implementation and go about 2x as quick. Only for our reviewers to complain that they would have written better CPU assembly... Perhaps things have changed with CUDA, but Nvidia certainly needed to improve their game on the 7 series. The final issue would have been the card memory latency but we spent all our time just trying to win on the compute cost. For exponentiation with large key sizes there is a lot more computing than memory access.
You noticed that my requirements don't line up with my budget :) Ideally I would take robust to mean mirroring, at different colo's, each on a different backbone. When I was costing up what I was prepared to spend that dropped to RAID-5 on a single box. Such are the constraints of my wallet :) The $50/250MB/yr estimate was based on similar costs to what you said, obviously a factor of two from avoiding replication. The profit margin only considered the running (bandwidth/colo) costs, and I think it took 18-24months to break even on hardware. That was when I decided it wasn't worth the effort for me, although I'm still interested in paying someone else for doing the work.
I'll keep an eye on your website, because if you go into that market with prices in that region you have no competition at all. The current wave of backup providers / media sharers give tiny chunks of storage for at least 10x the cost. You've picked a good market to get into. My personal guess is that 5 years from now everyone will have some sort of "cloud storage" to steal last years buzzword. The main driver is that people are switching to laptops from desktops, and nobody wants to carry around that kind of storage. Fast broadband + fast office connections make it practical to carry a virtual disk where ever you go. The privacy angle is key for a lot of people, but if that works I think it is a service that has monetary value.
I could guess what your plans are just by considering what I would do: dump the RAID hardware and move to JBOD. Implement the replication and redundancy in software by going down the same route as Google did with GFS. Slash the hardware costs as much as possible and invest in the service that you can run over the top of a bunch of cheap machines spread through different colo sites.
Although the theory is nice it assumes that you shut down the Nano after it finishes the task. Because the idle drain on the Nano is so close to the load drain on the Atom the theory breaks down. Assuming that the Nano machine sat idling for the remaining time it is still power consumption they did not include in the comparison.
Of course I'm still trying to get my head round the concept that I should buy Via for performance and Intel for power consumption...
I can see that we've been arguing at cross purposes somewhat. My original claim that the Padlock implementation was shit was not an indication that hardware acceleration for crypto is a bad idea. Perhaps I should mention that I work as a post-doc in a crypto group, I published a paper last year on accelerating RSA on graphics cards, and there are guys sitting a few desks away who design instruction set extensions.
I know they're a good idea, and I agree about your target application being an interesting one. In fact I woud say that it will become a vital one within five years. My point was only that Padlock could be much, much better than it is. And yes, your analysis is correct that the relative size of the design time at Via vs the one at Intel is responsible for this.
Your business looks interesting. I was looking into doing something similar last year: essentially I want to store 2-3TB of data somewhere. I want it to be robust and highly available, with fast download speeds and I don't want to pay the earth for it. I would prefer if the storage was an encrypted block device at the far end that I can mount and decrypt locally.
When I ran the numbers it looked like I could turn a 100% profit over my running costs, offering 250GB slices for (and my memory may be way off here) $50 a year. Your site looks interesting, but would you offer a service like the above for a reasonable price?
The figures that I found googling were about 45MB/sec for openssl. If it hits 511MB/sec then yes that is much more impressive, but that is ten times higher than the top few results in google suggested. The unpublished figures that I mentioned will be released in a few months: they're not mine but I can't really spoil another researchers thunder, as it were. Even ignoring those results there are published results for Crypto++ that show 20 clocks-cycles per byte for AES. That's 150MB/sec per core.
Maybe I'm missing something in your figures but how is 511MB/sec a 6-12x speedup over Core2? You can't just multiply out the clock-speed because the length of the longest path in the AES circuit won't allow the circuit to scale to 3Ghz.
The argument is quite simple: a "good" result for special purpose hardware is at least 10x faster than a general purpose circuit. This is roughly the order of speed-up in software specialisation, and for a lot of problems the hardware speedup is much higher.
The Via can produce 2x the throughput of a single core on the Core2. Applying a block cypher is inherently parallel so we can assume that using both cores will match the performance.
Sure your other arguments are nice, like the Via uses less power etc etc. But the point remains: for a processor that incorporates a special-purpose hardware circuit just to do AES the performance is lack lustre compared to the fastest general purpose architecture.
Yike, ranting, raving and selective quoting. You do go for the whole troll don't you. I'm not suprised you didn't understand my point as you quote all of my post but the part that explained it:
So no, it is not pathetic that a 3Ghz general purpose processor can match the special purpose extensions on the C7. Given that the achievable speedup is much larger than the ratio in clock speeds (let alone the extra the Core2 is doing) is shows that the VIA performance is shit.
Ah, I was talking about the 2Gb/s claim on Via's page. I would expect that to be 250MB/s of throughput for AES. The overhead of openssl is another matter although I would expect it to be less of a factor on a Core2. The single core of a Core2 should exceed 250MB/s, although not by much.
A single core on a 3Ghz Core2 can match the performance of Padlock. I can't provide a link as the figures are unpublished but it's not particularly hard to work out how.
Offloading is a good idea for any heavily used operation. Special purpose hardware (like Padlock) is always more efficient than executing a program on general purpose hardware. There is nothing magical about this - overhead has been removed and the execution has been optimised for that specific case.
The fact that the Core2 can keep up says volumes about the poor implementation of the C7.
This has no impact on cryptography whatsoever. Symmetric encryption has never been shown to be a problem that quantum computing can help with. A *large* QC would affect the use of public key algorithms as both factoring and discrete logs can be sped up.
However:
1. 28 is not a large number. Current asymmetric key sizes would takes thousands of qubits.
2. This is not a "quantum computer". Shor's algorithm requires entangled qubits that stay coherent during the length of the algorithm. The 28 cubits in this system are not entangled so it is useless for the (almost) only proven quantum algorithm.
Summary:
Lots of hype, no practical benefits.
How is their study either unethical, or illegal as you have claimed? Ignoring your hypothetical marijuana study as completely irrelevant you seem to have missed the key points in what they did.
They did not run a "wiretap" as claimed. They monitored the traffic at a tor node that they controlled. People willingly sent them the information that was supposed to be private.
Their study is a scientific investigation into whether the privacy claims of Tor can be sustained. They cannot - the system is open to abuse. This is an entirely ethical study into the claims made by Tor, and furthermore this is exactly how good empirical science should work.