You are not only going to get very low scaling (due to Amdahl law) unless all you do is calling quicksort, but also can get worse performance if your libraries make bad decisions when to run in parallel and when not.
To get reasonable scaling one needs to parallelize at the highest level possible and design accordingly. I work pretty much with application called WRF and it has _hybrid_ MPI+OpenMP parallelization: 2D domain decomposition with parallelized with MPI + tiling inside each subdomain with OpenMP. This makes its computational part extremely scalable.
Well, SSE matters much in HPC code. Intel compiler has quite decent vectorizer which helps a lot. And writing vectorizeable code is not that hard, actually.
Does anybody know how well does this correspond to the actual history?
You are not only going to get very low scaling (due to Amdahl law) unless all you do is calling quicksort, but also can get worse performance if your libraries make bad decisions when to run in parallel and when not. To get reasonable scaling one needs to parallelize at the highest level possible and design accordingly. I work pretty much with application called WRF and it has _hybrid_ MPI+OpenMP parallelization: 2D domain decomposition with parallelized with MPI + tiling inside each subdomain with OpenMP. This makes its computational part extremely scalable.
Sure, just use pkill -15 kcalc
Well, SSE matters much in HPC code. Intel compiler has quite decent vectorizer which helps a lot. And writing vectorizeable code is not that hard, actually.