You may want to try adding the option -Qipo (if you have not done so already), since this can help C++ code that makes heavy use of templates (and MSVC does a -ip equivalent under -O2). If you are willing to contact me personally and provide me with some more details, I may be able to help you improve the performance of your application(s).
I regret to see that again you try to change the subject by presenting the results of an application without vectorizable loops (and at the very least you could have provided some more information on the used compiler switches and target architecture).
If you call someone's work a "joke", but subsequently are not willing to backup your claim due to time constraints, then I can only hope that other readers will take your claim for what it really was...
If you ever have real issues with vectorization, please do not hesitate to contact me directly.
Aart
Seriously, please try one of our latest versions!
Version 7.0 has no problem with vectorizing this loop.
[C:/cmplr/temp] icl/Fa/QxW sl2.cpp
sl2.cpp(4) : (col. 1) remark: LOOP WAS VECTORIZED.
Due to the large constant, the compiler (in combination with some alignment optimizations) is also able to see that a so-called streaming store is useful to minimize cache pollution:...
Back: movntps XMMWORD PTR [ebp+edx*4], xmm0
movntps XMMWORD PTR [ebp+edx*4+16], xmm0
add edx, 8
cmp edx, ecx
jb.B1.8...
How much speedup is actually obtained due to vectorization heavily depends on the context in which this fragment is used in your application.
Since you called Intel's vectorization a "joke", I was specifically interested in examples where the Intel compiler fails to vectorize loops. Incidentally, your example contains one initialization loop that is nicely vectorized (so I rest my case)
No it is "a", not "the" although seeing Intel's vectorization being called a "joke" became rather personal:-) As for any future plans, I am the wrong person to ask. I am just vectorizing loops here....
Would you mind sharing some examples of code where the Intel compiler misses obvious opportunities for vectorization, since I find your claim rather strong (also considering the fact that you have not even tried version 6.0 yet; verions 7.0 is already out now)? A recent article with programming guidelines for vectorizing compilers that you may find useful can be found at: http://www.cuj.com/articles/2003/0302/0302c/0302c. htm?topic=articles
Privately I maintain a web page with some more in-depth information on vectorization for SSE/SSE2. See: http://www.aartbik.com
--
Aart Bik, Senior Staff Engineer, Intel Corporation
email: aart.bik@intel.com
You may want to try adding the option -Qipo (if you have not done so already), since this can help C++ code that makes heavy use of templates (and MSVC does a -ip equivalent under -O2). If you are willing to contact me personally and provide me with some more details, I may be able to help you improve the performance of your application(s).
I regret to see that again you try to change the subject by presenting the results of an application without vectorizable loops (and at the very least you could have provided some more information on the used compiler switches and target architecture). If you call someone's work a "joke", but subsequently are not willing to backup your claim due to time constraints, then I can only hope that other readers will take your claim for what it really was... If you ever have real issues with vectorization, please do not hesitate to contact me directly. Aart
Seriously, please try one of our latest versions! Version 7.0 has no problem with vectorizing this loop. [C:/cmplr/temp] icl /Fa /QxW sl2.cpp
sl2.cpp(4) : (col. 1) remark: LOOP WAS VECTORIZED.
Due to the large constant, the compiler (in combination with some alignment optimizations) is also able to see that a so-called streaming store is useful to minimize cache pollution: ...
Back: movntps XMMWORD PTR [ebp+edx*4], xmm0
movntps XMMWORD PTR [ebp+edx*4+16], xmm0
add edx, 8
cmp edx, ecx
jb .B1.8 ...
How much speedup is actually obtained due to vectorization heavily depends on the context in which this fragment is used in your application.
Since you called Intel's vectorization a "joke", I was specifically interested in examples where the Intel compiler fails to vectorize loops.
...
Incidentally, your example contains one initialization loop that is nicely vectorized (so I rest my case)
=> icl sl.cpp
sl.cpp(28) : (col. 1) remark: LOOP WAS VECTORIZED.
I realize that this does not address your performance concerns (which we can discuss offline), but your example did not provide what was requested.
No it is "a", not "the" although seeing Intel's vectorization being called a "joke" became rather personal :-) As for any future plans, I am the wrong person to ask. I am just vectorizing loops here....
Would you mind sharing some examples of code where the Intel compiler misses obvious opportunities for vectorization, since I find your claim rather strong (also considering the fact that you have not even tried version 6.0 yet; verions 7.0 is already out now)? A recent article with programming guidelines for vectorizing compilers that you may find useful can be found at: http://www.cuj.com/articles/2003/0302/0302c/0302c. htm?topic=articles
Privately I maintain a web page with some more in-depth information on vectorization for SSE/SSE2. See: http://www.aartbik.com
--
Aart Bik, Senior Staff Engineer, Intel Corporation
email: aart.bik@intel.com