6.0 C++ SSE header
In February 2009 Krzysztof Jakubowski posted a full implementation of C++ wrapping over the 32-bit floating point and the 32-bit integer SSE C intrinsics on a ray tracing Internet forum called ompf. It is called veclib and it is licensed with very permissive zlib license. From my experiments with that implementation this guide was born.
Generally, when writing vector code in vector notation in C++, one can in my experience do so without any overhead to worry about by using expression templates. Basically expression templates help avoid unnecessary temporaries over naive C++ operator overloaded implementations. An implementation of this is a BSD-type licensed piece of code called cvalarray.
What I advocate is to use these two implementations together to get, as far as I know, the fastest and neatest freely available C++ implementation of basic 3D vectors and SSE types. Of course, for some specific higher level tasks there are better options, like Eigen for linear algebra. To get better notation compared to what veclib uses I took example of GLSL specifications to come up with this for writing SSE code:
|C++ vector types with SSE for single precision (32-bit)|
vec2,vec3,vec4 floating point vector vec2b,vec3b,vec4b mask type for floating point vector ivec2,ivec3,ivec4 signed integer vector ivec2b,ivec3b,ivec4b mask type for signed integer vector mat4x2,mat4x3,mat4x4 4 column floating point matrix (SSE)
Also any of these can be prefixed with "p" for double precision. And this is then a full header file, that implements part of these types:
#include "cvalarray.hpp" #include "veclib.h" typedef veclib::f32x4 vec4; typedef veclib::f32x4b vec4b; typedef veclib::i32x4 ivec4; typedef veclib::i32x4b ivec4b; typedef n_std::cvalarray<vec4,2> mat4x2; typedef n_std::cvalarray<vec4,3> mat4x3; typedef n_std::cvalarray<vec4,4> mat4x4; typedef n_std::cvalarray<float,2> vec2; typedef n_std::cvalarray<float,3> vec3; typedef n_std::cvalarray<double,2> pvec2; typedef n_std::cvalarray<double,3> pvec3;
The idea here is that each of these has the exact same user interface. Brackets  are used for element access and arithmetic with operators works as it should. Easy to remember.
Documentation of the veclib's interface can be found at veclib.h. For how the library can and should be used refer to the algorithm examples.
Strengths / weaknesses of this approach
Code is easy to read and to type. The conversion from scalar to vectorized form or the other way around can be at best copy paste with little modifications.
Proven real world viability and efficiency by this very guide.
Code is not tied to any particular low level implementation including even the intrinsics or veclib. A few special functions, like a vectorized square root and a conditional move, do not, however, satisfy this criteria.
The biggest weakness is the lack of implementations of packaged version of basic mathematical functions not provided by SSE instructions at hardware level. These include the trigonometric functions, the exponential function and basically everything except basic arithmetic and a square root. Free implementations of some of these on top of the intrinsics are available online. Maybe AMD's open source SSEPlus project could be of service here. Intel's commercial Math Kernel Library has many implemented.One has to be very careful when coding on this high level to not do just anything that is possible. It can still give correct output, but potentially degrade or kill performance. What are behind the C++ operators are the C intrinsics functions and SSE types, and a compiler doesn't produce optimal code for all usages. The algorithm examples in this guide help document the possible severity of this problem and help to learn how to avoid it.
Veclib selects the correct implementations of the high level operations based on the version of SSE selected by compiler parameters. For GCC this is done with "-msseX" where X is the version or "-march=native" for the best supported version. However, this means a single binary will support only the selected version of SSE. Optimally it would be desirable to have a single binary perform run time selection of the SSE path based on the detected CPU. Maybe several binaries compiled for the different versions of SSE could be combined with CPU detection to have such support for an application.
Veclib is unlikely to see much development in the future as the author (Krzysztof Jakubowski) has stated he has discontinued his project that included veclib. This means a question mark for support of the current SSE4 extensions or the upcoming AVX. However, SSE3/SSSE3/SSE4 provided little extra for most usages of floating point over SSE2. Also veclib is also just a thin wrapping over the intrinsics to begin with and so easy to modify. Furthermore Intel states in May 2008 AVX document: "Most apps written with intrinsics need only recompile".