7.0 Conclusions

All the tested algorithms gave plausible speedups when turned into SSE versions. However, four times the speed improvement over a scalar version isn't a conclusive proof that SSE is giving the kind of benefit it should. For example, the scalar implementation could just be plain inefficient.

Nevertheless, based on the benchmarks I have done, I argue that veclib is generating genuinely competitive SSE code when used properly. SSE can be coded like this in theory by anyone without knowing the name of a single SSE instruction or intrinsic function. All one needs to know is that arithmetic and comparisons work as they ought to, when using the 128-bit type vec4. This opens up way to use SSE for people who don't care about what SSE instructions are available or what are they named like. That means the vast majority of programmers and also practically all scientists coding programs for research in fields outside real-time graphics.

A sum up of speed improvements is presented in the following table:

 Speed benefits gained with SSE
Vector normalization 4.0x - 3.4x at large N of vectors
N-particle dynamics Approaching ~4.0x as N grows
Ray sphere intersection
3.1x - 2.6x (small N / large N)
Cone quad intersection
0.95x - 2.0x depending on the scenario

SSE for scientific computing

The problem with floating point SSE is that single precision is often not enough. Or even if it was it takes some very special considerations how and in what domain it should be used. Whereas double precision can often be used to avoid such numerical issues. Nothing can fix this except future possible version of SSE with higher SIMD width than two for double precision calculations.

Secondly, it takes a lot of time and thinking to vectorize most algorithms and that time is often better spent elsewhere. However, this might not be an issue for some larger projects.