3.0 Effective use of SSE

SSE 2.0 up to the currently latest version 4.2 can process four single precision (32-bit) floating point numbers or two double precision (64-bit) floating point numbers in vectorized manner. If this is not enough precision then SSE will be of no use. Furthermore for double precision floating point data there is a realistic potential for speedup of less than 2x to begin with.

Algorithms can be simple unsuitable for SIMD processing. The less single instruction and multiple data parallel parts there are the less speedup there will be. The processing needs to be mostly coherent.

In most cases, only algorithms that are actually expensive enough or run enough times to be significant in total application run time should be vectorized, because of the additional work. Several examples of good or known candidates for vectorization from different fields of science follows:

Graphics
Rasterization of triangles, ray tracing of coherent rays, shading pixels with a coherent shader.
Physics
N-body dynamics like gravitation between mass points, motion of atoms in potentials or particle dynamics.
Mathematics
Matrix operations in general and linear algebra, analytical geometry like intersection tests between primitives

Generally, of course, anything that be done in parallel mostly coherently with the same set of instructions is a candidate.

Data storage and byte boundary alignment

Intel's and AMD's processors will transfer data to and from memory into registers faster if the data is aligned to 16-byte boundaries. While compiler will take care of this alignment when using the basic 128-bit type it means optimally data has to be stored in sets of four 32-bit floating point values in memory. This is one more additional hurdle to deal with when using SSE. If data is not stored in this kind of fashion then more costly unaligned scalar memory moves are needed instead of packaged 128-bit aligned moves.

Intel optimization manual says: "Data must be 16-byte aligned when loading to and storing from the 128-bit XMM registers used by SSE/SSE2/SSE3/SSSE3. This must be done to avoid severe performance penalties."

Long code blocks and register usage

Effective SSE will minimize the amount of moving of data between memory subsystem and the CPU registers. The same is true of scalar code, however, the benefit is higher with SSE. Code blocks should be as long as possible, where the data is loaded into SSE registers only once and then results moved back into memory only once. Storage to memory should be done when data is no longer needed in a code block. What I mean by a code block is a pathway in code that has no boundaries that can no be eliminated by a compiler. An example of a boundary would be a function call that the compiler can not inline.

While the compiler will be of great help for this optimization by inlining functions, algorithms and program structure have to be designed with this in mind. SSE versions between 2.0 to 4.2 have total of eight 128-bit registers available in 32-bit mode and sixteen in 64-bit mode. The latter can hold total of 64 single precision floating point values in registers.