In my limited spare time, I’ve been trying to teach myself some basic
SIMD programming in order to optimize my 3D
math library I’ve been work on. Whilst at EA, we had the idea of having an “FPU”
(Floating-point Processing Unit) math library, alongside a “VPU” (vector
processing unit) optimized class. I decided to do the same thing in my math
library for a few reasons:
- It allows me to write a simpler, much more basic math library that I could
use in unit tests to validate the VPU version
- It gives me a way to accurately benchmark my SIMD optimized code path
- Finally there are cases where it’s actually better to use a floating point
class version, which I’ll briefly explain later
As I wanted to guarantee backwards compatibility, I decided to just use SSE2
instructions and not anything more fancy, which means it’ll work on every 64-bit
desktop processor* as well as hopefully port quite nicely to
neon for ARM
(more on that in a future post!)
Rather than go through everything, I’ll instead just explain what I did for dot
and cross products, which will show most of the simple ways of doing things.
Introduction to SIMD SSE
I won’t go into huge detail here as there’s plenty of better resources for this,
but the basic principality of SSE is to try and do multiple operations (four to
be exact) in one instruction call. This is done by using special intrinsic
registers which are 128bits wide, and then using special instructions to perform
operations on all 4 values at the same time. This works really well for 3D
operations where you’ll more often than not want to perform actions on the X, Y,
Z and W values at the same time. The key is to then try as much as possible to
keep the results of these operations in these 4-wide registers to ensure you
don’t incur the cost of moving them back.
The one caveat to all this is getting the data into these special registers in
the first place. For SSE2, nearly all of the load operations require the source
data to be 16byte aligned when you load them using
__mm_load_ps_
instructions There are non-aligned calls, but traditionally
these were very slow on various processors, so if you were doing it a lot then
you’d cause some serious slow downs. On newer processors and using newer
instruction sets this is not so much of a problem, but it’s still good to know.
To check for alignment, I wrote a simple macro that tests any reference object
to ensure the memory is appropriately aligned before calling any intrinsic
functions. This is obviously only for debug and compiles out for our optimized
code.
#define M_CHECK_ALIGNMENT(lObject, luAlignment) \
{ \
assert( (luAlignment) ); \
assert( ( (luAlignment) & ( (luAlignment) - 1 ) ) == 0 ); \
assert( (reinterpret_cast<intptr_t>( &(lObject) ) & ( (luAlignment) - 1 )) == 0 ); \
}
Simple Operations
Before showing how I put together a dot product function, let’s do a simple
example: adding. For this ill just be multiplying 4 floats together, which is
exactly what I do in my Vector4 class. Below is a simple example of loading 8
floats into two intrinsic registers, performing a mult operation and then print
the result.
static float lafResult[] = { 0.0f, 0.0f, 0.0f, 0.0f };
__m128 lvfA = _mm_set_ps( 1.0f, 2.0f, 3.0f, 4.0f );
__m128 lvfB = _mm_set_ps( 2.0f, 3.0f, 4.0f, 5.0f );
__m128 lvfA_mult_B = _mm_mult_ps( lvfA, lvfB );
_mm_store_ps( lafResult, lvfA_mult_B );
printf( "%f, %f, %f, %f\n", lafResult[0], lafResult[1], lafResult[2], lafResult[3]);
Pretty simple huh?! An interesting point to note is that this snippet of code
actually returns the result in the reverse order, so 20.0, 12.0, 6.0, 2.0”. This
is because intrinsics are stored the opposite way round than what is expected
for some reason, so when we store the result back into a float register this
gets reversed. It’s also important to note that the float array we store the
result into must be aligned for this example, otherwise it would crash. One way
to ensure your Vector class is 16-byte aligned is by using the
__attribute__ ((aligned (x)));
keyword in the class declaration. This is actually pretty annoying, as it means
for vector types that are smaller than 16bytes (so Vector2 and Vector3) it means
we’ll be wasting space between instances of this class. It is this reason why
it’s sometimes actually faster to use the FPU variants of the vector class, as
it packs nicer, meaning less cache misses. Interesting fact, whilst at EA and
working on the AI systems in Need For Speed, we actually found it was quicker to
use the FPU vectors when performing things like A* algorithms, as the
cache-miss overhead on the PS3 and X360 processors cancelled out any benefit
from using the VPU variants.
The Dot Product shuffle
Now that we’ve got a rough idea of how easy SSE coding is done, let’s see how we
can apply that to a dot product function. First lets write out exactly what a
dot product using floats would look:
inline
float32_t Dot( const Vector4& lLhs, const Vector4& lRhs )
{
return lLhs.GetX() * lRhs.GetX() +
lLhs.GetY() * lRhs.GetY() +
lLhs.GetZ() * lRhs.GetZ() +
lLhs.GetW() * lRhs.GetW();
}
As you hopefully can see, fundamentally this is just multiplying two vector4’s
together, and then adding each result together to get the scalar result. We’ve
already shown how to multiply two arrays of 4 floats together, the difficulty
comes when moving adding the results together. One way to do this would be to
move the results of the multiply into 4 float variables and then add them
together, however this would cause more instructions and involve moving to and
from float registers, which would negate the benefit of doing all 4
multiplications at the same time. Introducing the shuffle instruction! This
function allows you to take two intrinsic variables and move two channels from
each of these into a new variable. Here’s a simple example:
__m128 lvfA = _mm_set_ps( 1.0f, 2.0f, 3.0f, 4.0f );
__m128 lvfB = _mm_set_ps( 5.0f, 6.0f, 7.0f, 8.0f );
__m128 lvfResult = _mm_shuffle_ps( lvfA, lvfB, _MM_SHUFFLE( 0, 1, 2, 3 ) );
printf( "%f, %f, %f, %f\n", lvfResult.m128_f32[0], lvfResult.m128_f32[1],
lvfResult.m128_f32[2], lvfResult.m128_f32[3] );
The shuffle command takes a 3rd argument, which is an unsigned int that defines
which channel to take from each intrinsic variable. Using this technique, we can
move each channel around, add the result together and then end up with the
result. Let’s see what that looks like:
typedef __m128 FloatIntrinsic;
inline FloatIntrinsic
DotV4( const FloatIntrinsic& lLhs, const FloatIntrinsic& lRhs )
{
M_CHECK_ALIGNMENT( lLhs, sizeof(FloatIntrinsic) );
M_CHECK_ALIGNMENT( lRhs, sizeof(FloatIntrinsic) );
FloatIntrinsic lvMult = _mm_mult_ps( lLhs, lRhs );
FloatIntrinsic lvTemp = _mm_shuffle_ps( lvMult, lvMult, _MM_SHUFFLE( 3, 2, 1, 0 ) ); // W, Z, Y, X
lvTemp = _mm_add_ps( lvTemp, lvMult ); // (x+w), (y+z), (z+y), (w+x)
FloatIntrinsic lvTemp2 = _mm_shuffle_ps(lvTemp, lvTemp, _MM_SHUFFLE( 2, 3, 0, 1 ) );
return _mm_add_ps( lvTemp, lvTemp2 ); // (x+w+z+y), (y+z+w+x), (z+y+x+w), (w+x+y+z)
}
As the comments suggest, what we’re doing is shuffling the result of the mult
operation together, adding this to the original result, then shuffling this
again and adding those two together. This produces a resultant whereby all
the channels are set to the sum of the multiplication, which is very useful if
we wanted to later scale a given vector equally, and keeps the result in an
intrinsic register, which is good! For good measure, here’s what the assembly
looks like, which is pretty small:
000000013F8FD0E8 mulps xmm0,xmmword ptr [13F940320h]
000000013F8FD0EF movaps xmm1,xmm0
000000013F8FD0F2 shufps xmm1,xmm0,1Bh
000000013F8FD0F6 addps xmm1,xmm0
000000013F8FD0F9 movaps xmm0,xmm1
000000013F8FD0FC shufps xmm0,xmm1,4Eh
000000013F8FD100 addps xmm0,xmm1
000000013F8FD103 movaps xmmword ptr [rsp+50h],xmm0
This code could also be used for a Vector3 dot product as well, however you need
to guarantee that the “W” component is always zero, otherwise you’ll
accidentally get wrong results. A common place this might happen is when
converting from a Vec4 to a Vec3, or (where I found it an issue) when trying to
apply a dot product to a quaternion axis. However it’s easy enough to write a
DotV3 function, shown below:
inline FloatIntrinsic
DotV3( const FloatIntrinsic& lLhs, const FloatIntrinsic& lRhs )
{
M_CHECK_ALIGNMENT( lLhs, sizeof(FloatIntrinsic) );
M_CHECK_ALIGNMENT( lRhs, sizeof(FloatIntrinsic) );
FloatIntrinsic lvMult = _mm_mult_ps( lLhs, lRhs );
FloatIntrinsic lvTemp = _mm_shuffle_ps( lvMult, lvMult, _MM_SHUFFLE( 1, 0, 0, 0 ) );
FloatIntrinsic lvTemp2 = _mm_shuffle_ps( lvMult, lvMult, _MM_SHUFFLE( 2, 0, 0, 0 ) );
FloatIntrinsic lvSum = _mm_add_ps( lvMult, _mm_add_ps( lvTemp, lvTemp2 ) );
return _mm_shuffle_ps( lvSum, lvSum, _MM_SHUFFLE( 0, 0, 0, 0 ) );
}
In this example you could avoid the last shuffle and use the first value from
the intrinsic. However as I said before, it’s very beneficial to have the same
value in all the channels, as you can then use it in other operations (this is
called “splatting” the result across all channels). Here’s the assembly for this
as well for good measure:
000000013FC5F2A8 mulps xmm2,xmmword ptr [13FCB0580h]
000000013FC5F2AF movaps xmm0,xmm2
000000013FC5F2B2 shufps xmm0,xmm2,1
000000013FC5F2B6 movaps xmm1,xmm2
000000013FC5F2B9 shufps xmm1,xmm2,2
000000013FC5F2BD addps xmm1,xmm0
000000013FC5F2C0 addps xmm1,xmm2
000000013FC5F2C3 shufps xmm1,xmm1,0
000000013FC5F2C7 movaps xmmword ptr [rsp+50h],xmm1
Now for the Cross Product
Phew! That wasn’t too bad was it?! Ok maybe it was, but the good news is that
using these techniques, you can do cross product in a very similar way. Again,
lets write a simple cross product function using floats:
inline
Vector3 Cross( const Vector3& lLhs, const Vector3& lRhs )
{
Vector3 lvResult;
lvResult.GetX() = lLhs.GetY() * lRhs.GetZ() - lLhs.GetZ() * lRhs.GetY();
lvResult.GetY() = lLhs.GetZ() * lRhs.GetX() - lLhs.GetX() * lRhs.GetZ();
lvResult.GetZ() = lLhs.GetX() * lRhs.GetY() - lLhs.GetY() * lRhs.GetX();
return lvResult;
}
Again I’ve structured this in a way that makes it obvious that we have 3
operations that need to be performed. Unlike the dot however, we need to first
get our two vectors into the right order before multiplying and subtracting.
Let’s see what that looks like:
M_FORCE_INLINE FloatIntrinsic
CrossV3( const FloatIntrinsic& lLhs, const FloatIntrinsic& lRhs )
{
M_CHECK_ALIGNMENT( lLhs, sizeof(FloatIntrinsic) );
M_CHECK_ALIGNMENT( lRhs, sizeof(FloatIntrinsic) );
const uint32_t YZXMask = _MM_SHUFFLE( 1, 2, 0, 0 );
const uint32_t ZXYMask = _MM_SHUFFLE( 2, 0, 1, 0 );
FloatIntrinsic lvTemp1 = _mm_shuffle_ps( lLhs, lLhs, YZXMask );
FloatIntrinsic lvTemp2 = _mm_shuffle_ps( lRhs, lRhs, ZXYMask );
FloatIntrinsic lvMult = _mm_mult_ps( lvTemp1, lvTemp2 ); // (y1*z2), (z1*x2), (x1*y2), (x1*x2)
lvTemp1 = _mm_shuffle_ps( lLhs, lLhs, ZXYMask );
lvTemp2 = _mm_shuffle_ps( lRhs, lRhs, YZXMask );
FloatIntrinsic lvMult2 = _mm_mult_ps( lvTemp1, lvTemp2 ); // (z1*y2), (x1*z2), (y1*x2), (x1*x2)
return _mm_sub_ps( lvMult, lvMult2 );
}
Pretty striaght forward right?! The cool thing about this is that because we’re
doing x1x2 - x1x2 in the W, no matter what is there will become zero, which is
very handy when using the result in some other operation. Let’s see what this
looks like as assembly:
000000013F34EDAA movaps xmm6,xmm2
000000013F34EDAD shufps xmm6,xmm2,12h
000000013F34EDB1 movaps xmm0,xmm1
000000013F34EDB4 shufps xmm0,xmm1,9
000000013F34EDB8 mulps xmm6,xmm0
000000013F34EDBB shufps xmm1,xmm1,12h
000000013F34EDBF shufps xmm2,xmm2,9
000000013F34EDC3 mulps xmm2,xmm1
000000013F34EDC6 subps xmm6,xmm2
Summary
So that’s a quick whistle stop tour of writing SSE instructions and showing how
to create dot and cross functions. I’ll try and add some performance tests
comparing the the two functions above to their FPU equivalents in a later post.
* That’s not actually true, some of the earliest Pentium 4 64-bit processors
didn’t have any SSE support, but I doubt anybody will be using those