Fast dot product using SSE/AVX intrinsics

I am looking for a fast way to calculate the dot product of vectors with 3 or 4 components. I tried several things, but most examples online use an array of floats while our data structure is different.

We use structs which are 16 byte aligned. Code excerpt (simplified):

struct float3 {
    float x, y, z, w; // 4th component unused here

struct float4 {
    float x, y, z, w;

In previous tests (using SSE4 dot product intrinsic or FMA) I could not get a speedup, compared to using the following regular c++ code.

float dot(const float3 a, const float3 b) {
    return a.x*b.x + a.y*b.y + a.z*b.z;

Tests were done with gcc and clang on Intel Ivy Bridge / Haswell. It seems that the time spend to load the data into the SIMD registers and pulling them out again kills alls the benefits.

I would appreciate some help and ideas, how the dot product can be efficiently calculated using our float3/4 data structures. SSE4, AVX or even AVX2 is fine.

Thanks in advance.

Source: gcc

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.