I am performing a scattered read of 8-bit data from a file (De-Interleaving a 64 channel wave file). I am then combining them to be a single stream of bytes. The problem I m having is with my re-construction of the data to write out.
Basically I m reading in 16 bytes and then building them into a single __m128i variable and then using _mm_stream_ps to write the value back out to memory. However I have some odd performance results.
In my first scheme I use the _mm_set_epi8 intrinsic to set my __m128i as follows:
const __m128i packedSamples = _mm_set_epi8( sample15, sample14, sample13, sample12, sample11, sample10, sample9, sample8,
sample7, sample6, sample5, sample4, sample3, sample2, sample1, sample0 );
Basically I leave it all up to the compiler to decide how to optimise it to give best performance. This gives WORST performance. MY test runs in ~0.195 seconds.
Second I tried to merge down by using 4 _mm_set_epi32 instructions and then packing them down:
const __m128i samples0 = _mm_set_epi32( sample3, sample2, sample1, sample0 );
const __m128i samples1 = _mm_set_epi32( sample7, sample6, sample5, sample4 );
const __m128i samples2 = _mm_set_epi32( sample11, sample10, sample9, sample8 );
const __m128i samples3 = _mm_set_epi32( sample15, sample14, sample13, sample12 );
const __m128i packedSamples0 = _mm_packs_epi32( samples0, samples1 );
const __m128i packedSamples1 = _mm_packs_epi32( samples2, samples3 );
const __m128i packedSamples = _mm_packus_epi16( packedSamples0, packedSamples1 );
This does improve performance somewhat. My test now runs in ~0.15 seconds. Seems counter-intuitive that performance would improve by doing this as I assume this is exactly what _mm_set_epi8 is doing anyway ...
My final attempt was to use a bit of code I have from making four CCs the old fashioned way (with shifts and ors) and then putting them in an __m128i using a single _mm_set_epi32.
const GCui32 samples0 = MakeFourCC( sample0, sample1, sample2, sample3 );
const GCui32 samples1 = MakeFourCC( sample4, sample5, sample6, sample7 );
const GCui32 samples2 = MakeFourCC( sample8, sample9, sample10, sample11 );
const GCui32 samples3 = MakeFourCC( sample12, sample13, sample14, sample15 );
const __m128i packedSamples = _mm_set_epi32( samples3, samples2, samples1, samples0 );
This gives even BETTER performance. Taking ~0.135 seconds to run my test. I m really starting to get confused.
So I tried a simple read byte write byte system and that is ever-so-slightly faster than even the last method.
So what is going on? This all seems counter-intuitive to me.
I ve considered the idea that the delays are occuring on the _mm_stream_ps because I m supplying data too quickly but then I would to get exactly the same results out whatever I do. Is it possible that the first 2 methods mean that the 16 loads can t get distributed through the loop to hide latency? If so why is this? Surely an intrinsic allows the compiler to make optimisations as and where it pleases .. i thought that was the whole point ... Also surely performing 16 reads and 16 writes will be much slower than 16 reads and 1 write with a bunch of SSE juggling instructions ... After all its the reads and writes that are the slow bit!
Anyone with any ideas whats going on will be much appreciated! :D
Edit: Further to the comment below I stopped pre-loading the bytes as constants and changedit to this:
const __m128i samples0 = _mm_set_epi32( *(pSamples + channelStep3), *(pSamples + channelStep2), *(pSamples + channelStep1), *(pSamples + channelStep0) );
pSamples += channelStep4;
const __m128i samples1 = _mm_set_epi32( *(pSamples + channelStep3), *(pSamples + channelStep2), *(pSamples + channelStep1), *(pSamples + channelStep0) );
pSamples += channelStep4;
const __m128i samples2 = _mm_set_epi32( *(pSamples + channelStep3), *(pSamples + channelStep2), *(pSamples + channelStep1), *(pSamples + channelStep0) );
pSamples += channelStep4;
const __m128i samples3 = _mm_set_epi32( *(pSamples + channelStep3), *(pSamples + channelStep2), *(pSamples + channelStep1), *(pSamples + channelStep0) );
pSamples += channelStep4;
const __m128i packedSamples0 = _mm_packs_epi32( samples0, samples1 );
const __m128i packedSamples1 = _mm_packs_epi32( samples2, samples3 );
const __m128i packedSamples = _mm_packus_epi16( packedSamples0, packedSamples1 );
and this improved performance to ~0.143 seconds. Sitll not as good as the straight C implementation ...
Edit Again: The best performance I m getting thus far is
// Load the samples.
const GCui8 sample0 = *(pSamples + channelStep0);
const GCui8 sample1 = *(pSamples + channelStep1);
const GCui8 sample2 = *(pSamples + channelStep2);
const GCui8 sample3 = *(pSamples + channelStep3);
const GCui32 samples0 = Build32( sample0, sample1, sample2, sample3 );
pSamples += channelStep4;
const GCui8 sample4 = *(pSamples + channelStep0);
const GCui8 sample5 = *(pSamples + channelStep1);
const GCui8 sample6 = *(pSamples + channelStep2);
const GCui8 sample7 = *(pSamples + channelStep3);
const GCui32 samples1 = Build32( sample4, sample5, sample6, sample7 );
pSamples += channelStep4;
// Load the samples.
const GCui8 sample8 = *(pSamples + channelStep0);
const GCui8 sample9 = *(pSamples + channelStep1);
const GCui8 sample10 = *(pSamples + channelStep2);
const GCui8 sample11 = *(pSamples + channelStep3);
const GCui32 samples2 = Build32( sample8, sample9, sample10, sample11 );
pSamples += channelStep4;
const GCui8 sample12 = *(pSamples + channelStep0);
const GCui8 sample13 = *(pSamples + channelStep1);
const GCui8 sample14 = *(pSamples + channelStep2);
const GCui8 sample15 = *(pSamples + channelStep3);
const GCui32 samples3 = Build32( sample12, sample13, sample14, sample15 );
pSamples += channelStep4;
const __m128i packedSamples = _mm_set_epi32( samples3, samples2, samples1, samples0 );
_mm_stream_ps( pWrite + 0, *(__m128*)&packedSamples );
This gives me processing in ~0.095 seconds which is considerably better. I don t appear to be able to get close with SSE though ... I m still confused by that but .. ho hum.