(I m a newbie to SSE/asm, apologies if this is obvious or redundant)
Is there a better way to transpose 8 SSE registers containing 16-bit values than performing 24 unpck[lh]ps and 8/16+ shuffles and using 8 extra registers? (Note using up to SSSE 3 instructions, Intel Merom, aka lacking BLEND* from SSE4.)
Say you have registers v[0-7] and use t0-t7 as aux registers. In pseudo intrinsics code:
/* Phase 1: process lower parts of the registers */
/* Level 1: work first part of the vectors */
/* v[0] A0 A1 A2 A3 A4 A5 A6 A7
** v[1] B0 B1 B2 B3 B4 B5 B6 B7
** v[2] C0 C1 C2 C3 C4 C5 C6 C7
** v[3] D0 D1 D2 D3 D4 D5 D6 D7
** v[4] E0 E1 E2 E3 E4 E5 E6 E7
** v[5] F0 F1 F2 F3 F4 F5 F6 F7
** v[6] G0 G1 G2 G3 G4 G5 G6 G7
** v[7] H0 H1 H2 H3 H4 H5 H6 H7 */
t0 = unpcklps (v[0], v[1]); /* Extract first half interleaving */
t1 = unpcklps (v[2], v[3]); /* Extract first half interleaving */
t2 = unpcklps (v[4], v[5]); /* Extract first half interleaving */
t3 = unpcklps (v[6], v[7]); /* Extract first half interleaving */
t0 = pshufhw (t0, 0xD8); /* Flip middle 2 high */
t0 = pshuflw (t0, 0xD8); /* Flip middle 2 low */
t1 = pshufhw (t1, 0xD8); /* Flip middle 2 high */
t1 = pshuflw (t1, 0xD8); /* Flip middle 2 low */
t2 = pshufhw (t2, 0xD8); /* Flip middle 2 high */
t2 = pshuflw (t2, 0xD8); /* Flip middle 2 low */
t3 = pshufhw (t3, 0xD8); /* Flip middle 2 high */
t3 = pshuflw (t3, 0xD8); /* Flip middle 2 low */
/* t0 A0 B0 A1 B1 A2 B2 A3 B3 (A B - 0 1 2 3)
** t1 C0 D0 C1 D1 C2 D2 C3 D3 (C D - 0 1 2 3)
** t2 E0 F0 E1 F1 E2 F2 E3 F3 (E F - 0 1 2 3)
** t3 G0 H0 G1 H1 G2 H2 G3 H3 (G H - 0 1 2 3) */
/* L2 */
t4 = unpcklps (t0, t1);
t5 = unpcklps (t2, t3);
t6 = unpckhps (t0, t1);
t7 = unpckhps (t2, t3);
/* t4 A0 B0 C0 D0 A1 B1 C1 D1 (A B C D - 0 1)
** t5 E0 F0 G0 H0 E1 F1 G1 H1 (E F G H - 0 1)
** t6 A2 B2 C2 D2 A3 B3 C3 D3 (A B C D - 2 3)
** t7 E2 F2 G2 H2 E3 F3 G3 H3 (E F G H - 2 3) */
/* Phase 2: same with higher parts of the registers */
/* A A0 A1 A2 A3 A4 A5 A6 A7
** B B0 B1 B2 B3 B4 B5 B6 B7
** C C0 C1 C2 C3 C4 C5 C6 C7
** D D0 D1 D2 D3 D4 D5 D6 D7
** E E0 E1 E2 E3 E4 E5 E6 E7
** F F0 F1 F2 F3 F4 F5 F6 F7
** G G0 G1 G2 G3 G4 G5 G6 G7
** H H0 H1 H2 H3 H4 H5 H6 H7 */
t0 = unpckhps (v[0], v[1]);
t0 = pshufhw (t0, 0xD8); /* Flip middle 2 high */
t0 = pshuflw (t0, 0xD8); /* Flip middle 2 low */
t1 = unpckhps (v[2], v[3]);
t1 = pshufhw (t1, 0xD8); /* Flip middle 2 high */
t1 = pshuflw (t1, 0xD8); /* Flip middle 2 low */
t2 = unpckhps (v[4], v[5]);
t2 = pshufhw (t2, 0xD8); /* Flip middle 2 high */
t2 = pshuflw (t2, 0xD8); /* Flip middle 2 low */
t3 = unpckhps (v[6], v[7]);
t3 = pshufhw (t3, 0xD8); /* Flip middle 2 high */
t3 = pshuflw (t3, 0xD8); /* Flip middle 2 low */
/* t0 A4 B4 A5 B5 A6 B6 A7 B7 (A B - 4 5 6 7)
** t1 C4 D4 C5 D5 C6 D6 C7 D7 (C D - 4 5 6 7)
** t2 E4 F4 E5 F5 E6 F6 E7 F7 (E F - 4 5 6 7)
** t3 G4 H4 G5 H5 G6 H6 G7 H7 (G H - 4 5 6 7) */
/* Back to first part, v[0-3] can be re-written now */
/* L3 */
v[0] = unpcklpd (t4, t5);
v[1] = unpckhpd (t4, t5);
v[2] = unpcklpd (t6, t7);
v[3] = unpckhpd (t6, t7);
/* v[0] = A0 B0 C0 D0 E0 F0 G0 H0
** v[1] = A1 B1 C1 D1 E1 F1 G1 H1
** v[2] = A2 B2 C2 D2 E2 F2 G2 H2
** v[3] = A3 B3 C3 D3 E3 F3 G3 H3 */
/* Back to second part, t[4-7] can be re-written now... */
/* L2 */
t4 = unpcklps (t0, t1);
t5 = unpcklps (t2, t3);
t6 = unpckhps (t0, t1);
t7 = unpckhps (t2, t3);
/* t4 A4 B4 C4 D4 A5 B5 C5 D5 (A B C D - 4 5)
** t5 E4 F4 G4 H4 E5 F5 G5 H5 (E F G H - 4 5)
** t6 A6 B6 C6 D6 A7 B7 C7 D7 (A B C D - 6 7)
** t7 E6 F6 G6 H6 E7 F7 G7 H7 (E F G H - 6 7) */
/* L3 */
v[4] = unpcklpd (t4, t5);
v[5] = unpckhpd (t4, t5);
v[6] = unpcklpd (t6, t7);
v[7] = unpckhpd (t6, t7);
/* v[4] = A4 B4 C4 D4 E4 F4 G4 H4
** v[5] = A5 B5 C5 D5 E5 F5 G5 H5
** v[6] = A6 B6 C6 D6 E6 F6 G6 H6
** v[7] = A7 B7 C7 D7 E7 F7 G7 H7 */
Each unpck* takes 3 cycles of latency, or 2 for reciprocal throughput (reported by Agner.) This is killing big part of the performance gains from using SSE (on this code) because this register dance takes almost one cycle per element. I tried to understand x264 s asm file for x86 transpose but failed understanding the macros.
Thanks!