This is specifically related to ARM Neon SIMD coding. I am using ARM Neon instrinsics for certain module in a video decoder. I have a vectorized data as follows:
There are four 32 bit elements in a Neon register - say, Q0 - which is of size 128 bit.
3B 3A 1B 1A
There are another four, 32 bit elements in other Neon register say Q1 which is of size 128 bit.
3D 3C 1D 1C
I want the final data to be in order as shown below:
1D 1C 1B 1A
3D 3C 3B 3A
What Neon instrinsics can achieve the desired data order?