English 中文(简体)
哪一种最快的办法是在借方的8x8块区划线?
原标题:What is the fastest way to transpose the bits in an 8x8 block on bits?

I m not sure the exact term for what I m trying to do. I have an 8x8 block of bits stored in 8 bytes, each byte stores one row. When I m finished, I d like each byte to store one column.

For example, when I m finished:

Byte0out = Byte0inBit0 + Bit0inByte1 + Bit0inByte2 + Bit0inByte3 + ...
Byte1out = Bit1inByte0 + Bit1inByte1 + Bit1inByte2 + Bit1inByte3 + ...

<>最容易<>的>方式是什么? 这将由一家dsPIC微型控制器操作。

最佳回答

This code is cribbed directly from "Hacker s Delight" - Figure 7-2 Transposing an 8x8-bit matrix, I take no credit for it:

void transpose8(unsigned char A[8], int m, int n, 
                unsigned char B[8]) {
   unsigned x, y, t; 

   // Load the array and pack it into x and y. 

   x = (A[0]<<24)   | (A[m]<<16)   | (A[2*m]<<8) | A[3*m]; 
   y = (A[4*m]<<24) | (A[5*m]<<16) | (A[6*m]<<8) | A[7*m]; 

   t = (x ^ (x >> 7)) & 0x00AA00AA;  x = x ^ t ^ (t << 7); 
   t = (y ^ (y >> 7)) & 0x00AA00AA;  y = y ^ t ^ (t << 7); 

   t = (x ^ (x >>14)) & 0x0000CCCC;  x = x ^ t ^ (t <<14); 
   t = (y ^ (y >>14)) & 0x0000CCCC;  y = y ^ t ^ (t <<14); 

   t = (x & 0xF0F0F0F0) | ((y >> 4) & 0x0F0F0F0F); 
   y = ((x << 4) & 0xF0F0F0F0) | (y & 0x0F0F0F0F); 
   x = t; 

   B[0]=x>>24;    B[n]=x>>16;    B[2*n]=x>>8;  B[3*n]=x; 
   B[4*n]=y>>24;  B[5*n]=y>>16;  B[6*n]=y>>8;  B[7*n]=y; 
}

I didn t check if this rotates in the direction you need, if not you might need to adjust the code.

Also, keep in mind datatypes & sizes - int & unsigned (int) might not be 32 bits on your platform.

BTW, I suspect the book (Hacker s Delight) is essential for the kind of work you re doing... check it out, lots of great stuff in there.

问题回答

If you are looking for the simplest solution:

/* not tested, not even compiled */

char bytes_in[8];
char bytes_out[8];

/* please fill bytes_in[] here with some pixel-crap */

memset(bytes_out, 0, 8);
for(int i = 0; i < 8; i++) {
    for(int j = 0; j < 8; j++) {
        bytes_out[i] = (bytes_out[i] << 1) | ((bytes_in[j] >> (7 - i)) & 0x01);
    }
}

If your are looking for the fastest solution:

如何利用SSE2.在组装上一个比值矩阵。

This sounds a lot like a so-called "Chunky to planar" routine used on displays that use bitplanes. The following link uses MC68K assembler for its code, but provides a nice overview of the problem (assuming I understood the question correctly):

http://membres.multimania.fr/amycoders/sources/c2ptut.html

Lisp prototype:

(declaim (optimize (speed 3) (safety 0)))
(defun bit-transpose (a)
  (declare (type (simple-array unsigned-byte 1) a))
  (let ((b (make-array 8 :element-type  (unsigned-byte 8))))
    (dotimes (j 8)
      (dotimes (i 8)
    (setf (ldb (byte 1 i) (aref b j))
          (ldb (byte 1 j) (aref a i)))))
    b))

这是你如何操作守则:

#+nil
(bit-transpose (make-array 8 :element-type  unsigned-byte
               :initial-contents  (1 2 3 4 5 6 7 8)))
;; => #(85 102 120 128 0 0 0 0)

我有时会解散法典,以检查安全职能没有不必要的呼吁。

#+nil
(disassemble # bit-transpose)

This is a benchmark. Run the function often enough to process a (binary) HDTV image.

#+nil
(time 
 (let ((a (make-array 8 :element-type  unsigned-byte
              :initial-contents  (1 2 3 4 5 6 7 8)))
       (b (make-array 8 :element-type  unsigned-byte
              :initial-contents  (1 2 3 4 5 6 7 8))))
   (dotimes (i (* (/ 1920 8) (/ 1080 8)))
     (bit-transpose a))))

这只花了51米。 请注意,我会把很多东西混在一起,因为这一职能分配了全部时间的新的8个星群。 我确信,在C执行《公约》的情况会更加严重。

Evaluation took:
  0.051 seconds of real time
  0.052004 seconds of total run time (0.052004 user, 0.000000 system)
  101.96% CPU
  122,179,503 processor cycles
  1,048,576 bytes consed

这里还有一些测试案例:

#+nil
(loop for j below 12 collect
  (let ((l (loop for i below 8 collect (random 255))))
    (list l (bit-transpose (make-array 8 :element-type  unsigned-byte
                :initial-contents l)))))
;; => (((111 97 195 202 47 124 113 164) #(87 29 177 57 96 243 111 140))
;;     ((180 192 70 173 167 41 30 127) #(184 212 221 232 193 185 134 27))
;;     ((244 86 149 57 191 65 129 178) #(124 146 23 24 159 153 35 213))
;;     ((227 244 139 35 38 65 214 64) #(45 93 82 4 66 27 227 71))
;;     ((207 62 236 89 50 64 157 120) #(73 19 71 207 218 150 173 69))
;;     ((89 211 149 140 233 72 193 192) #(87 2 12 57 7 16 243 222))
;;     ((97 144 19 13 135 198 238 33) #(157 116 120 72 6 193 97 114))
;;     ((145 119 3 85 41 202 79 134) #(95 230 202 112 11 18 106 161))
;;     ((42 153 67 166 175 190 114 21) #(150 125 184 51 226 121 68 58))
;;     ((58 232 38 210 137 254 19 112) #(80 109 36 51 233 167 170 58))
;;     ((27 245 1 197 208 221 21 101) #(239 1 234 33 115 130 186 58))
;;     ((66 204 110 232 46 67 37 34) #(96 181 86 30 0 220 47 10)))

Now I really want to see how my code compares to Andrejs Cainikovs C solution (Edit: I think its wrong):

#include <string.h>

unsigned char bytes_in[8]={1,2,3,4,5,6,7,8};
unsigned char bytes_out[8];

/* please fill bytes_in[] here with some pixel-crap */
void bit_transpose(){
  memset(bytes_out, 0, 8);
  int i,j;
  for(i = 0; i < 8; i++)
    for(j = 0; j < 8; j++) 
      bytes_out[i] = (bytes_out[i] << 1) | ((bytes_in[j] >> (7 - i)) & 0x01);
}

int
main()
{
  int j,i;
  for(j=0;j<100;j++)
    for(i=0;i<(1920/8*1080/8);i++)
      bit_transpose();
  return 0;
}

基准:

wg@hp:~/0803/so$ gcc -O3 trans.c
wg@hp:~/0803/so$ time ./a.out 

real    0m0.249s
user    0m0.232s
sys     0m0.000s

每一种照相机都带有2.5米。 这远远快于我的无选择的利什。

不幸的是,《刑法》没有像我这样带来同样的结果:

#include <stdio.h>
int
main()
{
  int j,i;
  bit_transpose();
  for(i=0;i<8;i++)
    printf("%d ",(int)bytes_out[i]);
  return 0;
}
wg@hp:~/0803/so$ ./a.out 
0 0 0 0 1 30 102 170 

这类似于。 借机问题栏,可以通过将这种投入作为64轨道分类的8个字塔来有效解决。 如果借方0是最不重要的,则0号星号是阵列中的第一个星号,那么我假定你想做以下工作:

                              Column 7 becomes...
                              ↓
[ b07 b06 b05 b04 b03 b02 b01 b00   [ b70 b60 b50 b40 b30 b20 b10 b00 ← row 0
  b17 b16 b15 b14 b13 b12 b11 b10     b71 b61 b51 b41 b31 b21 b11 b01
  b27 b26 b25 b24 b23 b22 b21 b20     b72 b62 b52 b42 b32 b22 b12 b02
  b37 b36 b35 b34 b33 b32 b31 b30  →  b73 b63 b53 b43 b33 b23 b13 b03
  b47 b46 b45 b44 b43 b42 b41 b40  →  b74 b64 b54 b44 b34 b24 b14 b04
  b57 b56 b55 b54 b53 b52 b51 b50     b75 b65 b55 b45 b35 b25 b15 b05
  b67 b66 b65 b64 b63 b62 b61 b60     b76 b66 b56 b46 b36 b26 b16 b06
  b77 b76 b75 b74 b73 b72 b71 b70 ]   b77 b67 b57 b47 b37 b27 b17 b07 ]

缩略语 这样,左上列一栏的轮值就只是把所有最重要的轨道按反向单列,而其他各栏也可轮流使用。

To do that we mask out all the last 7 columns and read the array as an uint64_t. The result is

0b h0000000 g0000000 f0000000 e0000000 d0000000 c0000000 b0000000 a0000000
   ↑        ↑        ↑        ↑        ↑        ↑        ↑        ↑
   b77      b67      b57      b47      b37      b27      b17      b07

几乎在终端用户中,abcdefgh分别为707至77。 现在,我们只需要将这一数值乘以

uint8_t transpose_column(uint64_t matrix, unsigned col)
{
    const uint64_t column_mask = 0x8080808080808080ull;
    const uint64_t magic       = 0x0002040810204081ull;
    
    return ((matrix << col) & column_mask) * magic >> 56;
}

uint64_t block8x8;
memcpy(&block8x8, bytes_in, sizeof(block8x8));
#if __BYTE_ORDER == __BIG_ENDIAN
block8x8 = swap_bytes(block8x8);
#endif

for (unsigned i = 0; i < 8; i++)
    byte_out[i] = transpose_column(block8x8, 7 - i);

由于你将8个星阵列作为<代码>uint64_t<>/code>处理,你可能需要适当调整阵列,以取得更好的业绩,因为只有单一记忆负荷需要这样做。


在AVX2英特尔,在http://www.felixcloutier.com/x86/PEXT.html”上,“rel=“nofollow noreferer”>PDEP指令(通过_pext_u64):

data[i] = _pext_u64(matrix, column_mask << (7 - col));

但不幸的是,正如你所期望的那样,这在dsPIC赢得了一定的工作。

可在

If you wanted an optimized solution you would use the SSE extensions in x86.

You d need to use 4 of these SIMD opcodes.

  • MOVQ - move 8 bytes
  • PSLLW - packed shift left logical words
  • PMOVMSKB - packed move mask byte

2个普通x86码

  • LEA - load effective address
  • MOV - move
byte[] m = byte[8]; //input
byte[] o = byte[8]; //output
LEA ecx, [o]
// ecx = the address of the output array/matrix
MOVQ xmm0, [m]
// xmm0 = 0|0|0|0|0|0|0|0|m[7]|m[6]|m[5]|m[4]|m[3]|m[2]|m[1]|m[0]
PMOVMSKB eax, xmm0
// eax = m[7][7]...m[0][7] the high bit of each byte
MOV [ecx+7], al
// o[7] is now the last column
PSLLW xmm0, 1
// shift 1 bit to the left
PMOVMSKB eax, xmm0
MOV [ecx+6], al
PSLLW xmm0, 1
PMOVMSKB eax, xmm0
MOV [ecx+5], al
PSLLW xmm0, 1
PMOVMSKB eax, xmm0
MOV [ecx+4], al
PSLLW xmm0, 1
PMOVMSKB eax, xmm0
MOV [ecx+3], al
PSLLW xmm0, 1
PMOVMSKB eax, xmm0
MOV [ecx+2], al
PSLLW xmm0, 1
PMOVMSKB eax, xmm0
MOV [ecx+1], al
PSLLW xmm0, 1
PMOVMSKB eax, xmm0
MOV [ecx], al

25 x86 opcodes/instructions than the acked for loop Solutions with 64 iterations.





相关问题
Fastest method for running a binary search on a file in C?

For example, let s say I want to find a particular word or number in a file. The contents are in sorted order (obviously). Since I want to run a binary search on the file, it seems like a real waste ...

Print possible strings created from a Number

Given a 10 digit Telephone Number, we have to print all possible strings created from that. The mapping of the numbers is the one as exactly on a phone s keypad. i.e. for 1,0-> No Letter for 2->...

Tips for debugging a made-for-linux application on windows?

I m trying to find the source of a bug I have found in an open-source application. I have managed to get a build up and running on my Windows machine, but I m having trouble finding the spot in the ...

Trying to split by two delimiters and it doesn t work - C

I wrote below code to readin line by line from stdin ex. city=Boston;city=New York;city=Chicago and then split each line by ; delimiter and print each record. Then in yet another loop I try to ...

Good, free, easy-to-use C graphics libraries? [closed]

I was wondering if there were any good free graphics libraries for C that are easy to use? It s for plotting 2d and 3d graphs and then saving to a file. It s on a Linux system and there s no gnuplot ...

Encoding, decoding an integer to a char array

Please note that this is not homework and i did search before starting this new thread. I got Store an int in a char array? I was looking for an answer but didn t get any satisfactory answer in the ...