niedziela, 9 marca 2014

GCC - asm goto

Starting from GCC 4.5 the asm statement has new form: asm goto. Programmer can use any labels from C/C++ program, however output from this block is not allowed.

Using asm block in old form required more instructions and additional storage:
 uint32_t bit_set;
 asm (
  "bt %2, %%eax  \n"
  "setc %%al  \n"
  "movzx %%al, %%eax \n"
  : "=a" (bit_set)
  : "a" (number), "r" (bit)
  : "cc"
 );

 if (bit_set)
  goto has_bit;
Above code is compiled to:
 80483f6:       0f a3 d8                bt     %ebx,%eax
 80483f9:       0f 92 c0                setb   %al
 80483fc:       0f b6 c0                movzbl %al,%eax
 80483ff:       85 c0                   test   %eax,%eax
 8048401:       74 16                   je     8048419 
With asm goto the same task could be writting shorter and easier:
 asm goto (
  "bt %1, %0 \n"
  "jc %l[has_bit] \n"

  : /* no output */
  : "r" (number), "r" (bit)
  : "cc"
  : has_bit // name of label
 );
Complete demo is available. See also GCC documentation about Extended Asm.

Integer log 10 of unsigned integer - SIMD version

Fast calculate ceil(log10(x)) of some unsigned number is described on Bit Twiddling Hacks, this text show the SIMD solution for 32-bit numbers.

Algorithm:

1. populate value in XMM registers. Since maximum value of this function is 10 we need three registers.
movd   %eax, %xmm0          // xmm0 = packed_dword(0, 0, 0, x)
pshufd $0, %xmm0, %xmm0 \n" // xmm0 = packed_dword(x, x, x, x)
movapd %xmm0, %xmm1
movapd %xmm0, %xmm2
2. compare these numbers with sequence of powers of 10.
// powers_a = packed_dword(10^1 - 1, 10^2 - 1, 10^3 - 1, 10^4 - 1)
// powers_c = packed_dword(10^5 - 1, 10^6 - 1, 10^7 - 1, 10^8 - 1)
// powers_c = packed_dword(10^9 - 1, 0, 0, 0)
pcmpgtd powers_a, %xmm0
pcmpgtd powers_b, %xmm1
pcmpgtd powers_c, %xmm2
result of comparisons are: 0 (false) or -1 (true), for example:
xmm0 = packed_dword(-1, -1, -1, -1)
xmm1 = packed_dword( 0, 0, -1, -1)
xmm2 = packed_dword( 0, 0, 0, 0)
3. calculate sum of all dwords
psrld $31, %xmm0       // xmm0 = packed_dword( 1, 1, 1, 1) - convert -1 to 1
psubd %xmm1, %xmm0     // xmm0 = packed_dword( 1, 1, 2, 2)
psubd %xmm2, %xmm0     // xmm0 = packed_dword( 1, 1, 2, 2)

// convert packed_dword to packed_word
pxor %xmm1, %xmm1
packssdw %xmm1, %xmm0 // xmm0 = packed_word(0, 0, 0, 0, 1, 1, 2, 2)

// max value of word in xmm0 is 3, so higher
// bytes are always zero
psadbw %xmm1, %xmm0   // xmm0 = packded_qword(0, 6)
4. save result, i.e. the lowest dword
movd %xmm0, %eax      // eax = 6

Sample program is available.

Mask for zero/non-zero bytes

The description of Determine if a word has a zero byte from "Bit Twiddling Hacks" says about haszero(v): "the result is the high bits set where the bytes in v were zero".

Unfortunately this is not true. High bits are also set for ones followed zeros, i.e. haszero(0xff010100) = 0x00808080. Of course the result is still valid (non-zero if there were any zero byte), but if we want to iterate over all zeros or find last zero index, this could be a problem.

It's possible to create exact mask:
uint32_t nonzeromask(const uint32_t v) {
    // MSB are set if any of 7 lowest bits are set
    const uint32_t nonzero_7bit = (v & 0x7f7f7f7f) + 0x7f7f7f7f;

    return (v | nonzero_7bit) & 0x80808080;
}

uint32_t zeromask(const uint32_t v) {
     // negate MSBs
     return (nonzeromask(v)) ^ 0x80808080;
}

Function nonzeromask requires 4 simple instructions, and zeromask one additional xor.