niedziela, 22 marca 2015

SIMD: detecting a bit pattern

The problem: there are 64-bit values with some data bits and some metadata bits; metadata includes a k-bit field describing a "type" (k >= 0). Type field is located in a lower 32-bits.

Procedure processes two "types", one denoted with code 3 and another with 5. When all items are of type 3 then we can use a fast AVX2 path, if there are some types 5, we have to call an additional function (a virtual method, to be precise). Read more ...

Compiler warnings are your future errors

Months ago I was asked to upgrade GCC from version 4.7 to 4.9 and also cleanup configure scripts. Not very exciting, merely time consuming task. Oh, we were hit by a bug in libstdc++, but simple patch have fixed the problem. Few weeks later I was asked to change GCC switch from -std=c++11 to -std=c++14 -- the easiest task in the world. I had to modify single script, run configure, type make, then run tests... everything was OK. Quite boring so far. Read more ...

AVX512: ternary functions evaluation

Intel's version of SIMD offers following 2-argument (binary) boolean functions: and, or, xor, and not. There isn't a single argument not, this function can be expressed with xor reg, ones, however this require additional, pre-set register.

AVX512F will come with very interesting instruction called vpternlog. Read more ...

sobota, 21 marca 2015

Not everything in AVX2 is 256-bit

AVX2 has added support for 256-bit arguments for many operations on packed integers, although not all. Some instructions accept the 256-bit registers, but operates on 128-bit lanes rather whole register.

There are three major groups of instructions: packing (narrowing conversion), unpacking (interleave) and permutations; below is a full list of instructions (with intrinsics):
  • valignr (_mm256_alignr_epi8)
  • vpslldq (_mm256_bslli_epi128)
  • vpsrldq (_mm256_bsrli_epi128)
  • vmpsadbw (_mm256_mpsadbw_epu8)
  • vpacksswb (_mm256_packs_epi16)
  • vpackssdw (_mm256_packs_epi32)
  • vpackuswb (_mm256_packus_epi16)
  • vpackusdw (_mm256_packus_epi32)
  • vperm2i128 (_mm256_permute2x128_si256)
  • vpermq (_mm256_permute4x64_epi64)
  • vpermpd (_mm256_permute4x64_pd)
  • vpshufd (_mm256_shuffle_epi32)
  • vpshufb (_mm256_shuffle_epi8)
  • vpshufhw (_mm256_shufflehi_epi16)
  • vpshuflw (_mm256_shufflelo_epi16)
  • vpslldq (_mm256_slli_si256)
  • vpsrldq (_mm256_srli_si256)
  • vpunpckhwd (_mm256_unpackhi_epi16)
  • vpunpckhdq (_mm256_unpackhi_epi32)
  • vpunpckhqdq (_mm256_unpackhi_epi64)
  • vpunpckhbw (_mm256_unpackhi_epi8)
  • vpunpcklwd (_mm256_unpacklo_epi16)
  • vpunpckldq (_mm256_unpacklo_epi32)
  • vpunpcklqdq (_mm256_unpacklo_epi64)
  • vpunpcklbw (_mm256_unpacklo_epi8)
For me the most surprising are packing instructions (vpack*) as they require additional shuffling (after or before the instruction) if we want to keep order of values. In some cases the order is crucial.

SSE: Generating mask where n leading (trailing) bytes are set

Informal specification:

__m128i mask_lower(const unsigned n) {
    
    assert(n < 16);
    switch (n) {
        case 0: return {0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00};
        case 1: return {0xff, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00};
        case 2: return {0xff, 0xff, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00};
        // ...
        case 14: return {0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x00};
        case 15: return {0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff};
    }
}
 
__m128i mask_higher(const unsigned n) {
    
    assert(n < 16);
    return ~mask_lower(15 - n);
}

Read more ...