sobota, 21 marca 2015

Not everything in AVX2 is 256-bit

AVX2 has added support for 256-bit arguments for many operations on packed integers, although not all. Some instructions accept the 256-bit registers, but operates on 128-bit lanes rather whole register.

There are three major groups of instructions: packing (narrowing conversion), unpacking (interleave) and permutations; below is a full list of instructions (with intrinsics):
  • valignr (_mm256_alignr_epi8)
  • vpslldq (_mm256_bslli_epi128)
  • vpsrldq (_mm256_bsrli_epi128)
  • vmpsadbw (_mm256_mpsadbw_epu8)
  • vpacksswb (_mm256_packs_epi16)
  • vpackssdw (_mm256_packs_epi32)
  • vpackuswb (_mm256_packus_epi16)
  • vpackusdw (_mm256_packus_epi32)
  • vperm2i128 (_mm256_permute2x128_si256)
  • vpermq (_mm256_permute4x64_epi64)
  • vpermpd (_mm256_permute4x64_pd)
  • vpshufd (_mm256_shuffle_epi32)
  • vpshufb (_mm256_shuffle_epi8)
  • vpshufhw (_mm256_shufflehi_epi16)
  • vpshuflw (_mm256_shufflelo_epi16)
  • vpslldq (_mm256_slli_si256)
  • vpsrldq (_mm256_srli_si256)
  • vpunpckhwd (_mm256_unpackhi_epi16)
  • vpunpckhdq (_mm256_unpackhi_epi32)
  • vpunpckhqdq (_mm256_unpackhi_epi64)
  • vpunpckhbw (_mm256_unpackhi_epi8)
  • vpunpcklwd (_mm256_unpacklo_epi16)
  • vpunpckldq (_mm256_unpacklo_epi32)
  • vpunpcklqdq (_mm256_unpacklo_epi64)
  • vpunpcklbw (_mm256_unpacklo_epi8)
For me the most surprising are packing instructions (vpack*) as they require additional shuffling (after or before the instruction) if we want to keep order of values. In some cases the order is crucial.

SSE: Generating mask where n leading (trailing) bytes are set

Informal specification:

__m128i mask_lower(const unsigned n) {
    
    assert(n < 16);
    switch (n) {
        case 0: return {0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00};
        case 1: return {0xff, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00};
        case 2: return {0xff, 0xff, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00};
        // ...
        case 14: return {0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x00};
        case 15: return {0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff};
    }
}
 
__m128i mask_higher(const unsigned n) {
    
    assert(n < 16);
    return ~mask_lower(15 - n);
}

Read more ...