wtorek, 29 grudnia 2015

Fast conversion of floating-point values to string

The conversion to string could be 15 times faster than sprintf. Read more...

niedziela, 27 grudnia 2015

Base64 encoding — implementation study

Although base64 encoding is a very basic algorithm, it could be sped up a little (25% sounds good?) Read more...

sobota, 26 grudnia 2015

Benefits from the obsession

Everything has started few years ago when I found John Regher's blog. If you don't know the blog I highly recommend it. Among other things (I like the photos!) the author studies bugs in compilers, undefined behaviours and similar things. The word "overflow" appears quite often in his posts due to the great number of errors related to improper use of the integer arithmetic. Well, I don't know when the obsession has exactly started, but recently I realized that I am alert of all integer operations in my programs. Read more...

sobota, 28 listopada 2015

Implicit conversion - the enemy

I wrote:

    result += string_utils::pad_left(string, '0');

I forget that pad_left signature is string, int, char and the char parameter has a default value. My mistake, without doubts.

This is another example of dark sides of the implicit conversions. C++ converts between characters and integers seamlessly. These two beast are distinct in the nature. Of course characters are represented by the numbers, however it's an implementation detail.

One can say: you made a mistake and now blame the language. No, I blame language's design. I'm afraid that we end up with something like Integer and int to overcome such problems.

Lesson learned: never use default parameters in public API (surprise!)

niedziela, 22 listopada 2015

Another C++ nasty feature

I'm fond of C++ weirdness, really. This language is full of traps, and it shocks me once in a while.

Let's look at this piece of code, a part of a larger module:

void validate_date() {

    // ...

    boost::optional<unsigned> clock_hour;
    boost::optional<unsigned> am_pm_clock;
    // ... fill these fields

    if (some sanity check failed) {
        report_error("user has entered wrong time: %d %s",
            *am_pm_clock ? "AM" : "PM");

We would expect that in case of an error following line will be reported: "user has entered wrong time: 123 PM". Obvious. But please look closer at the code, do you see any mistake? There is one... dirty... hard to notice. I'll give you a minute.

So, the mistake is lack of comma between expressions *clock_hour and *am_pm_clock. However, the code is valid! It compiles! And it took me a little longer than a minute to understand what happened. Explanation is:
  • *clock_hour evaluates to expression of type unsigned;
  • then compiler sees * - a multiplication operator;
  • so checks if multiplication of unsigned (on the left side) with boost::optional<unsigned> (on the right side) is possible;
  • it is, because boost::optional<T> has conversion operator to type T.
We can rewrite the whole expression, now it should be clear:

    ((*clock_hour) * unsigned(am_pm_clock)) ? "AM" : "PM"

In result method is called with a single parameter of type cont char*.

It's bizarre, it's terrible. A language should help a programmer. In my opinion implicit conversions is the worst feature of C++.

niedziela, 15 listopada 2015

Short report from code::dive 2015

Few days ago I attended code::dive 2015, an IT conference in Wrocław, Poland. It was a one-day conference with a great number of presentations. There were four stages and five sessions, in total 20 talks. Impressive number! But an attender had to choose his own path of just five lectures. I think the decision was pretty difficult. Sometimes less is better. Read more

środa, 15 lipca 2015

C++ magick

A programmer wrote:

class container;

class IndexOutOfBounds {
    IndexOutOfBounds(const std::string& msg);

void container::remove(int index) {

    if (index < 0 || index >= size()) {
        throw new IndexOutOfBounds("Invalid index: " + index);

    // the rest of method

Do you see the mistake? Programmer assumed that expression "Invalid index: " + index evaluates to std::string("Invalid index: <some number>").

In fact type of expression "Invalid index: " is char[15], so char[15] + integer results in --- more or less --- char*. For index in range [0, 15] exception will carry tail of the message; for example when index=10 then it will be "dex: ". But for indexes larger than 15 and less than 0 program likely crash.

This is why I hate C++, the language has many dark corners, stupid conventions, implicit conversion, not to mention UB ("just" 150 UB, if you're curious).

sobota, 20 czerwca 2015

Implementation of BT-trees

Great paper by Lars F. Bonnichsen, Christian W. Probst, Sven Karlsson:
This document presents the full implementation details of BT-trees, a highly efficient ordered map, and an evaluation which compares BT-trees with unordered maps. BT- trees are often much faster than other ordered maps, and have comparable performance to unordered map implementations. However, in benchmarks which favor unordered maps, BT-trees are not faster than the fastest unordered map implementations we know of.

Boolean function for the rescue

The problem is defined as follows: a set of features is saved using bit-sets (usually large), and there is a list/map/whatever of sets containing features of different objects. We have to find which features are unique. Read more...

środa, 10 czerwca 2015

Big progress in verification

Formal verification is not easy task, for example ComCert compiler is able to verify, that optimizations haven't modified semantic of a program. Paper Verified correctness and security of OpenSSL HMAC describes verification of the whole "stack":
We have proved, with machine-checked proofs in Coq, that an OpenSSL implementation of HMAC with SHA-256 correctly implements its FIPS functional specification and that its functional specification guarantees the expected cryptographic properties. This is the first machine-checked cryptographic proof that combines a source-program implementation proof, a compiler-correctness proof, and a cryptographic-security proof, with no gaps at the specification interfaces.

piątek, 22 maja 2015

Fast exact summation using small and large superaccumulators

Interesting article by Radford M. Neal:
I present two new methods for exactly summing a set of floating-point numbers, and then correctly rounding to the nearest floating-point number. Higher accuracy than simple summation (rounding after each addition) is important in many applications, such as finding the sample mean of data.

środa, 20 maja 2015

Optimizing Dijkstra for real-world performance

Another interesting paper:

Our experimental results currently put our prototype implementation at about twice as fast as the Boost implementation of the algorithm on both real-world and generated large graphs. Furthermore, this preliminary implementation was written in only a few weeks, by a single programmer. The fact that such an early prototype compares favorably against Boost, a well-known open source library developed by expert programmers, gives us reason to believe our design for the queue is indeed better suited to the problem at hand, and the favorable time measurements are not a product of any specific implementation technique we employed.

wtorek, 21 kwietnia 2015

The Influence of Malloc Placement on TSX Hardware Transactional Memory

Interesting paper:

We show that the placement policies of dynamic storage allocators -- such as those found in common "malloc" implementations -- can influence the L1 conflict miss rate in the L1. Conflict misses -- sometimes called mapping misses -- arise because of less than ideal associativity and represent imbalanced distribution of active memory blocks over the set of available L1 indices. Under transactional execution conflict misses may manifest as aborts, representing wasted or futile effort instead of a simple stall as would occur in normal execution mode.

niedziela, 19 kwietnia 2015

Conversion numbers to binary ASCII representation - new method

Recently I've checked different methods to convert numbers to binary representation, including use of new PDEP instruction from BMI2 extension.

Today I've updated the article with new SWAR version 2, a tricky use of multiplication. The method is not faster, but I like the approach---in certain conditions multiplication can be seen as multi-shift/bit-or instruction. I've already use multiplication in this way to emulate instruction pmovmskb.

poniedziałek, 13 kwietnia 2015

czwartek, 9 kwietnia 2015

Github repositories

I've put source code for my two articles at github:
Repositories contain original code, read: C99, 32-bit for GCC with inline assembly and also new programs in C++11 using intrinsics, tested in 64-bit environment.

BTW the article about popcount has gained popularity, and I hope another crazy idea about hacking MPSADBW will spread all over the world.

SIMD-ized searching in unique constant dictionary

The problem: there is a ordered dictionary containing only unique keys. Dictionary is read only, and keys are 32-bit (SSE) or 64-bit (AVX2). Read more

niedziela, 22 marca 2015

SIMD: detecting a bit pattern

The problem: there are 64-bit values with some data bits and some metadata bits; metadata includes a k-bit field describing a "type" (k >= 0). Type field is located in a lower 32-bits.

Procedure processes two "types", one denoted with code 3 and another with 5. When all items are of type 3 then we can use a fast AVX2 path, if there are some types 5, we have to call an additional function (a virtual method, to be precise). Read more ...

Compiler warnings are your future errors

Months ago I was asked to upgrade GCC from version 4.7 to 4.9 and also cleanup configure scripts. Not very exciting, merely time consuming task. Oh, we were hit by a bug in libstdc++, but simple patch have fixed the problem. Few weeks later I was asked to change GCC switch from -std=c++11 to -std=c++14 -- the easiest task in the world. I had to modify single script, run configure, type make, then run tests... everything was OK. Quite boring so far. Read more ...

AVX512: ternary functions evaluation

Intel's version of SIMD offers following 2-argument (binary) boolean functions: and, or, xor, and not. There isn't a single argument not, this function can be expressed with xor reg, ones, however this require additional, pre-set register.

AVX512F will come with very interesting instruction called vpternlog. Read more ...

sobota, 21 marca 2015

Not everything in AVX2 is 256-bit

AVX2 has added support for 256-bit arguments for many operations on packed integers, although not all. Some instructions accept the 256-bit registers, but operates on 128-bit lanes rather whole register.

There are three major groups of instructions: packing (narrowing conversion), unpacking (interleave) and permutations; below is a full list of instructions (with intrinsics):
  • valignr (_mm256_alignr_epi8)
  • vpslldq (_mm256_bslli_epi128)
  • vpsrldq (_mm256_bsrli_epi128)
  • vmpsadbw (_mm256_mpsadbw_epu8)
  • vpacksswb (_mm256_packs_epi16)
  • vpackssdw (_mm256_packs_epi32)
  • vpackuswb (_mm256_packus_epi16)
  • vpackusdw (_mm256_packus_epi32)
  • vperm2i128 (_mm256_permute2x128_si256)
  • vpermq (_mm256_permute4x64_epi64)
  • vpermpd (_mm256_permute4x64_pd)
  • vpshufd (_mm256_shuffle_epi32)
  • vpshufb (_mm256_shuffle_epi8)
  • vpshufhw (_mm256_shufflehi_epi16)
  • vpshuflw (_mm256_shufflelo_epi16)
  • vpslldq (_mm256_slli_si256)
  • vpsrldq (_mm256_srli_si256)
  • vpunpckhwd (_mm256_unpackhi_epi16)
  • vpunpckhdq (_mm256_unpackhi_epi32)
  • vpunpckhqdq (_mm256_unpackhi_epi64)
  • vpunpckhbw (_mm256_unpackhi_epi8)
  • vpunpcklwd (_mm256_unpacklo_epi16)
  • vpunpckldq (_mm256_unpacklo_epi32)
  • vpunpcklqdq (_mm256_unpacklo_epi64)
  • vpunpcklbw (_mm256_unpacklo_epi8)
For me the most surprising are packing instructions (vpack*) as they require additional shuffling (after or before the instruction) if we want to keep order of values. In some cases the order is crucial.

SSE: Generating mask where n leading (trailing) bytes are set

Informal specification:

__m128i mask_lower(const unsigned n) {
    assert(n < 16);
    switch (n) {
        case 0: return {0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00};
        case 1: return {0xff, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00};
        case 2: return {0xff, 0xff, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00};
        // ...
        case 14: return {0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0x00};
        case 15: return {0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff};
__m128i mask_higher(const unsigned n) {
    assert(n < 16);
    return ~mask_lower(15 - n);

Read more ...