wtorek, 23 grudnia 2014

Flipping Bits in Memory Without Accessing Them

Quote from paper's abstract:
[...] as DRAM process technology scales down to smaller dimensions, it becomes more difficult to prevent DRAM cells from electrically interacting with each other. In this paper, we expose the vulnerability of commodity DRAM chips to disturbance errors. By reading from the same address in DRAM, we show that it is possible to corrupt data in nearby addresses.
This is must-read paper.

niedziela, 30 listopada 2014

Conversion int and double to string - comparison

Milo Yip has compared different itoa and dtoa implementations on Core i7, including my itoa algorithm 2, that use SSE2 instructions.

Results for itoa are interesting: SSE2 version is not as good as it seemed to be. Tricky branchlut algorithm is only 10% slower, moreover is perfectly portable. One obvious drawback of this method is using lookup-table - in real environment where is a big pressure on cache, memory access could be a bottleneck.

sobota, 22 listopada 2014

Simple Testing Can Prevent Most Critical Failures

I recommend very interesting paper. Authors studied many errors in complicated distributed systems, like Cassandra, and found that majority of failures are caused by trivial errors (some of them can be detected even in unit tests). Here is very interesting quote:

almost all (92%) of the catastrophic system failures are the result of incorrect handling of non-fatal errors explicitly signaled in software.

In my opinion causes of errors spotted in the study may apply to any kind of software.

niedziela, 16 listopada 2014

Speeding up searching in linked list

Sounds crazy, but it's possible in some cases. Here are experiments results - 3 times faster isn't so bad.

list                          :      0.780s, speedup  1.00
array list (4)                :      0.703s, speedup  1.11
array list (8)                :      0.515s, speedup  1.51
SIMD array list (4)           :      0.365s, speedup  2.14
SIMD array list (8)           :      0.258s, speedup  3.03

piątek, 14 listopada 2014

MSVC 2013 Release code

Today I had a quite long session with debugger and release code, you know: no debugger symbols and optimized code. I spotted two pieces of assembly that forced me to check if compilation mode was really set to release.

00000000000000D9  mov         byte ptr [rcx+1Fh],dl 
00000000000000DC  mov         byte ptr [rdx+rcx],0 
00000000000000E0  movzx       edx,byte ptr [rcx+1Fh] 
00000000000000E4  movzx       eax,dl 

First dl is stored in memory. It's ok. Then edx is used as an offset. Also ok. Seems that compiler knows that highest bits of edx are zeros i.e. edx = dl. Not so fast - edx is reloaded with it's original value and then eax is populated with the same value. These two movzx could be replaced with single mov eax, edx.

And another:

00000000000000BF  movaps      xmm6,xmmword ptr [foo] 
00000000000000C4  movdqa      xmmword ptr [foo],xmm6 

Yes, load & store the same value. Completely useless! Moreover, xmm6 isn't used in following code. (It's worth noting that load is done by an FP-unit and store by integer unit, inter-unit transfers cost one additional cycle on some processor.)

Above instruction sequences were produced by the newest compiler from Microsoft (MSVC 2013, update 3).

piątek, 26 września 2014

Fork bomb in Windows

It's really simple. Just install Cygwin, write fork bomb (even by accident) and finally run in bash - works fine. I had to restart my machine with restart button.

Intepolation search revisited

Interpolation search revisited -- theoretical complexity of interpolation search is very promising, but...

niedziela, 21 września 2014

Conversion number to hexadecimal representation

Conversion numbers to hexadecimal representation - SWAR, plain SSE, and draft of BMI2 implementation.

Article SSSE3: printing hex values describes the same topic but is limited to exploit PSHUFB.

czwartek, 18 września 2014

String literals are weird (at least in C/C++)

Simple quiz: what is the length of this string "\xbadcafe"?
  1. 5 letters
  2. 4 letters
  3. 2 letters
  4. 1 letter
The correct answer is d, i.e. single letter. In C/C++ escape sequence "\x..." consumes all hexadecimal digits.

czwartek, 11 września 2014

Python: rename file in Windows

Observation: function os.remove under Windows doesn't allow to overwrite an existing file. Python from Cygwin works properly.

Here is a workaround:

def rename_file(old, new):
    if sys.platform != 'win32':
        os.rename(new, old)
    else:
        os.remove(old)
        os.rename(new, old)

Conversion numbers to binary representation

New article Conversion numbers to binary representation — SIMD & SWAR versions.

Few years ago I've described an MMX variant of SIMD algorithm. But the text was in Polish, so audience was limited.

sobota, 22 marca 2014

C++ bitset vs array

C++ bitset conserves a memory, but at cost of speed access. Bitset must be slower than set represented as a plain old array, at least when sets are small (say few hundred elements).

Lets look at this simple functions:

// set_test.cpp 
#include <stdint.h>
#include <bitset>

const int size = 128;

typedef uint8_t byte_set[size];

bool any_in_byteset(uint8_t* data, size_t size, byte_set set) {
 for (auto i=0u; i < size; i++)
  if (set[data[i]])
   return true;
 
 return false;
}


typedef std::bitset<size> bit_set;

bool any_in_bitset(uint8_t* data, size_t size, bit_set set) {
 for (auto i=0u; i < size; i++)
  if (set[data[i]])
   return true;
 
 return false;
}

The file was compiled with g++ -std=c++11 -O3 set_test.cpp; Assembly code of the core of any_in_byteset:

  28: 0f b6 10              movzbl (%eax),%edx
  2b: 83 c0 01              add    $0x1,%eax
  2e: 80 3c 11 00           cmpb   $0x0,(%ecx,%edx,1)
  32: 75 0c                 jne    40
  34: 39 d8                 cmp    %ebx,%eax
  36: 75 f0                 jne    28
Statement if (set[data[i]]) return true are lines 28, 2e and 32, i.e.: load from memory, compare and jump. Instructions 2b, 34 and 36 handles the for loop.

Now, look at assembly code of any_in_bitset:

  5f: 0f b6 13              movzbl (%ebx),%edx
  62: b8 01 00 00 00        mov    $0x1,%eax
  67: 89 d1                 mov    %edx,%ecx
  69: 83 e1 1f              and    $0x1f,%ecx
  6c: c1 ea 05              shr    $0x5,%edx
  6f: d3 e0                 shl    %cl,%eax
  71: 85 44 94 18           test   %eax,0x18(%esp,%edx,4)
  75: 75 39                 jne    b0

All these instructions implements the if statement! Again we have load from memory (5f), but checking which bit is set require much more work. Input (edx) is split to lower part --- i.e. bit number (67, 6c) and higher part --- i.e. word index (6c). Last step is to check if a bit is set in a word --- GCC used variable shift left (6f), but x86 has BT instruction, so in a perfect code we would have 2 instructions less.

However, as we see simple access in the bitset is much more complicated than simple memory fetch from byteset. For short sets memory fetches are well cached and smaller number of instruction improves performance. For really large set cache misses would kill performance, then bitset is much better choice.

środa, 19 marca 2014

Is const-correctness paranoia good?

Yes, definitely. Lets see this simple example:
$ cat test.cpp
int test(int x) {
 if (x = 1)
  return 42;
 else
  return 0;
}
$ g++ -c test.cpp
$ g++ -c -Wall test.cpp
int test(int x) {
 if (x = 1)
  return 42;
 else
  return 0;
}
Only when we turn on warnings, compiler tell us about a possible error. Making the parameter const shows us error:
$ cat test2.cpp
int test(int x) {
 if (x = 1)
  return 42;
 else
  return 0;
}
$ g++ -c test.cpp
test2.cpp: In function ‘int test(int)’:
test2.cpp:2:8: error: assignment of read-only parameter ‘x’
  if (x = 1)
        ^
All input parameters should be const, all write-once variables serving as a parameters for some computations should be also const.

Quick and dirty ad-hoc git hosting

Recently I needed to synchronize my local repository with a remote machine, just for full backup. It's really simple if you have standard Linux tools (Cygwin works too, of course).

1. in a working directory run:
$ pwd
/home/foo/project
$ git update-server-info

2. in a parent directory start HTTP server:
$ cd ..
$ pwd
/home/foo
$ python -m SimpleHTTPServer
Serving HTTP on 0.0.0.0 port 8000 ...


3. on a remote machine clone/pull/whatever:
$ git clone http://your_ip:8000/project/.git
Step 1 have to be executed manually when local repository has changed.

niedziela, 16 marca 2014

Scalar version of SSE move mask instruction

SSE instruction PMOVMSKB gathers all most significant bits from bytes and stores them as a single 16-bit value; similar action is performed by MOVMSKPD and MOVMSKPS.

Such operation could be easily done using scalar multiplication. Read more ...

Asymmetric numeral systems

wtorek, 11 marca 2014

C++ standard inaccuracy

First we read:
21.4.1 basic_string general requirements [string.require]
[...]
3 No erase() or pop_back() member function shall throw any exceptions.
... few pages later:

21.4.6.5 basic_string::erase [string::erase]
basic_string<charT,traits,
Allocator>& erase(size_type pos = 0, size_type n = npos);
[...]
2 Throws: out_of_range if pos > size().

SIMD-friendly Rabin-Karp modification

Rabin-Karp algorithm uses a weak hash function to locate possible substring positions. This modification uses merely equality of first and last char of searched substring, however equality of chars can be done very fast in parallel, even without SIMD instruction. Read more ...

niedziela, 9 marca 2014

GCC - asm goto

Starting from GCC 4.5 the asm statement has new form: asm goto. Programmer can use any labels from C/C++ program, however output from this block is not allowed.

Using asm block in old form required more instructions and additional storage:
 uint32_t bit_set;
 asm (
  "bt %2, %%eax  \n"
  "setc %%al  \n"
  "movzx %%al, %%eax \n"
  : "=a" (bit_set)
  : "a" (number), "r" (bit)
  : "cc"
 );

 if (bit_set)
  goto has_bit;
Above code is compiled to:
 80483f6:       0f a3 d8                bt     %ebx,%eax
 80483f9:       0f 92 c0                setb   %al
 80483fc:       0f b6 c0                movzbl %al,%eax
 80483ff:       85 c0                   test   %eax,%eax
 8048401:       74 16                   je     8048419 
With asm goto the same task could be writting shorter and easier:
 asm goto (
  "bt %1, %0 \n"
  "jc %l[has_bit] \n"

  : /* no output */
  : "r" (number), "r" (bit)
  : "cc"
  : has_bit // name of label
 );
Complete demo is available. See also GCC documentation about Extended Asm.

Integer log 10 of unsigned integer - SIMD version

Fast calculate ceil(log10(x)) of some unsigned number is described on Bit Twiddling Hacks, this text show the SIMD solution for 32-bit numbers.

Algorithm:

1. populate value in XMM registers. Since maximum value of this function is 10 we need three registers.
movd   %eax, %xmm0          // xmm0 = packed_dword(0, 0, 0, x)
pshufd $0, %xmm0, %xmm0 \n" // xmm0 = packed_dword(x, x, x, x)
movapd %xmm0, %xmm1
movapd %xmm0, %xmm2
2. compare these numbers with sequence of powers of 10.
// powers_a = packed_dword(10^1 - 1, 10^2 - 1, 10^3 - 1, 10^4 - 1)
// powers_c = packed_dword(10^5 - 1, 10^6 - 1, 10^7 - 1, 10^8 - 1)
// powers_c = packed_dword(10^9 - 1, 0, 0, 0)
pcmpgtd powers_a, %xmm0
pcmpgtd powers_b, %xmm1
pcmpgtd powers_c, %xmm2
result of comparisons are: 0 (false) or -1 (true), for example:
xmm0 = packed_dword(-1, -1, -1, -1)
xmm1 = packed_dword( 0, 0, -1, -1)
xmm2 = packed_dword( 0, 0, 0, 0)
3. calculate sum of all dwords
psrld $31, %xmm0       // xmm0 = packed_dword( 1, 1, 1, 1) - convert -1 to 1
psubd %xmm1, %xmm0     // xmm0 = packed_dword( 1, 1, 2, 2)
psubd %xmm2, %xmm0     // xmm0 = packed_dword( 1, 1, 2, 2)

// convert packed_dword to packed_word
pxor %xmm1, %xmm1
packssdw %xmm1, %xmm0 // xmm0 = packed_word(0, 0, 0, 0, 1, 1, 2, 2)

// max value of word in xmm0 is 3, so higher
// bytes are always zero
psadbw %xmm1, %xmm0   // xmm0 = packded_qword(0, 6)
4. save result, i.e. the lowest dword
movd %xmm0, %eax      // eax = 6

Sample program is available.

Mask for zero/non-zero bytes

The description of Determine if a word has a zero byte from "Bit Twiddling Hacks" says about haszero(v): "the result is the high bits set where the bytes in v were zero".

Unfortunately this is not true. High bits are also set for ones followed zeros, i.e. haszero(0xff010100) = 0x00808080. Of course the result is still valid (non-zero if there were any zero byte), but if we want to iterate over all zeros or find last zero index, this could be a problem.

It's possible to create exact mask:
uint32_t nonzeromask(const uint32_t v) {
    // MSB are set if any of 7 lowest bits are set
    const uint32_t nonzero_7bit = (v & 0x7f7f7f7f) + 0x7f7f7f7f;

    return (v | nonzero_7bit) & 0x80808080;
}

uint32_t zeromask(const uint32_t v) {
     // negate MSBs
     return (nonzeromask(v)) ^ 0x80808080;
}

Function nonzeromask requires 4 simple instructions, and zeromask one additional xor.

poniedziałek, 3 marca 2014

Slow-paths in GNU libc strstr

I've observed that some patterns issued to strstr cause significant slowdown.

Sample program kill-strstr.c executes strstr(data, pattern), where data is a large string (16MB) filled with character ?; patterns are read from command line.

On my machine following times were recorded:

1. searching string 'johndoe'...
        time: 0.032
2. searching string '??????????????????a'...
        time: 0.050
3. searching string '??????????????????????????????a'...
        time: 0.049
4. searching string '???????????????????????????????a'...
        time: 0.274
5. searching string '??????????????????????????????a?'...
        time: 0.356
6. searching string '??????????????????????????????a??????????????????????????????'...
        time: 0.396
  • Slowdown is visible in case 4 (5 times slower than pattern 3). Pattern has 32 characters, and contains '?', except last char.
  • Even bigger slowdown occurs in case 5 (7 times slower). This pattern also contains 32 chars, but position of the single letter 'a' is last but one.
  • Similar slowdown occurs in case 5 (nearly 8 times slower). In this pattern single letter 'a' is surrounded by 30 '?'.

niedziela, 26 stycznia 2014

Penalties of errors in SSE floating point calculations

SSE provides not widely known control register, called MXCSR. This register plays three roles:
  1. controls calculations:
    • flag "flush to zero" (described later)
    • flag "denormals are zeros" (described later)
    • rounding mode (not covered in this text)
  2. allow to mask/unmask floating-point exceptions
  3. save information about floating-point errors - these flags are sticky, i.e. the programmer is responsible for clearing them.
Floating point errors may cause significant slowdown, some of the flags can silence errors makes a program faster. Read more ...

środa, 1 stycznia 2014

x86 - ISA where 80% of instructons are unimportant


Few years ago I counted instructions from many Linux binaries --- 2014 is good year to repeated this experiment and see that nothing has changed.

I use 32-bit Debian, my installation has been updated few months ago. All files from /usr/bin and all *.so files from /usr/lib was disassembled with objdump (5050 files were processed). Instructions were grouped simply by mnemonic name, taking into account all addressing and encoding modes would be overkill. I've published script that does the job.

Short summary

  • Number of distinct decoded instructions is around 650. Since objdump use AT&T syntax, same opcode is seen under different mnemonics, for example mov is saved as movw, movb, movl depending on argument size.
  • Total number of x86 instructions is around 750. Read: one hundred instructions never appeared in the binaries.
  • There are 81 instructions used just once. For example quite useful CMPPD.
  • There are 22 instructions used twice. For example MFENCE --- no one aware of memory ordering?
  • There are 15 instructions used three times. For example BTC, but bit manipulating operations are useless.
  • 81 plus 22 plus 15 is 118. Another hundred of useless stuff.
Lets look at top 15 rows from detailed results, i.e. instruction with frequency grater than 1%:
  • The total count of these instructions is 87.84% of all instructions (almost all, isn't it?).
  • The most frequently used instruction is data transfer (mov/movl) --- 42%
  • Control flow instructions (call/ret/jmp) --- 13%.
  • Conditions (cmp/test/condition jumps: je/jne) --- 10%.
  • Basic arithmetic (add/sub/lea) --- 12%
  • Simple stack operations (push/pop) --- 6%

Very interesting observation is that conditions are mostly based on je/jne, i.e. jump if zero/jump if not zero.

First FPU instruction appear at 28-th position. First integer SSE appear at 167-th position. First SSE instruction operating on packed floats appear at 315-th position.

Detailed results

Whole table as txt file.

instructioncount%
mov593409837.63%
call14143558.97%
lea10715016.79%
movl7606774.82%
push6559214.16%
jmp6115403.88%
add5605173.55%
je4902503.11%
test4758993.02%
pop4416082.80%
sub3662282.32%
cmp3263792.07%
jne2641101.67%
nop2423561.54%
ret2385691.51%
xor1481940.94%
movzbl1227300.78%
and888630.56%
xchg668850.42%
cmpl649070.41%
movzwl645890.41%
movb572470.36%
or521380.33%
shl509080.32%
cmpb501520.32%
jle410830.26%
leave399230.25%
fldl374280.24%
fstpl373680.24%
shr365030.23%
jbe328660.21%
ja323330.21%
sar309170.20%
flds296720.19%
subl276360.18%
setne276260.18%
testb274200.17%
addl259060.16%
imul255690.16%
jg247960.16%
fstp243490.15%
fxch234640.15%
js215500.14%
fstps212480.13%
sbb166070.11%
inc162000.10%
lock160490.10%
jae148250.09%
sahf147650.09%
dec142760.09%
fnstsw140260.09%
sete139020.09%
movw138950.09%
adc136400.09%
jb124670.08%
jl117000.07%
repz111780.07%
fldcw111100.07%
jge110190.07%
movswl108160.07%
fildl88520.06%
cmpw76010.05%
jns74900.05%
fldz73310.05%
fmul72290.05%
out72030.05%
not70280.04%
movsbl67200.04%
in65030.04%
fld63090.04%
faddp62540.04%
fstl57600.04%
fucom57530.04%
neg57250.04%
fucompp53540.03%
rep50590.03%
fmuls50390.03%
pushl44300.03%
jp44240.03%
fnstcw44000.03%
fld141760.03%
fmulp41330.03%
orl39270.02%
fadds37890.02%
movq37790.02%
fistpl37090.02%
cltd35970.02%
fmull33130.02%
stos32980.02%
lret31830.02%
scas31030.02%
lods30660.02%
cwtl30640.02%
fadd28520.02%
fucomp26780.02%
orb24810.02%
fildll24180.02%
andl23790.02%
setb23370.01%
andb22630.01%
552 rows more...