niedziela, 10 października 2010

PIC Language

Yesterday I made my first diagram with PIC (see Wikipedia) and... I felt in love. Language is just perfect, intuitive and powerful. Try it and you will never use any popular drawing software.

wtorek, 28 września 2010

Xorg VESA driver - higher resolutions

Default configuration of VESA driver allows resolutions 640x480 and 800x600.
It is quite simple to get higher resolutions - in section Monitor of /etc/X11/xorg.conf we have to set horizontal and vertical refresh rates. By default these values are set to lowest, safe area.

Here are my settings:

Section "Monitor"
        Identifier   "Monitor0"
        VendorName   "Monitor Vendor"
        ModelName    "Monitor Model"

        HorizSync    42.0 - 80.0
        VertRefresh  55.0 - 100.0 
EndSection

X.org is able to work at 1280x1024 with vertical refresh rate 75Hz. HTH

wtorek, 24 sierpnia 2010

PostgrSQL: printf in PL/pgSQL

PostgreSQL wiki has entry about sprintf - is is quite simple approach (and isn't marked as immutable), the main drawback is iterating over all chars of format string. Here is a version use strpos to locate % in format string, and it's faster around 2 times:

CREATE OR REPLACE FUNCTION printf2(fmt text, variadic args anyarray) RETURNS text
LANGUAGE plpgsql IMMUTABLE AS $$
   DECLARE
      argcnt  int  := 1;
      head    text := '';     -- result
      tail    text := fmt;    -- unprocessed part
      k       int;
   BEGIN
      LOOP
         k := strpos(tail, '%');
         IF k = 0 THEN
            -- no more '%'
            head := head || tail;
            EXIT;
         ELSE
            IF substring(tail, k+1, 1) = '%' THEN
               -- escape sequence '%%'
               head := head || substring(tail, 1, k);
               tail := substring(tail, k+2);
            ELSE
               -- insert argument
               head := head || substring(tail, 1, k-1) || COALESCE(args[argcnt]::text, '');
               tail := substring(tail, k+1);
               argcnt := argcnt + 1;
            END IF;
         END IF;
      END LOOP;
   RETURN head;
END;
$$;

sobota, 17 lipca 2010

SSSE3 population count vs hardware

Peter Kankowski compared speed of SSE4.2 instructions crc32 and popcnt against software implementations. Hardware CRC32 is significantly faster, but population count is slightly slower than my SSSE3 popcount!

niedziela, 9 maja 2010

Branchless set mask if value greater or how to print hex values

Suppose we need to get mask when nonnegative argument is greater then some constant value; in other words, we want to evaluate following expression:

if x > const_n then
   mask := 0xffffffff;
else
   mask := 0x00000000;

Portable branchless solution:
  • choose magic number M := (1 << (k-1)) - 1 - n, where k is a bit position, for example 31 if we operate on 32-bit words
  • calculate R := x + M
  • k-th bit of R is set if x > n
  • fill mask with this bit - see note Fill word with selected bit

The key to understand this trick is binary form of M: 0111..1111zzzz, where z is 0 or 1 depending on n value. When x is greater then n, then x + M has form 1000..000zzzz, because carry bit propagate through series of ones to k-th position of result.

Real world example - branchless converting hex digit to ASCII (M=0x7ffffff6 for k=31 and n=9).

; input:    eax - hex digit
; output:   eax - ASCII letter (0-9, A-F or a-f)
; destroys: ebx

        andl 0xf, %eax
        leal 0x7ffffff6(%eax), %ebx     ; MSB(ebx)=1 when eax >= 10
        sarl $31, %ebx                  ; ebx - mask
        andl  $7, %ebx                  ; ebx = 7 when eax >= 10 (for A-F letters)
        ;andl $39, %ebx                 ; ebx = 39 when eax >= 10 (for a-f letters)
        leal '0'(%eax, %ebx), %eax      ; eax = '0' + eax + ebx => ASCII letter

It is also possible to convert 4 hex digits in parallel using similar algorithm, but input data have to be correctly prepared. Moreover generating mask requires 3 instructon and one extra register (in scalar version just one arithmetic shift). I guess it wont be fast on x86, maybe this approach would be good for SIMD code, where similar code transforms more bytes at once.

; input: eax - four hex digits in form [0a0b0c0d]
; output: eax - four ascii letters
; destroys: ebx, ecx

        leal 0x76767676(%eax), %ebx        ; MSB of each byte is set when corresponding eax byte is >= 10
                                           ; (here: 0x7f - 9 = 0x76)
        andl $0x80808080, %ebx
        movl %ebx, %ecx
        shrl    $7, %ebx
        subl %ebx, %ecx                    ; ecx - byte-wise mask
        ;andl $0x07070707, %ecx            ; for ASCII letters A-F
        andl $0x27272727, %ecx             ; for ASCII letters a-f
        leal 0x30303030(%eax, %ecx), %eax  ; ecx - four ascii letters


See also: SSSE3: printing hex values (weird use of PSHUFB instruction)

sobota, 1 maja 2010

Speedup reversing table of bytes

With help of BSWAP instruction or SSE instructions (PSHUFD, PSHUFLW, PSHUFHW) or SSSE3 instruction (PSHUFB) reversing table can be faster. Speedup depends on three factors:
  • table size: larger=faster
  • table address: aligned=faster/much faster (15.5 speedup - possible! see chart)
  • CPU type

Read full article

niedziela, 11 kwietnia 2010

Determining if an integer is a power of 2

Method from Bit Twiddling Hacks: (x != 0) && (x & (x-1) == 0). GCC compiles this to following code:
; input/ouput: eax
; destroys: ebx

        test    %eax,  %eax     ; x == 0?
        jz      1f

        leal -1(%eax), %ebx     ; ebx := x-1
        test    %eax,  %ebx     ; ZF  := (eax & ebx == 0)

        setz     %al
        movzx    %al, %eax       ; eax := ZF
1:

We can use also BSF and BSR instructions, that determines position of first and last bit=1. If number is power of 2, then just one bit is set, and thus these positions are equal. BSx sets also ZF flag if input value is zero.

; input/output: eax
; destroys: ebx, edx

        bsf     %eax, %ebx      ; ebx := LSB's position if eax != 0, ZF = 1 if eax = 0
        jz      1f
        bsr     %eax, %edx      ; edx := MSB's position

        cmp     %ebx, %edx      ; ZF  := (ebx = edx)

        setz    %al
        movzx   %al, %eax       ; eax := ZF
1:

czwartek, 8 kwietnia 2010

Brenchless conditional exchange

Suppose we have to exchange (or just move) two registers A and B:
  1. C := A xor B
  2. C := 0 if condition is not true
  3. A := A xor C
  4. B := B xor C
If C is 0, then  A and B left unchanged, else A and B are swapped. If only conditional move from B to A is needed, then step 4th have to be skipped.

Here is a sample x86 code, where condition is value of CF:
sbb edx, edx ; part of step 2. - edx = 0xffffff if CF=1, 0x000000 otherwise
mov ecx, eax
xor ecx, ebx ; step 1
and ecx, edx ; completed step 2. - now C is 0 or (A xor B)
xor eax, ecx ; step 3
xor ebx, ecx ; step 4

Branchless moves are possible in Pentium Pro and higher with instructions cmovcc.

See also XOR linked list.

sobota, 3 kwietnia 2010

STL: map with string as key - access speedup

The idea is quite simple: we do not have a single stl::map<string, something>, but vector of maps, indexed with O(1) time - each map store keys sharing certain properties. Drawback: additional memory.

I've tested following grouping schemes:
  1. length of string,
  2. first letter of string (one level trie),
  3. both length and first letter.
Third is the fastest - around 60% faster then plain std::map from gcc (red-black tree).

Tests: my program read plain text (I've used The Illiad from http://gutenberg.org), text is splitted into words (~190000) and then each words is inserted into dictonary (~28000 distinct words); then the same words are searched in dictionaries. Table below summarizes results on my computer (gcc 4.3.4 from cygwin).

+-------------+--------------------------------+--------------------------------+
|             |          running time [ms]     |            speedup [%]         |
| data struct +----------+----------+----------+----------+----------+----------+
|             |    min   |   avg    |   max    |   min    |   avg    |   max    |
+-------------+----------+----------+----------+----------+----------+----------+
|                                          inserting                            |
+-------------+----------+----------+----------+----------+----------+----------+
|  std::map   |    269   |   287    |   355    |   100    |   100    |   100    |
+-------------+----------+----------+----------+----------+----------+----------+
| first char  |    218   |   241    |   395    |    81    |    84    |   111    |
+-------------+----------+----------+----------+----------+----------+----------+
|   length    |    218   |   240    |   345    |    81    |    84    |    97    |
+-------------+----------+----------+----------+----------+----------+----------+
| len./char   |    165   |   172    |   207    |    61    |    60    |    58    |
+-------------+----------+----------+----------+----------+----------+----------+
|                                            finding                            |
+-------------+----------+----------+----------+----------+----------+----------+
|  std::map   |    295   |   322    |   483    |   100    |   100    |   100    |
+-------------+----------+----------+----------+----------+----------+----------+
| first char  |    243   |   263    |   460    |    82    |    82    |    95    |
+-------------+----------+----------+----------+----------+----------+----------+
|   length    |    238   |   248    |   292    |    80    |    77    |    60    |
+-------------+----------+----------+----------+----------+----------+----------+
| len./char   |    184   |   190    |   241    |    62    |    60    |    50    |
+-------------+----------+----------+----------+----------+----------+----------+

Download test program.

czwartek, 1 kwietnia 2010

Branchless signum

Problem: calculate value of sign(x):
  • -1 when x < 0
  • 0 when x = 0,
  • +1 when x > 0.
My solution do not involve any hardware specific things like ALU flags nor special instructions - just plain AND, OR, shifts.
 ; input: eax = X

 movl %eax, %ebx
 sarl $31, %eax  // eax = -1 if X less then zero, 0 otherwise

 andl $0x7fffffff, %ebx
 addl $0x7fffffff, %ebx // MSB is set if any lower bits were set
 shrl $31, $ebx  // eax = +1 if X greater then zero, 0 otherwise

 orl %ebx, %eax  // eax = result

C99 implementation:
int32_t sign(int32_t x) {
 int32_t y;
 y = (x & 0x7fffffff) + 0x7fffffff;
 return (x >> 31) | ((uint32_t)y >> 31);
}

Fill word with selected bit

This is continuation of subproblem from previous post: we have a word (byte, dword, whatever) and want to fill it with selected bit.

1. The most general algorithm:
  1. mask bit
    [10111010] => [00010000]
  2. clone word
    a=[00010000], b=[00010000]
  3. shift bit in first word to MSB, and to LSB in second word
    a=[10000000], b=[00000001]
  4. subtract c = a - b
    c=[01111111]
  5. add missing MSB c = c OR a
    c=[11111111]
2. If arithmetic shifts are supported:
  1. shift bit to MSB
    a=[10000000]
  2. arithmetic shift right
    a=[11111111]
3. On processor 386+ we can copy selected bit from reg to carry flag (CF) with bt reg, reg and then clone CF with sbb reg, reg.
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>

uint32_t fill1(uint32_t a, int bit) {
uint32_t b;

b = a = a & (1 << bit);

a <<= 31 - bit;
b >>= bit;

return (a - b) | a;
}

uint32_t fill2(uint32_t a, int bit) {
a <<= 31 - bit;
return (int32_t)(a) >> 31;
}

uint32_t fill386(uint32_t a, int bit) {
uint32_t result;             
__asm__ __volatile__ (
"bt  %1, %0\n"
"sbb %0, %0\n"
: "=r" (result)
: "r" (bit), "0" (a)
);
return result;
}

int main(int argc, char* argv[]) {
uint32_t x, i;
if (argc > 1)
x = (unsigned long)strtol(argv[1], NULL, 0);

printf("input = %08x\n", x);
for (i=0; i < 32; i++)
printf("bit %2d: fill1 = %08x, fill2 = %08x, fill386 = %08x\n",
i,
fill1(x, i),
fill2(x, i),
fill386(x, i)
);

return EXIT_SUCCESS;
}

środa, 31 marca 2010

Transpose bits in byte using SIMD instructions

Method presented here allows to get any bit permutation, transposition is just one of possible operations. Lookup-based approach would be faster, but algorithm is worth to (re)show.

Algorithm outline for 8-byte vector (with SSE instruction it is possible to get 2 operations in parallel):
  1. fill vector with given byte
    [11010001] =>
    [11010001|11010001|11010001|11010001|11010001|11010001|11010001|11010001]
  2. leave one bit per byte
    [10000000|01000000|00000000|00010000|00000000|00000000|00000000|00000001]
  3. perform desired transposition ("move" bits around)
    [00000001|000000010|00000000|00001000|00000000|00000000|00000000|10000000]
  4. perform horizontal OR of all bytes
    [10001011]
Here is my old MMX code (polish text); below SSE/SSE5 implementation details.

Ad 1. Series of punpcklbw/punpcklwb/shufps or pshufb if CPU supports SSSE3.
# 1.1
movd %eax, %xmm0
shufps $0x00, %xmm0, %xmm0
punpcklbw %xmm0, %xmm0
punpcklwd %xmm0, %xmm0

# 1.2
pxor %xmm1, %xmm1
movd %eax, %xmm0
pshufb %xmm1, %xmm0
Ad 2. Simple pand with mask packed_qword(0x8040201008040201).
pand  MASK1, %xmm0
Ad 3. If plain SSE instructions are supported this step require some work. First each bit is populated to fill whole byte (using pcmpeq - we get negated result), then mask bits on desired positons.

SSE5 has powerful instruction protb that can do perform rotation of each byte with independent amount - so in this case just one instruction is needed.
# SSE
pcmpeq %xmm1, %xmm0
pandn MASK2, %xmm0

# SSE5
protb ROT, %xmm0, %xmm0
Ad 4. Since bits are placed on distinct positions, we can use instruction psadbw, that calculate horizontal sums of bytewide differences from two registers (separately for low and high registers half). If one register is full of zeros, we get sum of bytes from other register.
psadbw  %xmm1, %xmm0
movd %xmm0, %eax
Depending on instruction set three (SSE) or two (SSE5) additional tables are needed.

Opis biblioteki pthreads na Wikibooks

W ramach głębszego poznawania biblioteki pthreads utworzyłem w serwisie wikibooks opis POSIX Threads, zawierający bardzo podstawowe informacje o większości funkcji tej biblioteki oraz nieco o rozszerzeniach Linuxowej implementacji. Integralną częścią są przykładowe programy.

(Wikibooks niestety nie zyskał takiej popularności, jak Wikipedia, a to całkiem wygodna platforma do tworzenia różnego rodzaju podręczników).

wtorek, 30 marca 2010

PostgreSQL: get selected rows with given order

Suppose that database stores some kind of dictionary and user picks some items, but wants to keep order. For example dictionary has entries with id=0..10, and user picked 9, 2, 4 and 0. This simple query does the job (query splitted):

foo = SELECT (ARRAY[9,2,4,0])[i] AS index, i AS ord FROM  generate_series(1, 4) AS i;
SELECT * FROM dictionary INNER JOIN (foo) ON dictionary.id=foo.index ORDER BY foo.ord