L5. Assembly programming techniques

Clearing and testing the register content - x86

Clearing
xor eax, eax
binary code shorter than mov eax, 0
special handling in modern processors
Testing - zero and sign check
test eax, eax js negative jz zero
new processors treat there sequences in a special way (fusion)

Multiplication and division by powers of 2

Use left shift lsh for multiplication
Use right shift rsh for division
arithmetic right shift for division of signed numbers
Use logical AND with 2n1 for modulo

Arithmetics with LEA !

3-argument addition with constant
3-argument multiplication by 2, 4, 8 (same as left shift by 1, 2, 3)
2-/3-argument multiplication by 3, 5, 9
3-argument shift with addition/logical OR
if ones in arguments don't overlap, addition is equivalent to OR

Procedures

Procedure is a fragment of code intended to be used by many parts of a program
Procedure call and return is time-consuming
Call - branch and return trace storing
Return - branch with variable target address
Do not use procedures if

Loop organization

If possible, a loop should end with a conditional branch closing the loop
minimizes the number of control instructions
If the number of ieerations is known in advance, loop should end with 'decrement counteer and branch if not zero'
Pther methods may be used to reduce the numebr of variables modified within the loop
In x86. string instructions may be used

Using flags

Flags are affected by all basic arithmetic and logical instructions
x86 single-argument instructions affect some of the flags
INC, DEC do not update CF
CF value may be preserved for the next pass of a loop
to check if the variable was decremented to negative value, use JS/JSN not JB/JAE
In some cases, CF value previously set by 2-argument instruction may be combined with other flags set by INC/DEC to obtain logical OR/AND of two conditions produced by two instructions

Extended precision arithmetics

Carry flag may be used for propagating carry/borrow between partial results of addition/subtraction or shifts

Reducing branches

Branched may introduce significant delays, yp to 100 instruction time slots.
Branch cannot be easily predicted if:

Bit order reversal

Frequently needed - graphics, data transfer/conversion
Some of the newer architectures have the BITREV instruction

Two solutions

Trivial

  1. bit by bit copy using shifts or rotations in two directions
    it's slow O(n)=n2
  2. swapping 2k-bit groups within 2k+1 groups
    O(n)=log2n
    step: x = ((x & m) << k | (x & ~m) >> k

01101010 -> 01010110
10010101
10011010
10101001
0 1|1 0|1 0|1 0
x x x x (swap bits in every 2 bits)
1_0|0_1|0_1|0_1
\ / \ /
x x (swap pairs of 2 bits in every nibble)
/ \ /
0_1_1_0|0_1_0_1
\ /
\ /
\ /
\ /
x (swap nibbles in a byte)
/
/
/
/
0 1 0 1 0 1 1 0
~
1 0 1 0 1 0 0 1

Popularity count