L5. Assembly programming techniques

Clearing and testing the register content - x86

Clearing
xor eax, eax
binary code shorter than mov eax, 0
special handling in modern processors
Testing - zero and sign check
test eax, eax js negative jz zero
new processors treat there sequences in a special way (fusion)

Multiplication and division by powers of 2

Use left shift lsh for multiplication
Use right shift rsh for division
arithmetic right shift for division of signed numbers
Use logical AND with $2^{n} - 1$ for modulo

Arithmetics with LEA !

3-argument addition with constant
3-argument multiplication by 2, 4, 8 (same as left shift by 1, 2, 3)
2-/3-argument multiplication by 3, 5, 9
3-argument shift with addition/logical OR
if ones in arguments don't overlap, addition is equivalent to OR

Procedures

Procedure is a fragment of code intended to be used by many parts of a program
Procedure call and return is time-consuming
Call - branch and return trace storing
Return - branch with variable target address
Do not use procedures if

it is called from one place in the program
insert the bofy of the routine instead for its invocation
in C - use inline attribute to suggest the compiler to do the same
procesure body execytes in time similar to call and return
Use registers to pass arguments in assembly

Loop organization

If possible, a loop should end with a conditional branch closing the loop
minimizes the number of control instructions
If the number of ieerations is known in advance, loop should end with 'decrement counteer and branch if not zero'
Pther methods may be used to reduce the numebr of variables modified within the loop
In x86. string instructions may be used

Using flags

Flags are affected by all basic arithmetic and logical instructions
x86 single-argument instructions affect some of the flags
INC, DEC do not update CF
CF value may be preserved for the next pass of a loop
to check if the variable was decremented to negative value, use JS/JSN not JB/JAE
In some cases, CF value previously set by 2-argument instruction may be combined with other flags set by INC/DEC to obtain logical OR/AND of two conditions produced by two instructions

Extended precision arithmetics

Carry flag may be used for propagating carry/borrow between partial results of addition/subtraction or shifts

Reducing branches

Branched may introduce significant delays, yp to 100 instruction time slots.
Branch cannot be easily predicted if:

it depends on data value
has variable target address
appeaars close to other branch instructions
Removal of a single branch instruction may positively affect the prediction of neighboring branches
Conclusion - minimize the number of branches

Bit order reversal

Frequently needed - graphics, data transfer/conversion
Some of the newer architectures have the BITREV instruction

Two solutions

Trivial

bit by bit copy using shifts or rotations in two directions
it's slow $O (n) = n^{2}$
swapping $2^{k}$ -bit groups within $2^{k + 1}$ groups
$O (n) = l o g_{2} n$
step: x = ((x & m) << k | (x & ~m) >> k

01101010 -> 01010110
10010101
10011010
10101001
0 1|1 0|1 0|1 0
x x x x (swap bits in every 2 bits)
1_0|0_1|0_1|0_1
\ / \ /
x x (swap pairs of 2 bits in every nibble)
/ \ /
0_1_1_0|0_1_0_1
\ /
\ /
\ /
\ /
x (swap nibbles in a byte)
/
/
/
/
0 1 0 1 0 1 1 0
~
1 0 1 0 1 0 0 1