Chapter 7

This chapter focuses on x86 binary shifts and rotations. Bit manipulation is a important and powerful technique, frequently used in computer graphics, data encryption and hardware manipulation. Those instructions are rarely, or partially implemented in HLL, and usually platform-dependent.
Throughout the chapter, when discussing arithmetic operations, we assume the operands are integers.
HLL do not support arithmetic operations on arbitrary-length integers, whereas x86 assembly allows addition and subtraction of virtually any size.

Shift and Rotate Instructions

Extending the bitwise instructions from chapter 6, shift instructions are used to move bits right and left inside the operand. x86 CPUs instruction sets are particuralrly rich in them, all affecting OF and CF flags
Available shift and rotate instructions

Mnemonic	Description
`SHL`	Shift left
`SHR`	Shift right
`SAL`	Shift arithmetic left
`SAR`	Shift arithmetic right
`ROL`	Rotate left
`ROR`	Rotate right
`RCL`	Rotate carry left
`RCR`	Rotate carry right
`SHLD`	Double-precision shift left
`SHRD`	Double-precision shift right

Logical Shifts and Arithmetic Shifts

There are two ways to shift oeprand's bits:
First, logical shift, which moves the bits by a provided number, and fills the 'new/empty' bits with zeros. When shifting right, the LSb is copied into the CF

Second type of shift is called arithmetic shift. The 'new/empty' bit is filled with the copy of the shifted bit. Shifting right, the sign bit is copied.

SHL Instruction

The SHL (shift left) instruction performs a logical left shift, filling the lowest bit with 0. The highest bit is moved into the CF, with the previous CF value being discarded.

The instruction format: $$ \text{SHL destination, count} $$
x86 processors allow the count to be imm8 or CL register (values 0-255)
The same format applies to all of the other shifts. (excl. SHRD, SHRL)

mov bl, 8Fh    ; BL = 10001111b
shl bl, 1      ; BL = 00011110b , CF = 1
shl bl, 3      ; BL = 11110000b , CF = 0

Bitwise Multiplication

By shifting bits to the left, the operation is equal to multiplying the number by $2^{n}$ with $n$ being the number of shifts.
For example, $5 ≪ 3 = 5 \cdot 2^{3} = 5 \cdot 8 = 40$

\begin{aligned} 5_{10} & = 00000101_{2} \\ 40_{10} & = 001010 00_{2} \end{aligned}

SHR Instruction

The SHR (shift right) instruction performs a logical right shift, replacing the highest bit with 0. The lowest bit is copied into CF
The usage is exactly the same to SFL

Bitwise Division

Invertly to #Bitwise Multiplication, SHR performs a division of an operand by $2^{n}$ , with $n$ being the number of shifts.
For example, $68 ≫ 2 = 68 \div 2^{2} = 68 \div 4 = 17$

\begin{aligned} 68_{10} & = 00100100_{2} \\ 17_{10} & = 00 001001_{2} \end{aligned}

SAL and SAR Instructions

The SAL (shift arithmetic left) instruction performs the same operation as SHL (same opcode). For each shift count, the lowest bit gets filled with 0. The highest bit is moved into CF.

The SAR (shift arithmetic right) instructions shifts the bits to the right, and fills the highest bits with a copy of the MSb (sign bit).

Signed Division

SAR allows for signed division, which preserves the sign bit.

	mov al, 0F0h  ; AL = 11110000b (-16)
	sar al, 2     ; AL = 11111100b (-4)

Sign-Extend AX into EAX

If AX contains a signed integer, to extend its sign into EAX, SHL and SAR can be used

	mov eax, -128   ; EAX = ????FF80h
	shl eax, 16     ; EAX = FF800000h
	sar eax, 16     ; EAX = FFFFFF80h

ROL Instruction

Bitwise rotation occurs when bits are moved like in a circle - bit moved out from one end, gets copied into the other end. There exists another type of rotation, using CF as an intermediate point for shifted bits.

The ROL (rotate left) instruction shifts each bit to the left, copies MSb into CF and LSb.

None of the operand's bits are discarded.

Multiple Rotations

When multiple rotations are performed (either by calling ROL multiple times, or by making the rotation count greater than 1) the CF contains the last bit rotated out of the MSb position.

Exchanging Groups of Bits

ROL can be used to exchange the upper and lower halves of an operand

	mov al, 26h   ; AL = 00100110b
	rol al, 4     ; AL = 01100010b = 62h

When rotating a multibyte integer by four bits, it results in rotating each hex digit by one position. When rotating a word 4 times by 4 bytes, the final result will be equal to the original value

	mov ax, 6A4Bh  ;
	rol ax, 4      ; AX = A4B6h
	rol ax, 4      ; AX = 4B6Ah
	rol ax, 4      ; AX = B6A4h
	rol ax, 4      ; AX = 6A4Bh

ROR Instruction

The ROR (rotate right) instruction shifts all bits to the right, and copies LSb into MSb and CF.

Multiple Rotations

When rotating more than once, the CF contains the last bit moved out of LSb.

RCL and RCR Instructions

RCL Instruction

The RCL (rotate carry left) instruction shifts each bit to the left, copies CF into LSb, and MSb into CF.
CF can be imagined as an extra bit added to the high end of the operand performing a rotate left operation.

	clc          ; CF = 0
	mov bl, 88h  ; CF = 0, BL = 10001000b
	rcl bl, 1    ; CF = 1, BL = 00010000b
	rcl bl, 1    ; CF = 0, BL = 00100001b

Recover a Bit from the Carry Flag

RCL can be used to recoved a bit that was previously shifted into CF. The code below tests the lowest bit of testval by shifting it into the CF. If bit is set, jump is taken, if 0 - RCL restores the original value

.data
	testVal BYTE 01101010b
.code
	shr testVal, 1   ; shift LSB into CF
	jc exit          ; jump if CF set
	rcl testVal, 1   ; else restore the testVal

RCR Instruction

The RCR (rotate carry right) instruction shifts each bit to the right, copies CF into MSb, and copies LSb into CF.
Similar to RCL, visualizing the rotation as using CF as another bit is the easiest to understand

Signed Overflow

Signed overflow occurrs when result of signed operation generates a result that exceeds the capacity of the destination. OF is set if shifting/rotating a signed integer by one position generates a value outside the signed integer range of the destination operand (number is reversed). In the following, positive $+ 127$ is rotated, and registed becomes negative $- 2$

	mov al, +127    ; AL = 01111111b
	rol al, 1       ; AL = 11111110b, OF = 1

Similarly, when shifting $- 128$ one position to the right, the result is $+ 64$ and overflow flag gets set

	mov al, -128
	shr al, 1

The OF value is undefined if the shift/rotation count is greated than 1

SHLD/SHRD Instructions

The SHLD (shift left double) instruction shifts a destination operand by a number of given bits. The new/empty bits are filled by the MSbs of the source operand.
Source operand is not affected. SF, ZF, AF, PF, and CF are affected
Format:

SHLD, dest, source, count

The SHRD (shift right double) instruction shifts the destination operand by a given number of bits, and fills the opened bits with the LSbs of the source operand.

SHRD, dest, source, count

For both SHLD and SHRD, the destination oeprand must be register or memory, and source must be a register. The count can be CL or imm8

Shift and Rotation Applications

There are specific situations where bitwise shifts and rotations are needed, and assembly is a good tool for that. This part will focus on example applications

Shifting Multiple Doublewords

By dividing an extended-precision integer into array of bytes, words, or doublewords, we can shift it. Array elements are stored typically in little-endian order/array, where low-order byte is placed at the array's starting address. Then, working up from the bytes to the high-order byte, next bytes are stored. Instead of storing the array in bytes, it can be stored in words/doublewords, but that will not affect the memory placement, as words/doublewords are also saved nin the little-endian order.

To shift an array of bytes 1 bit to the right:

shift the highest byte at [esi+1] to the right, automatically copying its lowest bit into CF
Rotate value at [esi+1] to the right, filling the highest bit with the value of the CF and shifting the LSb into the CF
Repeat he same for value at [esi]

.data
	ArraySize = 3
	array BYTE ArraySize DUP(99h)   ; 1001b pattern in each nybble
.code
main PROC
	mov esi, 0
	shr array[esi+2], 1  ; high byte
	rcr array[esi+1], 1  ; middle byte, use CF from the previous shift
	rcr array[esi], 1    ; low byte, use CF from the previous rotation

Multiplication by Shifting Bits

To squeeze every performance advantage from integer multiplication, we can use bit shifting instead of MUL instruction. The SHL instruction performs unsigned multiplication when the multiplier is a power of 2. Below is an example of unsigned multiplication.
This technique uses the fact that any number can be represented as the sum of powers of two

\begin{aligned} E A X \times 36 & = E A X \times (2^{5} + 2^{2}) \\ = E A X \times (32 + 4) \\ = (E A X \times 32) + (E A X \times 4) \end{aligned}

The following shows multiplication $123 \times 36 = 4428$

\begin{array}{rrll} 01111111 & 123 \\ \times & 00100100 & 36 \\ 01111011 & 123 & SHL 2 \\ + & 01111011 & 123 & SHL 5 \\ 0001000101001100 & 4428 \end{array}

Which can be turned into following code:

	mov eax, 123
	mov ebx, eax
	shl eax, 5
	shl ebx, 2
	add eax, ebx

Displaying Binary Bits

Converting a binary integer into ASCII binary string, allowing for display is a common task. The SHL instruction is useful for that, as it copies the MSb of an operand into CF during every shift left. Following is the simple implementation

; Converts 32-bit binary integer into binary
; Receives: EAX = binary integer, ESI = buffer pointer
; Returns: buffer filled with ASCII binary digits
BinToAsc PROC
	push ecx
	push esi
	mov ecx, 32                 ; number of bits in EAX
L1:
	shl eax, 1                  ; shift MSb into CF
	mov BYTE PTR [esi], '0'     ; choose 0 as default digit
	jnc L2                      ; if CF not set, repeat loop
	mov BYTE PTR [esi], '1'     ; if CF set, move 1 into buffer
L2:
	inc esi                     ; next buffer position
	loop L1                     ; loop
	
	pop esi
	pop ecx
	ret
BinToAsc ENDP

Multiplication and Division Instrustions

Unsigned Integer Multiplication (MUL)

In 32-bit mode, MUL has three formats: multiplying 8-bit operand by AL, 16-bit operand by AX, or 32-bit operand by EAX.
The multiplier and multiplicand must always be the same size, and the product is twice their size

\begin{aligned} MUL reg/mem8 \\ MUL reg/mem16 \\ MUL reg/mem32 \end{aligned}

The single operand is the multiplier.
Because the destination operand is twice the size, overflow cannot occur
CF and OF are set if upper half of product is not equal zero

When AX is multiplied by 16-bit, the product is stored in DX:AX (high 16 bits are in DX, low 16 bits are in AX)
CF is set if DX is not equal to zero, which informs that products doesn't fit into implied destination (AX)

Multiplicand	Multiplier	Product
AL	reg/mem8	AX
AX	reg/mem16	DX:AX
EAX	reg/mem32	EDX:EAX

In 64-bit mode, the 64-bit operands are available, allowing for 128-bit product stored in RDX:RAX

Examples

CF is not set because the AH is equal zero

	mov al, 5h
	mov bl, 10h
	mul bl      ; AX = 0050h, CF=0

CF is set because the upper half (DX) is not equal zero

.data
	val1 WORD 2000h
	val2 WORD 0100h
.code
	mov ax, val1  ; AX = 2000h
	mul val2      ; DX:AX = 00200000h, CF = 1
			      ; DX = 0020  AX = 0000h

Signed Integer Multiplication (IMUL)

The signed multiply IMUL instruction performs the signed multiplication performing the sign of the product. This is done by sign extending the highest bit of the lower half of the product into the opper bits of the products. The x86 instruction supports three formats: one, two, and three operands

Using IMUL in 64-bit mode is exactly the same, with the additional RDX:RAX result available

Single-Operand Formats

The one-operand store the product in AX, DX:AX, or EDX:EAX
Both operands must be the same size

\begin{aligned} IMUL reg/mem8 \\ IMUL reg/mem16 \\ IMUL reg/mem32 \end{aligned}

Similar to MUL, the storage size makes overflow impossible
CF and OF behave the same - set if upper half is not zero

Example

	mov al, -4
	mov bl, 4
	imul bl    ; AX = FFF0h, OF = 0

OF is clear, because AH is the sign-extension of AL

Two-Operand Formats (32-bit Mode)

The result is stored in the first operand, which must be a register
The second operand (multiplier) can be a register, memory, or immediate

\begin{aligned} IMUL reg16, reg/mem16 \\ IMUL reg16, imm8 \\ IMUL reg16, imm16 \\ IMUL reg32, reg/mem32 \\ IMUL reg32, imm8 \\ IMUL reg32, imm32 \end{aligned}

Two-operand formats truncate the product to the length of destination. If significant digits are lost, OF and CF are set

Always check their value after multiplication!

Three-Operand Format

Three-operand format stores product in first operand, which is a result of multiplying the second by the third operand

\begin{aligned} IMUL reg16, reg/mem16, imm8 \\ IMUL reg16, reg/mem16, imm16 \\ IMUL reg32, reg/mem32, imm8 \\ IMUL reg32, reg/mem32, imm32 \end{aligned}

Unsigned Integer Division (DIV)

There are three formats:

\begin{aligned} DIV reg/mem8 \\ DIV reg/mem16 \\ DIV reg/mem32 \end{aligned}

Dividend	Divisor	Quotient	Remainder
AX	reg/mem8	AL	AH
DX:AX	reg/mem16	AX	DX
EDX:EAX	reg/mem32	EAX	EDX

In 64-bit mode, DIV uses RDX:RAX as divident and permits the divisor to be 64-bit register or memory.

Example

	mov ax, 0083h  ; dividend
	mov bl, 2      ; divisor
	div bl         ; AL = 41h, AH = 01h

Signed Integer Division (IDIV)

The signed division is nearly identical to the unsigned one, with one important difference: dividend must be sign-extended before the division takes place.
Example why:

.data
	wordVal SWORD -101  ; FF98h
.code
	mov dx, 0
	mov ax, wordVal     ; DX:AX = 0000FF98h
	mov bx, 2
	idiv bx

After moving the wordVal into AX, the DX:AX doesn't hold corrent -101, but +65435. To fix that, CWD (convert word to doubleword) must be used to sign-extend AX into DX:AX

.data
	wordVal SWORD -101  ; FF98h
.code
	mov dx, 0
	mov ax, wordVal     ; DX:AX = 0000FF98h
	cwd                 ; DX:AX = FFFFFF98h (-101)
	mov bx, 2
	idiv bx             ; AX = FFCEh(-50)

Sign Extension Instructions

There are three sign extension instructions in the Intel architecture:

CBW - convert by to word
CWD - convert word to doubleword
CDQ - convert doubleword to quadword

CBW sign-extends AL to AX
CWD sign-extends AX to DX:AX
CDQ sign-extends EAX to EDX:EAX

IDIV Instruction

IDIV (signed divide) performs signed integer division using the same operands as #Unsigned Integer Division (DIV)
Before the execution, the divident must be sign-extended
The remainded always has the same sign as the dividend

Divide Overflow

When operation produces a quotient that will not fit into the destination operand, a divide overflow occurs. This causes processor exception and halts the program. The following code will generate such error:

	mov ax, 1000h
	mov bl, 10h
	div bl       ; AL cannot hold 100h

A similar error will appear when dividing by zero

Extended Addition and Subtraction

Extended precision addition and subtraction is a technique of adding and subtracting numbers having an almost unlimited size. In C++ there is no way of adding 1024-bit integers, but in assembly using ADC (add with carry) and SBB(subtract with borrow), we can do that

ADC Instruction

The ADC (add with carry) adds both source and CF to a destination
The instruction formats are the same as for ADD instruction, and operands must be the same size

Extended Addition Example

Example below adds two extended integers of the same size. It loops through the integers as if they were parallel arrays.
The procedure receives two pointer in ESI and EDI that point to the integers to be added. EBX points to the result buffer, with the precondition that the buffer must be one byte longer than two integers. The length of the longest integer is passed in ECX. The numbers must be stored in little-endian order, with the lowest order byte at each array's starting offset.

Extended_Add PROC
	pushad
	clc                   ; clear carry flag
	
L1: mov al, [esi]         ; get first integer
	adc al, [edi]         ; add second integer
	pushfd                ; save carry flag
	mov [ebx], al         ; store partial sum
	inc esi               ; advance all three pointers
	inc edi
	inc ebx               
	popfd                 ; restore carry flag
	loop L1               ; repeat loop
	
	mov BYTE PTR [ebx], 0 ; clear high byte
	adc BYTE PTR [ebx], 0 ; add any leftover carry
	popad
	ret

The flags are pushed onto the stack in line 7 and restored in line 12, because ADD instructions on 9-11 might set/unset the CF

LOOP never modifies CPU status flags

SBB Instruction

The SBB (subtract with borrow) instruction subtracts both source and CF from the destination operand. The formats are the same as for ADC
The following code performs a 64-bit subtraction using 32-bit operands

	mov edx, 7
	mov eax, 1    ; EDX:EAX = 0000000700000001h
	sub eax, 2    ; EDX:EAX = 00000007FFFFFFFFh, CF = 1
	sbb edx, 0    ; EDX:EAX = 00000006FFFFFFFFh

ASCII and Unpacked Decimal Arithmetic

The CPU calculates in binary, but is also able to perform arithmetic on ASCII decimal strings
For the below UI input:

Enter first number:  3402
Enter second number: 1256
The sum is:          4658

There are two ways of calculating the sum:

Convert both inputs to binary, add binary values, convert sum from binary to ASCII digit string
Add digit strings directly, by adding each pair of ASCII digits (3+1, 4+2, 0+5, 2+6). The sum is already an ASCII digit string so can be displayed on the screen
The second option uses specialized instructions that adjust the sum after adding each pair of ASCII digits

Instruction	Comment
`AAA`	ASCII adjust after addition
`AAS`	ASCII adjust after subtraction
`AAM`	ASCII adjust after multiplication
`AAD`	ASCII adjust before division

ASCII Decimal and Unpacked Decimal

The high 4 bits of an unpacked decimal integer are always zeros, whereas the same bits in ASCII decimal number are equal to 0011b. In both cases, both types of integers store one digit per byte. The following shows how 3402 would be stored

\begin{aligned} ASCII format: & 33 34 30 32 \\ Unpacked: & 03 04 00 02 \end{aligned}

ASCII arithmetic executes slower than binary arithmetic, but has two advantages:

No need to convert from string to binary before calculations
Decimal point permits operations on real numbers without danger of roundoff errors in floating-point numbers

ASCII addition and subtraction works on both ASCII format and unpacked decimal format, but multiplication and division works only on unpacked decimal format.

AAA Instruction

The AAA (ASCII adjust after addition) adjust the binary result of ADD or ADC instruction. If AL contains binary value procudes by adding two ASCII digits, AAA converts AL to two unpacked decimal digits and stores them in AH and AL in unpacked format. After that, AH and AL can be easily converted to ASCII by OR with 30h

	xor ax, ax
	mov al, '8'   ; AX = 0038h
	add al, '0'   ; AX = 006Ah
	aaa           ; AX = 0100h
	or ax, 3030h  ; AX = 3130h = '10'

AAS Insruction

In 32-bit mode, the AAS (ASCII adjust after subtraction) follows a SUb or SBB instruction that has subtracted one unpacked decimal value from another and stored the result in AL. It makes result in AL consistent with ASCII digit representation. Adjustment is only necessary if the result of subtraction is negative

	xor ax, ax
	mov al, '8'    ; AX = 0038h
	sub al, '9'    ; AX = 00FFh
	aas            ; AX = FF09h
	pushf          ; save the CF
	or al, 30h     ; AX = FF39h
	popf           ; restore CF

AAM Instruction

AAM instruction (ASCII adjust after multiplication) converts binary product of MUL to an unpaced decimal. The multiplication can only use unpacked decimals (5 == 05h, not 35h)

	mov bl, 05h
	mov al, 06h
	mul bl       ; AX = 001Eh
	aam          ; AX = 0300h

AAD Instruction

AAD instruction (ASCII adjust before division) converts unpacked decimal divident in AX to binary representation in preparation for executing DIV instruction.

	mov ax, 0307h    ; dividend UD '37'
	aad              ; AX = 0025h
	mov bl, 5        ; divisor
	div bl           ; AX = 0207h

Packed Decimal Arithmetic

Packed Binary-Coded Decimal Integers (BCD Integers) store two decimal digits per byte. For simplification, only unsigned BCD will be used. The numbers are stored in little-endian order, with each digit represented by 4 bits.

	bcd1 QWORD 2345673928737285h ; 2,345,673,928,737,285 decimal
	bcd2 DWORD 12345678h         ; 12,345,678 decimal
	bcd3 DWORD 08723654h         ; 8,723,654 decimal
	bcd4 WORD 9345h              ; 9,345 decimal
	bcd5 WORD 0237h              ; 237 decimal
	bcd6 BYTE 34h                ; 34 decimal

There are two main benefits from this:

Almost any number of significant digits are possible
Conversion of BCD to ASCII is relatively simple

DAA (decimal adjust after addition) and DAS (decimal adjust after subtraction) are two instructions used to adjust the result of operation on packed decimals. No instructions exist for multiplication and division. In those cases, numbers must be unpacked, multiplied or divided, and repacked

DAA Instruction

DAA (decimal adjust after addition) instruction converts a binary sum from ADD or ADC in AL to packed decimal format.

	mov al, 35h
	add al, 48h    ; AL = 7Dh
	daa            ; AL = 83h = '83'

DAS Instruction

DAS (decimal adjust after subtraction) converts result of subtraction by SUB or SBB instruction to in AL into BCD format

	mov bl, 48h
	mov al, 85h
	sub al, bl    ; AL = 3Dh
	das           ; AL = 37h = '37'