Chapter 3

Basic Assembly Language Elements

Example Code

.data              ; data segment
	sum DWORD 0    ; 32-bit variable initialized to 0
.code              ; code segment
main PROC          ; declaring starting point of 'main' procedure
	mov eax, 6
	add eax, 5
	mov sum, eax
	INVOKE ExitProcess, 0    ; exit process call with code 0
main ENDP          ; endpoint of 'main' procedure

Code Segments

Variables

Integer Literals

Also known as integer constant

[{+|}] digits [ radix ]
Radix

If a number is written as-is, it's assumed to be a decimal number. To make the computer understand a number in other numerical systems, a radix at the end has to be added

System Radix
Decimal d/t
Binary b/y
Hexadecimal h
Octal q/o
Encoded real r

If a hexadecimal number is starting with a letter, it must be started with 0, to prevent the assembler from understanding it as an identifier
So A9h should be written as 0A9h

Constant Integer Expressions

Constant integer expression is a mathematical operation with integer literals and arithmetic operators. The final value has to fit in 32 bits
The operators work according to their operator precedence (1 to 4)

Operator Symbol Level
Parentheses () 1
Unary plus, minus +, 2
Multiply, divide ,/ 3
Modulus MOD 3
Add, subtract +, 4

Real Number Literals

Also known as floating-point literals
Written in decimal reals or encoded (hex) reals
Decimal point has optional sign in front
Fraction part has optional exponential part after

[{+|}] integer .[exponent[{+|}] integer]

At least one digit and decimal point (not digits after) are required

Encoded real is a hex representation of a decimal floating-point
+1.0=00111111100000000000000000000000 can be encoded as 3F800000h

Character Literal

Single character can be enclosed in single or double quotes - 'a', "B"
Assembler stores the integer ASCII value, so A=65=41h

String Literal

Strings can be enclosed in single or double quotes - 'Hello', "World"
Quotes can be part of a string like this:
"We can't do this"
'He said "Hello" to her'
Basically enclosing with a opposing quote
Strings are stored as sequence of ASCII values - ’Hello’=48h,65h,6Ch,6Ch,6Fh

Reserved Words

Reserved words have special meaning for assembly syntax. Those are case-insentitive

Identifiers

Programmer-chosen name for variable, constant, procedure, code label
Rules:

Identifiers can be made case-sensitive by adding -Cp to assembler flags

Directives

Commands embedded in source code, read and acted upon by assembler
Do not execute at runtime
Let define variables, procedure macros. Assign names to segments
NOT case-sensitive
There are some system-specific directives not recognized by other assemblers

Defining Program Segments

Instructions

Statement that becomes executable after assembly
Translated by assembler into machine language bytes
Contains 4 parts:

[ label: ] mnemonic [ operand 1, operand 2, ... ][ ; comment ]
Label

Place marker for instruction and data
There are two types of labels - data and code
Label before instruction implies instruction address
Label before variable implies variable address

Data Label

Identifies location of variable, making referencing easier

count DWORD 100                  ; can be a singular variable
array BYTE  10h, 20h, 30h, 40h   ; or an array
Code Label

A label in the code section must and with a color :
Code labels are used for jumping and looping instructions
The below example creates a label target and jumps to it using JMP instruction, transferring to location marked by the label

target:
	mov eax, 5
	...
	JMP target

Code labels can be on the same line or separate as the code the follows

L1: mov eax, 6
L2:
	mov ebx, 7
Instruction Mnemonics

Short word defining instruction
Examples: mov, add, mul, sub, jmp, call

Operands

Input/Output value of instruction. Instruction take between 0-3 operands. They can be: register, memory operand, integer expression, I/O port
Memory operand - references a memory location by using register with brackets (memory direct), or variable name

Comments

Allow programmer to communicate about the code
The normal, inline comment starts with semicolon ;
And the comment block starts using COMMENT #, where # is any symbol that doesn't appear in the comment block

	mov eax, 5    ; this is a valid in-line comment

COMMENT !
	this is a valid block comment
!

COMMENT $
	as well as this!
	but it cannot have '!' as the starting symbol, because it's used in the line above
$
NOP Instruction

The most basic, safe, and useless instruction is NOP (no operation). It takes up 1 byte and does nothing. Used sometimes to memory-align the code or fight WAR/WAW hazards

Full Code Example

The full, complete code from the beggining of this chapter should look something like this

.368
.model flat, stdcall
.stack 4096
ExitProcess PROTO, dwExitCode:DWORD

.code
main PROC
	mov eax, 5
	add eax, 6
	
	INVOKE ExitProcess, 0
main ENDP
END main

On the 4th line there is ExitProcess PROTO, dwExitCode:DWORD function prototype. It declares a prototype using PROTO directive, , having a name

Assembler Directives

.386

Identifies program as 32-bit, which can access 32-bit registers and addresses

.model

Our code includes two model conventions

.stack

Sets the number of bytess of memory to reserve for the program's runtime stack. 4096 is a classic, typical number, as it corresponds to size of a memory page
Stack is used by all modern programs to call subroutines, hold and pass parameters and caller addresses
Additionally, we can use stack to hold local variables declared inside a function.

.code

Marks beginning of the code area, containing executable instructions. Usually the next line is the declaration of the program's entry point, typically main, using PROC directive.

Entry Point

Location of the very first instruction the program will execute

The ENDP directive marks and of a procedure. It must use the same name as the name used in declaration
Any lines after END directive will be ignored by the assembler

.data

Holds declarations of editable variables. Variables are placed in memory sequentially, one after another, so for example

.data
	count WORD 10h
	array BYTE 10h, 20h, 30h, 40h
	flag  BYTE 1

will look in the memory like so:

1000Little Endian!1020304001
.const

Holds declaration of constant variables. Saves memory exactly the same as .data

Assembling, Linking, and Running Programs

A source assembly code cannot be directly executed on the computer. It has to go through couple of steps in order to be understood by the processor - the most important being translation or assembly. Assembler works in a similar fashion to compiler, which is used for HLLs like C++ or Java.

Step 1 - program is written using a text editor in a text source file
Step 2 - Assembler reads the source file and produces an object file - machine-language translation. Optionally, it can also produce a listing file. It checks program for errors, and raises them if occured.
Step 3 - Object file is read by the linker which checks for calls to procedures in a link library. If found, linker copies the required procedures, combines them with the object file and creates the executable file.
Step 4 - OS loader utility reads executable file into memory and points CPU to the program's entry point.

Listing File

Listing file contains a copy of the program's source code with:

Example listing file line (first line in .code from AddTwo):
11: 00000000 B8 00000005 mov eax, 5
11 - line number
00000000 - relative address of the instruction
B8 00000005 - machine code instruction
B8 is called operation code (opcode) representing specific machine instruction to move imm32 integer into EAX

We can notice that INVOKE directive in our source code was changed into PUSH and CALL statements

Defining Data

Instrinsic Data Types

Assembler understands basic set of instrinsic data types which describe sizes of types - byte, word, doubleword, and so on, signed or unsigned, integer or reals
There's an overlap in sizes, as for example SDWORD, DWORD, and REAL4 are 32-bit long integers and processor doesn't care which one is used, but the programmer might declare whether the variable is signed for the ease of reading.

Data Definition Statement

Sets aside storage in memory for a variable with an optional name (label). Data definition statements create variables based on instrinsic data types.
The syntax is following:

[ name ] directive initializer [ , initializer]

There are following intrinsic data types

Type Usage
BYTE 8-bit unsigned integer
SBYTE 8-bit signed integer
WORD 16-bit unsigned integer
SWORD 16-bit signed integer
DWORD 32-bit unsigned integer
SDWORD 32-bit signed integer
FWORD 48-bit integer (far pointer in protected mode)
QWORD 64-bit integer
TBYTE 80-bit (10-byte) integer. T means ten-byte
REAL4 32-bit (4-byte) IEEE short real
REAL8 64-bit (8-byte) IEEE long real
REAL10 80-bit (10-byte) IEEE extended real

Example data definition statement - count DWORD 12345678

Name

Optional, assigned to variable, must conform to the rules of identifiers

Directive

Directive in a data definition statement can be BYTE, WORD, and so on, or any of the legacy types from the table below:

Directive Usage
DB 8-bit integer
DW 16-bit integer
DD 32-bit integer or real
DQ 64-bit integer or real
DT 80-bit (10 byte) integer
Initializer

At least one initializer is required (might be zero), in order to assign a starting or initial value to a variable
Additional initializers, if any, are separated by commas
For integer data types, initializer is an integer literal or expression matching the size of the variable's type
An unitialized variable (gets random value that was there before) can be created by using ? symbol as the initializer
Assembler converts all the initializers to the binary format

Declaring Data

Defining BYTE and SBYTE Data

BYTE and SBYTE directives allocate storage for one or more values, each of size 1 byte (8 bits). Examples

	value1 BYTE 'A'    ; character literal
	value2 BYTE 0      ; smallest unsigned byte
	value3 BYTE 255    ; largest unsigned byte
	value4 SBYTE -128  ; smallest signed byte
	value5 SBYTE +127  ; largest signed byte
	value6 BYTE ?      ; unitialized byte

Multiple Initializers

If multiple initializers are used in the same data definition, it's label refers only to the offset of the first initializer.
For the following declaration: list BYTE 10h, 20h, 30h, 40h the memory layout will look like (memory offset above the value): 100000200001300002400003

Not all data definitions require labels. We can continue the array of bytes list by defining additional bytes on the next lines

list BYTE 11, 12, 13, 14
     BYTE 15, 16, 17, 18
     BYTE 19, 20, 21, 22

list label only refers to the first initializer (11), but the next ones can still be accessed with [list + idx]

Withing one data definition, initializers can be of different radixes and literal types.

Defining Strings

To define as string, enclose them in single or double quotation marks. The most commant string type is null-terminated string, which ends with a 00h byte. Example of string declaration:
message1 BYTE "This is an example string", 0
Each character uses a byte of storage.
Knowing the list continuation ability, we can easily split the string into multiple lines without losing it's contents.

someText BYTE "Lorem ipsum dolor ",
	     BYTE "Sally is the best!", 0

This example would produce 'Lorem ipsum dolor Sally is the best!' when written

DUP Operator

DUP operator allocates memory for multiple (given by integer) of data.
Useful for allocating memory for a string input or array.

inputString BYTE 127 DUP(0)   ; allocates 127 bytes initialized to 0
arrayInput  BYTE 20 DUP(?)    ; allocates 20 bytes of uninitialized data
repeatText  BYTE 4 DUP("Text"); allocates 16 bytes with 'Text' repeated 4                                 ; times - 'TextTextTextText'

If the DUP operator is used with any other type (WORD, DWORD, etc), it will allocate the number of those types, so DWORD 4 DUP(0) will allocate 16 bytes initialized to 0, as 4×4bit=16bit

Packet BCD (TBYTE) Data

BCD integers are stored in 10-byte packages on Intel architecture.
Each byte except the MSB contains contains two decimal digits. MSB holds the sign - 80h means negative, 00h means positive.
MASM uses TBYTE type to declare BCD integers, but cannot translate decimal integers to BCD, they have to be declared in hex.

intVal TBYTE 80000000000000001234h ; valid, equal to -1234
intVal TBYTE -1234                 ; invalid

If we write -1234 in the declaration, MASM reads it as binary integer rather than packed BCD integer.
To encode a real number as packed BCD, we have to first load it onto the floating-point stack then pop it converting to BCD. This procedure rounds up to the nearest integer.

.data
	posVal REAL8 1.5
	bcdVal TBYTE ?
.code
	fld posVal     ; push onto FP stack
	fbstp bcdVal   ; pop from the FP stack and convert to BCD

Defining Floating-Point Types

REAL4 defines 4-byte single-precision FP variable
REAL8 defines 8-byte double-precision FP variable
REAL10 defines 10-byte extended-precision FP variable
Each requires one or more real constant initializers.

Declaring Uninitialized Data

We can use .data? directive to declare unitialized data
When defining large chunks of it, .data? reduces the size of a compiled program.
For example

.data
	smallArray DWORD 10 DUP(0) ; 40 bytes
.data?
	largeArray DWORD 20000 DUP(0) ; 20000 bytes, not initialized
============================================================================
.data
	smallArray DWORD 10 DUP(0)    ; 40 bytes
	largeArray DWORD 20000 DUP(0) ; 20000 bytes, initalized

The compiled program will be 20000 bytes bigger in the second case.

Mixing Code and Data

Assembler allows for mixing of .code and .data directives freely.
We can write code like:

.data
	val1 WORD 1234h
.code
	mov ax, val1
.data
	val2 WORD 5678h
.code
	add ax, val2

The assembler in the end will put all of the .data declarations in on place of the final file.
This might create a better logical structure of the code, but also make it more complicated to read.

Little-Endian Data

x86 processors store and retrieve data from memory in little-endian order (low to high). The LSB is stored at the lowest (first) memory address.
For an initialization of number 12345678h, the memory will look like $$ \overset{0000}{78} , \overset{0001}{56} , \overset{0002}{34} , \overset{0003}{12} $$

Big-Endian

Some computers use big-endian, which stores the data in a more human-readable way, data in memory looks like the one in code $$ \overset{0000}{12} , \overset{0001}{34} , \overset{0002}{56} , \overset{0003}{78} $$

Symbolic Constants

Symbolic constant is created by assigning identifier (symbol) with an integer expression or some text. They do not reserve storage, nor change value at runtime. Assembler reads them, scans the code and substitutes the used symbols for the associated expression.

Equal-Sign Symbol

Equal-sign directive associates symbol with an integer expression. $$ name = expression$$
Typically, expression is a 32-bit integer value.
This directive is useful for defining global values that might be used in a multiple of places throught the code and possibly changed in the future development, like COUNT, N etc.

Current Location Counter

One of the most important symbols - $, called current location counter. Defining variable like selfPtr DWORD $ will initialize it to it's own memory address.

Calculating Size of String or Array

It is useful for getting the size of strings/lists, as we can do something like

stringText BYTE "Hello, World!", 0
stringSize = ($ - stringText)     ; gets current address using $ and
								  ; subtracts the address of stringText
								  ; getting number of bytes (!)

To get number of elements in WORD list, we'd have to divide the expression by 2, for DWORD - by 4.

Keyboard Definitions

Programs often define symbols that identify commonly used numeric keyboard codes. For example, 27 is ASCII code for ESC key
Esc_key = 27

EQU Directive

EQU directive associates symbolic name with an integer expression or some arbitrary text

name EQU expression
name EQU symbol
name EQU <text>

In the first format, it must be a valid integer expression
In the second format, symbol must be an already existing name, previously definied using = or EQU.
In the third format, any text may appear withing <...> brackets.

EQU might be useful with defining values that don't evaluate to an integer, like PI EQU <3.1416>

Unlike equal-sign directive, EQU cannot be redefined in the source code.

TEXTEQU Directive

Similar to EQU, TEXTEQU is known as text macro. There are three formats:

name TEXTEQU <text>
name TEXTEQU textmacro
name TEXTEQU %constExpr%

Text format is self-explanatory
Text macros can build on each other. For the example below, count is set to value of expression using rowSize, then move is defined as mov, finally setupAL is build on move and count.

rowSize = 5
count   TEXTEQU %(rowSize * 2)
move    TEXTEQU <mov>
setupAL TEXTEQU <move al, count>
====
	; this can be used as follows:
setupAL
	; which will be assembled into
mov al, 10

Symbol defined by TEXTEQU can be redefined at any time