Chapter 3
Basic Assembly Language Elements
Example Code
.data ; data segment
sum DWORD 0 ; 32-bit variable initialized to 0
.code ; code segment
main PROC ; declaring starting point of 'main' procedure
mov eax, 6
add eax, 5
mov sum, eax
INVOKE ExitProcess, 0 ; exit process call with code 0
main ENDP ; endpoint of 'main' procedure
Code Segments
{asm}.code{asm}.data{asm}.stack
Variables
Integer Literals
Also known as integer constant
Radix
If a number is written as-is, it's assumed to be a decimal number. To make the computer understand a number in other numerical systems, a radix at the end has to be added
| System | Radix |
|---|---|
| Decimal | d/t |
| Binary | b/y |
| Hexadecimal | h |
| Octal | q/o |
| Encoded real | r |
If a hexadecimal number is starting with a letter, it must be started with 0, to prevent the assembler from understanding it as an identifier
So
Constant Integer Expressions
Constant integer expression is a mathematical operation with integer literals and arithmetic operators. The final value has to fit in 32 bits
The operators work according to their operator precedence (1 to 4)
| Operator | Symbol | Level |
|---|---|---|
| Parentheses | 1 | |
| Unary plus, minus | 2 | |
| Multiply, divide | 3 | |
| Modulus | 3 | |
| Add, subtract | 4 |
Real Number Literals
Also known as floating-point literals
Written in decimal reals or encoded (hex) reals
Decimal point has optional sign in front
Fraction part has optional exponential part after
At least one digit and decimal point (not digits after) are required
Encoded real is a hex representation of a decimal floating-point
Character Literal
Single character can be enclosed in single or double quotes - 'a', "B"
Assembler stores the integer ASCII value, so
String Literal
Strings can be enclosed in single or double quotes - 'Hello', "World"
Quotes can be part of a string like this:
"We can't do this"
'He said "Hello" to her'
Basically enclosing with a opposing quote
Strings are stored as sequence of ASCII values -
Reserved Words
Reserved words have special meaning for assembly syntax. Those are case-insentitive
- Instruction mnemonics - mov, ADD
- Register names - EAX, bl
- Assembler directives - Invoke, ENDP
- Variable/Operand attributes - BYTE, dword
- Operators in const expressions
- Defined symbols that return predefined integer values
Identifiers
Programmer-chosen name for variable, constant, procedure, code label
Rules:
- 1-247 characters
- not case-sensitive
- first character must be: letter, _, @, ?, $. Cannot be a digit.
Best not to use _ or @, as assembler uses them - cannot be the assembler reserver word
Identifiers can be made case-sensitive by adding -Cp to assembler flags
Directives
Commands embedded in source code, read and acted upon by assembler
Do not execute at runtime
Let define variables, procedure macros. Assign names to segments
NOT case-sensitive
There are some system-specific directives not recognized by other assemblers
Defining Program Segments
{asm}.data- defines variable constants{asm}.code- contains executable instructions{asm}.stack- defines size of a program holding runtime stack
Instructions
Statement that becomes executable after assembly
Translated by assembler into machine language bytes
Contains 4 parts:
Label
Place marker for instruction and data
There are two types of labels - data and code
Label before instruction implies instruction address
Label before variable implies variable address
Data Label
Identifies location of variable, making referencing easier
count DWORD 100 ; can be a singular variable
array BYTE 10h, 20h, 30h, 40h ; or an array
Code Label
A label in the code section must and with a color :
Code labels are used for jumping and looping instructions
The below example creates a label target and jumps to it using JMP instruction, transferring to location marked by the label
target:
mov eax, 5
...
JMP target
Code labels can be on the same line or separate as the code the follows
L1: mov eax, 6
L2:
mov ebx, 7
Instruction Mnemonics
Short word defining instruction
Examples: mov, add, mul, sub, jmp, call
Operands
Input/Output value of instruction. Instruction take between 0-3 operands. They can be: register, memory operand, integer expression, I/O port
Memory operand - references a memory location by using register with brackets (memory direct), or variable name
Comments
Allow programmer to communicate about the code
The normal, inline comment starts with semicolon ;
And the comment block starts using COMMENT #, where # is any symbol that doesn't appear in the comment block
mov eax, 5 ; this is a valid in-line comment
COMMENT !
this is a valid block comment
!
COMMENT $
as well as this!
but it cannot have '!' as the starting symbol, because it's used in the line above
$
NOP Instruction
The most basic, safe, and useless instruction is NOP (no operation). It takes up 1 byte and does nothing. Used sometimes to memory-align the code or fight WAR/WAW hazards
Full Code Example
The full, complete code from the beggining of this chapter should look something like this
.368
.model flat, stdcall
.stack 4096
ExitProcess PROTO, dwExitCode:DWORD
.code
main PROC
mov eax, 5
add eax, 6
INVOKE ExitProcess, 0
main ENDP
END main
On the 4th line there is ExitProcess PROTO, dwExitCode:DWORD function prototype. It declares a prototype using PROTO directive, , having a name
Assembler Directives
.386
Identifies program as 32-bit, which can access 32-bit registers and addresses
.model
Our code includes two model conventions
flat- always used in 32-bit programs, uses flat memory models associated with the protected modestdcall- tells assembler how to manage the runtime stack when procedures are called
.modeldirective must appear before both.stack,.code, and.data
.stack
Sets the number of bytess of memory to reserve for the program's runtime stack. 4096 is a classic, typical number, as it corresponds to size of a memory page
Stack is used by all modern programs to call subroutines, hold and pass parameters and caller addresses
Additionally, we can use stack to hold local variables declared inside a function.
.code
Marks beginning of the code area, containing executable instructions. Usually the next line is the declaration of the program's entry point, typically main, using PROC directive.
Location of the very first instruction the program will execute
The ENDP directive marks and of a procedure. It must use the same name as the name used in declaration
Any lines after END directive will be ignored by the assembler
.data
Holds declarations of editable variables. Variables are placed in memory sequentially, one after another, so for example
.data
count WORD 10h
array BYTE 10h, 20h, 30h, 40h
flag BYTE 1
will look in the memory like so:
.const
Holds declaration of constant variables. Saves memory exactly the same as .data
Assembling, Linking, and Running Programs
A source assembly code cannot be directly executed on the computer. It has to go through couple of steps in order to be understood by the processor - the most important being translation or assembly. Assembler works in a similar fashion to compiler, which is used for HLLs like C++ or Java.
Assemble-Link-Execute Cycle
Step 1 - program is written using a text editor in a text source file
Step 2 - Assembler reads the source file and produces an object file - machine-language translation. Optionally, it can also produce a listing file. It checks program for errors, and raises them if occured.
Step 3 - Object file is read by the linker which checks for calls to procedures in a link library. If found, linker copies the required procedures, combines them with the object file and creates the executable file.
Step 4 - OS loader utility reads executable file into memory and points CPU to the program's entry point.
Listing File
Listing file contains a copy of the program's source code with:
- line numbers
- numeric address of each instruction (relative, starts at 0x00000000)
- machine code bytes of each instruction (in hex)
- symbol table - contains names of all program identifiers and segments
It's useful to get detailed information about the program
Example listing file line (first line in .code from AddTwo):
11: 00000000 B8 00000005 mov eax, 5
11 - line number
00000000 - relative address of the instruction
B8 00000005 - machine code instruction
B8 is called operation code (opcode) representing specific machine instruction to move imm32 integer into EAX
We can notice that INVOKE directive in our source code was changed into PUSH and CALL statements
Defining Data
Instrinsic Data Types
Assembler understands basic set of instrinsic data types which describe sizes of types - byte, word, doubleword, and so on, signed or unsigned, integer or reals
There's an overlap in sizes, as for example SDWORD, DWORD, and REAL4 are 32-bit long integers and processor doesn't care which one is used, but the programmer might declare whether the variable is signed for the ease of reading.
Data Definition Statement
Sets aside storage in memory for a variable with an optional name (label). Data definition statements create variables based on instrinsic data types.
The syntax is following:
There are following intrinsic data types
| Type | Usage |
|---|---|
| BYTE | 8-bit unsigned integer |
| SBYTE | 8-bit signed integer |
| WORD | 16-bit unsigned integer |
| SWORD | 16-bit signed integer |
| DWORD | 32-bit unsigned integer |
| SDWORD | 32-bit signed integer |
| FWORD | 48-bit integer (far pointer in protected mode) |
| QWORD | 64-bit integer |
| TBYTE | 80-bit (10-byte) integer. T means ten-byte |
| REAL4 | 32-bit (4-byte) IEEE short real |
| REAL8 | 64-bit (8-byte) IEEE long real |
| REAL10 | 80-bit (10-byte) IEEE extended real |
Example data definition statement - count DWORD 12345678
Name
Optional, assigned to variable, must conform to the rules of identifiers
Directive
Directive in a data definition statement can be BYTE, WORD, and so on, or any of the legacy types from the table below:
| Directive | Usage |
|---|---|
| DB | 8-bit integer |
| DW | 16-bit integer |
| DD | 32-bit integer or real |
| DQ | 64-bit integer or real |
| DT | 80-bit (10 byte) integer |
Initializer
At least one initializer is required (might be zero), in order to assign a starting or initial value to a variable
Additional initializers, if any, are separated by commas
For integer data types, initializer is an integer literal or expression matching the size of the variable's type
An unitialized variable (gets random value that was there before) can be created by using
Assembler converts all the initializers to the binary format
Declaring Data
Defining BYTE and SBYTE Data
BYTE and SBYTE directives allocate storage for one or more values, each of size 1 byte (8 bits). Examples
value1 BYTE 'A' ; character literal
value2 BYTE 0 ; smallest unsigned byte
value3 BYTE 255 ; largest unsigned byte
value4 SBYTE -128 ; smallest signed byte
value5 SBYTE +127 ; largest signed byte
value6 BYTE ? ; unitialized byte
Multiple Initializers
If multiple initializers are used in the same data definition, it's label refers only to the offset of the first initializer.
For the following declaration: list BYTE 10h, 20h, 30h, 40h the memory layout will look like (memory offset above the value):
Not all data definitions require labels. We can continue the array of bytes list by defining additional bytes on the next lines
list BYTE 11, 12, 13, 14
BYTE 15, 16, 17, 18
BYTE 19, 20, 21, 22
list label only refers to the first initializer (11), but the next ones can still be accessed with [list + idx]
Withing one data definition, initializers can be of different radixes and literal types.
Defining Strings
To define as string, enclose them in single or double quotation marks. The most commant string type is null-terminated string, which ends with a
message1 BYTE "This is an example string", 0
Each character uses a byte of storage.
Knowing the list continuation ability, we can easily split the string into multiple lines without losing it's contents.
someText BYTE "Lorem ipsum dolor ",
BYTE "Sally is the best!", 0
This example would produce 'Lorem ipsum dolor Sally is the best!' when written
DUP Operator
DUP operator allocates memory for multiple (given by integer) of data.
Useful for allocating memory for a string input or array.
inputString BYTE 127 DUP(0) ; allocates 127 bytes initialized to 0
arrayInput BYTE 20 DUP(?) ; allocates 20 bytes of uninitialized data
repeatText BYTE 4 DUP("Text"); allocates 16 bytes with 'Text' repeated 4 ; times - 'TextTextTextText'
If the DUP operator is used with any other type (WORD, DWORD, etc), it will allocate the number of those types, so DWORD 4 DUP(0) will allocate 16 bytes initialized to 0, as
Packet BCD (TBYTE) Data
BCD integers are stored in 10-byte packages on Intel architecture.
Each byte except the MSB contains contains two decimal digits. MSB holds the sign - 80h means negative, 00h means positive.
MASM uses TBYTE type to declare BCD integers, but cannot translate decimal integers to BCD, they have to be declared in hex.
intVal TBYTE 80000000000000001234h ; valid, equal to -1234
intVal TBYTE -1234 ; invalid
If we write -1234 in the declaration, MASM reads it as binary integer rather than packed BCD integer.
To encode a real number as packed BCD, we have to first load it onto the floating-point stack then pop it converting to BCD. This procedure rounds up to the nearest integer.
.data
posVal REAL8 1.5
bcdVal TBYTE ?
.code
fld posVal ; push onto FP stack
fbstp bcdVal ; pop from the FP stack and convert to BCD
Defining Floating-Point Types
REAL4 defines 4-byte single-precision FP variable
REAL8 defines 8-byte double-precision FP variable
REAL10 defines 10-byte extended-precision FP variable
Each requires one or more real constant initializers.
Declaring Uninitialized Data
We can use .data? directive to declare unitialized data
When defining large chunks of it, .data? reduces the size of a compiled program.
For example
.data
smallArray DWORD 10 DUP(0) ; 40 bytes
.data?
largeArray DWORD 20000 DUP(0) ; 20000 bytes, not initialized
============================================================================
.data
smallArray DWORD 10 DUP(0) ; 40 bytes
largeArray DWORD 20000 DUP(0) ; 20000 bytes, initalized
The compiled program will be 20000 bytes bigger in the second case.
Mixing Code and Data
Assembler allows for mixing of .code and .data directives freely.
We can write code like:
.data
val1 WORD 1234h
.code
mov ax, val1
.data
val2 WORD 5678h
.code
add ax, val2
The assembler in the end will put all of the .data declarations in on place of the final file.
This might create a better logical structure of the code, but also make it more complicated to read.
Little-Endian Data
x86 processors store and retrieve data from memory in little-endian order (low to high). The LSB is stored at the lowest (first) memory address.
For an initialization of number 12345678h, the memory will look like $$ \overset{0000}{78} , \overset{0001}{56} , \overset{0002}{34} , \overset{0003}{12} $$
Big-Endian
Some computers use big-endian, which stores the data in a more human-readable way, data in memory looks like the one in code $$ \overset{0000}{12} , \overset{0001}{34} , \overset{0002}{56} , \overset{0003}{78} $$
Symbolic Constants
Symbolic constant is created by assigning identifier (symbol) with an integer expression or some text. They do not reserve storage, nor change value at runtime. Assembler reads them, scans the code and substitutes the used symbols for the associated expression.
Equal-Sign Symbol
Equal-sign directive associates symbol with an integer expression. $$ name = expression$$
Typically, expression is a 32-bit integer value.
This directive is useful for defining global values that might be used in a multiple of places throught the code and possibly changed in the future development, like COUNT, N etc.
Current Location Counter
One of the most important symbols - selfPtr DWORD $ will initialize it to it's own memory address.
Calculating Size of String or Array
It is useful for getting the size of strings/lists, as we can do something like
stringText BYTE "Hello, World!", 0
stringSize = ($ - stringText) ; gets current address using $ and
; subtracts the address of stringText
; getting number of bytes (!)
To get number of elements in WORD list, we'd have to divide the expression by 2, for DWORD - by 4.
Keyboard Definitions
Programs often define symbols that identify commonly used numeric keyboard codes. For example, 27 is ASCII code for ESC key
Esc_key = 27
EQU Directive
EQU directive associates symbolic name with an integer expression or some arbitrary text
name EQU expression
name EQU symbol
name EQU <text>
In the first format, it must be a valid integer expression
In the second format, symbol must be an already existing name, previously definied using = or EQU.
In the third format, any text may appear withing <...> brackets.
EQU might be useful with defining values that don't evaluate to an integer, like PI EQU <3.1416>
Unlike equal-sign directive, EQU cannot be redefined in the source code.
TEXTEQU Directive
Similar to EQU, TEXTEQU is known as text macro. There are three formats:
name TEXTEQU <text>
name TEXTEQU textmacro
name TEXTEQU %constExpr%
Text format is self-explanatory
Text macros can build on each other. For the example below, count is set to value of expression using rowSize, then move is defined as mov, finally setupAL is build on move and count.
rowSize = 5
count TEXTEQU %(rowSize * 2)
move TEXTEQU <mov>
setupAL TEXTEQU <move al, count>
====
; this can be used as follows:
setupAL
; which will be assembled into
mov al, 10
Symbol defined by TEXTEQU can be redefined at any time