2. Data

Data types

Binary data representation

Computer operates solely on groups of bits, treating them as binary numbers
Usually binary words of lengths 82n bits
Non-numeric data must be expressed using numbers

Alphanumeric data

Every character is represented by a number denoting its position in code table
Most frequently used codes:
ASCII - 128 positions
Extended ASCII - 256 positions
128 original positions, new ones - national characters, special symbols
EBCDIC - 256 symbols, mainly used by IBM
UNICODE - Initially 216, currently 221 positions
contains over 150000 characters
covers all alphabetic characters used in the world

ASCII

American Standard Code for Information Interchange
128 positions, 95 visible and 33 invisible
Invisible - codes 0...31 0x00..0x1f and 127 (0x7f)
white spaces, formatting codes, transmission and device control
Visible - codes 32...126 (0x20...0x7e)
space (code 32)
digits
latin letters
punctuation marks
common mathematical symbols

To remember:
Control codes - positions 0...31
7 - Bell/Alert (BEL)
8 - Backspace (BSP)
9 - Horizontal tab (HT)
10 - Line feed (LF)
12 - Form feed (FF)
13 - Carriage return (CR)
Space - 32 (0x20)
Digits - codes 48...57 (0x30...0x39)
Letters:
Uppercase - 65...90 (0x41...0x5a)
Lowercase - 97...122 (0x61...0x7a)
Distance between uppercase and lowercase - 32 (0x20)
Other visible chars - 33...126
Cancel/delete - 127 (0x7f)

UNICODE

21-bit character code designed to include all alphabetic characters used in the world
Symbolic notation of a character - U+<hex_code>
UTF-8 is the most common representation
Char represented by 1...4 octets

Text strings

Sequence of characters. Must have an indication of the end.

Two conventions:

Sounds and images

Sound

Images:

Units of information

bit (b) - 1 or 0
octet (o) - 8 bits
byte (B) - smallest unit of information addressed by computer, usually 8 bits
word - unit of information operated on by the computer (1, 2, 4, 8, 16, bytes)
processor word - unit of information operable by the processor (us. 32/64b)
memory word - unit of information that may be transferred in a single cycle between processor and memory (us. 64/182b)

Computers operate on words - groups of 2n bits
common word lengths - 8, 16, 32, 64, 128b
Some computers are capable of operating on single bits and bit fields of arbitrary widths
Data stored as words in memory

Logical (boolean) data

False, is always saved as series of 0s - 00...00
But True can be saved in different ways:

This is important for logical operations - overall and bit-wise

Unsigned integers

Used for fixed point decimal numbers
Decimal digits encoded in binary - 4 bits (nibble, tetrade) per digit
Allowed nibble values - 0...9

Formats:

Signed integer representations

How to correctly interpret an negative value number?
It must hold after negation functions.
University/WUT/ECOAR/pictures/Pasted image 20250412174132.png
Representation of zero
Ones complement and sign-magnitude have two representations of zero

Fixed point notation

Obtained by shifting bit weights in integer notation
2-f, where f - no. of bits in fractional part
Used for 2's complement and unsigned numbers
Commonly used formats

Binary floating point

Examples:
1.234105 0.1234106 12.34104
Elements:

Binary floating point notation

If possible, numbers are stored in normalized form
Exponent field special values:
00...00 - denormalized form
11...11 - Not-a-Number
Exponent is stored in e-bit wide field as biased value
BIAS = 2e11

Common formats

University/WUT/ECOAR/pictures/Pasted image 20250412175943.png

Floating point arithmetic

Floating the point is an approximation, so the result of the arithmetic operations are also approx.

Result may depend on order of operations

Equality test usually gives false result

Precision of IEEE single (24 bits) is smaller than that of 32-bit integer or fixed point


Memory organization

In g.p. computers, 8-bit byte is the smallest addressable unit of memory
Multibyte data occupy the appropriate number of consecutive byte-sized memory locations

Multibyte data addressing

Little Endian - least significant byte of data word at the lowest address
Big Endian - most significant byte at the lowest address
University/WUT/ECOAR/pictures/Pasted image 20250412180303.png

Little Endian

Byte numbering corresponds to bit weights
Natural for computers, not for humans
Type casting preserves pointer value

Big Endian

Natural for humans
Type cast changes the pointer value
Fast string compare is possible

Logical vs. physical memory organization

University/WUT/ECOAR/pictures/Pasted image 20250412180450.png

Data alignment

Physically the memory is organized as vector of words, word being vector of bytes
Any number of bytes within a single word may be accessed at the same time
Memory word is usually at least as long as the processor's word
The access to multibyte data in a single memword is faster than in case of it being split into 2 memwords (2 addresses are needed)
New archs enforce placement of every data item ensuring the fastest possible access

Size alignment

For the fastest access regardless the physical memory width, each item is places in memory at the address divisible by its length (rounded up to 2n)
Newer archs enforce size alignment, as it boosts efficiency
Each data type is characterized by two values:


Vectors and arrays

The elements of vector occupy the consecutive memory locations, starting with the first element
Vector alignment is the same as its element alignment

University/WUT/ECOAR/pictures/Pasted image 20250412181114.png

Structures

Compiler is required to preserve the order of fields from the declaration. It cannot optimize the layout of a struct
Each field must be aligned according to its type requirements
Padding (unused bytes) is added between fields to achieve proper alignment
Structure must be aligned according alignment requirement of the field with the biggest alignment
Structure is padded to the multiple of its alignment
In C - sizeof() returns the offset between two structures of a given type, necessary for allocating a vector of structs
University/WUT/ECOAR/pictures/Pasted image 20250412181351.png

x86 AVX vector unit

AVX is a vector unit of x86 processors
16 256-bit registers
AVX2 supports 256-bit integer vectors
New extension (AVX-512) supports 512-bit vectors