2. Data

Data types

Boolean (True/False)
Text characters
Numbers:
- integer (signed/unsigned)
- non-integer (floating point, fixed point)
Sounds and other unidimensional signals
Raster images

Binary data representation

Computer operates solely on groups of bits, treating them as binary numbers
Usually binary words of lengths $8 \cdot 2^{n}$ bits
Non-numeric data must be expressed using numbers

Alphanumeric data

Every character is represented by a number denoting its position in code table
Most frequently used codes:
ASCII - 128 positions
Extended ASCII - 256 positions
128 original positions, new ones - national characters, special symbols
EBCDIC - 256 symbols, mainly used by IBM
UNICODE - Initially 216, currently 221 positions
contains over 150000 characters
covers all alphabetic characters used in the world

ASCII

American Standard Code for Information Interchange
128 positions, 95 visible and 33 invisible
Invisible - codes 0...31 0x00..0x1f and 127 (0x7f)
white spaces, formatting codes, transmission and device control
Visible - codes 32...126 (0x20...0x7e)
space (code 32)
digits
latin letters
punctuation marks
common mathematical symbols

To remember:
Control codes - positions 0...31
7 - Bell/Alert (BEL)
8 - Backspace (BSP)
9 - Horizontal tab (HT)
10 - Line feed (LF)
12 - Form feed (FF)
13 - Carriage return (CR)
Space - 32 (0x20)
Digits - codes 48...57 (0x30...0x39)
Letters:
Uppercase - 65...90 (0x41...0x5a)
Lowercase - 97...122 (0x61...0x7a)
Distance between uppercase and lowercase - 32 (0x20)
Other visible chars - 33...126
Cancel/delete - 127 (0x7f)

UNICODE

21-bit character code designed to include all alphabetic characters used in the world
Symbolic notation of a character - U+<hex_code>
UTF-8 is the most common representation
Char represented by 1...4 octets

Text strings

Sequence of characters. Must have an indication of the end.

Two conventions:

End of string marked with a special code
Set known length of the string stored before the text

Sounds and images

Sound

Sample of voltages (freq. 8 - 48kHz) saved as integer numbers

Images:

Rectangular arrays of square picture elements (pixels)
Every pixel has an assigned color - represented by three primaries (RGB)
Values of primaries stored as unsigned integer numbers

Units of information

bit (b) - 1 or 0
octet (o) - 8 bits
byte (B) - smallest unit of information addressed by computer, usually 8 bits
word - unit of information operated on by the computer (1, 2, 4, 8, 16, bytes)
processor word - unit of information operable by the processor (us. 32/64b)
memory word - unit of information that may be transferred in a single cycle between processor and memory (us. 64/182b)

Computers operate on words - groups of 2n bits
common word lengths - 8, 16, 32, 64, 128b
Some computers are capable of operating on single bits and bit fields of arbitrary widths
Data stored as words in memory

Logical (boolean) data

False, is always saved as series of 0s - 00...00
But True can be saved in different ways:

C => 00..01
Most languages => any non-zero value
Visual Basic => 11..11

This is important for logical operations - overall and bit-wise

Unsigned integers

Used for fixed point decimal numbers
Decimal digits encoded in binary - 4 bits (nibble, tetrade) per digit
Allowed nibble values - 0...9

Formats:

packed - 2 digits per octet, octet value range 0...99
unpacked (ASCII) - one digit per octet, value range 0...9

Signed integer representations

How to correctly interpret an negative value number?
It must hold after negation functions.

Representation of zero
Ones complement and sign-magnitude have two representations of zero

Ones complement - 11...11 and 00...00
Sign-magnitude - 10...00 and 00...00
Negate operations
Ones' complement - ~x
Two's complement - ~x + 1
Sign-magnitude - negate sign bit
Biased - BIAS - x

Fixed point notation

Obtained by shifting bit weights in integer notation
2-f, where f - no. of bits in fractional part
Used for 2's complement and unsigned numbers
Commonly used formats

One or two bits of integer part, rest as fractional
Half of the word as integer, rest as fraction
Arithmetic operations similar to integer operations
Scaling (shifting) required during multiplication and division
Do not require special instructions or hardware structures

Binary floating point

Examples:
$- 1.234 \cdot 10^{5}$ $- 0.1234 \cdot 10^{6}$ $- 12.34 \cdot 10^{4}$
Elements:

Sign
Significand unsigned fixed point number
Exponent (signed integer) sys base 10 is fixed
Normalized form - integer part of significand is expressed using single non-zero digit

Binary floating point notation

If possible, numbers are stored in normalized form
Exponent field special values:
00...00 - denormalized form
11...11 - Not-a-Number
Exponent is stored in e-bit wide field as biased value
BIAS = $2^{e - 1} - 1$

Common formats

Floating point arithmetic

Floating the point is an approximation, so the result of the arithmetic operations are also approx.

Result may depend on order of operations

Addition and substraction should be performed in order of growing magnitude
if a << b, then a+b may be equal to b

Equality test usually gives false result

Use $a b s (a - b) < ε$ instead

Precision of IEEE single (24 bits) is smaller than that of 32-bit integer or fixed point

Memory organization

In g.p. computers, 8-bit byte is the smallest addressable unit of memory
Multibyte data occupy the appropriate number of consecutive byte-sized memory locations

Multibyte data addressing

Little Endian - least significant byte of data word at the lowest address
Big Endian - most significant byte at the lowest address

Little Endian

Byte numbering corresponds to bit weights
Natural for computers, not for humans
Type casting preserves pointer value

Big Endian

Natural for humans
Type cast changes the pointer value
Fast string compare is possible

Logical vs. physical memory organization

Data alignment

Physically the memory is organized as vector of words, word being vector of bytes
Any number of bytes within a single word may be accessed at the same time
Memory word is usually at least as long as the processor's word
The access to multibyte data in a single memword is faster than in case of it being split into 2 memwords (2 addresses are needed)
New archs enforce placement of every data item ensuring the fastest possible access

Size alignment

For the fastest access regardless the physical memory width, each item is places in memory at the address divisible by its length (rounded up to $2^{n}$ )
Newer archs enforce size alignment, as it boosts efficiency
Each data type is characterized by two values:

Alignment - data starting address must be a multiple of alignment (_Alignof() in C)
Size - must be a multiple of alignment
(sizeof() in C)
in modern implementations, size is always $2^{n}$ and equal to alignment

Vectors and arrays

The elements of vector occupy the consecutive memory locations, starting with the first element
Vector alignment is the same as its element alignment

Structures

Compiler is required to preserve the order of fields from the declaration. It cannot optimize the layout of a struct
Each field must be aligned according to its type requirements
Padding (unused bytes) is added between fields to achieve proper alignment
Structure must be aligned according alignment requirement of the field with the biggest alignment
Structure is padded to the multiple of its alignment
In C - sizeof() returns the offset between two structures of a given type, necessary for allocating a vector of structs

x86 AVX vector unit

AVX is a vector unit of x86 processors
16 256-bit registers
AVX2 supports 256-bit integer vectors
New extension (AVX-512) supports 512-bit vectors