Project

Write a series of small programs using x86 assembly
You can call printf and scanf from C standard library, but nothing else

Basic information

Integer registers

In x86 CPU architecture, there are 8 general registers:
AX: accumulator, used in arithmetic operations
- BX: base, pointer to data
- CX: counter, used in loops
- DX: data, used in arithmetic operations and I/O
- SI: source in string operations
- DI: destination in string operations
- SP: stack pointer
- BP: base pointer, used for local variables
Under these names you can access 16 bits
32 bits are accessible with prefix E e.g. EAX
64 bits are accessible with prefix R e.g. RAX
In 64-bit mode, there are also 8 additional registers: R8, R9, …, R15
For these additional registers, to access 32 bits you can use suffix D e.g. R8D and for 16 bits you can use suffix W e.g. R8W

Floating-point registers

Floating point numbers are handled differently:
- 1980: FPU (Floating-Point Unit) is introduced – a dedicated coprocessor for floating-point arithmetics
- 1989: FPU gets integrated within CPU
- 1999: SSE (Streaming SIMD Extensions) is introduced with 8 register XMM0, …, XMM7, each 128-bit wide
- 2011: AVX (Advanced Vector Extensions) is introduced which provides 16 registers YMM0, …, YMM15, each 256-bit wide (the XMMn remains an alias for the lower part of YMMn)
- 2017: AVX-512 is introduced and provides 32 registers ZMM0, …, ZMM31, each 512-bits wide (as above, YMMn and XMMn are aliases for lower parts of ZMMn)
The XMM registers are 128-bit wide and can hold either:
- 16x 8-bit integers
- 8x 16-bit integers
- 4x 32-bit integers
- 2x 64-bit integers
- 4x 32-bit single-precision floats
- 2x 64-bit double-precision floats
For YMM registers these values are doubled and for ZMM they are quadrupled (e.g. ZMM can hold 8x 64-bit double-precision floats)
The above list of types works like a union in C
- For example, XMM0 might hold these 16 bytes:
```
3F FF 00 00 00 00 00 00 3F FF 00 00 00 00 00 00
```
- If you treat it as 2x 64-bit floats, then it is equal to (1.0, 1.0)
- If you treat is as 8x 16-bit integers, then it is equal to (16383, 0, 0, 0, 16383, 0, 0, 0)
- Here “treating” means “using dedicated instructions”, so there isn’t one instruction for adding XMM registers, but rather a dedicated instruction for adding 2x 64-bit floats and another dedicated instruction for adding 8x 16-bit integers
Recommended reading: https://www.gamedev.net/blog/615/entry-2250281-demystifying-sse-move-instructions/
In 2012 Intel introduced F16C extension to instruction set, which can convert half-precision floats (using 16 bits) to single-precision and back

General remarks

Writing assembly code is an art of composing instructions one after another
Instructions usually perform a single operation on a register, a constant or a memory location
Other instructions control the flow of the program - they jump to other locations in the code
Most useful instructions:
- mov destination, source to copy contents
- add destination, operand, sub destination, operand to add/subtract values
- lea destination, [address] to load address of a memory location
- call function, ret to call a function or return from one
- cmp op1, op2 to compare two values
- jmp label to jump unconditionally
- jCC label to jump conditionally depending on CC:
  - jnz is jump if not zero
  - jz is jump if zero
  - jg is jump if greater
  - jl is jump if less
Assembly code is case insensitive so you can use either form: RAX or rax, CMP or cmp, etc.
Instruction reference: https://www.felixcloutier.com/x86/

Compiling

I recommend using NASM

You can use NASM with the following Makefile: (please download the file and not copy-paste from here, because Makefile syntax is sensitive to tabs vs spaces and copying from HTML page might break it)

inputs    = $(wildcard *.asm)
objects   = $(patsubst %.asm,%.o,$(inputs))
outputs   = $(patsubst %.asm,%,$(inputs))

all: $(outputs)

clean:
  $(RM) $(objects) $(outputs)

%.o: %.asm
  nasm -f elf64 $^

%: %.o
  cc -o $@ $^

This Makefile will:
- Grab all files with extension .asm in the current directory
- Execute nasm -f elf64 on them to produce object files with extension .o
- Execute cc to link object files with C standard library to produce the final executable
To use it, please open any text editor and write your assembly code in it. Then save the file with .asm extension and in the same directory as the Makefile. Then execute make in a console and run your generated executable (or read the error message and fix the code)

Debugging

I can recommend edb for debugging:
- Start your code with edb --run ./program
- Press F9 to run the program and stop in main function
- Now you can press F7 and F8 to step into / step over
- Panel on the right shows values of registers and flags
- Be sure to use the console window which appears next to the EDB one. When you step over scanf, the debugger will “freeze”, because the process awaits for your input in the console

Examples

Hello world

bits    64
default rel

global  main

extern  printf

section .data
    format      db 'Hello world!', 0xA, 0

section .bss

section .text
    main:
        sub     rsp, 8

        lea     rdi, [format]
        mov     al, 0
        call    printf wrt ..plt

        add     rsp, 8
        sub     rax, rax
        ret

bits directive instructs about the current mode
default rel instructs to use relative addressing
- In 16- and 32-bit architectures the usual addressing mode was absolute
- You used syntax mov reg, address, and the address was a constant 16- or 32-bit number
- With 64 bits typically the libraries and executables are compiled in position-independent mode, which requires relative addressing
- Instead of treating addresses as 64 bit numbers, all addresses are computed by the assembler/compiler as a difference between current address and the location in memory
- Because of the fact that position-independent code gets loaded to any locations in memory, when your executable is loaded it is not aware of the addresses (absolute or relative) of the external functions (e.g. printf)
- In NASM if you write call printf wrt ..plt, it means that the call will actually jump to PLT (Procedure Linkage Table)
- The PLT contains a lazy lookup routine i.e. it will take the address of the function from GOT (Global Offset Table) if known or it will load it into GOT during the first call
global is used to mark a label in code as exported to the outside (here we mark main as global so that linker knows where is the entrypoint)
extern is used to mark a symbol as one that will be resolved from external library
There are three main sections:
- section .data is where you put initialised data
- section .bss is where you declare uninitialised data
- section .text is where you put code

Initialised data is declared as:

_8bit       db 1, 1011b         ; char[2] = { 1, 0xB }
_16bit      dw 2, 755o          ; short[2] = { 2, 0755 }
_32bit      dd 3, 7Fh, 0.125    ; int[2] = { 3, 0x7F }, float[1] = { 0.125f }
_64bit      dq 4, 0x7F, 0.25    ; long[2] = { 4, 0x7F }, double[1] = { 0.25 }

Uninitialised data is reserved by providing number of elements required:

_8bit       resb 1              ; char[1]
_16bit      resw 2              ; short[2]
_32bit      resd 3              ; int[3] or float[3]
_64bit      resq 4              ; long[4] or double[4]

If the name of data label is used as it is, then it corresponds to the address of the variable. To refer to contents such memory address, you need to put the name of data label between brackets:

lea     rax, [variable]     ; load effective address of `variable` into rax register
mov     rax, [variable]     ; read contents of `variable` and store it in rax register

NASM uses Intel’s syntax, which is instruction destination, source e.g. mov rax, 7 means “set value 7 to rax register”
In 64-bit Linux applications, the calling convention is defined by System V AMD64 ABI:

The first six integer or pointer arguments are passed in registers RDI, RSI, RDX, RCX, R8, R9 (…), while XMM0, XMM1, XMM2, XMM3, XMM4, XMM5, XMM6 and XMM7 are used for the first floating point arguments. (…) Integer return values up to 64 bits in size are stored in RAX while values up to 128 bit are stored in RAX and RDX. Floating-point return values are similarly stored in XMM0 and XMM1.

If the callee is a variadic function, then the number of floating point arguments passed to the function in vector registers must be provided by the caller in the AL register.
In addition, the stack must be aligned to 16 bytes (the RSP register must be divisible by 16 without remainder). Because a call instruction stores 8 bytes in a 64-bit architecture, in the beginning of the example we use sub rsp, 8 to achieve this alignment requirement

Reading integers

bits    64
default rel
global  main
extern  scanf

section .data
    format_3x_d     db '%d %d %d', 0

section .bss
    array_int       resd 3

section .text
    main:
        sub     rsp, 8

        lea     rcx, [array_int + 8]    ; 4th integer/pointer argument
        lea     rdx, [array_int + 4]    ; 3rd integer/pointer argument
        lea     rsi, [array_int + 0]    ; 2nd integer/pointer argument
        lea     rdi, [format_3x_d]      ; 1st integer/pointer argument
        mov     al, 0                   ; no floating-point arguments
        call    scanf wrt ..plt

        add     rsp, 8
        sub     rax, rax
        ret

The screenshot shows the result of writing 10 20 30 on the terminal
The arguments were passed in rdi, rsi, rdx and rcx registers
Their values are shown below in colors green, yellow, red and purple respectively
In the data dump we can see the format string and 0xA, 0x14, 0x1E values (the 10, 20 and 30 as hexadecimal values)

Reading floating-point numbers

bits    64
default rel
global  main
extern  scanf

section .data
    format_3x_lf    db '%lf %lf %lf', 0

section .bss
    array_double    resq 3

section .text
    main:
        sub     rsp, 8

        lea     rcx, [array_double + 16]    ; 4th integer/pointer argument
        lea     rdx, [array_double + 8]     ; 3rd integer/pointer argument
        lea     rsi, [array_double + 0]     ; 2nd integer/pointer argument
        lea     rdi, [format_3x_lf]         ; 1st integer/pointer argument
        mov     al, 0                       ; no floating-point arguments
        call    scanf wrt ..plt

        add     rsp, 8
        sub     rax, rax
        ret

The screenshot shows the result of writing 1.0 2.0 3.0 on the terminal
The arguments were passed in rdi, rsi, rdx and rcx registers
Their values are shown below in colors green, yellow, red and purple respectively
The al register mentions that there are no floating-point arguments
In the data dump we can see the format string and 0x3FF0000000000000, 0x4000000000000000, 0x4008000000000000 values
The first one in double precision is -1^0 \cdot 2^0 \cdot 1.0 = 1.0
The second one in double precision is -1^0 \cdot 2^1 \cdot 1.0 = 2.0
The third one in double precision is -1^0 \cdot 2^1 \cdot 1.5 = 3.0

Writing floating-point numbers

bits    64
default rel
global  main
extern  printf

section .data
    format_3x_f     db '%f %f %f', 0xA, 0
    align 16
    data            dq 1.0, 2.0, 3.0

section .text
    main:
        sub     rsp, 8

        movlpd  xmm2, [data + 16]       ; 3rd floating-point argument
        movlpd  xmm1, [data + 8]        ; 2nd floating-point argument
        movlpd  xmm0, [data]            ; 1st floating-point argument
        lea     rdi, [format_3x_f]      ; 1st integer/pointer argument
        mov     al, 3                   ; three floating-point arguments
        call    printf wrt ..plt

        add     rsp, 8
        sub     rax, rax
        ret

The screenshot shows the moment just before calling printf
The floating-point values are read from the data array using movlpd instruction (it loads one double-precision value and stores it in the lower half of a XMM register)
The integer/pointer argument is in rdi, while the floating-point arguments are passed in xmm0, xmm1 and xmm3
Their values are shown below in colors green, yellow, red and purple respectively
The al register mentions that there are three floating-point registers in use
In the data dump we can see the format string and 0x3FF0000000000000, 0x4000000000000000, 0x4008000000000000 values
In the register window, we can see correct values in XMM registers (the upper half of XMM1 contains some trash data, but we are using only the lower half of the register)

Comparing floating point numbers

bits    64

default rel

global  main

extern  scanf
extern  printf

section .data
    format_2x_lf            db '%lf %lf', 0
    less_than_str           db '%lf is less than %lf', 0xA, 0
    greater_or_equal_str    db '%lf is greater or equal to %lf', 0xA, 0

section .bss
    array_double    resq 2

section .text
    main:
        sub     rsp, 8

        lea     rdx, [array_double + 8]     ; 3rd integer/pointer argument
        lea     rsi, [array_double + 0]     ; 2nd integer/pointer argument
        lea     rdi, [format_2x_lf]         ; 1st integer/pointer argument
        mov     al, 0                       ; no floating-point arguments
        call    scanf wrt ..plt

        movlpd  xmm0, [array_double]        ; load first value to lower half of xmm0

        movlpd  xmm1, [array_double + 8]    ; load second value to lower half of xmm1
        cmpltsd xmm0, xmm1                  ; check if lower half of xmm0 is LESS-THAN lower half of xmm1
        movq    rax, xmm0                   ; copy the comparison result to RAX
        cmp     rax, 0
        jz      greater_or_equal

    less_than:
        lea     rdi, [less_than_str]
        jmp     print_message

    greater_or_equal:
        lea     rdi, [greater_or_equal_str]

    print_message:
        movlpd  xmm1, [array_double + 8]
        movlpd  xmm0, [array_double]
        mov     al, 2
        call    printf wrt ..plt

        add     rsp, 8
        sub     rax, rax
        ret

Comparison of floating point numbers does not change the flags register (so conditional jumps will not work directly)
The comparison changes the bit pattern of the destination register to either all 1s (if the condition is true) or all 0s (otherwise)
Here, the instruction cmpltsd will compare:
- ...lt.. means “less than” (other possibilities: ...eq.., ...le.., etc.)
- .....sd means “scalar double” i.e. take only 1x double in the lower half of the register (other possibility: .....pd for “packed double”)
If the result of cmpltsd is true, than the lower half of the register will be set to 0xFFFFFFFFFFFFFFFF, otherwise it will be set to 0x0
movq will copy this value to rax which gets compared with 0 to know the actual comparison result of the floating-point numbers

Project specification

Requirement for 3.0

Write your own version of strcpy function

It should be an equivalent of:

#include <stdio.h>

int main() {
    char input[1024];
    char output[1024];

    scanf("%s", input);

    char *src = input;
    char *dst = output;
    while ((*dst = *src) != '\0') {
        src++;
        dst++;
    }

    printf("%s\n", output);
    return 0;
}

Useful links:
- REP/REPE/REPZ/REPNE/REPNZ — Repeat String Operation Prefix
- MOVS/MOVSB/MOVSW/MOVSD/MOVSQ — Move Data from String to String

Requirement for 3.5

Implement bubble sort in assembly

It should be an equivalent of:

#include <stdio.h>

int main() {
    int array[100];
    int n = 0;

    while (scanf("%d", &array[n]) == 1) {
        n++;
    }

    for (int i = 0; i < n; i++) {
        for (int j = n - 1; j > i; j--) {
            if (array[j] < array[j - 1]) {
                int tmp = array[j];
                array[j] = array[j - 1];
                array[j - 1] = tmp;
            }
        }
    }

    for (int i = 0; i < n; i++) {
        printf("%d ", array[i]);
    }

    return 0;
}

Useful links:
- LEA — Load Effective Address
- XCHG — Exchange Register/Memory with Register

Requirement for 4.0

Write a program which will calculate square root of a sequence of numbers with step 0.125

It should be an equivalent of: (but without calling sqrt() function)

#include <math.h>
#include <stdio.h>

int main() {
    double end;
    scanf("%lf", &end);

    for (double d = 0.0; d < end; d += 0.125) {
        printf("sqrt(%f) = %f\n", d, sqrt(d));
    }
    return 0;
}

Useful links:
- SQRTSD — Compute Square Root of Scalar Double-Precision Floating-Point Value
- CMPSD — Compare Scalar Double-Precision Floating-Point Value

Requirement for 4.5

The Maclaurin series for e^x is:
\sum_{k=0}^{\infty} \frac{x^k}{k!}
Write a program which will approximate e^x with first k components

It should be an equivalent of:

#include <stdio.h>

int main() {
    int k;
    double x;
    scanf("%i %lf", &k, &x);

    double series = 1;
    double numerator = 1;
    double denominator = 1;

    for (int i = 1; i <= k; i++) {
        numerator *= x;
        denominator *= i;
        series += numerator / denominator;
    }

    printf("e^x = %f\n", series);
    return 0;
}

Useful links:
- MULSD — Multiply Scalar Double-Precision Floating-Point Value
- DIVSD — Divide Scalar Double-Precision Floating-Point Value

Requirement for 5.0

Create the same variant as for 4.5, but make use of SIMD (Single Instruction Multiple Data) capabilities of SSE instructions
You can do it in one of two ways:
- Either compute several steps of the loop at once
- Or compute several, separate values at once (e.g. simultaneously compute e^x and e^{x+1})
Useful links:
- MULPD — Multiply Packed Double-Precision Floating-Point Values
- DIVPD — Divide Packed Double-Precision Floating-Point Values