Zigtastic Async

Goal

The goal of this post is to give us some basic intuition about the machine code generated by the Zig compiler when we use asynchronous functions. We’ll start by digging into the simplest asynchronous function there is, slowly adding features like local variables, parameters, multi-threaded support, await and more.

The only tool I’ll use in this expedition is the Godbolt Compiler Explorer, and maybe an x86_64 instruction reference.

I’m going to assume some standard knowledge and mention concepts like registers, memory, stack frames and the heap without any elaboration. I’m also not going to into details about the syntax of Zig, C or intel x86_64 assembly. I don’t think knowing Zig is a preliminary for this blogpost, since the syntax is pretty straightforward in my opinion, but feel free to check the official Zig language documentation.

What the fuck is Zig

So. What is Zig? According to their site:

Zig is a general-purpose programming language and toolchain for maintaining robust, optimal, and reusable software.

Sounds cool, doesn’t tell you a whole lot when you think about it. This is not a Zig tutorial or spotlight, so I won’t go into a long monologue about why I think it’s actually a great programming language, I’ll just mention the main features:

It’s a compiled language. That is, Zig source code is used to generate machine code that runs natively on the target machine. No VM, no runtime.
It has no hidden control flow: if something doesn’t look like a function call, it isn’t a function call.
It has four build modes: “debug”, “release-safe”, “release-fast” and “release-small”. In this blogpost, we’ll compile with “debug”.
It supports coroutines.

Why the fuck is Zig

The reason I chose Zig as the focus of my first few blogposts about asynchronous programming in compiled languages is its simplicity.

Zig supports very “low-level” features of coroutines, using only four keywords: async, suspend, resume and await. Essentially, it is exactly the “bare-minimum” of asynchronous programming. No hidden event loop, no yield, just “napping functions”. This is a great candidate for the first implementation to look into, since C++ for example supports generators and yields in its standard.

Environment

I’m going to use Godbolt’s Zig 0.6.0 compiler, and compile in debug and --single-threaded.

A simple program

Let’s start with simplest async program imaginable:

fn afoo() void {
    suspend;
}

export fn caller() void {
    var frame = async afoo();
}

We are going to go through the entire blob of machine-code generated by this small snippet. It might be a bit long, but I found it very educating. Afterwards, we’ll sum up our findings and list the questions that we didn’t answer yet.

We’ll compile it with --single-threaded, since there is a slight difference I saw when compiling without this flag, which might complicate things. Don’t worry, we’ll get back to that!

Stick this sucker in Godbolt, and let’s look at the output…

Click me for assembly and shit

Well. That’s a lot of code. We’ll go through it step by step.

Analysis

It starts with the standard prolog:

caller:
        push    rbp
        mov     rbp, rsp
        sub     rsp, 32

which means the frame variable is 32 bytes wide. We’ll write that down, and update it as we go.

struct afoo_frame {
    uint8_t unknown[32];
};

Godbolt tells us which line generates which instructions. So we see that the line:

var frame = async afoo();

generates:

        movabs  rax, offset afoo
        mov     qword ptr [rbp - 32], rax
        mov     qword ptr [rbp - 24], 0
        mov     qword ptr [rbp - 16], 0
        lea     rdi, [rbp - 32]
        mov     rsi, -3
        call    afoo

There are quite a few things going on here. We put the offset of afoo() in the first 8 bytes of the afoo_frame struct. We also put zeros in the 8th and 16th offsets of the structure. This makes us suspect that the structure is actually 24 bytes wide, and sub rsp, 32 was emitted to keep the stack 16-bytes aligned (or something of that sort). This makes us guess it looks a bit like this:

struct afoo_frame {
    fptr_t      func;       // The async function the frame holds
    uint64_t    unknown2;   // Initialized to 0
    uint64_t    unknown3;   // Initialized to 0
};

What’s the point of putting the function pointer in the frame? Remember, when we use async, we can continue a frame without knowing which function we’re calling. We’re continuing a frame, not a specific function. This means the frame has to hold, in some manner, the function we’re about to call.

Now, the actual call to afoo() looks like this:

afoo(
    &frame  // passed in %rdi%
    -3,     // passed in %rsi%
);

That’s odd. afoo() is supposed to be a void (*)(void) function, why is taking two parameters?

Well, the &frame parameter actually makes sense. Remember a function frame holds all of its parameters and arguments. Since coroutines are invoked asynchronously, they can’t push that data on the stack, cause stack might change when they are called again. They have to recieve a pointer to their frame, and trust their caller that this pointer is valid throughout all of their execution. My guess is that in a “real” asynchrounous program those frames will be heap allocated, to ensure the frame’s lifetime throughout their program. In our case, the compiler knows that this frame will only exist when caller() is running, so it’s making it stack allocated instead. We’ll come back to this topic later to check our assumption.

Now, the rsi parameter is a bit peculiar. I think it’s related to Zig’s safety guarentees or the fact we’re in debug mode. I think rsi is used as a parameter for the panic() function. This assumption is based on the fact that the only reference to it is in the panic assembly generated:

panic:
        push    rbp
        mov     rbp, rsp
        sub     rsp, 16
        mov     qword ptr [rbp - 8], rsi
        call    zig_panic

With that assumption in mind, I’m going to rudely ignore it throughout this article (please forgive me).

Now, let’s take a look at the start of the afoo() function:

afoo:
        push    rbp
        mov     rbp, rsp
        sub     rsp, 32
        mov     rax, rdi
        add     rax, 16
        mov     rcx, rdi
        add     rcx, 8
        mov     rdx, qword ptr [rdi + 8]
        test    rdx, rdx
        mov     qword ptr [rbp - 8], rax
        mov     qword ptr [rbp - 16], rcx
        mov     qword ptr [rbp - 24], rdx
        je      .LBB2_1
        jmp     .LBB2_8

I trust you all to know your basic x86_64, so let’s zoom through this. We have the standard prolog, allocating 32 bytes on the stack. Remembering that &frame as passed in %rdi%, we understand that this code:

        mov     rax, rdi
        add     rax, 16
        mov     rcx, rdi
        add     rcx, 8
        mov     rdx, qword ptr [rdi + 8]
        ; -- snip --
        mov     qword ptr [rbp - 8], rax
        mov     qword ptr [rbp - 16], rcx
        mov     qword ptr [rbp - 24], rdx

Translates to something of this fashion:

void afoo(...) {
        uint64_t *unknown3_ptr = &frame_ptr->unknown3;
        uint64_t *unknown2_ptr = &frame_ptr->unknown2;
        uint64_t unknown2 = frame_ptr->unknown2;
        // -- snip --
}

Cool. Now, we have a have a branching:

        mov     rcx, rdi
        ; -- snip --
        mov     rdx, qword ptr [rdi + 8]
        test    rdx, rdx
        ; -- snip -- 
        je      .LBB2_1
        jmp     .LBB2_8

This suggests that frame.unknown2 is actually a branch identifier. I like to think of coroutines like this: every suspension point “splits” the function into two code blocks. The code following the suspension point and the code preceeding it. When resumeing the frame, we have to know which block it has to continue from. Our current assumption is that Zig acheives this using a branch identifier saved in the frame, telling it where to resume next.

Lets rewrite our structure according to this assumption:

struct afoo_frame {
    fptr_t      func;       // The async function the frame holds
    uint64_t    branch_id;  // The branch to execute the next time the frame gets resumed
    uint64_t    unknown3;
};

void afoo(...) {
        uint64_t *unknown3_ptr = &frame_ptr->unknown3;
        uint64_t *branch_id_ptr = &frame_ptr->branch_id;
        uint64_t stored_branch_id = frame_ptr->branch_id;
        // -- snip --
}

Let’s keep going! First, let’s look at what happens in our flow, that is, stored_branch_id is 0. In that case, we jump to .LBB2_1:

.LBB2_1:
        mov     rax, qword ptr [rbp - 16]
        mov     qword ptr [rax], 1
        mov     qword ptr [rax], 2
        add     rsp, 32
        pop     rbp
        ret

Ok, there’s something odd happening here. We set *branch_id_ptr to 1, and then immediatly set it to 2. My guess is that this is a result of us using the --single-threaded flag. My assumption that this is some sort of spinlock or condition variable, to prevent the running of the same frame in several threads. Let’s check that assumption later! For now, we’ll ignore it.

We can now “split” our afoo() function to two branches:

fn afoo() void {
    // branch_id 0
    suspend;
    // branch_id 2
}

Now, we clean up the stack frame and ret. This is important: A suspend translates to a ret. It doesnt “suspend” anything. It just returns to the caller, like any “normal” function!

Let’s check out the other branch:

.LBB2_8:
        mov     rax, qword ptr [rbp - 24]
        sub     rax, 1
        je      .LBB2_3
        jmp     .LBB2_9

Hmm. It checks that stored_branch_id is not 1. Our standing assumption that frame.branch_id == 1 is a flag, implying that another thread is running this frame currently. Indeed, if we follow the case stored_branch_id == 1 we arrive at:

.LBB2_3:
        xor     eax, eax
        mov     esi, eax
        movabs  rdi, offset __unnamed_2
        call    panic

__unnamed_5:
        .asciz  "resumed a non-suspended function"

__unnamed_2:
        .quad   __unnamed_5
        .quad   32

We see that in that case, panic() is called with %esi% = 0 and %rdi% pointing to the Pascal String containing the error message "resumed a non-suspended function". This confirms our suspicion: in multithreaded build, branch_id == 1 implies that the frame is currently running.

This brings up another thing that have been bothering me. Why both &frame_ptr->branch_id and frame_ptr->branch_id are stored in the local stack frame of afoo()? My guess is that the storing and loading of frame_ptr->branch_id in multithreaded builds happens atomically, to prevent race conditions. Let’s add that assumption to our ever-growing list of stuff to check.

Well, lets continue. If no one else is running the frame, we arrive at another check:

.LBB2_9:
        mov     rax, qword ptr [rbp - 24]
        sub     rax, 2
        je      .LBB2_4
        jmp     .LBB2_2

This checks if the current branch identifier is 2, that is, we are just after the first (and only) suspend point. If we’re not after that one, we arrive at the following panic() call:

.LBB2_2:
        xor     eax, eax
        mov     esi, eax
        movabs  rdi, offset __unnamed_1
        call    panic

__unnamed_4:
        .asciz  "resumed an async function which already returned"

__unnamed_1:
        .quad   __unnamed_4
        .quad   48

Makes sense: If the branch identifier is not 0 (the first branch), not 1 (the coroutine is currnetly running) and not 2 (the last branch), it means that this frame has already returned. Cool.

If the branch identifier is 2, we arrive at:

.LBB2_4:
        mov     rax, qword ptr [rbp - 16]
        mov     qword ptr [rax], 1
        mov     qword ptr [rax], -1
        mov     rcx, qword ptr [rbp - 8]
        mov     rdx, qword ptr [rcx]
        mov     rsi, rdx
        not     rsi
        mov     qword ptr [rcx], rsi
        mov     rsi, rdx
        sub     rsi, -1
        mov     qword ptr [rbp - 32], rdx
        je      .LBB2_5
        jmp     .LBB2_10

Let’s break this down:

We put 1 in frame->branch_id. This means we’re letting other threads know that this frame is now running.
We put -1 in frame->branch_id. That value is probably used to indiciate that this frame is done. The next time someone will try to resume this frame, the program will panic, since it’s branch identifier is not 0, 1 or 2.
We store unknown3_ptr in a local variable: original_u3 = *unknown3_ptr.
We take unknown3_ptr, and bitwise not it: *unknown3_ptr = ~(*unknown3_ptr).
We check if original_u3 was was equal to -1, and branch accordingly.

Now, this is strange. Really strange. The first two steps make sense, we already understand them pretty well: we set the flag saying that this frame is running, and then mark the frame as complete. It’s not that clear what’s going on with the other three, so let’s add them to our mystery list.

The branching is even more peculiar. If original_u3 == -1, we panic() with the following message: "async function returned twice", as we can see in:

.LBB2_5:
        xor     eax, eax
        mov     esi, eax
        movabs  rdi, offset __unnamed_3
        call    panic

__unnamed_6:
        .asciz  "async function returned twice"

__unnamed_3:
        .quad   __unnamed_6
        .quad   29

Otherwise, we check if original_u3 is actually 0:

.LBB2_10:
        mov     rax, qword ptr [rbp - 32]
        test    rax, rax
        je      .LBB2_6
        jmp     .LBB2_7

If it is, we just return, nothing special:

.LBB2_6:
        add     rsp, 32
        pop     rbp
        ret

Now here’s the weird part. If original_u3 isn’t 0, we clear the stack frame and perform an absolute jump to the address stored in original_u3:

.LBB2_7:
        mov     rax, qword ptr [rbp - 32]
        mov     rcx, qword ptr [rbp - 32]
        mov     rdx, qword ptr [rcx]
        mov     rsi, -2
        mov     rdi, rax
        add     rsp, 32
        pop     rbp
        jmp     rdx

So that’s it! We went over the entire machine code generated by the most simple asynchronous program. Let’s sum up what our assumption are so far.

An asynchronous function takes at least one parameter, which is a pointer to it’s frame.
An asynchrnous function is composed of branches, one for each code segment between two supsension points.
Each branch has a specific branch identifier: a number that corresponds to that specific branch.
There are a couple of special identifiers:
- 0 is the first branch identifier, which represents the first call to the function.
- 1 is a flag, which signals that this frame is currently running, maybe on another thread.
The frame of an empty function holds the following data:
- A pointer to the asynchronous function the frame holds, which we will call func.
- The frame’s current branch identifier. This is the next branch that will execute when resume is used on the frame. We will call this member branch_id
- An unknown qword, which we will call unknown3.
In “debug” builds, there are several safety-guarentees: A frame can’t run in multiple threads at the same time, and you can’t resume a frame that have already returned, or return from the same frame twice.

We also have a few questions that arose during our research:

What is the purpose of unknown3?
Is access to branch_id in multithreaded builds atomic?
What happens when multiple threads try to resume the same frame?
What is the absolute jump to *unknown3 used for?

Those are some pretty tough questions, we’ll try to tackle them one by one. The next features we’re going to take a look at is understanding what happens when we resume a frame, and the we’ll start adding arguments and local variables to our coroutine.

That was exhausting, but if you reached this far, good job! I hope youv’e learnt something new (I sure did).