Zigtastic Async
Goal
The goal of this post is to give us some basic intuition about the machine code generated by the Zig compiler when we use asynchronous functions. We’ll start by digging into the simplest asynchronous function there is, slowly adding features like local variables, parameters, multi-threaded support, await
and more.
The only tool I’ll use in this expedition is the Godbolt Compiler Explorer, and maybe an x86_64 instruction reference.
I’m going to assume some standard knowledge and mention concepts like registers, memory, stack frames and the heap without any elaboration. I’m also not going to into details about the syntax of Zig, C or intel x86_64 assembly. I don’t think knowing Zig is a preliminary for this blogpost, since the syntax is pretty straightforward in my opinion, but feel free to check the official Zig language documentation.
What the fuck is Zig
So. What is Zig? According to their site:
Zig is a general-purpose programming language and toolchain for maintaining robust, optimal, and reusable software.
Sounds cool, doesn’t tell you a whole lot when you think about it. This is not a Zig tutorial or spotlight, so I won’t go into a long monologue about why I think it’s actually a great programming language, I’ll just mention the main features:
- It’s a compiled language. That is, Zig source code is used to generate machine code that runs natively on the target machine. No VM, no runtime.
- It has no hidden control flow: if something doesn’t look like a function call, it isn’t a function call.
- It has four build modes: “debug”, “release-safe”, “release-fast” and “release-small”. In this blogpost, we’ll compile with “debug”.
- It supports coroutines.
Why the fuck is Zig
The reason I chose Zig as the focus of my first few blogposts about asynchronous programming in compiled languages is its simplicity.
Zig supports very “low-level” features of coroutines, using only four keywords: async
, suspend
, resume
and await
. Essentially, it is exactly the “bare-minimum” of asynchronous programming. No hidden event loop, no yield
, just “napping functions”. This is a great candidate for the first implementation to look into, since C++ for example supports generators and yield
s in its standard.
Environment
I’m going to use Godbolt’s Zig 0.6.0 compiler, and compile in debug
and --single-threaded
.
A simple program
Let’s start with simplest async program imaginable:
fn afoo() void {
suspend;
}
export fn caller() void {
var frame = async afoo();
}
We are going to go through the entire blob of machine-code generated by this small snippet. It might be a bit long, but I found it very educating. Afterwards, we’ll sum up our findings and list the questions that we didn’t answer yet.
We’ll compile it with --single-threaded
, since there is a slight difference I saw when compiling without this flag, which might complicate things. Don’t worry, we’ll get back to that!
Stick this sucker in Godbolt, and let’s look at the output…
Click me for assembly and shit
Well. That’s a lot of code. We’ll go through it step by step.
Analysis
It starts with the standard prolog:
caller:
push rbp
mov rbp, rsp
sub rsp, 32
which means the frame
variable is 32 bytes wide. We’ll write that down, and update it as we go.
struct afoo_frame {
uint8_t unknown[32];
};
Godbolt tells us which line generates which instructions. So we see that the line:
var frame = async afoo();
generates:
movabs rax, offset afoo
mov qword ptr [rbp - 32], rax
mov qword ptr [rbp - 24], 0
mov qword ptr [rbp - 16], 0
lea rdi, [rbp - 32]
mov rsi, -3
call afoo
There are quite a few things going on here. We put the offset of afoo()
in the first 8 bytes of the afoo_frame
struct. We also put zeros in the 8th and 16th offsets of the structure. This makes us suspect that the structure is actually 24 bytes wide, and sub rsp, 32
was emitted to keep the stack 16-bytes aligned (or something of that sort). This makes us guess it looks a bit like this:
struct afoo_frame {
fptr_t func; // The async function the frame holds
uint64_t unknown2; // Initialized to 0
uint64_t unknown3; // Initialized to 0
};
What’s the point of putting the function pointer in the frame? Remember, when we use async
, we can continue a frame without knowing which function we’re calling. We’re continuing a frame, not a specific function. This means the frame has to hold, in some manner, the function we’re about to call.
Now, the actual call to afoo()
looks like this:
afoo(
&frame // passed in %rdi%
-3, // passed in %rsi%
);
That’s odd. afoo()
is supposed to be a void (*)(void)
function, why is taking two parameters?
Well, the &frame
parameter actually makes sense. Remember a function frame holds all of its parameters and arguments. Since coroutines are invoked asynchronously, they can’t push that data on the stack, cause stack might change when they are called again. They have to recieve a pointer to their frame, and trust their caller that this pointer is valid throughout all of their execution. My guess is that in a “real” asynchrounous program those frames will be heap allocated, to ensure the frame’s lifetime throughout their program. In our case, the compiler knows that this frame will only exist when caller()
is running, so it’s making it stack allocated instead. We’ll come back to this topic later to check our assumption.
Now, the rsi
parameter is a bit peculiar. I think it’s related to Zig’s safety guarentees or the fact we’re in debug mode. I think rsi
is used as a parameter for the panic()
function. This assumption is based on the fact that the only reference to it is in the panic
assembly generated:
panic:
push rbp
mov rbp, rsp
sub rsp, 16
mov qword ptr [rbp - 8], rsi
call zig_panic
With that assumption in mind, I’m going to rudely ignore it throughout this article (please forgive me).
Now, let’s take a look at the start of the afoo()
function:
afoo:
push rbp
mov rbp, rsp
sub rsp, 32
mov rax, rdi
add rax, 16
mov rcx, rdi
add rcx, 8
mov rdx, qword ptr [rdi + 8]
test rdx, rdx
mov qword ptr [rbp - 8], rax
mov qword ptr [rbp - 16], rcx
mov qword ptr [rbp - 24], rdx
je .LBB2_1
jmp .LBB2_8
I trust you all to know your basic x86_64, so let’s zoom through this. We have the standard prolog, allocating 32 bytes on the stack. Remembering that &frame
as passed in %rdi%
, we understand that this code:
mov rax, rdi
add rax, 16
mov rcx, rdi
add rcx, 8
mov rdx, qword ptr [rdi + 8]
; -- snip --
mov qword ptr [rbp - 8], rax
mov qword ptr [rbp - 16], rcx
mov qword ptr [rbp - 24], rdx
Translates to something of this fashion:
void afoo(...) {
uint64_t *unknown3_ptr = &frame_ptr->unknown3;
uint64_t *unknown2_ptr = &frame_ptr->unknown2;
uint64_t unknown2 = frame_ptr->unknown2;
// -- snip --
}
Cool. Now, we have a have a branching:
mov rcx, rdi
; -- snip --
mov rdx, qword ptr [rdi + 8]
test rdx, rdx
; -- snip --
je .LBB2_1
jmp .LBB2_8
This suggests that frame.unknown2
is actually a branch identifier. I like to think of coroutines like this: every suspension point “splits” the function into two code blocks. The code following the suspension point and the code preceeding it. When resume
ing the frame, we have to know which block it has to continue from. Our current assumption is that Zig acheives this using a branch identifier saved in the frame, telling it where to resume next.
Lets rewrite our structure according to this assumption:
struct afoo_frame {
fptr_t func; // The async function the frame holds
uint64_t branch_id; // The branch to execute the next time the frame gets resumed
uint64_t unknown3;
};
void afoo(...) {
uint64_t *unknown3_ptr = &frame_ptr->unknown3;
uint64_t *branch_id_ptr = &frame_ptr->branch_id;
uint64_t stored_branch_id = frame_ptr->branch_id;
// -- snip --
}
Let’s keep going! First, let’s look at what happens in our flow, that is, stored_branch_id
is 0. In that case, we jump to .LBB2_1
:
.LBB2_1:
mov rax, qword ptr [rbp - 16]
mov qword ptr [rax], 1
mov qword ptr [rax], 2
add rsp, 32
pop rbp
ret
Ok, there’s something odd happening here. We set *branch_id_ptr
to 1, and then immediatly set it to 2. My guess is that this is a result of us using the --single-threaded
flag. My assumption that this is some sort of spinlock or condition variable, to prevent the running of the same frame in several threads. Let’s check that assumption later! For now, we’ll ignore it.
We can now “split” our afoo()
function to two branches:
fn afoo() void {
// branch_id 0
suspend;
// branch_id 2
}
Now, we clean up the stack frame and ret
. This is important: A suspend
translates to a ret
. It doesnt “suspend” anything. It just returns to the caller, like any “normal” function!
Let’s check out the other branch:
.LBB2_8:
mov rax, qword ptr [rbp - 24]
sub rax, 1
je .LBB2_3
jmp .LBB2_9
Hmm. It checks that stored_branch_id
is not 1. Our standing assumption that frame.branch_id == 1
is a flag, implying that another thread is running this frame currently. Indeed, if we follow the case stored_branch_id == 1
we arrive at:
.LBB2_3:
xor eax, eax
mov esi, eax
movabs rdi, offset __unnamed_2
call panic
__unnamed_5:
.asciz "resumed a non-suspended function"
__unnamed_2:
.quad __unnamed_5
.quad 32
We see that in that case, panic()
is called with %esi% = 0
and %rdi%
pointing to the Pascal String containing the error message "resumed a non-suspended function"
. This confirms our suspicion: in multithreaded build, branch_id == 1
implies that the frame is currently running.
This brings up another thing that have been bothering me. Why both &frame_ptr->branch_id
and frame_ptr->branch_id
are stored in the local stack frame of afoo()
? My guess is that the storing and loading of frame_ptr->branch_id
in multithreaded builds happens atomically, to prevent race conditions. Let’s add that assumption to our ever-growing list of stuff to check.
Well, lets continue. If no one else is running the frame, we arrive at another check:
.LBB2_9:
mov rax, qword ptr [rbp - 24]
sub rax, 2
je .LBB2_4
jmp .LBB2_2
This checks if the current branch identifier is 2, that is, we are just after the first (and only) suspend point. If we’re not after that one, we arrive at the following panic()
call:
.LBB2_2:
xor eax, eax
mov esi, eax
movabs rdi, offset __unnamed_1
call panic
__unnamed_4:
.asciz "resumed an async function which already returned"
__unnamed_1:
.quad __unnamed_4
.quad 48
Makes sense: If the branch identifier is not 0 (the first branch), not 1 (the coroutine is currnetly running) and not 2 (the last branch), it means that this frame has already returned. Cool.
If the branch identifier is 2, we arrive at:
.LBB2_4:
mov rax, qword ptr [rbp - 16]
mov qword ptr [rax], 1
mov qword ptr [rax], -1
mov rcx, qword ptr [rbp - 8]
mov rdx, qword ptr [rcx]
mov rsi, rdx
not rsi
mov qword ptr [rcx], rsi
mov rsi, rdx
sub rsi, -1
mov qword ptr [rbp - 32], rdx
je .LBB2_5
jmp .LBB2_10
Let’s break this down:
- We put 1 in
frame->branch_id
. This means we’re letting other threads know that this frame is now running. - We put -1 in
frame->branch_id
. That value is probably used to indiciate that this frame is done. The next time someone will try toresume
this frame, the program will panic, since it’s branch identifier is not 0, 1 or 2. - We store
unknown3_ptr
in a local variable:original_u3 = *unknown3_ptr
. - We take
unknown3_ptr
, and bitwise not it:*unknown3_ptr = ~(*unknown3_ptr)
. - We check if
original_u3
was was equal to -1, and branch accordingly.
Now, this is strange. Really strange. The first two steps make sense, we already understand them pretty well: we set the flag saying that this frame is running, and then mark the frame as complete. It’s not that clear what’s going on with the other three, so let’s add them to our mystery list.
The branching is even more peculiar. If original_u3 == -1
, we panic()
with the following message: "async function returned twice"
, as we can see in:
.LBB2_5:
xor eax, eax
mov esi, eax
movabs rdi, offset __unnamed_3
call panic
__unnamed_6:
.asciz "async function returned twice"
__unnamed_3:
.quad __unnamed_6
.quad 29
Otherwise, we check if original_u3
is actually 0:
.LBB2_10:
mov rax, qword ptr [rbp - 32]
test rax, rax
je .LBB2_6
jmp .LBB2_7
If it is, we just return, nothing special:
.LBB2_6:
add rsp, 32
pop rbp
ret
Now here’s the weird part. If original_u3
isn’t 0, we clear the stack frame and perform an absolute jump to the address stored in original_u3
:
.LBB2_7:
mov rax, qword ptr [rbp - 32]
mov rcx, qword ptr [rbp - 32]
mov rdx, qword ptr [rcx]
mov rsi, -2
mov rdi, rax
add rsp, 32
pop rbp
jmp rdx
So that’s it! We went over the entire machine code generated by the most simple asynchronous program. Let’s sum up what our assumption are so far.
- An asynchronous function takes at least one parameter, which is a pointer to it’s frame.
- An asynchrnous function is composed of branches, one for each code segment between two supsension points.
- Each branch has a specific branch identifier: a number that corresponds to that specific branch.
- There are a couple of special identifiers:
- 0 is the first branch identifier, which represents the first call to the function.
- 1 is a flag, which signals that this frame is currently running, maybe on another thread.
- The frame of an empty function holds the following data:
- A pointer to the asynchronous function the frame holds, which we will call
func
. - The frame’s current branch identifier. This is the next branch that will execute when
resume
is used on the frame. We will call this memberbranch_id
- An unknown qword, which we will call
unknown3
.
- A pointer to the asynchronous function the frame holds, which we will call
- In “debug” builds, there are several safety-guarentees: A frame can’t run in multiple threads at the same time, and you can’t
resume
a frame that have alreadyreturn
ed, orreturn
from the same frame twice.
We also have a few questions that arose during our research:
- What is the purpose of
unknown3
? - Is access to
branch_id
in multithreaded builds atomic? - What happens when multiple threads try to
resume
the same frame? - What is the absolute jump to
*unknown3
used for?
Those are some pretty tough questions, we’ll try to tackle them one by one. The next features we’re going to take a look at is understanding what happens when we resume
a frame, and the we’ll start adding arguments and local variables to our coroutine.
That was exhausting, but if you reached this far, good job! I hope youv’e learnt something new (I sure did).