Recently, we at NEAR spend significant time and brainpower to sum up our vision on using virtual machines at the blockchains, especially on the possible design and architectural problems and decisions. Format of FAQ seemed to be the most suitable for that particular communications, so please welcome "The Contract Runtime VM FAQ". It is not a typical architecture white paper, as we intentionally avoid being too pushy in design decisions, and mostly describe particular problem/dichotomy as a question and describe approaches and opinions we consider constructive in solving it. And of course, NEAR's own virtual machine for blockchain follows the guidelines and opinions we express here.
Tuesday, June 23, 2020
Wednesday, April 22, 2020
Precise gas metering
The goal of gas metering in next generation blockchains, such as NEAR, is to provide mutually acceptable by validators and users cost of elementary computations, such as hashing, storage access, and smart contract regular execution. Previously, NEAR used gas metering calibration based on time elapsed, but this approach is tied to the hardware used, affected by the OS scheduler behavior and system load, and is not very stable or reproducible. Instead, we decided to come up with the completely transparent and reproducible approach to the gas metering, which could be reproduced and understood by any NEAR blockchain user.
What could be such a metric, if we are talking about computation intensive operations? The number of CPU instructions produced by an optimizing compiler required for performing certain action looks like sensible and reproducible metrics. But how could we count that? Of course, modern CPU has counters, like RDTSC, which shows the number of CPU clock generator ticks at the moment, but it doesn't properly account for task switching, different CPU microarchitectures (so that same instruction may take different number of ticks) and other instability sources. And we want to be precise and stable!
Fortunately, QEMU emulator has the mode for running only userland part of the program. We likely do not care about the kernel price of operation, as it mostly reflects the kernel mechanisms, not the computational complexity of what we are doing. And moreover, recent QEMU provides API for instrumenting QEMU's dynamic translator named TCG (Tiny Code Generator).
We can now combine those technologies and implement the exact CPU instructions measurement technology suitable for arbitrary Linux program. Plugin need to instrument all code generated by TCG, when requested — start counting, and on another request — stop counting and report back the number of operations performed since counting started. Note that we need to always instrument code produced by the TCG, as it maintains the compiled code cache and it is possible (and highly likely) that code participating in the certain operation (for example, malloc(3) implementation) is already compiled at the moment we want to start counting. We also need to capture the moment when code is ready to start and stop counting. Easiest way to achieve those two goals is to install listeners on code compilation by TCG and on system call execution.
Following code install those two handlers:
In TCG instrumentation callback we do the following operation:
i.e. on every instruction translated by TCG we register execution callback which generates call to function vcpu_insn_exec_before before executing the code emulating the actual instruction.
And the callback itself is implemented as:
We do compare the current thread with the one we remembered earlier to only count instructions executed on the particular thread, as QEMU in usermode emulation maps 1:1 guest to host threads.
Another interesting problem is how do we active and deactivate counters, and return back the results.
It is achieved by the syscall listener we installed during plugin installation function.
This function is invoked when any system call is executed. Here we decided to take syscall number 0 (implementing read(2) in Linux kernel). and check if the file descriptor argument (first argument a1) is equal to the magic constant CATCH_BASE, which is the huge number 0xcafebabe, so it's very unlikely the actual file descriptor of normal read(2) call will be equal to that value.
It means, if the guest code will execute code like this one:
we will start counting. We stop counting in the similar manner, with an operation like:
and here the matching code in the plugin does an interesting trick:
So we look at the third argument of the syscall a3 (which is the buffer size), and if it is equal to 8, we use the second argument a2 (pointer to the buffer) to store the number of instructions counted there. It works, as QEMU uses 1:1 address space translation between the guest and the host, so pointer passed from the guest is the valid pointer in the simulator.
As the result, we are able to obtain reproducible (modulo compiler and libraries versions) instruction count from arbitrary Linux userland program and successfully use the obtained instruction counter to measure the gas cost for the NEAR blockchain.
Of course, life would be too easy, if it just worked out of the box, and my previous post describes the QEMU memory leak which had to be fixed before the whole gas price estimator can be run.
Afterwards, the described technique is pretty generic and could be used for fair gas price estimation in other blockchains and in other scenarios, where exact instruction cost of a userland program had to be figured out!
What could be such a metric, if we are talking about computation intensive operations? The number of CPU instructions produced by an optimizing compiler required for performing certain action looks like sensible and reproducible metrics. But how could we count that? Of course, modern CPU has counters, like RDTSC, which shows the number of CPU clock generator ticks at the moment, but it doesn't properly account for task switching, different CPU microarchitectures (so that same instruction may take different number of ticks) and other instability sources. And we want to be precise and stable!
Fortunately, QEMU emulator has the mode for running only userland part of the program. We likely do not care about the kernel price of operation, as it mostly reflects the kernel mechanisms, not the computational complexity of what we are doing. And moreover, recent QEMU provides API for instrumenting QEMU's dynamic translator named TCG (Tiny Code Generator).
We can now combine those technologies and implement the exact CPU instructions measurement technology suitable for arbitrary Linux program. Plugin need to instrument all code generated by TCG, when requested — start counting, and on another request — stop counting and report back the number of operations performed since counting started. Note that we need to always instrument code produced by the TCG, as it maintains the compiled code cache and it is possible (and highly likely) that code participating in the certain operation (for example, malloc(3) implementation) is already compiled at the moment we want to start counting. We also need to capture the moment when code is ready to start and stop counting. Easiest way to achieve those two goals is to install listeners on code compilation by TCG and on system call execution.
Following code install those two handlers:
In TCG instrumentation callback we do the following operation:
i.e. on every instruction translated by TCG we register execution callback which generates call to function vcpu_insn_exec_before before executing the code emulating the actual instruction.
And the callback itself is implemented as:
We do compare the current thread with the one we remembered earlier to only count instructions executed on the particular thread, as QEMU in usermode emulation maps 1:1 guest to host threads.
Another interesting problem is how do we active and deactivate counters, and return back the results.
It is achieved by the syscall listener we installed during plugin installation function.
This function is invoked when any system call is executed. Here we decided to take syscall number 0 (implementing read(2) in Linux kernel). and check if the file descriptor argument (first argument a1) is equal to the magic constant CATCH_BASE, which is the huge number 0xcafebabe, so it's very unlikely the actual file descriptor of normal read(2) call will be equal to that value.
It means, if the guest code will execute code like this one:
we will start counting. We stop counting in the similar manner, with an operation like:
and here the matching code in the plugin does an interesting trick:
So we look at the third argument of the syscall a3 (which is the buffer size), and if it is equal to 8, we use the second argument a2 (pointer to the buffer) to store the number of instructions counted there. It works, as QEMU uses 1:1 address space translation between the guest and the host, so pointer passed from the guest is the valid pointer in the simulator.
As the result, we are able to obtain reproducible (modulo compiler and libraries versions) instruction count from arbitrary Linux userland program and successfully use the obtained instruction counter to measure the gas cost for the NEAR blockchain.
Of course, life would be too easy, if it just worked out of the box, and my previous post describes the QEMU memory leak which had to be fixed before the whole gas price estimator can be run.
Afterwards, the described technique is pretty generic and could be used for fair gas price estimation in other blockchains and in other scenarios, where exact instruction cost of a userland program had to be figured out!
Sunday, April 19, 2020
Twoliner's tale
Let's begin from the very end. From the patch, the simple one.
--- a/linux-user/mmap.c
+++ b/linux-user/mmap.c
@@ -650,6 +650,8 @@ int target_munmap(abi_ulong start, abi_ulong len)
if (ret == 0) {
page_set_flags(start, start + len, 0);
tb_invalidate_phys_range(start, start + len);
+ if (start < mmap_next_start)
+ mmap_next_start = start;
}
mmap_unlock();
return ret;
This is the fix to QEMU's usermode Linux emulator used for gas cost estimator. The story of the gas price estimator will be told in the next post, but for now sufficient information is that this is the mechanism we use to compute the exact number of x86 instructions to perform certain operation.
We executed the program in question under QEMU and instrument every instruction with the callback, so that exact number of instructions is the reproducible operation cost.
To achieve this goal simple QEMU JIT plugin was written. It is activated by magic syscall (read on fd == 0xcafebabe), and starts counting instructions. When deactivated by read on magic fd == 0xcafebabf) we write the number of executed instruction into syscall's argument buffer, so that emulated program can know how many CPU instructions were executed.
Writing plugin was nice and straightforward, tests were working and everything was great. However if I execute the actual NEAR gas price estimator under QEMU (with or without instruction counter plugin) - it uses up all the memory and dies.
Hmm, so what could be wrong?
If the same application is executed without QEMU, it works fine and no memory leaks happens. So obviously, it's something wrong with the QEMU or how application is executed under QEMU. What could it be?
As this is memory leak, we need to understand what kind of memory is actually leaking. We hope, that leaking memory is the regular malloc(3) memory, so let's use GLIBC malloc tracing functionality mtrace(3).
We run QEMU with activated mtrace(3), and see that when benchmark enters the memory hog mode, there are a lot of 16K allocations, but where the come from? Let's attach to the QEMU with the debugger, it shall help, right? As we develop inside Docker container, so have to restart container with --cap-add=SYS_PTRACE --security-opt seccomp=unconfined otherwise gdb cannot attach to another process. So we wait until benchmark reaches the hog mode by running docker exec -it a4f8748760c0 top in another terminal, and attach to QEMU process using gdb's attach command.
And then we set breakpoint in g_malloc0() function, resume execution and see the following backtrace:
So, we execute mprotect syscall, and in its implementation we allocate this large 16K buffer. But why we do so?
It turns out that Wasmer WebAssembly VM used by NEAR to run contracts, has the following code in it:
Writing plugin was nice and straightforward, tests were working and everything was great. However if I execute the actual NEAR gas price estimator under QEMU (with or without instruction counter plugin) - it uses up all the memory and dies.
Hmm, so what could be wrong?
If the same application is executed without QEMU, it works fine and no memory leaks happens. So obviously, it's something wrong with the QEMU or how application is executed under QEMU. What could it be?
As this is memory leak, we need to understand what kind of memory is actually leaking. We hope, that leaking memory is the regular malloc(3) memory, so let's use GLIBC malloc tracing functionality mtrace(3).
We run QEMU with activated mtrace(3), and see that when benchmark enters the memory hog mode, there are a lot of 16K allocations, but where the come from? Let's attach to the QEMU with the debugger, it shall help, right? As we develop inside Docker container, so have to restart container with --cap-add=SYS_PTRACE --security-opt seccomp=unconfined otherwise gdb cannot attach to another process. So we wait until benchmark reaches the hog mode by running docker exec -it a4f8748760c0 top in another terminal, and attach to QEMU process using gdb's attach command.
And then we set breakpoint in g_malloc0() function, resume execution and see the following backtrace:
#0 0x00007f700b0cd900 in g_malloc0 () from /usr/lib/x86_64-linux-gnu/libglib-2.0.so.0
#1 0x00005647d88718f7 in page_find_alloc (index=index@entry=225465344, alloc=alloc@entry=1) at /host/qemu/accel/tcg/translate-all.c:527
#2 0x00005647d8873619 in page_set_flags (start=start@entry=923492368384, end=end@entry=923559477248, flags=27, flags@entry=11)
at /host/qemu/accel/tcg/translate-all.c:2516
#3 0x00005647d889b99f in target_mprotect (start=923492368384, len=len@entry=67108864, prot=prot@entry=3) at /host/qemu/linux-user/mmap.c:119
#4 0x00005647d8893060 in do_syscall1 (cpu_env=cpu_env@entry=0x5647da1d2370, num=num@entry=10, arg1=<optimized out>, arg1@entry=923492368384,
arg2=<optimized out>, arg2@entry=67108864, arg3=<optimized out>, arg4=34, arg5=4294967295, arg6=0, arg8=0, arg7=0) at /host/qemu/linux-user/syscall.c:9151
#5 0x00005647d8897db3 in do_syscall (cpu_env=cpu_env@entry=0x5647da1d2370, num=10, arg1=923492368384, arg2=67108864, arg3=<optimized out>,
arg4=<optimized out>, arg5=4294967295, arg6=0, arg7=0, arg8=0) at /host/qemu/linux-user/syscall.c:12440
#6 0x00005647d88a3ca2 in cpu_loop (env=0x5647da1d2370) at /host/qemu/linux-user/x86_64/../i386/cpu_loop.c:214
#7 0x00005647d882dcbe in main (argc=<optimized out>, argv=0x7ffc6eba3ff8, envp=<optimized out>) at /host/qemu/linux-user/main.c:876So, we execute mprotect syscall, and in its implementation we allocate this large 16K buffer. But why we do so?
It turns out that Wasmer WebAssembly VM used by NEAR to run contracts, has the following code in it:
Thus, on every execution of the smart contract (Webassembly program) we allocate and free 6G of address space. So, we're getting closer. QEMU Linux usermode emulator manages memory mapping with this code. As one may see, even when mmap(2) syscall in guest doesn't have MAP_FIXED flag, controlling the desired placement of returned memory, QEMU still enforces fixed placement, and tries to manage VMA regions on its own.
And so address placing algorithm always finds the new base address for mmap(2) result. Could it be problematic or leaking? Indeed so!
Even if there's no real memory under VMA region, QEMU's software MMU manages page table entries for the newly created mapping, and so the page table entry is allocated, as we saw in the debugger.
And thus for the huge regions mmap'ed and munmap'ed we still maintain the page tables, and they never get disposed. This is exactly the leak! After we realized the nature of the problem fix becomes rather simple, and is just the two lines we started with. Whenever memory mapping is released, we allow VMA searching heuristics to find recently unmapped VMA, and thus avoid creating unneeded page table entries.
It only took whole weekend full of hacking and meditation, and resulted in simple patch recently sent to QEMU developers for inclusion in the mainline.
This little story shows the dark side of system-level programming, and unsuitability of LOC written or changed as metrics for efforts in software development.
Subscribe to:
Posts (Atom)