Sunday, April 19, 2020

Twoliner's tale

Let's begin from the very end. From the patch, the simple one.

--- a/linux-user/mmap.c
+++ b/linux-user/mmap.c
@@ -650,6 +650,8 @@ int target_munmap(abi_ulong start, abi_ulong len)
     if (ret == 0) {
         page_set_flags(start, start + len, 0);
         tb_invalidate_phys_range(start, start + len);
+        if (start < mmap_next_start)
+            mmap_next_start = start;
     }
     mmap_unlock();
     return ret;


This is the fix to QEMU's usermode Linux emulator used for gas cost estimator. The story of the gas price estimator will be told in the next post, but for now sufficient information is that this is the mechanism we use to compute the exact number of x86 instructions to perform certain operation.

We executed the program in question under QEMU and instrument every instruction with the callback, so that exact number of instructions is the reproducible operation cost.

To achieve this goal simple QEMU JIT plugin was written. It is activated by magic syscall (read on fd == 0xcafebabe), and starts counting instructions. When deactivated by read on magic fd == 0xcafebabf) we write the number of executed instruction into syscall's argument buffer, so that emulated program can know how many CPU instructions were executed.

Writing plugin was nice and straightforward, tests were working and everything was great. However if I execute the actual NEAR gas price estimator under QEMU (with or without instruction counter plugin) - it uses up all the memory and dies.

Hmm, so what could be wrong?

If the same application is executed without QEMU, it works fine and no memory leaks happens. So obviously, it's something wrong with the QEMU or how application is executed under QEMU. What could it be?

As this is memory leak, we need to understand what kind of memory is actually leaking. We hope, that leaking memory is the regular malloc(3) memory, so let's use GLIBC malloc tracing functionality mtrace(3).

We run QEMU with activated mtrace(3), and see that when benchmark enters the memory hog mode, there are a lot of 16K allocations, but where the come from? Let's attach to the QEMU with the debugger, it shall help, right? As we develop inside Docker container, so have to restart container with --cap-add=SYS_PTRACE --security-opt seccomp=unconfined otherwise gdb cannot attach to another process. So we wait until benchmark reaches the hog mode by running docker exec -it a4f8748760c0 top in another terminal, and attach to QEMU process using gdb's attach command.

And then we set breakpoint in g_malloc0() function, resume execution and see the following backtrace:

#0  0x00007f700b0cd900 in g_malloc0 () from /usr/lib/x86_64-linux-gnu/libglib-2.0.so.0
#1  0x00005647d88718f7 in page_find_alloc (index=index@entry=225465344, alloc=alloc@entry=1) at /host/qemu/accel/tcg/translate-all.c:527
#2  0x00005647d8873619 in page_set_flags (start=start@entry=923492368384, end=end@entry=923559477248, flags=27, flags@entry=11)
    at /host/qemu/accel/tcg/translate-all.c:2516
#3  0x00005647d889b99f in target_mprotect (start=923492368384, len=len@entry=67108864, prot=prot@entry=3) at /host/qemu/linux-user/mmap.c:119
#4  0x00005647d8893060 in do_syscall1 (cpu_env=cpu_env@entry=0x5647da1d2370, num=num@entry=10, arg1=<optimized out>, arg1@entry=923492368384, 
    arg2=<optimized out>, arg2@entry=67108864, arg3=<optimized out>, arg4=34, arg5=4294967295, arg6=0, arg8=0, arg7=0) at /host/qemu/linux-user/syscall.c:9151
#5  0x00005647d8897db3 in do_syscall (cpu_env=cpu_env@entry=0x5647da1d2370, num=10, arg1=923492368384, arg2=67108864, arg3=<optimized out>, 
    arg4=<optimized out>, arg5=4294967295, arg6=0, arg7=0, arg8=0) at /host/qemu/linux-user/syscall.c:12440
#6  0x00005647d88a3ca2 in cpu_loop (env=0x5647da1d2370) at /host/qemu/linux-user/x86_64/../i386/cpu_loop.c:214
#7  0x00005647d882dcbe in main (argc=<optimized out>, argv=0x7ffc6eba3ff8, envp=<optimized out>) at /host/qemu/linux-user/main.c:876

So, we execute mprotect syscall, and in its implementation we allocate this large 16K buffer. But why we do so?

It turns out that Wasmer WebAssembly VM used by NEAR to run contracts, has the following code in it:


Thus, on every execution of the smart contract (Webassembly program) we allocate and free 6G of address space. So, we're getting closer. QEMU Linux usermode emulator manages memory mapping with this code. As one may see, even when mmap(2) syscall in guest doesn't have MAP_FIXED flag, controlling the desired placement of returned memory, QEMU still enforces fixed placement, and tries to manage VMA regions on its own. 

And so address placing algorithm always finds the new base address for mmap(2) result. Could it be problematic or leaking? Indeed so!

Even if there's no real memory under VMA region, QEMU's software MMU manages page table entries for the newly created mapping, and so the page table entry is allocated, as we saw in the debugger.

And thus for the huge regions mmap'ed and munmap'ed we still maintain the page tables, and they never get disposed. This is exactly the leak! After we realized the nature of the problem fix becomes rather simple, and is just the two lines we started with. Whenever memory mapping is released, we allow VMA searching heuristics to find recently unmapped VMA, and thus avoid creating unneeded page table entries.

It only took whole weekend full of hacking and meditation, and resulted in simple patch recently sent to QEMU developers for inclusion in the mainline.

This little story shows the dark side of system-level programming, and unsuitability of LOC written or changed as metrics for efforts in software development.


No comments:

Post a Comment