Leveraging eBPF as a feedback loop for runtime system tuning

Problem Motivation

When building large projects like Android AOSP, system resource limitations often dictate success or failure. For instance, running a make build (i.e., make -j$(nproc)) command on a high-end system (e.g., 128GB RAM and a modern multi-core CPU) typically completes successfully due to ample memory and processing power. By contrast, on a system with only 16GB RAM, the build process is prone to failures caused by Linux’s OOM (Out-Of-Memory) Killer terminating processes to reclaim memory, or other bugs stem from resource contention, such as segmentation faults due to extreme memory pressure.

However, the issue is rarely just about RAM. System-wide bottlenecks—such as a slow CPU struggling to compile code efficiently, or sluggish I/O (e.g., HDDs or slow SSDs) delaying read/write operations—can compound memory pressure. These limitations force processes to linger in memory longer than necessary, overloading the system and triggering instability. In such cases, even adequate RAM may not prevent failures if the CPU or storage cannot keep pace with the build’s demands.

Background

We use eBPF to trace system-wide CPU, memory, I/O, and scheduler metrics. Throttling (that is, controlling build resource consumption) can be based on resource thresholds such as Pressure Stall Information (PSI) APIs [1-2] and memory watermarks [3]. We can read those watermark levels and tune the system on demand, for example by increasing background kswapd activity so enough memory remains available. PSI also reports how long the system has been under pressure for memory, I/O, and CPU.

Our approach has two steps: (1) perform runtime tuning, such as reducing the number of parallel jobs; and (2) if the system becomes saturated and no safe option remains, freeze the process at cgroup granularity instead of letting it get killed, tune the system, and then thaw the process. One practical method is to intercept make -j$(nproc) and run the build with a dynamically scaled job pool.

I am exploring an eBPF interceptor (a wrapper layer) that takes make -j$(nproc) and throttles build resource usage based on runtime conditions. Part of this logic may need kernel support. If so, we can use kfuncs to connect eBPF and kernel modules and perform throttling inside the kernel. The main challenge is safety: any kernel extension must be reliable, free of regressions, and well optimized, especially when combined with FDO (feedback-directed optimization), LTO (link-time optimization), and other compiler optimizations.

Let us walk through one way to use an eBPF wrapper to intercept and tune the build process.

Create cgroup for manage CPU, memory, and IO (i.e., cgcreate -g cpu,io,memory:$(throttle_build_cgroup)
We develop an eBPF program that observes watermarks, or PSI APIs for CPU, IO, and Memory.
We attach our cgroup with this eBPF tool (i.e, `$(eBPF_tool) --cgroup $(throttle_build_cgroup)`
We run build process inside cgroup (i.e., cgexec -g cpu,io,memory :$(throttle_build_cgroup) $(our_build_job)".
Our $(eBPF_tool) has callbacks that detect memory, io, CPU pressure, alerts it and handles it directly or by delegating to other in-kernel (or user-space) handler.

Up to step 5 is relatively straightforward. The harder part is deciding exactly how to handle each overload condition.

The handling logic can look like this:

if memory_usage > 85% || cpu_usage > 90% || io_latency > 100ms:
    - Solution 1:
        - pause to build (`pkill -STOP $(pid_of_build_process`)
        - NEW_JOBS=$(calc_new_job_count) //this is difficult, so this should be our research
        - make -j$NEW_JOBS
    - Solution 2:
        -  cgroup throttling
            -  echo "50000 100000" > /sys/fs/cgroup/"$(throttle_build_cgroup)"/cpu.max
    -  Solution 3:
        -  In case, system saturates and we are in high watermarks level, then we may freeze entire cgroup.

One note, we also need to disable default OOM killer that system uses, and guarantee our throttler is the only thing that is solving this low-resources issue.

Second note, if doing tuning or throttling in user-space is slow incase we delegated solution to user-space, then we can do that via kernel-mdules. Or when cases such as we reach out eBPF limitation to perform certain things then we may need to use kfunc to kernel modules, where we have superpower and larger context, and then do those actions that were hard or impossible in eBPF alone. One challenge is that we need to guarantee the kernel modules won't bring new kernel bugs.

Other on-the-fly system tuning ideas

  1. To make build process pass is that we can also track the resource utilization, and where there is bottleneck (for example, if we detect CPU bottleneck) then we can overclock the CPU on the fly. [5]

  2. Let's say we use Mellanox Gigabit NIC that has chance of overheating. We detect this using eBPF and modify the fan speed on the fly. Other ideas can be, track those recently modified files and speed up incrementatl backups [5]

  3. Detect what process hangs (e.g., the firefox tab which is slow and stuck your mouse) and kill the process.

References

[1] https://docs.kernel.org/accounting/psi.html

[2] https://source.android.com/docs/core/perf/lmkd

[3] https://www.kernel.org/doc/gorman/html/understand/understand005.html

[4] https://lpc.events/event/4/contributions/404/attachments/326/550/Handling_memory_pressure_on_Android.pdf

[5] https://dl.acm.org/doi/10.1016/j.sysarc.2024.103130