2026-06-08
For thirty years, every time I wanted to know what the kernel was actually doing, I had bad options. strace stops the world with ptrace. perf samples but doesn't tell you why. SystemTap compiles a kernel module and made me cry in 2009. bpftrace finally fixes it: an awk-shaped language that compiles to eBPF, runs in the kernel at native speed, and aggregates results in-kernel so you don't drown in events.
The one-liner most engineers learn first — and the one that has saved me a dozen late nights — is the system-wide openat() tap:
bpftrace -e '
tracepoint:syscalls:sys_enter_openat {
printf("%-16s %s\n", comm, str(args->filename));
}'
Every process. Every file open. No attaching, no PID, no recompile. The overhead is maybe 1% on a busy box. Try doing that with strace -f from PID 1 and watch your machine catch fire.
The real party trick is in-kernel aggregation. Want a histogram of read latencies, bucketed in log2, in microseconds, for every block device, live?
bpftrace -e '
kprobe:vfs_read { @start[tid] = nsecs; }
kretprobe:vfs_read /@start[tid]/ {
@us = hist((nsecs - @start[tid]) / 1000);
delete(@start[tid]);
}'
Hit Ctrl-C and you get an ASCII histogram. No log file rotation, no awk post-processing, no missed events because your userspace was paged out.
A few more I keep in my back pocket:
fsync? — bpftrace -e 'tracepoint:syscalls:sys_enter_fsync { @[comm] = count(); }'. Catches the daemon that's making your SSD sing.bpftrace -e 'kprobe:tcp_retransmit_skb { @[kstack] = count(); }'. Better than tcpdump for finding where the kernel decided to resend.bpftrace -e 'software:faults:1 { @[comm] = count(); }'. The mystery memory hog reveals itself.bpftrace -e 'tracepoint:signal:signal_generate /args->pid == 4242/ { printf("%s -> %d\n", comm, args->sig); }'. Finally know whose kill -9 it was.List every probe point your kernel exposes — there are tens of thousands — with bpftrace -l 'tracepoint:*' or bpftrace -l 'kprobe:tcp_*'. The probe namespace alone is an education: uprobe, uretprobe, and usdt let you hook userspace symbols and USDT markers (PostgreSQL, Python, OpenJDK all ship them).
Why this beats the mainstream tools:
strace ptraces, which means every syscall takes two context switches just to be observed. On a 50k-syscall/sec process, you've slowed it to a crawl. bpftrace runs your filter inside the kernel, on the syscall path, with the event never leaving ring 0 unless your script says so.perf samples; it tells you which stacks are hot, not why a specific event happened. bpftrace is deterministic — every event you ask about fires your handler.SystemTap, there's no kernel module to compile, no DKMS dance, no kernel panic if you typo a script.The catches: needs root (or CAP_BPF+CAP_PERFMON on 5.8+), needs a kernel with BTF for kprobe argument access by name (most distros ship it now), and you can hang yourself with an unbounded map. Use hist(), lhist(), and count() — don't printf a million events a second unless you want to watch a tree fall in a forest you can't observe.
Once it clicks, you stop reaching for strace entirely for production diagnostics. The Brendan Gregg book and the bpftrace reference guide on GitHub are the only docs you need.
