Daily Low-Level Programming: System Calls and the Kernel Boundary

System Calls and the Kernel Boundary

2026-04-25

Every time your userspace code needs the kernel to do something — open a file, allocate memory, send a packet — it must cross the privilege boundary via a system call. Understanding this mechanism is essential because syscalls are the single most expensive "function call" your program makes regularly.

On x86-64 Linux, the modern syscall path uses the syscall instruction (replacing the older int 0x80 trap). The convention is precise:

RAX holds the syscall number (e.g., 0 = read, 1 = write, 59 = execve)
RDI, RSI, RDX, R10, R8, R9 carry up to six arguments
RCX and R11 are clobbered by the kernel (RCX gets the return RIP, R11 gets RFLAGS)
The return value lands in RAX; negative values in the range -4095 to -1 indicate errors (the negated errno)

On ARM64 (AArch64), the equivalent is the svc #0 instruction, with the syscall number in X8 and arguments in X0–X5.

Here's a minimal x86-64 write syscall in inline assembly — no libc involved:

const char msg[] = "hello\n";
long ret;
asm volatile(
    "syscall"
    : "=a"(ret)
    : "a"(1),            // syscall number: write
      "D"(1),            // fd: stdout
      "S"(msg),          // buffer
      "d"(sizeof(msg)-1) // count
    : "rcx", "r11", "memory"
);

Real-world impact: A raw syscall on modern x86-64 hardware takes roughly 50–100 nanoseconds of overhead (mode switch, KPTI page table swap, speculative execution mitigations). Compare that to a normal function call at ~1–2 ns. This means a tight loop doing one syscall per iteration is 50–100x slower than one doing purely userspace work. This is exactly why high-performance code uses techniques like io_uring (batch submissions), mmap (avoid read/write syscalls entirely), and the vDSO — a kernel-mapped shared library that implements certain syscalls like gettimeofday and clock_gettime entirely in userspace, eliminating the mode switch.

Rule of thumb: if your hot path makes more than ~10,000 syscalls per second, the kernel boundary overhead itself starts consuming a meaningful percentage of a single core. At 100 ns each, 10,000 syscalls burn 1 ms per second — about 0.1% of a core. At 1,000,000 syscalls/sec, that's 100 ms/sec — 10% of a core lost purely to mode switching.

You can trace your program's syscalls with strace -c ./program to get a summary showing call counts and cumulative time. This is often the fastest way to diagnose unexpected I/O or permission issues, and to find syscalls worth batching or eliminating.

See it in action: Check out what is kernel space and user space ? #shorts #linux #kernel #bydubebox by The Digital Folks to see this theory applied.

Key Takeaway: System calls are the controlled gateway between user and kernel mode; each one costs 50–100 ns on modern x86-64, so minimizing, batching, or eliminating them (via vDSO, mmap, or io_uring) is a core performance strategy.

All newsletters