2026-05-09
When Docker says "limit this container to 2 CPUs and 4 GiB," it's not magic — it's writing numbers into files under /sys/fs/cgroup/. Cgroups (control groups) are the kernel mechanism that accounts for and constrains resource usage of process trees. Cgroups v2, default since systemd 244 and required by Kubernetes 1.25+, unifies the messy v1 hierarchy into a single tree.
The hierarchy. Every process belongs to exactly one cgroup, listed in /proc/<pid>/cgroup. Cgroups form a tree under /sys/fs/cgroup/; child cgroups inherit and further constrain parent limits. Each directory has files that are the control surface: cgroup.procs (membership), memory.max, cpu.max, io.max, plus accounting files like memory.current and cpu.stat.
CPU limits use a quota/period model. Writing "200000 100000" to cpu.max means "200ms of CPU time per 100ms window" — i.e., 2 full CPUs worth. The scheduler tracks runtime per period; when you exhaust the quota, your tasks get throttled until the next period starts. This is why a thread can look idle in top while latency spikes: it's runnable but parked. Check cpu.stat for nr_throttled and throttled_usec — non-zero values mean your limit is biting.
Memory limits are harder. memory.max is a hard wall: hit it and the kernel invokes the cgroup OOM killer, which targets a process inside that cgroup rather than ransacking the host. memory.high is softer — exceed it and the kernel throttles allocations and aggressively reclaims page cache from your cgroup before it hits the hard ceiling. Modern container runtimes often set memory.high slightly below memory.max to get graceful pressure instead of cliff-edge OOMs.
Real example. A Java service on Kubernetes was getting OOMKilled at exactly its 4 GiB limit despite -Xmx3g. The culprit: the JVM's off-heap (Metaspace, direct buffers, thread stacks) plus page cache for mmap'd JARs pushed memory.current over memory.max. Cgroups count all kernel-attributed memory, not just the heap.
Rule of thumb for sizing. Set cpu.max quota = (desired vCPUs) × period. For latency-sensitive workloads, set the period shorter (e.g., 10ms) so throttling, when it happens, parks you for milliseconds not tens of milliseconds. For memory, budget ~25–30% above your application's measured RSS to absorb page cache, kernel slab, and short-lived spikes — anything tighter and you're playing OOM roulette.
cpu.stat throttling and memory.current accounting is the difference between a healthy container and a mysteriously slow one.
