2026-05-07
Every developer eventually writes the same awful awk script: read a column of numbers, compute mean, stddev, maybe median if they're brave. Median in awk requires sorting and array indexing. Percentiles? Good luck. Group-by aggregates? Now you're writing a small program.
GNU datamash has been quietly sitting in your distro's repos for over a decade, doing all of this in one line. It's from the GNU project itself, not some npm-of-the-week.
apt install datamash # or brew install datamash
The basics — pipe in numbers, ask for statistics:
$ seq 100 | datamash sum 1 mean 1 median 1 sstdev 1
5050 50.5 50.5 29.011491975882
$ seq 100 | datamash perc:95 1 perc:99 1
95.05 99.01
$ seq 100 | datamash count 1 unique 1 first 1 last 1
100 100 1 100
Operators include min max range mean median mode antimode q1 q3 iqr, plus sum count countunique and string ops unique collapse countunique. Twenty-plus operators, no awk required.
The real power is group-by. Imagine a CSV of HTTP request logs:
$ cat requests.tsv
endpoint status ms
/api/users 200 45
/api/users 200 52
/api/users 500 3001
/api/posts 200 12
/api/posts 200 18
$ datamash -H -g endpoint,status count ms mean ms perc:99 ms < requests.tsv
GroupBy(endpoint) GroupBy(status) count(ms) mean(ms) perc:99(ms)
/api/users 200 2 48.5 51.93
/api/users 500 1 3001 3001
/api/posts 200 2 15 17.94
-H reads the header, -g groups by named columns. Add --sort and datamash handles sorting; otherwise it expects pre-sorted input (much faster on huge files via sort | datamash).
Some less-known tricks that pay rent:
Transpose a matrix — flip rows and columns without dragging in numpy:
$ printf 'a b c\n1 2 3\n4 5 6\n' | datamash -W transpose
a 1 4
b 2 5
c 3 6
Crosstab (pivot table) — turn long format into wide:
$ printf 'host metric value\nweb1 cpu 45\nweb1 mem 60\nweb2 cpu 30\nweb2 mem 80\n' \
| datamash -H -W crosstab 1,2 mean 3
cpu mem
web1 45 60
web2 30 80
Bin/histogram on the fly — combine with --narm to skip NaNs:
$ awk '{print int($1/100)*100}' latencies.txt | sort | datamash -g 1 count 1
0 4523
100 891
200 203
500 12
Round-trip with check — datamash check validates that every row has the expected number of fields, perfect for pipeline guard rails:
$ cat data.tsv | datamash check 5 fields
data.tsv: line 17 has 4 fields (expected 5)
The reason this beats awk isn't capability — awk is Turing-complete — it's that you write zero arithmetic. Median is median 1, not a sort-and-index loop. Standard deviation is sstdev 1 with the correct Bessel correction, not your half-remembered formula. When the operation is named, mistakes have nowhere to hide.
For ad-hoc analysis on TSV/CSV without leaving the shell, datamash sits in the sweet spot between awk (too low-level) and pandas (too heavyweight). It reads from stdin, writes to stdout, and composes with everything else.
