datamash: Statistics and Group-By for the Command Line

2026-05-07

Every developer eventually writes the same awful awk script: read a column of numbers, compute mean, stddev, maybe median if they're brave. Median in awk requires sorting and array indexing. Percentiles? Good luck. Group-by aggregates? Now you're writing a small program.

GNU datamash has been quietly sitting in your distro's repos for over a decade, doing all of this in one line. It's from the GNU project itself, not some npm-of-the-week.

apt install datamash    # or brew install datamash

The basics — pipe in numbers, ask for statistics:

$ seq 100 | datamash sum 1 mean 1 median 1 sstdev 1
5050    50.5    50.5    29.011491975882

$ seq 100 | datamash perc:95 1 perc:99 1
95.05   99.01

$ seq 100 | datamash count 1 unique 1 first 1 last 1
100     100     1       100

Operators include min max range mean median mode antimode q1 q3 iqr, plus sum count countunique and string ops unique collapse countunique. Twenty-plus operators, no awk required.

The real power is group-by. Imagine a CSV of HTTP request logs:

$ cat requests.tsv
endpoint   status  ms
/api/users 200     45
/api/users 200     52
/api/users 500     3001
/api/posts 200     12
/api/posts 200     18

$ datamash -H -g endpoint,status count ms mean ms perc:99 ms < requests.tsv
GroupBy(endpoint) GroupBy(status) count(ms) mean(ms) perc:99(ms)
/api/users        200             2         48.5     51.93
/api/users        500             1         3001     3001
/api/posts        200             2         15       17.94

-H reads the header, -g groups by named columns. Add --sort and datamash handles sorting; otherwise it expects pre-sorted input (much faster on huge files via sort | datamash).

Some less-known tricks that pay rent:

Transpose a matrix — flip rows and columns without dragging in numpy:

$ printf 'a b c\n1 2 3\n4 5 6\n' | datamash -W transpose
a 1 4
b 2 5
c 3 6

Crosstab (pivot table) — turn long format into wide:

$ printf 'host  metric  value\nweb1  cpu  45\nweb1  mem  60\nweb2  cpu  30\nweb2  mem  80\n' \
    | datamash -H -W crosstab 1,2 mean 3
        cpu   mem
web1    45    60
web2    30    80

Bin/histogram on the fly — combine with --narm to skip NaNs:

$ awk '{print int($1/100)*100}' latencies.txt | sort | datamash -g 1 count 1
0       4523
100     891
200     203
500     12

Round-trip with check — datamash check validates that every row has the expected number of fields, perfect for pipeline guard rails:

$ cat data.tsv | datamash check 5 fields
data.tsv: line 17 has 4 fields (expected 5)

The reason this beats awk isn't capability — awk is Turing-complete — it's that you write zero arithmetic. Median is median 1, not a sort-and-index loop. Standard deviation is sstdev 1 with the correct Bessel correction, not your half-remembered formula. When the operation is named, mistakes have nowhere to hide.

For ad-hoc analysis on TSV/CSV without leaving the shell, datamash sits in the sweet spot between awk (too low-level) and pandas (too heavyweight). It reads from stdin, writes to stdout, and composes with everything else.

Key Takeaway: When you find yourself writing awk to compute statistics or group-by aggregates, stop — datamash already has the operator named, debugged, and statistically correct.

All newsletters