2026-05-14
You learned awk in 1989 and it's served you well for positional fields. But the world filled up with CSV headers, JSONL streams, and TSV exports from analysts who don't know what column 7 is — only what it's called. Miller (mlr, John Kerl, ~2015) is awk's spiritual successor for name-indexed records. It speaks CSV, TSV, JSON, JSON Lines, key-value pairs (DKVP), positional indexed (NIDX), and pretty-print — and converts between them in a single pipe.
Convert formats by changing flags:
# CSV in, JSON Lines out
mlr --icsv --ojson cat orders.csv
# JSONL in, pretty-printed table out
mlr --ijson --opprint cat events.jsonl
# Shortcut: --c2p is "csv to pprint", --j2c is "json to csv", etc.
mlr --c2p cat orders.csv
Filter and compute with a real expression language — variables reference fields by name, not $7:
# Filter rows, then add a computed column
mlr --csv filter '$status == "paid" && $amount > 100' \
then put '$tax = $amount * 0.0875' \
then cut -f order_id,amount,tax orders.csv
# Group-by aggregation (the part that makes you delete 40 lines of awk)
mlr --csv stats1 -a mean,stddev,p95 -f latency_ms -g endpoint requests.csv
The then verb chains operations into a streaming pipeline — one pass over the file, no temp tables. Verbs include sort, uniq, top, tac, head, tail, cat -n (numbering), nest (explode delimited fields), reshape (wide ↔ long), and joins:
# SQL-style join across two CSVs
mlr --csv join -j user_id -f users.csv then sort -f signup_date orders.csv
# Histogram of values in a column
mlr --csv histogram -f response_time --lo 0 --hi 1000 --nbins 20 logs.csv
# Frequency table — like sort | uniq -c, but it understands columns
mlr --csv count-distinct -f country,plan subscribers.csv
Where it leaves xsv behind: JSON. mlr happily flattens nested JSON into flat records and back. If your data pipeline crosses formats — say, ingesting JSONL from an API and emitting CSV for a BI tool — you no longer need jq piped into csvkit piped into awk:
# Tail a JSONL stream, project a few fields, emit CSV
tail -F app.jsonl | mlr --ijsonl --ocsv cut -f ts,user_id,event,latency_ms
# Heterogeneous records? Use --jflatsep to control nested key naming
mlr --ijsonl --opprint --jflatsep . cat nested.jsonl
The DSL has proper types, regex, string functions, and even tee for splitting output mid-pipeline:
# Route errors and successes to different files in one pass
mlr --csv put -q '
if ($status >= 500) { tee > "errors.csv", $* }
else { tee > "ok.csv", $* }
' access.csv
One more sleeper feature: line-oriented streaming. Miller doesn't slurp the file. You can pipe a 50GB JSONL log through mlr stats1 and watch RSS stay flat. That's the real upgrade over loading it into pandas just to compute a mean.
Install: apt install miller, brew install miller, or grab a static binary. Single executable, zero dependencies, written in Go since v6.
mlr — it's awk that understands headers, JSON, and group-by, all in one streaming binary.
