xsv: A CSV Swiss Army Knife That Makes awk Look Slow

2026-05-01

Every few months, someone hands you a 2GB CSV file and asks a "quick question" about it. You reach for awk, remember the field separator quoting nightmare, switch to Python pandas, wait 45 seconds for it to load into RAM, and hate everything. There's a better way.

xsv is a Rust-based CSV toolkit by Andrew Gallant (the same person who wrote ripgrep). It's fast enough to process gigabyte-scale CSVs in seconds, handles RFC 4180 quoting correctly (which your awk one-liner does not), and provides subcommands that compose with Unix pipes exactly the way you'd expect.

# Install
cargo install xsv
# or: brew install xsv / apt install xsv

Inspect a CSV instantly. Before you do anything, you want to know what you're dealing with:

# Show column names and their index positions
xsv headers data.csv

# Row count without loading the whole file into memory
xsv count data.csv

# Per-column statistics: min, max, mean, stddev, nulls
xsv stats data.csv | xsv table

That last command is the killer. xsv stats computes summary statistics in a single streaming pass, then xsv table pretty-prints the output into aligned columns. On a 1.5GB file with 12 million rows, this finishes in about 8 seconds. Pandas would still be allocating memory.

Select, slice, and filter. Grab specific columns by name or index, no more counting $1, $2, $7:

# Select columns by name
xsv select name,email,signup_date users.csv

# Exclude a column
xsv select '!password_hash' users.csv

# First 100 rows (like head, but CSV-aware — preserves the header)
xsv slice -l 100 data.csv

# Search with regex on a specific column
xsv search -s status 'cancel|refund' orders.csv

Joins. This is where people's jaws drop. You can do SQL-style joins on the command line:

# Inner join orders.csv and customers.csv on customer_id
xsv join customer_id orders.csv id customers.csv

That's a proper hash join, not a nested loop. It handles millions of rows without breaking a sweat. Need a left join? Add --left. Want to join on multiple files? Chain them.

Frequency tables and deduplication:

# Top values in the "country" column
xsv frequency -s country -l 20 users.csv | xsv table

# Sort by a column (numeric sort with -N)
xsv sort -s revenue -N -R sales.csv | xsv slice -l 10

# Deduplicate on a column
xsv sort -s email data.csv | xsv dedup -s email

The real power is composition. Because every subcommand reads and writes well-formed CSV, you can build pipelines that rival SQL queries:

# "Show me the top 5 countries by average order value
#  for orders over $100 in 2025"
xsv search -s year '2025' orders.csv \
  | xsv search -s amount '[1-9][0-9]{2,}' \
  | xsv join customer_id - id customers.csv \
  | xsv frequency -s country -l 5 \
  | xsv table

There's also xsv split (break a huge CSV into N-row chunks), xsv sample (random sample without loading the full file), xsv fmt (convert delimiters — TSV to CSV and back), and xsv cat (concatenate CSVs with mismatched columns, aligning by header name).

The performance difference compared to scripting languages isn't marginal — it's often 10-50x. xsv uses memory-mapped I/O, SIMD-optimized CSV parsing, and avoids allocations wherever possible. For ad-hoc data exploration on the command line, nothing else comes close.

Key Takeaway: xsv gives you SQL-like operations on CSV files at speeds that embarrass scripting languages, all while composing cleanly with Unix pipes — install it and stop wrestling with awk -F',' forever.

All newsletters