Tool Nobody Knows: pdsh: Parallel Distributed Shell From the Folks Who Run Actual Supercomputers

pdsh: Parallel Distributed Shell From the Folks Who Run Actual Supercomputers

2026-06-07

Before Ansible, before Salt, before anyone reached for a 200MB Python install to run uptime on a few boxes, sysadmins at Lawrence Livermore wrote pdsh. It's a small C program from the late '90s that still ships in Debian, Ubuntu, EPEL, and Homebrew, and it remains the fastest way to fan a command out to a herd of hosts. No inventory YAML. No facts gathering. No agent. Just SSH, in parallel, with grown-up output handling.

The killer feature is the hostlist syntax, which collapses runs of nodes into one expression. If you've ever typed a for h in loop, this will hurt:

pdsh -w node[01-32] uptime
pdsh -w node[01-04,06,10-20].dc1,gw[1-2].dc2 'systemctl is-active nginx'
pdsh -w ^/etc/cluster/web-tier 'nginx -t'   # ^file reads a hostfile
pdsh -x node07 -w node[01-32] 'reboot'      # exclude with -x

Output from each node is prefixed with the hostname, which is fine for ten hosts and unreadable for two hundred. That's where its little sibling dshbak earns its keep. With -c it coalesces identical output, grouping hosts that produced the same lines:

$ pdsh -w node[01-32] uname -r | dshbak -c
----------------
node[01-30,32]
----------------
5.15.0-91-generic
----------------
node31
----------------
5.15.0-50-generic

That report — "31 nodes agree, one is a snowflake" — is exactly the question you usually want answered, and it falls out of the pipeline for free. Try getting that out of ansible -m shell without writing a custom callback.

Companions in the package handle file movement: pdcp pushes, rpdcp gathers:

pdcp  -w node[01-10] /etc/resolv.conf /etc/resolv.conf
rpdcp -w node[01-10] /var/log/dmesg /tmp/dmesg-gather/
# rpdcp drops files into /tmp/dmesg-gather/<hostname>/dmesg

The transport is pluggable via rcmd modules: -R ssh (default on modern installs), -R rsh, -R krb4, -R mrsh, -R exec (for testing with a local shell). You can set PDSH_RCMD_TYPE=ssh in your shell rc and forget about it. There's also -f for fanout (how many concurrent connections; default 32 — bump it for big clusters), and -u for a per-command timeout so one hung node doesn't wedge the whole run.

The genuinely Unix-wizard trick is using the WCOLL environment variable. Point it at a file of hostnames and every pdsh invocation in that shell uses it by default:

export WCOLL=~/clusters/web-tier
pdsh 'tail -n1 /var/log/nginx/error.log' | dshbak -c
pdsh 'date -u' | dshbak -c   # quick clock-skew check across the fleet

Why use this instead of Ansible? Three reasons. First, latency: pdsh is a tiny binary; on a warm SSH ControlMaster it answers in milliseconds, not the second-plus Ansible spends importing modules and gathering facts. Second, composability: it's a pipe-friendly Unix tool, so the output flows into dshbak, awk, sort | uniq -c, or whatever else you've got. Third, zero remote requirements: no Python on the target, no agent, nothing to install. If ssh user@host command works, pdsh works.

It's not a replacement for configuration management. It's the tool you reach for when you need to look at two hundred machines right now and decide what to do next.

Key Takeaway: When the question is "what does the fleet look like right this second?", pdsh plus dshbak -c answers it in one pipeline, with no agents, no YAML, and no apology.

All newsletters