xidel: The CLI Scraper That Speaks XPath, XQuery, CSS, and JSONiq

2026-05-20

You need to pluck data out of an HTML page. The usual options: pipe curl through grep and pray the markup never shifts, fire up Python with BeautifulSoup, or wire up a Node toolchain. There's a fourth option that has been quietly sitting in Debian since 2012 — xidel. One Pascal-compiled binary, no runtime, and it understands HTML, XML, and JSON through the same expression engine.

The simplest form looks like wget with a brain:

# Grab every story link off Hacker News
xidel https://news.ycombinator.com -e '//span[@class="titleline"]/a/@href'

# Same thing with a CSS selector
xidel https://news.ycombinator.com -e 'css("span.titleline a")/@href'

The -e flag takes XPath, XQuery, JSONiq, or CSS. xidel auto-detects HTML vs. JSON and parses accordingly — no flipping between jq and pup.

The real power shows up when you need multiple fields at once. XQuery lets you build structured records inline:

xidel https://news.ycombinator.com -e '
  for $row in //tr[@class="athing"] return {
    "title": $row//span[@class="titleline"]/a,
    "url":   $row//span[@class="titleline"]/a/@href,
    "rank":  $row//span[@class="rank"]
  }
' --output-format=json-wrapped

One process, structured JSON output, no glue script. Now the killer feature — automatic link following:

# Front page → every comment thread → comment bodies, in one invocation
xidel https://news.ycombinator.com \
  -f '//span[@class="titleline"]/a/@href' \
  -e '//div[@class="comment"]//text()'

The -f flag means "follow these URLs and apply the next -e to each." Chain -f and -e arbitrarily — five levels deep, paginated archive crawls, login-then-scrape — all in one process. Cookies persist across the chain automatically.

It speaks JSON with the same engine:

# Every top-level dependency name from a package-lock.json
xidel package-lock.json -e '$json/packages/*/dependencies'

Need to log in first? It does forms and sessions:

xidel https://site.example/login \
  --post='user=alice&pass=secret' \
  --cookie-jar=cookies.txt \
  -f '//a[text()="Dashboard"]/@href' \
  -e '//table[@id="stats"]//tr'

Output formatters cover the cases you actually need. --output-format= takes adhoc, json-wrapped, xml, html, or bash. The bash mode is gold — xidel emits title="..." lines you can eval straight into your shell:

eval "$(xidel https://example.com/api/whoami \
  -e 'user:=//span[@id="username"]' \
  -e 'mail:=//a[@class="email"]' \
  --output-format=bash)"
echo "$user $mail"

Why this beats the alternatives: grep can't parse nested HTML. jq can't read HTML at all. BeautifulSoup needs a Python interpreter and fifty lines for what xidel does in one expression. pup handles CSS but not XPath, and has no link-following or session handling. xidel collapses fetcher, parser, session manager, and serializer into one tool — and the expression language is a real, standardized one you already half-know.

Key Takeaway: xidel replaces curl + parser + glue script with a single expression-based pipeline that walks HTML, JSON, cookies, and forms using the same XPath/XQuery/CSS syntax.

All newsletters