2026-05-20
You need to pluck data out of an HTML page. The usual options: pipe curl through grep and pray the markup never shifts, fire up Python with BeautifulSoup, or wire up a Node toolchain. There's a fourth option that has been quietly sitting in Debian since 2012 — xidel. One Pascal-compiled binary, no runtime, and it understands HTML, XML, and JSON through the same expression engine.
The simplest form looks like wget with a brain:
# Grab every story link off Hacker News
xidel https://news.ycombinator.com -e '//span[@class="titleline"]/a/@href'
# Same thing with a CSS selector
xidel https://news.ycombinator.com -e 'css("span.titleline a")/@href'
The -e flag takes XPath, XQuery, JSONiq, or CSS. xidel auto-detects HTML vs. JSON and parses accordingly — no flipping between jq and pup.
The real power shows up when you need multiple fields at once. XQuery lets you build structured records inline:
xidel https://news.ycombinator.com -e '
for $row in //tr[@class="athing"] return {
"title": $row//span[@class="titleline"]/a,
"url": $row//span[@class="titleline"]/a/@href,
"rank": $row//span[@class="rank"]
}
' --output-format=json-wrapped
One process, structured JSON output, no glue script. Now the killer feature — automatic link following:
# Front page → every comment thread → comment bodies, in one invocation
xidel https://news.ycombinator.com \
-f '//span[@class="titleline"]/a/@href' \
-e '//div[@class="comment"]//text()'
The -f flag means "follow these URLs and apply the next -e to each." Chain -f and -e arbitrarily — five levels deep, paginated archive crawls, login-then-scrape — all in one process. Cookies persist across the chain automatically.
It speaks JSON with the same engine:
# Every top-level dependency name from a package-lock.json
xidel package-lock.json -e '$json/packages/*/dependencies'
Need to log in first? It does forms and sessions:
xidel https://site.example/login \
--post='user=alice&pass=secret' \
--cookie-jar=cookies.txt \
-f '//a[text()="Dashboard"]/@href' \
-e '//table[@id="stats"]//tr'
Output formatters cover the cases you actually need. --output-format= takes adhoc, json-wrapped, xml, html, or bash. The bash mode is gold — xidel emits title="..." lines you can eval straight into your shell:
eval "$(xidel https://example.com/api/whoami \
-e 'user:=//span[@id="username"]' \
-e 'mail:=//a[@class="email"]' \
--output-format=bash)"
echo "$user $mail"
Why this beats the alternatives: grep can't parse nested HTML. jq can't read HTML at all. BeautifulSoup needs a Python interpreter and fifty lines for what xidel does in one expression. pup handles CSS but not XPath, and has no link-following or session handling. xidel collapses fetcher, parser, session manager, and serializer into one tool — and the expression language is a real, standardized one you already half-know.
