2026-05-17
You know the dance. Hardware datasheet you need to search? RFC printed to PDF? A 400-page vendor manual where the table of contents lies? You reach for pdftotext file.pdf - | grep foo, lose the page numbers, can't recurse across a directory, and discover the next PDF was scanned at a weird angle and your regex falls off a cliff.
pdfgrep is exactly what the name promises: a grep-compatible binary that opens PDFs natively, knows about pages, and supports the flags your fingers already type. It's been in Debian since 2010 and most people I show it to have never installed it.
# Basic — just like grep, but the file is a PDF
pdfgrep "TLS 1.3" rfc8446.pdf
# Page numbers (the killer feature pdftotext loses)
pdfgrep -n "deprecated" manual.pdf
# manual.pdf:42: deprecated since version 3.0
# manual.pdf:118: this API is deprecated...
# Recursive across a directory of papers
pdfgrep -r --include "*.pdf" "byzantine fault" ~/papers/
# Context lines, just like grep -C
pdfgrep -C 3 "CVE-2025" advisories/*.pdf
# Perl regex — lookbehinds work
pdfgrep -P "(?<=Section )\d+\.\d+\.\d+" ieee-spec.pdf
# Only the matching part, not the whole line
pdfgrep -o -P "RFC ?\d{4,5}" draft.pdf | sort -u
# Encrypted PDFs (yes, really)
pdfgrep --password hunter2 "settlement amount" contract.pdf
The hidden gem is --cache. PDF text extraction is slow — fonts, ligatures, CID maps, that one paper from 1998 with embedded PostScript. With a cache, the second search through a corpus is nearly instant:
# First run extracts and caches; subsequent runs are fast
pdfgrep --cache -r "homomorphic" ~/library/
Pair it with fd or xargs for serious sweeps. Want to find every PDF in your downloads that mentions a specific contract number, sorted by which page it appears on?
pdfgrep -rHn "PO-2026-0421" ~/Downloads/ \
| awk -F: '{print $1, $2}' \
| sort -k2 -n
Or audit a stack of vendor manuals for deprecated config options, exporting filename and page for a JIRA ticket:
pdfgrep -rHn -P "\b(deprecated|obsolete|removed in)\b" vendor-docs/ \
| column -t -s:
The -H/-h, -i, -c, -l, -L, -v, --include, --exclude flags all behave exactly like GNU grep. There's nothing to relearn. pdfgrep --help reads like grep --help with a few PDF-specific additions (-p for page numbers in output, --page-range 50-100, --unac to strip accents before matching).
Why not just pdftotext piped to grep?
Install: apt install pdfgrep, brew install pdfgrep, pacman -S pdfgrep. It links against poppler, which you almost certainly already have because something on your machine renders PDFs.
The day you stop reaching for pdftotext | grep is the day searching documentation stops being a chore.
pdftotext | grep pipeline with one tool that already speaks your grep muscle memory.
