Tool Nobody Knows: pdfgrep: grep That Actually Reads Your PDFs

pdfgrep: grep That Actually Reads Your PDFs

2026-05-17

You know the dance. Hardware datasheet you need to search? RFC printed to PDF? A 400-page vendor manual where the table of contents lies? You reach for pdftotext file.pdf - | grep foo, lose the page numbers, can't recurse across a directory, and discover the next PDF was scanned at a weird angle and your regex falls off a cliff.

pdfgrep is exactly what the name promises: a grep-compatible binary that opens PDFs natively, knows about pages, and supports the flags your fingers already type. It's been in Debian since 2010 and most people I show it to have never installed it.

# Basic — just like grep, but the file is a PDF
pdfgrep "TLS 1.3" rfc8446.pdf

# Page numbers (the killer feature pdftotext loses)
pdfgrep -n "deprecated" manual.pdf
# manual.pdf:42: deprecated since version 3.0
# manual.pdf:118: this API is deprecated...

# Recursive across a directory of papers
pdfgrep -r --include "*.pdf" "byzantine fault" ~/papers/

# Context lines, just like grep -C
pdfgrep -C 3 "CVE-2025" advisories/*.pdf

# Perl regex — lookbehinds work
pdfgrep -P "(?<=Section )\d+\.\d+\.\d+" ieee-spec.pdf

# Only the matching part, not the whole line
pdfgrep -o -P "RFC ?\d{4,5}" draft.pdf | sort -u

# Encrypted PDFs (yes, really)
pdfgrep --password hunter2 "settlement amount" contract.pdf

The hidden gem is --cache. PDF text extraction is slow — fonts, ligatures, CID maps, that one paper from 1998 with embedded PostScript. With a cache, the second search through a corpus is nearly instant:

# First run extracts and caches; subsequent runs are fast
pdfgrep --cache -r "homomorphic" ~/library/

Pair it with fd or xargs for serious sweeps. Want to find every PDF in your downloads that mentions a specific contract number, sorted by which page it appears on?

pdfgrep -rHn "PO-2026-0421" ~/Downloads/ \
  | awk -F: '{print $1, $2}' \
  | sort -k2 -n

Or audit a stack of vendor manuals for deprecated config options, exporting filename and page for a JIRA ticket:

pdfgrep -rHn -P "\b(deprecated|obsolete|removed in)\b" vendor-docs/ \
  | column -t -s:

The -H/-h, -i, -c, -l, -L, -v, --include, --exclude flags all behave exactly like GNU grep. There's nothing to relearn. pdfgrep --help reads like grep --help with a few PDF-specific additions (-p for page numbers in output, --page-range 50-100, --unac to strip accents before matching).

Why not just pdftotext piped to grep?

You lose page numbers — the most useful piece of metadata a PDF has.
Recursion requires a wrapper script.
No caching, so repeated searches pay the extraction cost every time.
Encrypted files need a separate decrypt step.
Multi-line layout (columns, headers) gets mangled differently than pdfgrep's layout-aware extraction.

Install: apt install pdfgrep, brew install pdfgrep, pacman -S pdfgrep. It links against poppler, which you almost certainly already have because something on your machine renders PDFs.

The day you stop reaching for pdftotext | grep is the day searching documentation stops being a chore.

Key Takeaway: pdfgrep is a drop-in grep for PDFs that preserves page numbers, recurses directories, caches extractions, and handles encryption — replacing the lossy pdftotext | grep pipeline with one tool that already speaks your grep muscle memory.

All newsletters