2026-05-21
You know ripgrep. Fast, recursive, respects .gitignore, the default search tool for anyone who has typed rg more than once. But ripgrep stops at the file boundary — it sees a PDF, a zip, a sqlite database, or an epub and treats them as binary garbage. ripgrep-all (binary name rga) is a ripgrep wrapper that transparently extracts text from dozens of archive and document formats before handing the stream to ripgrep, with caching so the second search is nearly free.
Install it on Debian/Ubuntu with apt install ripgrep-all, on macOS with brew install rga, or grab a static binary from the GitHub releases. You'll also want poppler-utils, pandoc, and ffmpeg on your PATH — rga shells out to them as adapters.
The killer demo:
$ rga "kerberos delegation" ~/research/
research/papers/active_directory.pdf: Page 14: Unconstrained kerberos delegation allows...
research/notes.tar.gz: ad-notes.md: see also: kerberos delegation attack chains
research/archive.zip: slides.pptx: Slide 7: Kerberos Delegation (Constrained vs. RBCD)
research/log.sqlite: events: row 4821: "kerberos delegation detected on host DC01"
That's one command searching inside a PDF, a gzipped tarball, a PowerPoint inside a zip, and a SQLite database. No pdftotext, no unzip -p, no scripting. List the adapters rga knows about:
$ rga --rga-list-adapters
Adapters:
- pandoc: epub,odt,docx,fb2,ipynb,html,rtf
- poppler: pdf
- postprocpagebreaks (adds page numbers to pdftotext output)
- ffmpeg: mkv,mp4,mov,avi,wmv,webm (extracts subtitle tracks!)
- zip: zip
- decompress: gz,bz2,xz,zst,br
- tar: tar
- sqlite: db,db3,sqlite,sqlite3
Yes — it greps through video subtitles. rga "i am your father" ~/Movies/ works exactly how you'd hope. Useful in genuine ways too: searching webinar recordings for a quote, scanning a directory of conference talks, finding a specific exchange in a recorded meeting.
The cache is the secret weapon. rga stores extracted text in ~/.cache/rga/ keyed by file hash, so the first search through a 400-page PDF takes seconds, but the next twenty searches are instant. Inspect or wipe it:
$ du -sh ~/.cache/rga
142M /home/shaun/.cache/rga
$ rga --rga-cache-max-blob-len 50M # tune what's worth caching
A few flags worth knowing. --rga-adapters=+pandoc,-zip enables or disables specific adapters inline. --rga-no-cache bypasses the cache for one-shot searches. Pass --rga-accurate to favor extraction quality over speed (matters for messy PDFs with columns). Everything else — -i, -A 3, --type, -l — is just ripgrep, because rga inherits the entire flag set.
Why this beats the alternatives: the standard advice is "use pdftotext in a loop" or "write a shell script with find." Those work for one format. rga handles a dozen with no glue code, caches transparently, and gives you ripgrep's speed and ergonomics for everything. It's the tool I reach for when someone says "I think I mentioned this in a doc somewhere, but I don't remember which one or what format it was."
