Can it (or any tool) perform proximity searches on scanned PDFs? E.g word1 withi...

phiresky · on Dec 2, 2020

Scanned PDFs only work well if they already have an OCR layer. There's some optional integration of rga with tesseract, but it's pretty slow and less good than external OCR tools.

ripgrep-all can do the same regexes as rg on any filetypes it supports. So you can could do something like --multiline and foo(\w+[\s\n]+){,20}bar

It won't work exactly like this, but something similar should do it:

--multiline enables multiline matching

* foo searches for foo

* \w+ searches for at least one word character

* [\W]+ searches for at least one space/nonword character like sentence marks

* {,20} searches for at most 20 iterations of the word-space combination bar searches for bar

ballmerspeak · on Dec 2, 2020

If its a scanned PDF (essentially a collection of 1 image per page), there would need to be an OCR step to get some text out first. Tesseract would work for this.

Once that's done, you have all the options available to perform that search. But I don't know of a search tool that does the OCR for you. I did read a blog post of someone uploading PDFs to google drive (they OCR them on upload) as an easy way to do this.