This is wrong. Boyer-Moore does the same thing. It just always selects the last byte to feed into memchr. If that byte happens to be rare, then it works great. But if it happens to be common, then it gets into the weeds pretty quickly. My suggested heuristic does the same thing, but increases the chances that it selects a rare byte.
And yes, the performance difference can be very large. Here's an example on a ~9.3GB file:
$ time rg --no-mmap 'Sherlock ' OpenSubtitles2016.raw.en | wc -l
6698
real 3.006
user 1.658
sys 1.345
maxmem 8 MB
faults 0
$ time grep 'Sherlock ' OpenSubtitles2016.raw.en | wc -l
6698
real 9.023
user 7.921
sys 1.092
maxmem 8 MB
faults 0
Notice that the pattern is `Sherlock `. The last byte is an ASCII space character, which is incredibly common. Boyer-Moore blindly picks this as the skip byte, but ripgrep uses a simple pre-computed frequency table to select a better byte as the skip byte.
> Presumably this is not too bad for ripgrep itself, as long as it falls back to something sensible when the assumption fails.
It does. That's why I said, "ripgrep does not use Boyer-Moore in most searches."
What pathological case are you referring to? The worst case I can think of is a completely random haystack in which case it is approximately equivalent to a naive search.
Even Boyer-Moore is not superior to a naive search in literally every case, e.g. short needle or large alphabet.
This kind of thing is an pattern unless the benefit over Boyer-Moore huge.
A small performance gain in the common-case is not worth the pain of introducing pathological cases that only bite you once you are deeply committed.
Presumably this is not too bad for ripgrep itself, as long as it falls back to something sensible when the assumption fails.