>... but it turns out that a guess works pretty well most of time since most thi...

burntsushi · on March 31, 2019

This is wrong. Boyer-Moore does the same thing. It just always selects the last byte to feed into memchr. If that byte happens to be rare, then it works great. But if it happens to be common, then it gets into the weeds pretty quickly. My suggested heuristic does the same thing, but increases the chances that it selects a rare byte.

And yes, the performance difference can be very large. Here's an example on a ~9.3GB file:

    $ time rg --no-mmap 'Sherlock ' OpenSubtitles2016.raw.en | wc -l
    6698
    
    real    3.006
    user    1.658
    sys     1.345
    maxmem  8 MB
    faults  0
    
    $ time grep 'Sherlock ' OpenSubtitles2016.raw.en | wc -l
    6698
    
    real    9.023
    user    7.921
    sys     1.092
    maxmem  8 MB
    faults  0

Notice that the pattern is `Sherlock `. The last byte is an ASCII space character, which is incredibly common. Boyer-Moore blindly picks this as the skip byte, but ripgrep uses a simple pre-computed frequency table to select a better byte as the skip byte.

> Presumably this is not too bad for ripgrep itself, as long as it falls back to something sensible when the assumption fails.

It does. That's why I said, "ripgrep does not use Boyer-Moore in most searches."

TheCoelacanth · on March 31, 2019

What pathological case are you referring to? The worst case I can think of is a completely random haystack in which case it is approximately equivalent to a naive search.

Even Boyer-Moore is not superior to a naive search in literally every case, e.g. short needle or large alphabet.