It's probably worth pointing out that OS X is using BSD grep. You will see a phe...

justinsaccount · on March 29, 2019

All the OS X tools are horribly broken due to something fucked up inside their unicode support.

try something like tr:

  time gtr a b < application.log > /dev/null
  time tr a b < application.log > /dev/null

if you set LC_ALL=c it might be a little faster.

  $ time gtr a b < http_10x.log > /dev/null

  real 0m0.061s
  user 0m0.032s
  sys 0m0.028s

  $ time tr a b < http_10x.log > /dev/null

  real 0m2.284s
  user 0m2.267s
  sys 0m0.014s

scottlamb · on March 29, 2019

The same LC_ALL=C trick also speeds up GNU tools, sort in particular:

    $ shuf /var/log/syslog > shuf-syslog
    $ time sort shuf-syslog > /dev/null

    real    0m0.536s
    user    0m0.870s
    sys     0m0.031s
    $ time LC_ALL=C sort shuf-syslog > /dev/null

    real    0m0.094s
    user    0m0.109s
    sys     0m0.024s

It's more dramatic with a bigger file, of course.

I once got annoyed with sort's speed, threw together a parallel external sort program that dramatically outperformed it, then realized the difference was not as dramatic with LC_ALL=C...oh well, it was a fun afternoon program anyway.

danudey · on March 29, 2019

In fairness, when I'm dealing with unicode data which is actually unicode data, I don't want to set `LC_ALL=C` and fuck up my data set.

scottlamb · on March 29, 2019

It's certainly worth considering, but the answer depends on what you're doing with it. If your goal is to binary search over it for an exact string match, for example, the important thing is that your sorting and searching code are consistent. Might as well have both use the simpler/faster LC_ALL=C semantics.

treffer · on March 30, 2019

Actually sorting should be fine. The order of UTF-8 will always be the same as the codepoints encoded under it if you sort it as binary.

I really appreciate this property of UTF-8, and I can highly recommend to do a pen&paper exercise to see why it preserves order :-)

It wouldn't work if you want any type of normalisation and especially solving composition/decomposition.

burntsushi · on March 31, 2019

Locale specific sorting is not the same as sorting by codepoint. The Unicode collation algorithm is fairly involved: https://www.unicode.org/reports/tr10/

james_s_tayler · on March 30, 2019

If you are just using sort because you need to use uniq then you can replace the whole thing with an awk command that gets unique lines without sorting. I think it's faster.

microtonal · on March 29, 2019

Yep. For this reason I override the standard UNIX tools on macOS with Nix + home-manager. E.g.:

  ~ % which grep
  /Users/daniel/.nix-profile/bin/grep
  ~ % which sed
  /Users/daniel/.nix-profile/bin/sed
  ~ % which ls
  /Users/daniel/.nix-profile/bin/ls

Twirrim · on March 29, 2019

So far as I can see, setting LC_ALL=c shaves about 5 seconds off the "foobarbaz" case, and nothing off the "^INFO" case.

JdeBP · on March 29, 2019

It is definitely worth pointing out that MacOS is not using BSD grep. Apple has not tracked the updates, I am told.

* https://unix.stackexchange.com/a/398249/5132

Twirrim · on March 29, 2019

I stand corrected. May be more accurate to say "BSD derived grep"

TheRealPomax · on March 29, 2019

also worth updating your initial comment to clear that up?

Twirrim · on March 29, 2019

It's not possible. Comments only have a small window in which they can be edited.

pushpop · on March 30, 2019

Doesn’t that doesn’t mean OS X isn’t using BSD grep. That just means it’s using an outdated version of BSD grep.

mbauman · on March 29, 2019

Just don't forget about these aliases since GNU grep and the builtin grep can vary in what they support. I once ran into a head-scratching afternoon where `sed` would work with GNU extensions from the command line but not inside a shell script, and it took far too long to figure it out!

https://stackoverflow.com/questions/40183218/why-does-sed-be...

herdrick · on March 29, 2019

I did the very same thing, and asked about it on SO too! https://stackoverflow.com/questions/54658932/grep-v-f-of-emp...

loeg · on March 29, 2019

Keep in mind the "BSD grep" in OS X is some ancient version. Some work was done in the last few years to improve it somewhat. Today, I'd skip over both tools in favor of smarter searchers like Ripgrep or Ag (the_silver_searcher).

taeric · on March 29, 2019

Just a quick callout that pinning the search to start of line does nothing. It is just another character, after all.

Now, getting it to report line numbers will kill it, since you then have to scan every character.

tyingq · on March 29, 2019

Why wouldn't pinning to start of line help for some patterns/data? You do still have to find the next newline, but you don't have to make other, additional comparisons once ^x is false, for that line.

Or, roughly, grepping for '^x' instead of 'x' should result in less work if the line does not contain x in the first character. One fewer comparison for each character after the first.

dllthomas · on March 29, 2019

You seem to be thinking about handling a line at a time. In that context, checking for an x at the beginning would be much faster than looking for an x anywhere in the line. However, for anything fast you're going to be handling a large buffer at a time, if not the entire memory-mapped file. In that context, there's no difference between ^x and yx, except that ^x also needs to match at start-of-file (so if anything it should be marginally slower). Finding just x is probably faster than finding yx - longer patterns can be faster when you can skip looking at some fraction of the input (as per TFA) but I expect 2 characters isn't enough to allow that.

taeric · on March 29, 2019

Just to elaborate a little on my quick callout. Doing line oriented searches actually requires you to look at every character, if only to find all the line breaks.

Instead, treat the file as a collection of characters and just look for patterns. Which may include the line break character.

That make sense?

Edit: I should also add that you should look for burntsushi's posts. Turns out some of this had become out of date. And largely depends on size of what you are searching.

markrages · on March 29, 2019

newline is just another character, so grepping '^x' should be no different than grepping 'yx'.

dllthomas · on March 29, 2019

Except at start-of-file.

taeric · on March 30, 2019

I'm guessing the instructions necessary to special case beginning of file are in the nanosecond range.

dllthomas · on March 30, 2019

Probably not much performance difference, for sure.

eadmund · on March 30, 2019

It's a pity that Apple lets license politics outweigh technological excellence in the software it chooses to ship.

Way back when, the first thing I'd do on any non-GNU host was to install a full GNU userland. I could write a book on my issues with GNU tools, but they are on balance preferable to the alternatives. All IMHO, of course.

m463 · on March 30, 2019

Apple ships GPL v2 software, but they don't ship any GPL v3 software.

I'm uncertain if it's the patent clause or the fact that apple prevents you from modifying and running changed software (such as on the iphone).

What's funny is that I think they are technicall in violation of the GPL (their modifications to the GPL v2 bash are not entirely distributed)

Wowfunhappy · on March 30, 2019

> It's a pity that Apple lets license politics outweigh technological excellence in the software it chooses to ship.

Is there much else they could do? Don't see how Apple could reasonably ship GPL code...

m0zg · on March 29, 2019

While you're at it, could you also benchmark ripgrep and silversearcher on the same exact file? Thanks!

zZorgz · on March 30, 2019

I personally use ripgrep instead.

the_jeremy · on March 29, 2019

You can also do `brew install grep --with-default-names` to avoid having to alias.

saagarjha · on March 29, 2019

You can no longer do this because homebrew/core has dropped options.

t0mbstone · on March 29, 2019

In the latest version, if you need to use these commands with their normal names, you can add a "gnubin" directory to your PATH from your bashrc like: PATH="/usr/local/opt/grep/libexec/gnubin:$PATH"

maayank · on March 31, 2019

Wow, many thanks. I actually do a lot of greps regularly, partly on OS X. Will replace now.

If anyone knows other tools where the gnu/home brew alternative is much better, please chime in (already saw tr on another comment).

maxxxxx · on March 29, 2019

How can it crunch through 1GB in 1s? Even just reading the data would take longer on any system I know.

karlding · on March 29, 2019

According to Apple's MacBook Pro marketing page [0], their SSD supports sequential read speeds of up to 3.2GB/s. So this isn't as far-fetched as you imply, even if we're discounting other factors like the filesystem cache.

[0] https://www.apple.com/macbook-pro/

maxxxxx · on March 29, 2019

Yes, I see these specs all the time but I never see them materialize in the real world. Agree about the cache but that's also an easy way to fool yourself when you are measuring performance.

dspillett · on March 29, 2019

> easy way to fool yourself when you are measuring performance

In this case isn't the "already in RAM" test a more accurate reflection of performance anyway, as we are talking about the performance of grep and not the IO subsystem?

There are many cases where grep's performance won't be bottlenecked by IO, or at least not impacted directly by a bottleneck there. Anywhere when the input is coming from another program, essentially, and even if that task is IO bound it might be sending output to grep much faster than it is consuming input (perhaps gunzip <file | grep searchtext).

And in the case of searching a log file interactively, it is common that you won't just run grep of the file just once in a short space of time, instead doing it a couple of times as you refine your search, so for most of the runs it will be in cache (assuming the file is not to large).

Rapzid · on March 29, 2019

The data could also be in the VFS cache/buffers for a number of reasons depending on the system setup and access pattern; including being written.

kllrnohj · on March 29, 2019

This usage is literally ideal for pretty much any file I/O - it's a straight sequential read. Even spinning rust will achieve >400MB/s on this type of load. Take a look at the sequential burst speeds at QD1, first chart: https://www.anandtech.com/show/13761/the-samsung-970-evo-plu...

Nearly ever SSD listed achieves well over 1GB/s in an actual benchmark, not just on a spec sheet. And these are just boring old off the shelf consumer drives. Nothing crazy.

BeeOnRope · on March 30, 2019

HDDs almost never sustain 400 MB/s unless you are talking about something pretty exotic. 5200 RPM drives are generally in the 100-130 MB/s range and 7200 proportionally faster but still usually under 200 MB/s.

kllrnohj · on March 30, 2019

https://www.tomshardware.com/reviews/wd-red-10tb-8tb-nas-hdd...

So yeah maybe not over 400MB/s, but all of them are over 200MB/s. Sequential speeds really spiked as densities kept increasing.

the8472 · on March 29, 2019

Definitely seeing GB/s IO spikes on my samsung NVMe drives. E.g. when persisting application state into sqlite.

Note that you're not going to get this with SATA SSDs, you need NVMe, it's a 5x difference in throughput and IOPS.

avar · on March 29, 2019

On my puny laptop with an SSD I get ~400MB/s from cold cache, and ~5GB/s after the first run. So the answer is likely "it's in the FS cache".

That's a very common use-case with grep. Either grepping a file you recently wrote, or running grep multiple times as you refine the regex, at which point the files will be in the FS cache.

cyphar · on March 29, 2019

As TFA mentions,

> #1 trick: GNU grep is fast because it AVOIDS LOOKING AT EVERY INPUT BYTE.

TFA is incredibly short, and will explain it much better than I can.

saagarjha · on March 29, 2019

> it AVOIDS LOOKING AT EVERY INPUT BYTE

This would not help, since the backing storage doesn't provide support for this kind of resolution. It would end up reading in the entire file anyways, unless your input string is on the order of an actual block.

BeeOnRope · on March 29, 2019

Sure it would help, not for the IO part, but the CPU-bound part of actually checking each character, which is apparently a much lower bound in this case.

saagarjha · on March 29, 2019

Yeah, that's why the article talks about decreasing the amount of CPU work. From the context of disk IO though (which is what this thread seems to be about) this can't help.

pathseeker · on March 29, 2019

DMA would like a word with you

saagarjha · on March 29, 2019

I'm still not seeing how this significantly changes things?

kortilla · on March 29, 2019

Loading the file from disk into memory doesn’t require reads by the CPU. That’s significantly different than the cpu doing comparisons (or even reads) on each byte.

Arcuru · on March 29, 2019

That doesn't avoid it needing to read the data off disk though. AFAIK even SSDs still only read in 4K chunks.

As the other reply mentioned though, it's just that MacBook SSDs are that fast.

rujuladanh · on March 29, 2019

All SSDs are that fast (and way faster). Nothing special on Macbook ones.

tyingq · on March 29, 2019

"GNU grep is fast because it AVOIDS LOOKING AT EVERY INPUT BYTE"

Somewhat confusing since it has to look at every byte to find the newlines. They are using a pretty specific definition of "look".

tribaal · on March 29, 2019

This is why the linked article specifically says it does not look for newlines:

> Moreover, GNU grep AVOIDS BREAKING THE INPUT INTO LINES. Looking for newlines would slow grep down by a factor of several times, because to find the newlines it would have to look at every byte!

dllthomas · on March 29, 2019

... until you ask for line numbers.

tyingq · on March 29, 2019

It can't avoid breaking the output into lines, though, so it probably looks for newlines after a match is found.

I assume the Boyer Moore preprocessor reads a lot of bytes also.

Not disputing it's more efficient, but there's no magic. It avoids reading some bytes when and if it can.

dllthomas · on March 29, 2019

> It can't avoid breaking the output into lines

It can whenever you don't ask for line numbers, can't it?

> It probably looks for newlines after a match is found

Probably, yeah. Counting number of newlines in a range, without caring just where they fall, can probably be pretty darned fast with vector instructions. Idk if that's worth the additional pass, though.

BeeOnRope · on March 29, 2019

It doesn't need to look for newlines. That is looking for ^ABCD is not harder than looking for ABCD. The set of matches for the former is a subset of the latter, so if you have some fast way of doing the latter you have a fast way of doing the former (with an additional check for the preceding newline).

Another way of looking at is just considering the ^ another character (plus one-off special handling for start of file).

Twirrim · on March 29, 2019

The SM0512G in this machine is capable of over 1.5GB/s in sequential read.

It's also possible that the file is cached in memory (I ran grep a few times through the file before I carried out the specific measurements).

Rapzid · on March 29, 2019

Others have mentioned the SSD and VFS cache, however spinning disks in a RAID configuration can easily surpass this in raw sequential read performance.

BeeOnRope · on March 29, 2019

Not really "easily" - the second test does 1.5 GB in 0.622s for a throughput of 2.41 GB/s.

If we assume something like 100 MB/s sustained for spinning disks, that's a lot of disks to get to 2.41 GB/s even ignoring overheads.

Rapzid · on March 31, 2019

100 MB/s sustained is very slow these days. Even a Barracuda budget 7200 will hit >150 MB/s real world and it goes up from there.

BeeOnRope · on March 31, 2019

Even at 200 MB/s that is at least a dozen disks in zero redundancy RAID to get the needed speed.

This test was hitting the OS disk cache.

kijin · on March 29, 2019

NVMe SSD's can read about 3GB/s these days.

Twirrim · on March 29, 2019

Faster than that even, but you're not likely to see them in laptops at the moment :D

Samsung's PM1725b (https://www.samsung.com/semiconductor/ssd/enterprise-ssd/MZP...) has a Seq. Read of 6300 MB/s and Seq. Write of 3300 MB/s.

LeoPanthera · on March 29, 2019

This is answered in the article.

scrooched_moose · on March 29, 2019

No it isn't. Unless grep is capable of skipping entire blocks (4096 bytes) it still has to pull that data off the disk.

BeeOnRope · on March 29, 2019

No because most blocks will be found in the file system cache.