It's probably worth pointing out that OS X is using BSD grep. You will see a phenomenal difference if you install the gnu grep via homebrew:
$ homebrew info grep
GNU grep, egrep and fgrep
https://www.gnu.org/software/grep/
/usr/local/Cellar/grep/3.3 (21 files, 885.3KB) *
Poured from bottle on 2019-01-03 at 12:23:37
From: https://github.com/Homebrew/homebrew-core/blob/master/Formula/grep.rb
...
This is a very arbitrary example, but I've got 1.5Gb application log sitting on my machine right now that'll suffice. I'll even do this in a way that might give a slight performance advantage to BSD, by using gnu grep first:
$ time /usr/local/bin/ggrep "foobarbaz" application.log
real 0m1.319s
user 0m0.948s
sys 0m0.345s
vs OSX's grep:
$ time /usr/bin/grep "foobarbaz" application.log
real 0m37.225s
user 0m31.036s
sys 0m1.286s
1s vs 37s. Same file.
There's nothing odd about the file, straight boring application logs, and the line starts with the logging level in capital letters. So I can use a regex that is pinned to the start of the line, about as optimal as you can get:
first gnugrep:
$ time /usr/local/bin/ggrep -c "^INFO" application.log
1527786
real 0m0.622s
user 0m0.323s
sys 0m0.292s
then OS X's native bsd grep:
$ time /usr/bin/grep -c "^INFO" application.log
1527786
real 0m3.588s
user 0m3.206s
sys 0m0.349s
BSD grep was significantly better than the prior search, but it is still notably slower than the gnu grep.
In my experience, this isn't the most pathological example, but it's close. Note that this also applies to other tools that leverage grep under the surface, like zgrep etc.
I've specifically aliased grep over to ggrep in my shell so that I'm avoiding bsd grep whenever I can:
All the OS X tools are horribly broken due to something fucked up inside their unicode support.
try something like tr:
time gtr a b < application.log > /dev/null
time tr a b < application.log > /dev/null
if you set LC_ALL=c it might be a little faster.
$ time gtr a b < http_10x.log > /dev/null
real 0m0.061s
user 0m0.032s
sys 0m0.028s
$ time tr a b < http_10x.log > /dev/null
real 0m2.284s
user 0m2.267s
sys 0m0.014s
The same LC_ALL=C trick also speeds up GNU tools, sort in particular:
$ shuf /var/log/syslog > shuf-syslog
$ time sort shuf-syslog > /dev/null
real 0m0.536s
user 0m0.870s
sys 0m0.031s
$ time LC_ALL=C sort shuf-syslog > /dev/null
real 0m0.094s
user 0m0.109s
sys 0m0.024s
It's more dramatic with a bigger file, of course.
I once got annoyed with sort's speed, threw together a parallel external sort program that dramatically outperformed it, then realized the difference was not as dramatic with LC_ALL=C...oh well, it was a fun afternoon program anyway.
It's certainly worth considering, but the answer depends on what you're doing with it. If your goal is to binary search over it for an exact string match, for example, the important thing is that your sorting and searching code are consistent. Might as well have both use the simpler/faster LC_ALL=C semantics.
Locale specific sorting is not the same as sorting by codepoint. The Unicode collation algorithm is fairly involved: https://www.unicode.org/reports/tr10/
If you are just using sort because you need to use uniq then you can replace the whole thing with an awk command that gets unique lines without sorting. I think it's faster.
Just don't forget about these aliases since GNU grep and the builtin grep can vary in what they support. I once ran into a head-scratching afternoon where `sed` would work with GNU extensions from the command line but not inside a shell script, and it took far too long to figure it out!
Keep in mind the "BSD grep" in OS X is some ancient version. Some work was done in the last few years to improve it somewhat. Today, I'd skip over both tools in favor of smarter searchers like Ripgrep or Ag (the_silver_searcher).
Why wouldn't pinning to start of line help for some patterns/data? You do still have to find the next newline, but you don't have to make other, additional comparisons once ^x is false, for that line.
Or, roughly, grepping for '^x' instead of 'x' should result in less work if the line does not contain x in the first character. One fewer comparison for each character after the first.
You seem to be thinking about handling a line at a time. In that context, checking for an x at the beginning would be much faster than looking for an x anywhere in the line. However, for anything fast you're going to be handling a large buffer at a time, if not the entire memory-mapped file. In that context, there's no difference between ^x and yx, except that ^x also needs to match at start-of-file (so if anything it should be marginally slower). Finding just x is probably faster than finding yx - longer patterns can be faster when you can skip looking at some fraction of the input (as per TFA) but I expect 2 characters isn't enough to allow that.
Just to elaborate a little on my quick callout. Doing line oriented searches actually requires you to look at every character, if only to find all the line breaks.
Instead, treat the file as a collection of characters and just look for patterns. Which may include the line break character.
That make sense?
Edit: I should also add that you should look for burntsushi's posts. Turns out some of this had become out of date. And largely depends on size of what you are searching.
It's a pity that Apple lets license politics outweigh technological excellence in the software it chooses to ship.
Way back when, the first thing I'd do on any non-GNU host was to install a full GNU userland. I could write a book on my issues with GNU tools, but they are on balance preferable to the alternatives. All IMHO, of course.
In the latest version, if you need to use these commands with their normal names, you can add a "gnubin" directory to your PATH from your bashrc like:
PATH="/usr/local/opt/grep/libexec/gnubin:$PATH"
According to Apple's MacBook Pro marketing page [0], their SSD supports sequential read speeds of up to 3.2GB/s. So this isn't as far-fetched as you imply, even if we're discounting other factors like the filesystem cache.
Yes, I see these specs all the time but I never see them materialize in the real world. Agree about the cache but that's also an easy way to fool yourself when you are measuring performance.
> easy way to fool yourself when you are measuring performance
In this case isn't the "already in RAM" test a more accurate reflection of performance anyway, as we are talking about the performance of grep and not the IO subsystem?
There are many cases where grep's performance won't be bottlenecked by IO, or at least not impacted directly by a bottleneck there. Anywhere when the input is coming from another program, essentially, and even if that task is IO bound it might be sending output to grep much faster than it is consuming input (perhaps gunzip <file | grep searchtext).
And in the case of searching a log file interactively, it is common that you won't just run grep of the file just once in a short space of time, instead doing it a couple of times as you refine your search, so for most of the runs it will be in cache (assuming the file is not to large).
This usage is literally ideal for pretty much any file I/O - it's a straight sequential read. Even spinning rust will achieve >400MB/s on this type of load. Take a look at the sequential burst speeds at QD1, first chart: https://www.anandtech.com/show/13761/the-samsung-970-evo-plu...
Nearly ever SSD listed achieves well over 1GB/s in an actual benchmark, not just on a spec sheet. And these are just boring old off the shelf consumer drives. Nothing crazy.
HDDs almost never sustain 400 MB/s unless you are talking about something pretty exotic. 5200 RPM drives are generally in the 100-130 MB/s range and 7200 proportionally faster but still usually under 200 MB/s.
On my puny laptop with an SSD I get ~400MB/s from cold cache, and ~5GB/s after the first run. So the answer is likely "it's in the FS cache".
That's a very common use-case with grep. Either grepping a file you recently wrote, or running grep multiple times as you refine the regex, at which point the files will be in the FS cache.
This would not help, since the backing storage doesn't provide support for this kind of resolution. It would end up reading in the entire file anyways, unless your input string is on the order of an actual block.
Sure it would help, not for the IO part, but the CPU-bound part of actually checking each character, which is apparently a much lower bound in this case.
Yeah, that's why the article talks about decreasing the amount of CPU work. From the context of disk IO though (which is what this thread seems to be about) this can't help.
Loading the file from disk into memory doesn’t require reads by the CPU. That’s significantly different than the cpu doing comparisons (or even reads) on each byte.
This is why the linked article specifically says it does not look for newlines:
> Moreover, GNU grep AVOIDS BREAKING THE INPUT INTO LINES. Looking for newlines would slow grep down by a factor of several times, because to find the newlines it would have to look at every byte!
It can whenever you don't ask for line numbers, can't it?
> It probably looks for newlines after a match is found
Probably, yeah. Counting number of newlines in a range, without caring just where they fall, can probably be pretty darned fast with vector instructions. Idk if that's worth the additional pass, though.
It doesn't need to look for newlines. That is looking for ^ABCD is not harder than looking for ABCD. The set of matches for the former is a subset of the latter, so if you have some fast way of doing the latter you have a fast way of doing the former (with an additional check for the preceding newline).
Another way of looking at is just considering the ^ another character (plus one-off special handling for start of file).
Others have mentioned the SSD and VFS cache, however spinning disks in a RAID configuration can easily surpass this in raw sequential read performance.
There's nothing odd about the file, straight boring application logs, and the line starts with the logging level in capital letters. So I can use a regex that is pinned to the start of the line, about as optimal as you can get:
first gnugrep:
then OS X's native bsd grep: BSD grep was significantly better than the prior search, but it is still notably slower than the gnu grep.In my experience, this isn't the most pathological example, but it's close. Note that this also applies to other tools that leverage grep under the surface, like zgrep etc.
I've specifically aliased grep over to ggrep in my shell so that I'm avoiding bsd grep whenever I can: