Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Faster Mac dev tools with custom allocators (eisel.me)
113 points by meisel on Nov 1, 2021 | hide | past | favorite | 33 comments


Pathfinder got enormous benefits (3x performance difference as I recall?) switching from Apple's default allocator to jemalloc. Apple's allocator has not kept pace with the competition, especially for multithreaded workloads.


I saw similar results in malloc()-heavy benchmarks. libc on macOS seems to have a fair number of performance traps; as another example, I found that timegm() on macOS is 100x slower than Linux, and 1000x slower than a reasonably optimized standalone algorithm: https://blog.reverberate.org/2020/05/12/optimizing-date-algo...


Yeah, related to timegm(), I made this to fix a performance issue: https://github.com/michaeleisel/JJLISO8601DateFormatter


What’s stopping Apple switching its default allocator?


I can't speak to Apple's reasons specifically, but usually it comes down to:

1. Degenerate cases with new allocator designs, e.g. being better for some workloads and not others

2. Bug-for-bug compatibility - applications which break due to dependencies on undocumented behavior or the old allocator memory structures.

3. Boundary conflicts - systems where an allocator change would mean allocation and free are hitting different implementations across module boundaries, as one module allocates memory for another to consume. Some systems and programming languages are more vulnerable to this sort of issue.


A fourth one would be debugging support. They have a bunch of stuff in there (perhaps dated these days) to auto scribble on malloc/free, allocate guard pages etc. They probably could/should look into integrating a more modern allocator (mimalloc, jemalloc) and falling back to the old one when those features are needed (assuming they can’t bring them forward).


MallocScribble is available as "opt.junk" in jemalloc [1]. As for guard pages, tcmalloc has TCMALLOC_PAGE_FENCE [2], and there is an issue [3] in jemalloc.

In any case, Apple and others have invested hugely in LLVM AddressSanitizer, so the Electric Fence-like malloc debugging features are considered more of a last resort these days.

[1]: http://jemalloc.net/jemalloc.3.html

[2]: https://chromium.googlesource.com/external/gperftools/+/gper...

[3]: https://github.com/jemalloc/jemalloc/issues/1664


When I worked at Apple they were the first resort as they were usually cheaper to iterate with (no recompilation, lower overhead, etc), but ASAN has only gotten better & was the primary focus. I agree that maybe jemalloc might be a good drop-in replacement, but it just might not be a priority for the libc maintainers. Lots of low hanging fruit can be missed for very long periods of time because you have to pick what to focus on.

GWP ASAN from TCMalloc might be the better direction for runtime support for memory corruption from malloc as it gives a lot of ASAN-like protection with an electric fence-like performance profile (& can be turned on/off at runtime).


Yeah, but ASan requires recompilation. GuardMalloc and MallocScribble apply to your whole process, and so even malloc()s called by system frameworks (ex AppKit) get checked. I've found several OS bugs this way over the years. Maybe internally Apple has ASan builds of the whole OS, but they don't distribute it externally.


Security-critical components like the kernel and WebKit have ASan variants, that’s for sure. But they also use custom allocators.


Valgrind works OK on macOS, right?


According to release notes, it supports 10.12 (with preliminary support for 10.13). That's several releases behind (10.14, 10.15, 11, 12).


Homebrew won't even install valgrind anymore.


it hasn't for years afaik


Apple has a whole set of heap debugging tools that they use internally and are quite useful even as a third party developer, and I’m sure they rely on having certain heap metadata around so they can reconstruct a memory graph after the fact. Take a look at heap(1) and leaks(1) to see what they offer.


I don't think 3. is an issue for Apple, because there is no static libSystem.


My understanding is that Apple really cares about memory footprint, since they ship hundreds of daemons on machines that often don’t have much memory to spare. So IMO their allocator prioritizes that over say throughout, which might annoy app developers but isn’t necessarily a bad compromise. (In places where Apple has different needs, e.g. WebKit, different allocators are used.)


There’s more to measure with allocators than how fast it can pump out memory. You’d probably want to check for memory overhead and heap fragmentation and various security considerations.


I doubt there is any particularly compelling reason other than not having yet paid down that piece of technical debt.


Custom allocators have always been available for specialized needs, even the JDK ships with a number of options which each can be tuned further. I wrote a fast and safe memory allocator for the old MacOS that was briefly popular before MacOSX appeared which obsoleted it. But having built such a beast (and written the tons of test apps you need to ensure it works under all conditions) there is always room to optimize for needs that you can't employ in a generalized allocator. Like everything, you can't optimize for all cases and still be good enough for the average case.


I'd be really interested to see benchmarks of Apple's allocator versus others, both in memory consumption and performance. Sometimes, Apple's version of things is worse in almost all use cases, but I'll reserve judgment and wait to see numbers.


We use a modified (by the author) jemalloc in a heavily threaded server that runs on Centos.

The reason was due to memory fragmentation over time and jemalloc reduced this to a level that we could live with.

We’ve tried other allocators over the years but still jemalloc, or our version of to be precise, is the winner in memory usage and fragmentation.

Edit: performance wise, gains can be had by doing your own memory pools on a per-thread basis and making use of stack objects rather than heap objects. Larger allocations for your own object pools and managing those reduces the calls to malloc also helps reduce memory fragmentation and your vmsize diverging from your rss.


Just to add… std::vector can lead to fragmentation in heavily threaded applications. We found std::deque solved that issue.


mimalloc. Has anyone given Microsoft allocator (MIT license) a try?

It appears to benchmark better than even jemalloc and others.

https://github.com/microsoft/mimalloc#Performance


mimalloc is briefly mentioned in my article, and I was surprised that it worked on the mac for swapping itself in. I found the performance of it to be roughly the same as jemalloc when I measured


For even more bang, rebuild LLVM with your custom allocator, profile guidance, and link-time optimization. If you're planning to invoke the compiler a million times, you might as well get it peak optimized.


I'd love to read a detailed walkthrough about this!



Yeah, building Swift is not straight-forward. There are several repos needs to be checked out and coordinated in lock-steps. I usually starts with "swift/release/xxx" branches in llvm / cmark / swift, and call this to find what I am missing: https://github.com/apple/swift/blob/main/utils/build-script


Mimalloc improved Bun’s performance on macOS by 10%.


Are there any official benchmarks for swift compilation times ? I’m curious to see if they’ve been going up or down with the latest releases.


Perhaps there's a sandbox escape hiding in the workaround they did to get swiftc to use their jemalloc build?


It’s a build variable in Xcode telling it which compiler to use. At that point you have arbitrary code execution anyways as you control the build process.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: