Pathfinder got enormous benefits (3x performance difference as I recall?) switching from Apple's default allocator to jemalloc. Apple's allocator has not kept pace with the competition, especially for multithreaded workloads.
I saw similar results in malloc()-heavy benchmarks. libc on macOS seems to have a fair number of performance traps; as another example, I found that timegm() on macOS is 100x slower than Linux, and 1000x slower than a reasonably optimized standalone algorithm: https://blog.reverberate.org/2020/05/12/optimizing-date-algo...
I can't speak to Apple's reasons specifically, but usually it comes down to:
1. Degenerate cases with new allocator designs, e.g. being better for some workloads and not others
2. Bug-for-bug compatibility - applications which break due to dependencies on undocumented behavior or the old allocator memory structures.
3. Boundary conflicts - systems where an allocator change would mean allocation and free are hitting different implementations across module boundaries, as one module allocates memory for another to consume. Some systems and programming languages are more vulnerable to this sort of issue.
A fourth one would be debugging support. They have a bunch of stuff in there (perhaps dated these days) to auto scribble on malloc/free, allocate guard pages etc. They probably could/should look into integrating a more modern allocator (mimalloc, jemalloc) and falling back to the old one when those features are needed (assuming they can’t bring them forward).
MallocScribble is available as "opt.junk" in jemalloc [1]. As for guard pages, tcmalloc has TCMALLOC_PAGE_FENCE [2], and there is an issue [3] in jemalloc.
In any case, Apple and others have invested hugely in LLVM AddressSanitizer, so the Electric Fence-like malloc debugging features are considered more of a last resort these days.
When I worked at Apple they were the first resort as they were usually cheaper to iterate with (no recompilation, lower overhead, etc), but ASAN has only gotten better & was the primary focus. I agree that maybe jemalloc might be a good drop-in replacement, but it just might not be a priority for the libc maintainers. Lots of low hanging fruit can be missed for very long periods of time because you have to pick what to focus on.
GWP ASAN from TCMalloc might be the better direction for runtime support for memory corruption from malloc as it gives a lot of ASAN-like protection with an electric fence-like performance profile (& can be turned on/off at runtime).
Yeah, but ASan requires recompilation. GuardMalloc and MallocScribble apply to your whole process, and so even malloc()s called by system frameworks (ex AppKit) get checked. I've found several OS bugs this way over the years. Maybe internally Apple has ASan builds of the whole OS, but they don't distribute it externally.
Apple has a whole set of heap debugging tools that they use internally and are quite useful even as a third party developer, and I’m sure they rely on having certain heap metadata around so they can reconstruct a memory graph after the fact. Take a look at heap(1) and leaks(1) to see what they offer.
My understanding is that Apple really cares about memory footprint, since they ship hundreds of daemons on machines that often don’t have much memory to spare. So IMO their allocator prioritizes that over say throughout, which might annoy app developers but isn’t necessarily a bad compromise. (In places where Apple has different needs, e.g. WebKit, different allocators are used.)
There’s more to measure with allocators than how fast it can pump out memory. You’d probably want to check for memory overhead and heap fragmentation and various security considerations.
Custom allocators have always been available for specialized needs, even the JDK ships with a number of options which each can be tuned further. I wrote a fast and safe memory allocator for the old MacOS that was briefly popular before MacOSX appeared which obsoleted it. But having built such a beast (and written the tons of test apps you need to ensure it works under all conditions) there is always room to optimize for needs that you can't employ in a generalized allocator. Like everything, you can't optimize for all cases and still be good enough for the average case.
I'd be really interested to see benchmarks of Apple's allocator versus others, both in memory consumption and performance. Sometimes, Apple's version of things is worse in almost all use cases, but I'll reserve judgment and wait to see numbers.
We use a modified (by the author) jemalloc in a heavily threaded server that runs on Centos.
The reason was due to memory fragmentation over time and jemalloc reduced this to a level that we could live with.
We’ve tried other allocators over the years but still jemalloc, or our version of to be precise, is the winner in memory usage and fragmentation.
Edit: performance wise, gains can be had by doing your own memory pools on a per-thread basis and making use of stack objects rather than heap objects. Larger allocations for your own object pools and managing those reduces the calls to malloc also helps reduce memory fragmentation and your vmsize diverging from your rss.
mimalloc is briefly mentioned in my article, and I was surprised that it worked on the mac for swapping itself in. I found the performance of it to be roughly the same as jemalloc when I measured
For even more bang, rebuild LLVM with your custom allocator, profile guidance, and link-time optimization. If you're planning to invoke the compiler a million times, you might as well get it peak optimized.
Yeah, building Swift is not straight-forward. There are several repos needs to be checked out and coordinated in lock-steps. I usually starts with "swift/release/xxx" branches in llvm / cmark / swift, and call this to find what I am missing: https://github.com/apple/swift/blob/main/utils/build-script
It’s a build variable in Xcode telling it which compiler to use. At that point you have arbitrary code execution anyways as you control the build process.