Awesome
SurPlus
A fix for the race condition that MacOS 11.3+ exhibits on unsupported Macs
26 September 2021
It took me six months, a lot of sleepless nights, and I ended up having to write my own debugger to get the answers that I needed, but I finally found the source of the race condition that's been plaguing older Mac Pros (and others) using Big Sur 11.3+ and Monterey. I still don't know why this problem doesn't seem to appear on newer systems; my focus has been on finding the problem and creating a solution, and the question of newer systems' success seems like an academic exercise for someone with more time on their hands. (Thanks to insight from @vit9696, it seems that newer Macs don't suffer from this problem because those CPUs support the rdrand
instruction, meaning they don't need floating-point access during early boot.)
The patch posted here is intended to be incorporated into an OpenCore config.plist
file. If you need assistance with this task, please use one (or more) of the following resources:
- The SurPlus thread on MacRumors
- The OpenCore documentation
- The OpenCore thread on MacRumors
- The OpenCore Legacy Patcher (OCLP) site
- The OpenCore site
The patch itself appears at the bottom of this page, in a format suitable for cutting-and-pasting into the appropriate location in an OpenCore config.plist
file.
Questions, comments, or discussion about the "race condition" bug or this patch should be directed to the SurPlus thread on MacRumors.
If this information or the patch extends the usable life of your classic Mac, or you find other value in them, donations are gratefully accepted. (To be clear, I am not a tax-exempt organization, charity, or non-profit, and there is no tax benefit to you should you choose to donate - just my heartfelt thanks.)
<b>Update 30oct21</b> - Monterey 12.1b1 was seeded to developers this week, and adds a new wrinkle to this ongoing mess. See this MacRumors thread for more details (a writeup similar to this one will eventually appear here, either as a separate README or a separate repository). Also, I have updated the MaxKernel
values in the enclosed SurPlus patches to 21.2.0, allowing them to be applied to 12.1b1.
<b>Update 20dec21</b> - It appears that the released version of Monterey 12.1 does not require the SurPlus patch. While this may change in a subsequent release, it's certainly welcome news. Also, note that the MonteRand patch referenced in my 30oct21 update is only required for 12.1b1; other betas and releases do not exhibit the problem that MonteRand addresses.
<h2>The problem and solution for non-programmers:</h2>
MacOS consists of many separate parts working together. Two of those parts, cryptography and kernel memory management, are interdependent (each makes calls to the other). Starting with 11.3, there is a circular dependency between them - crypto needs something from memory management, and that same part of memory management needs something from crypto. If that dependency is encountered at the wrong time during boot, a deadlock occurs - crypto is waiting on memory management, and memory management is waiting on crypto. The boot process is then hung until the system is forcibly stopped.
There are several possible solutions, some better than others. I've refined the solution I'm now posting to where it only modifies three bytes of code, and it doesn't impair the functionality or security of the system.
<h2>The problem and solution for more technical folks:</h2>
First, a bit of relevant background:
-
Intel does something interesting with floating point. "Floating point" (FP) includes not only the obvious (floating point math), but also all of the vector and SIMD instructions (SSE, AVX, basically anything that touches the XMM/YMM/ZMM registers). When an operating system does a context switch (where the CPU pauses execution of one thread to continue execution of another), it has to save and restore all the registers so nothing gets lost. When you add in all the SIMD registers, that's a lot of overhead - so Intel lets the OS assume that FP is unlikely to happen, and will throw a #NM exception if any FP operations occur. That way, context switches can just deal with the "normal" registers and ignore the SIMD registers unless a thread actually uses them; if an FP operation occurs, the system catches the #NM exception, allocates some memory to save the SIMD registers, and saves/restores those registers when switching contexts for that thread.
-
Since at least El Capitan (and probably earlier), the MacOS kernel has utilized "zone allocation" (
zalloc
) for internal dynamic memory allocation.zalloc
has evolved slowly over the years, but it changed rapidly during Big Sur's various releases; Apple added both refinements and security. While earlier versions ofzalloc
simply worked with contiguous chunks of memory, in 11.3 Apple introduced random gaps in the memory pages (presumably for security reasons, although it's a bit unclear to me what vulnerabilities this scheme mitigates - if an attacker has access low enough for that to matter, you've already lost the battle). -
The
corecrypto
kext handles most cryptography tasks for MacOS, including random number generation. Like many other kexts, it tests the hardware it's running on and uses the most advanced instructions the hardware will support (e.g. AVX2 on newer machines, AVX1 or AES-NI or SSE3 on older machines). I was surprised at the extent to whichcorecrypto
is used - many/most MacOS subsystems touchcorecrypto
at some point.
OK, so here's the gist of the problem. At startup, MacOS launches multiple threads to initialize the system - to discover and configure the hardware, etc. At some point, a thread will call IOLockAlloc()
(to allocate a lock group for IOKit), which will in turn call zalloc
to allocate memory. At this early stage of booting, zalloc
has not yet initialized, so it does that now. Remember those random memory gaps I mentioned? zalloc
initialization needs random numbers to make the gaps random, so it calls early_random()
to get some random numbers.
So far, so good. Now, at this point in the boot, corecrypto
may or may not have been initialized yet. If it has not yet been initialized, early_random()
will just use its own SHA1 random number generator. This is the "success" case, because that generator doesn't use any floating point instructions, so early_random()
will return a random number, zalloc
will finish initializing, and everybody's happy.
Unfortunately, corecrypto
's initialization code is short and sweet, so most of the time, corecrypto
is already initialized when that call to early_random()
occurs. In that case, early_random()
hands the request to corecrypto
, which chooses the best instructions to use (on the Mac Pro 5,1-era machines, that's SSE3 and AES-NI, both of which are floating-point as far as the CPU is concerned). corecrypto
then acquires a thread lock (the necessity of which is not entirely clear to me), then it starts generating the random number by executing a floating point/SIMD instruction - which throws a #NM exception so MacOS knows to keep track of the SIMD registers. That exception calls an OS routine called fpnoextflt()
, which (among other things) allocates a buffer to save the SIMD registers for this thread. Where does it get that buffer? zalloc
, of course.
As you'll recall, the only reason we're executing anything in corecrypto
at all is because zalloc
needed some random numbers to initialize itself. Therefore, when fpnoextflt()
wants to allocate a buffer, zalloc
is still uninitialized, and it tries to request some random numbers from corecrypto
. (Sound familiar?)
This unhappy set of circumstances would lead to an infinite loop of zalloc
requesting random numbers and corecrypto
's FP instructions triggering #NM exceptions that call zalloc
, except for the corecrypto
thread lock I mentioned. The second time through, corecrypto
tries to acquire that same lock again. The lock mechanism doesn't care that the same thread already holds that lock, it just knows that the lock is held, and the current request has to wait for the lock to be released, which effectively puts the thread to sleep while it waits for a lock it can never acquire.
The "race" here is between the zalloc
and corecrypto
subsystems getting initialized (or, more accurately, between zalloc
getting initialized and the execution of any SIMD or floating-point instruction, which is most likely to occur in corecrypto
). If a floating-point/SIMD instruction occurs at any point before an attempt to use zalloc
, fpnoextflt()
will invoke zalloc
, which will invoke corecrypto
, which will deadlock (as described above). If no floating-point/SIMD instructions occur before zalloc
is invoked/initialized, the boot proceeds normally, and everybody's happy. For all the current problematic MacOS versions (11.3-11.6 and every 12.x Monterey beta to date), the solution is to delay using corecrypto
until the zalloc
subsystem is initialized.
After quickly rejecting a kext-based solution (because any kext would just become part of the race condition), I developed a "blunt instrument" solution that just skipped over the zalloc
code that introduces random gaps in the memory map (by modifying zone_allocate_va()
to NOP out the call to zone_allocate_random_early_gap()
). That produced no functional changes to the system, and, most likely, no reduction in system security. However, "most likely" isn't good enough, and since we don't know exactly why Apple chose to add those gaps, we shouldn't cavalierly remove them. In addition, this solution was too narrow - it covered the most likely race case, but there could be other similar cases on differently-configured systems that would require separate patching.
I then settled on a more elegant "scalpel" solution, which is what is included below. It only modifies three bytes of code. The idea here is to modify the early random number generator ("early" meaning it's only called during early boot) to only use its SHA1 method and never call corecrypto
during early boot. (This involves modifying read_erandom()
, which the compiler inlines into early_random()
.) This way, zalloc
can still introduce those gaps in the memory map, and system security is not compromised. Also, because it covers all of the random number generation during the early boot process, it should handle most or all future changes to MacOS that might introduce a new variant of this same race condition (such as increased use of SIMD instructions during early boot). This patch has minimal impact on the system, and will hopefully have some longevity.
As an aside, I had originally considered APFS as my prime suspect for being the root cause of this problem. As it turns out, APFS is both a victim and an agent provocateur, but not a direct cause. Because cryptography is fundamental to APFS, as soon as an APFS disk is detected and the APFS kext gets involved, both zalloc
and corecrypto
come into play, as well as SIMD instructions that the APFS kext itself may use. Booting successfully then boils down to timing and luck. By staggering the timing a bit, latebloom
helped improve the luck. By circumventing the actual cause, luck becomes irrelevant (to this particular problem, anyway).
Note that future releases of MacOS might introduce new variants of this race condition, if new early boot code uses SIMD instructions, or if Apple continues to tinker with the zalloc
initialization sequence. If that happens, at least we know where to start looking.
The gory details, for anyone wanting to make their own patch
The public portions of the MacOS source code are at https://opensource.apple.com/.
The relevant random number code is in osfmk/prng/prng_random.c
. The relevant memory allocation code is in osfmk/kern/zalloc.c
. Unfortunately, many of the relevant functions are declared as static
, so their symbols are not public, and it takes more effort to locate their code in the kernel binary or at runtime. In addition, the compiler often inlines the relevant functions, so there is no discrete function call to manipulate; in those cases, you're stuck locating the code in situ, possibly in multiple places.
The "blunt instrument" patch (included here only for completeness)
During zalloc
initialization, zone_expand_locked()
will eventually be called, which will call zone_allocate_va()
, which will call zone_allocate_random_early_gap()
, which is where those random gaps in the memory allocation map are inserted (which calls early_random()
, which almost always invokes corecrypto
, which throws a #NM floating-point exception, which eventually causes a hang, as described earlier). We circumvent that call by locating the zone_allocate_random_early_gap()
code, then finding the only call to that address. In 11.5, zone_allocate_random_early_gap()
looks like this:
By searching for a call to 0xffffff80002d94e0
, we can find the zone_allocate_va()
code (which is inlined):
That callq 0xffffff80002d94e0
needs to go, so we just NOP it out (overwrite e8 b0 06 00 00
with 90 90 90 90 90
). Now the random gaps don't get inserted, and zalloc
can initialize without involving corecrypto
.
Note that all of the code in the second block there (from cmpb
through jmp
) is RIP-relative, meaning the linker and loader won't alter it, and we can search for those exact bytes without having to mask them. (Sadly, the nearest public symbol prior to this is _work_interval_port_type_render_server
, which is more than 2kB away, and unrelated to zalloc
)
.
The "scalpel" patch (as included below)
During early boot, the function early_random()
is called for random number generation. The first time through, it sets things up for corecrypto
, then calls read_erandom()
to generate a random number. read_erandom()
looks at the static int flag prng_ready
to see if corecrypto
is ready for use; if prng_ready
is set, read_erandom()
passes the request off to corecrypto
. If prng_ready
is not set, read_erandom()
calls ccdrbg_generate(&erandom...)
to generate the random number; ccdrbg_generate(&erandom...)
uses SHA1 and does not use any floating-point instructions that will throw a #NM exception. Note that it's the erandom
structure which contains pointers to the generator function and the digest data.
When corecrypto
initializes, it calls a function in osfmk/prng/prng_random.c called register_and_init_prng()
. That function initializes pointers to functions and data passed in by corecrypto
, then zeros out the erandom
structure (erasing the pointer to the SHA1 generator and data).
With a simple patch, we can prevent register_and_init_prng()
from zeroing out the erandom
structure, so the SHA1 generator remains intact.
Fortunately, read_erandom()
is only called from two places, and it gets inlined in each - meaning that we can patch the instance embedded in early_random()
without affecting the instance inlined in read_frandom()
, which is used elsewhere in the kernel during normal (post-boot) operation.
In 11.3, the read_erandom()
instance inlined in early_random()
contains this code:
We do a one-byte patch to change "je <zero_case>" to "jmp <zero_case>" (i.e. change 74 23
to eb 23
), so this instance of read_erandom()
always behaves as if prng_ready
is 0. (Note that to decrease the odds of a false pattern match, the patch uses more than just those two bytes (it looks for 00 74 23 48 8b
and replaces it with 00 eb 23 48 8b
)).
In 11.3, the register_and_init_prng()
function contains this code at the end of the function:
To prevent the erandom
structure from getting zeroed out by _secure_memset
, we just jump over that function call. The patch searches for ba 48 01 00 00 31 f6
(movl $0x148,%edx; xorl %esi,%esi) and replaces it with ba 48 01 00 00 eb 05
(movl $0x148,%edx; jmp $+5).
When both of these patches are applied, we effectively make the early boot random number generator always use its non-floating-point SHA1 code. We don't prevent zalloc
from introducing its random gaps, or make any other functional changes to the system. By avoiding the "floating point before zalloc initialization" deadlock, we allow Big Sur and Monterey to boot just as smoothly as their predecessors.
And now, the beautiful part - the handful of bytes that we look for and modify have not changed between 11.3 and 12b7, meaning we don't (currently, at least) need to have different patches for different versions of MacOS. This will probably change at some point, but for now, we got very lucky.
The actual SurPlus patch
The patch below is intended to be inserted in the Kernel - Patch
array of the OpenCore config.plist
file.
Classic Macs that don't use OpenCore will need something to patch their kernel. I'll happily work with anyone who wishes to develop such a patcher (creating a new patcher isn't on my to-do list, since OpenCore makes it simple).
Remember to disable or uninstall latebloom if you have it installed!
The SurPlus patch for both Big Sur
and Monterey
(through Beta 7, at least):
<dict>
<key>Arch</key>
<string>x86_64</string>
<key>Base</key>
<string>_early_random</string>
<key>Comment</key>
<string>SurPlus v1 - PART 1 of 2 - Patch read_erandom (inlined in _early_random)</string>
<key>Count</key>
<integer>1</integer>
<key>Enabled</key>
<true/>
<key>Find</key>
<data>
AHQjSIs=
</data>
<key>Identifier</key>
<string>kernel</string>
<key>Limit</key>
<integer>800</integer>
<key>Mask</key>
<data>
</data>
<key>MaxKernel</key>
<string>21.2.0</string>
<key>MinKernel</key>
<string>20.4.0</string>
<key>Replace</key>
<data>
AOsjSIs=
</data>
<key>ReplaceMask</key>
<data>
</data>
<key>Skip</key>
<integer>0</integer>
</dict>
<dict>
<key>Arch</key>
<string>x86_64</string>
<key>Base</key>
<string>_register_and_init_prng</string>
<key>Comment</key>
<string>SurPlus v1 - PART 2 of 2 - Patch register_and_init_prng</string>
<key>Count</key>
<integer>1</integer>
<key>Enabled</key>
<true/>
<key>Find</key>
<data>
ukgBAAAx9g==
</data>
<key>Identifier</key>
<string>kernel</string>
<key>Limit</key>
<integer>256</integer>
<key>Mask</key>
<data>
</data>
<key>MaxKernel</key>
<string>21.2.0</string>
<key>MinKernel</key>
<string>20.4.0</string>
<key>Replace</key>
<data>
ukgBAADrBQ==
</data>
<key>ReplaceMask</key>
<data>
</data>
<key>Skip</key>
<integer>0</integer>
</dict>
Note that this is a two-part patch (there are two separate patches included), and both parts need to successfully load or the kernel will most likely panic. Assuming your installation of OpenCore is relatively new (and stable), this should not be an issue.
What should Apple do about this?
First, let's be realistic - it's unreasonable to expect Apple to make modifications to MacOS because of a problem that seems to only affect unsupported systems. Patches like this are probably the best path forward for these older systems.
That being said, I have yet to see any reason why this problem could not manifest itself on a supported system. It may simply be that the newer systems are fast enough that this problem never shows up, or there may be some piece of this that I've overlooked. In any case, it's a definite flaw for there to be a circular dependency like this, and I hope Apple will consider removing it (there are several possible means to accomplish this). (I'm not holding my breath, though.)
Update: Thanks to insight from @vit9696, it seems that newer models are unaffected by this bug because those CPUs support the rdrand
instruction, meaning they don't require FP access during early boot. From Apple's perspective, the race condition fixed by the SurPlus patch is not a bug, since it doesn't affect supported systems. Therefore, Apple has no reason to address this issue at all.