Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

workaround Clang v15 AArch64 miscompile that affects parallel collection #879

Merged
merged 1 commit into from
Oct 14, 2024

Conversation

mflatt
Copy link
Contributor

@mflatt mflatt commented Oct 13, 2024

This patch avoids a miscompile using Clang v15 on macOS. The default compiler on macOS was recently upgraded to Clang v16, which appears to fix the problem, and I have not been able to replicate the problem with Clang v15 variants that are available in Linux distributions. So, it might be ok to just ignore the problem. But since v15 installations are likely to hang around for a while in other macOS installations, since the workaround is simple, since Racket users who build themselves are affected, and since I spent a lot of time tracking down the problem, I'm inclined to include a workaround.

For details on the miscompile at it affects Chez Scheme, see clang15-miscompile.zip.

@mflatt
Copy link
Contributor Author

mflatt commented Oct 13, 2024

I spent so long tracking this down that I'd like to tell you the long story, even though it doesn't really matter. The miscompile seems like a run-of-the-mill compiler error, but the way it affected Chez Scheme and Racket made it especially difficult to find.

During 2022-2024, I've tried off and on to track down an occasional failure in Racket builds on my macOS M1/M2 laptops. Memory would get mangled late in the build — specifically during documentation rendering for he "math" library, which uses libgmp and libmpfr in multi-threaded mode. Since the problem never happened on x86_64, and since it only happened during parallel documentation rendering, I was pretty sure that I was looking for some sort of race condition exposed by AArch64's weak memory coherence.

Although I discovered that I could provoke a crash by just rebuilding documentation, even that step takes 10 minutes, and the crash would only happen rarely, so getting a crash would take hours. Any little change I made to try to gather information would make the crash go away or become much more difficult to provoke, so hours turned to days.

Meanwhile, users of the Racket main distribution were not running into problems, which I chalked up to the fact that documentation is pre-rendered. Also, maybe more generally libgmp or libmpfr needed to be involved, so maybe it wasn't my problem. In any case, the lack of reports made the problem feel less of an emergency than I would normally consider crashing bugs, especially since I had so much trouble replicating the crash or pinpointing an issue. So, I'd burn a day or three on the issue every few months.

In September 2024, I finally gathered evidence to suspect that the problem was in the GC's parallel mode. And with that suspicion, I was finally able to make a small Chez Scheme program with the right ingredients to crash, showing that the problem was independent of Racket and math libraries. The big difference was being able to provoke a crash within seconds instead of hours, and I found the problem over the next day.

In retrospect, it's clear why the problem was so difficult to find. I was pretty sure I was looking for a memory race, but that turned out to be because only multi-threaded programs could reach the miscompiled code. And only during parallel collections. And only when the collector is looking at specific words within a thread representing virtual registers, which are not something that programs normally use directly. The effect of the miscompile was that a "does this object belong to me?" check would succeed when it shouldn't. That matters only when a thread has an object in its virtual register that was allocated by a different thread, which is an even more rare use of a virtual register. And even when it goes wrong, there's only a small chance that different collector threads will end up looking at the same object at the same time, and even concurrent traversal of the same object will turn out ok a lot of the time! Finally, and most perniciously, the miscompile creates a race that isn't in the source code, and in a code template that is put in place by a macro that is used dozens of times in the output (and compiled ok in all other other instances).

Meanwhile, Racket distributions are compiled with Clang v12, which is why it hasn't been a problem for Racket users, even when they run programs with parallelism.

Copy link
Contributor

@jltaylor-us jltaylor-us left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great detective work, Matthew!

@maoif
Copy link
Contributor

maoif commented Oct 14, 2024

Thanks for fix and sharing your experience of tracking down this tricky bug.

@ufo5260987423
Copy link

You are the hero!

@mflatt mflatt merged commit fc577f2 into cisco:main Oct 14, 2024
15 checks passed
mflatt added a commit to racket/racket that referenced this pull request Oct 14, 2024
@glandium
Copy link

#if defined(__arm64__) && defined(__clang__) && (__clang_major__ == 15)

FYI, __clang_major__ from the clang provided by Apple on macos/Xcode does not match the upstream clang version's __clang_major__. For some reason, Apple decided clang's version was Xcode's. But it doesn't match the LLVM version it's derived from. Xcode 15's clang is based on LLVM 16.
So, __clang_major__ == 15 matches entirely different versions of the compiler on clang versions that don't come from Xcode/Apple. You may want to add defined(__apple_build_version__)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants