Analyzing a stack allocation overflow error

I had an interesting delve into a curious bug recently (C++), and I'm going to write it up here, just in case it's interesting to anyone else. Note: Some of the early stage debugging was done by others as well in parallel (although I did all the late stage debugging); I'm simplifying somewhat by just focusing on what I did, though.

The issue started with unit tests failing. Not just random unit tests, though: around 2/3 of the unit test projects started failing to run (which technically doesn't appear as "failures", so it wasn't noticed immediately). Once this was noticed, it was traced to correlate to a recent change to run the UT's for the Debug|x64 build (vs the previous Debug|x86 build). But this was just the starting point for the analysis, when it got effectively punted to me ("benefits" of being one of the most senior devs, and often the last resort for hard problems).

So, initially, I looked at the cause of the failures to execute; it turned out the UT binaries were crashing on startup. So, I looked at the crash dump. In the failure cases, the binary appeared to be crashing on initialization. Looking closer, it appeared that there was a stack overflow exception being reported (even more odd, since main wasn't run yet, so the traditional stack wasn't even in play yet). I traced this to the initialization of a single object, which was being constructed in heap space at global initialization time (the object was contained in a std::unique_ptr<>, for which there was a global instance, and it was populated on the constructor).

For context, this object contained an internal std::map, which was large, and populated with entries. Some other devs had theorized that that map might be using the stack for storage; I dismissed that theory. Within the population method, there were no objects which had excessively long lifetimes (although there was numerous objects with automatic duration during the map population, almost all of them were anonymous). I theorized that perhaps something else was happening, and what we were seeing was just bad data because the failure was happening at initialization time (and not all data structures/info were correct). This was essentially the starting point where I did all the remaining analysis.

First things first: the global initialization didn't follow my recommended pattern of runtime initialization for shared instances, so I converted this object to be runtime initialized (inclusive of all references). With that task complete, I could write a simple bootstrap UT to exhibit the issue, and validate that the app did not crash without the data structure initialized (which I verified: it's good to check all assumptions along the way). Now with a test bed, I could validate the actual error more clearly: a stack overflow exception was being raised in the population call (although curiously, the debugger showed this as very early in the population method, but I initially thought this might be a red herring artifact of the runtime).

Focusing on the reported stack overflow, I set about to validate this. I found some prototype stack checking code here: https://devblogs.microsoft.com/oldnewthing/20200610-00/?p=103855, which I adopted with my new test bed. From trace output, I could see that the stack remaining went from ~1MB in the UT wrapper, to ~350k in the population method (again, interestingly, at the start of the method). This was the first definitive clue as to what was going on, although it would take a bit more knowledge and analysis to be sure.

First, though, a small but material diversion. In C++, we typically think of the stack as logically growing with automatic lifetime variables and function calls (stack frames). However, while this is a fairly accurate mental model, it's not exactly aligned with what the compiler actually generates; one obvious example is inline methods, but there are other examples as well. In particular, the "stack" as a concept is not defined in the C++ standard, and compilers can rearrange variable locations in memory arbitrarily, as long as the semantics remain consistent. This is leveraged, for example, to place stack guard cookies on the stack, expand data sizes to check for buffer overruns in some compilation configurations, etc.

This is material, in the case, because going back to the previous test, I next decided to test various build configurations (with the above stack remaining function). Interestingly, there were some significant differences. In Debug|x86, the stack remaining on entry was ~600k; in Release builds, it was significantly more (~800k+). This meant that something was very different between build types, and the change to Debug|x64 was seemingly causing some limit to be exceeded.

The next part is a bit of educated speculation, as compiler internals are not documented, but I noticed a reference in the exception stack to __chkstk_ms. Pulling on this thread led to a pretty sparse info page from MS (https://learn.microsoft.com/en-us/windows/win32/devnotes/-win32-__chkstk), but a more detailed write-up in relation to platforms emulating Windows (https://nullprogram.com/blog/2024/02/05/). In specific (and simplified summary), this is a function which the the compiler adds internally, to check for page faults on the stack, but also appears to check for overall size usage. It's different per bitness, undocumented, somewhat unknown, and seemingly related to what was happening here.

Then, more educated speculation. In msvc, the compiler supports a certain amount of debug-time variable exploration, and sometimes manipulation (with Edit and Continue). This implies that the compiler is reserving some amount of buffer space on the stack, which might scale with the amount of possible references within a function. For example, anonymous variables with automatic lifetimes. If this was happening, and allocation reservation size was scaling with anonymous automatics, and the page fault check was larger for x64, then this could be what we were hitting. Spoiler: this was the root cause.

At first, I thought I could mitigate this by using internal blocks to control scope, but it turns out that the compile-time stack reservation doesn't seem to consider internal blocks (see the previous note about the difference between the mental model and actual compilation). However, encapsulating some call code (with some anonymous automatics) within a lambda call did reduce the stack reservation at the start of the method. We had a winner! By restructuring the common and repeated call into a lambda, the remaining stack size at function entry increased to 600k, and the crash was gone.

Once that was done, it was just a matter of "mopping up", so to speak. There were a surprising amount of automatics in the aggregate method; it turns out the macros can make lots of code which is "hidden". But the initial fix, with some additional code refactoring, eventually took the reservation size for the method to <200k, which should be sufficient for now (and importantly, should not scale with additional additions in the future). As an aside, the ancillary other code updates (such as correct usage of std::move, etc.) also reduced the runtime cost of the initialization by about a half.

So that's the story of how I figured out and fixed an annoying crash issue in our UT's, in Debug mode, when we switched to the x64 build. There's more details there, but hopefully the above is clear enough to be at least somewhat interesting and/or informative.

Addendum: I have, at present now, probably spent more time explaining the issue to other people, than it took to figure it out. There's a saying in the industry (and probably other fields as well): "I can explain it to you, but I cannot understand it for you." This is very pertinent for cases like this, where the root cause is both somewhat esoteric, and requires a fairly in-depth knowledge and educated speculation about compilation semantics and low-level language behavior, to understand what's probably going on. That's part of the job, though: just solving the technical problem is usually only a fraction of the actual work involved in getting an issue fully solved.

 

Comments

Popular posts from this blog

The Story of PayPal V2

Documenting a software implementation

Thoughts on "extra code" in a codebase