Debugging Heisenbugs: A tale of parallel processing
A bug that dissappears when trying to debug it.
We had recently done a very large migration over the Python Analyzer codebase to make it faster. Among other changes, one major improvement was that the analyzer was now using concurrent.futures.ProcessPoolExecutor to do independent tasks parallelly, which allowed us to use all the CPU cores.
And it worked pretty well — we saw significant speed improvements on large repositories, which would only improve over time now that we have better control over how we're running the issue checkers.
Chapter 0: The Initial Bump
Within the first couple weeks of deploying the upgrade, we started seeing some analysis runs crashing. This was partly expected — such a major migration is bound to have edge cases, you just need to be able to debug it as soon as the issue shows up.
Looking into the logs of these runs, our process pool would crash with a BrokenProcessPool error. These crashes usually only happened on a big repository trying to analyze a large number of files.
This would not happen every time. The same analysis run would sometimes go through just fine. The runs were flaky.
This answer suggests that this could happen because of the process running out of memory. Time to look into monitoring.
We use Google Cloud to run our analysis jobs, and they run inside containers that can be monitored using GCP Monitoring. Retrying one of the failed analysis runs and monitoring the RAM usage shows the following:
That seems a lot like the container is running out of memory.
We confirmed that the container was indeed sitting at 100% of the memory limit that was set on it.
After some more research, we figured out what seemed to be the problem: the task queue.
The way we were setting up the futures was not correct. For example, here's what you'd do to parallelly run a process on all the items inside any given data:
It seems straightforward and correct at first glance, and for the most part it is, but it forgets one thing: there's only so much memory to go around.
Every executor.submit call is creating a future. That future will now hold a copy of the data that you passed to it. Then we start executing the futures. Each future does some memory allocation, processing, and then returns a result. At the end of the calculation, we're holding on to every single future and its memory inside the futures list.
And if the number of items is high: hundreds, or even thousands of futures have been created, and are hogging up memory. In a container with limited memory, this will eventually cause memory allocations to fail, leading to BrokenProcessPool errors.
There were two ways to fix this. The first was to start batching the tasks, whose code would look something like this:
Now we only store ten futures, and discard them once we get the result. We also delete the item's that have already been processed.
But there's an even simpler way to do this:
This is interesting: We've turned futures into a generator.
Unlike holding a list of futures before, a generator is lazy. It will create a future only when we ask it to. And as_completed only asks it for a future when it has space to start running it. Since the future is only used by us in the for-loop, it gets garbage collected on every iteration.
This means we're getting the result, and then immediately getting rid of the unneeded future. No memory hoarding, problem solved! ...right?
Chapter 1: Phantom timeouts
Most of the time the Analyzer now worked perfectly as intended. But on some rare occasions, we started seeing some timeouts.
We have a cap of 25 minutes for how long the analysis on one repository can run, and if it ends up taking longer, we cancel the analysis and mark it as "timed out". 25 minutes is a very long time for analysis, and ususally timeouts only happen if there's some bug in the analyzer.
The behaviour seemed very erratic and unpredictable.
Every now and then, about 10 minutes into the analysis run, logs would simply stop showing up. As in, the container would suddenly stop printing any logs, as if nothing was going on.
25 minutes later, the container would be killed. Since no result is generated, this is declared a timeout.
There's no real way to figure out more about the cause of the hang, because production containers can't be accessed from outside.
This is a classic multiprocessing problem at first glance. It looked like we either have a deadlock, or some worker process that we are waiting on is stuck inside an infinite loop. So we started digging into source code:
- Infinite loops, any while loops anywhere in the codebase? Nope.
- Extremely long processing time? Every single process that we spawn has a timeout, so that can't be the problem.
- Deadlocks? This is possible. If there's a deadlock the processes should not be consuming CPU. Usage should be almost zero. Time to look at monitoring again:
All the while, there were also memory issues going on. As a safety measure against spurious BrokenProcessPool errors, we had modified the code to re-create the pool if it happens:
And we were also seeing pool re-creation in the logs. So memory was being hogged as well. We didn't pay much attention to that yet, as all signs point to a deadlock.
Now the problem is, we need to re-create this behaviour somehow. The repository that was causing analysis to hang was private, so we can't re-create that same environment locally. We'll have to find a public repository that causes the same problem. There's one clue though: The repository seems to have 800 or so Python files, so it is a large project.
We tried running other large projects like django through the production Analyzer, and no luck. Analysis runs smoothly, every single time.
We looked for historical timeouts in the analyzer to find other large repos that have timed out before, and found a couple that were open source. And somehow, after trying about a dozen projects and not being able to get a deadlock, one of them was able to reproduce that problem: a project called weblate.
We tried re-running the Analyzer to see if the bug can be reliably reproduced on this weblate repository. And... no luck.
Retrying a few times, no timeout. A true Heisenbug indeed. Well, at least the behaviour is consistent in that regard: The timeouts that led to this investigation also go away if you re-run that same analysis again.
Running analysis on weblate on our internal instance of DeepSource also didn't help: no timeouts, ho hangs.
Now instead of having one repository that randomly fails, now we have two. Sounds like we just doubled the problem, but this isn't a dead end — now we can start drawing parallels between the two runs, see what's similar. Maybe we'll figure out more about the exact conditions that need to be met for this bug to show up.
Digging into the logs of the two, we noticed something peculiar. In the beginning, there's one line:
This was a bit odd, because we noticed that we run 24 processes at a time in our internal cluster, and the number of processes submitted to the executor is double the core count:
We had added some extra logging when trying to reproduce this issue, and indeed, both failing analysis runs were running 48 processes at once, instead of the expected 24.
Checking the logs of the internal runs that didn't timeout, the logs show:
12 cores, 24 processes.
Analysis passes when there are 12 cores, and fails when there are 24 cores. We have found common ground. Now, to be 100% sure, we need to reliably reproduce this bug.
Let's look at the CPU_COUNT definition:
We changed this to a hardcoded value:
After doing this, and trying to analyze weblate on our internal instance of the analyzer, et voila. It reproduces. The analysis gets stuck after 10 minutes every single time.
We have nailed down the bug: When os.cpu_count() is high, we get timeouts.
The real problem
Our containers run on Google Kubernetes Engine, which is a managed k8s cluster with many cores. But we have set resource limits on our analysis clusters such that they are limited to only run on 4 CPU cores at maximum.
We'll here's the thing. After reading the docs, it turns out that the resource limiting in Kubernetes is usually done through limiting "CPU time", and not through core count.
This means that when you set a process to use "4 CPU cores", it won't actually use 4 cores, it will use every single core of the cluster, but the scheduler will be told to limit the number of CPU cycles to run for this container, such that it is equivalent to 4 cores.
Doesn't make sense? Let me explain with an example:
Say that your cluster has 40 cores. If you say that each container should use a maximum of 4 cores, instead of the container being limited to 4 cores, the container will actually be limited to 10% of CPU's power. If there's only one container currently running, the process will be taking up 10% of processing time on every single CPU core on that cluster.
This means two things:
- os.cpu_count() will see every single core of the cluster, even with CPU limitations. And since we use that count to spawn the number of processes, we are spawning more processes for the same amount of "CPU time".
- In our previous example, if we spawn 5 processes, each process gets 2% of the total 10% CPU time available to that container. But if we spawn 10 processes, each process suddenly slows down to half speed, as it's only getting 1% of the total resources.
- So if the machine the process is running on has more CPU cores, each parallel process that we do suddenly slows down for no reason.
- The more processes we spawn, the more memory we will end up using. If the machine changes from 12 to 24 cores, suddenly our memory usage has doubled. And since we also have memory limits set on the container, we are bound to see memory issues.
So the bug was not with the parallel processing, it was due to scheduling. This explains why it didn't reproduce on the internal instance: our internal instance was issuing smaller machines into the cluster, while production had larger machines with more cores.
The fix was straightforward: Limit max process count to a small number, say, 8:
After doing that, all timeouts disappeared.
This doesn't reduce the performance of our analyzer because as long as your processes are utilizing 100% of the CPU time given to your container, you're running the fastest you can.
In fact, this actually increases performance in this case because using less memory leads to fewer page swaps, thrashing, and so on, ultimately making each single process faster overall.
This whole debugging saga didn't teach anything particularly novel. But the processes that had to be gone through led to some lessons learnt:
- Reproducing the issue is step one. If you can't reproduce the bug reliably, no matter what fix you make, you can't know for sure that the fix works.
- Process of elimination and finding common patterns is step two. Noticing similarities between two instances of the bug, adding logs, and eliminating possible sources of the problem will eventually lead you to the root cause.
- One must know their infrastructure, even if you're running in a container. Docker solves a lot of problems by isolating your environment, but resource constraints isn't one of them.
By the time this fix was implemented, our Ruby Analyzer started seeing similar memory problems with DeadWorker exceptions. The exact same fix (limiting the analyzer to spawn a maximum of 8 processes) fixed its timeouts as well.
A week later, the same happened with PHP, the same fix worked.
We squashed one nasty Heisenbug.