Skip to content

Excessive CPU usage unless using --gil option under Python 3.12 asyncio #811

@esarp

Description

@esarp

We have a rather bespoke Python setup where we build everything, including pip modules, from source. Most everything is running with gRPC as well, which may be a compounding factor, and we host a daemon on each service that allows a gRPC request to trigger a pyspy profile (via subprocess.run) of arbitrary duration and report back to the caller. We have a service that periodically calls into this endpoint for all of our Python services to collect and aggregate profiles for analysis.

While working on upgrading our Python version from 3.11 to 3.12 we found that our services that are running exclusively under asyncio would become unresponsive and unrecoverable shortly after a profile was requested. We found that the pyspy process would run significantly longer than requested and reach a "zombie" state, which prevented the node from being shutdown. We have that particular issue resolved now, and can now observe that when a profile is requested under Python 3.12, pyspy quickly begins consuming nearly all the CPU on server, preventing it from serving any traffic, which ends up causing it to fail health checks.

Running pyspy manually, I'm able to see a large number of messages in the form of "X.XXs behind in sampling, results may be inaccurate. Try reducing the sampling rate." where the time behind continues to increase with each message. I'm assuming that pyspy running for longer than requested is due to it continuing to collect samples until at least the requested amount is collected, while falling further and further behind, creating a scenario where it can't complete the request at all.

I've tried the following to see if they had any impact, but each only marginally changed the results:

  • Reducing the sample rate from 50 to 1
  • Collecting native and non-native samples
  • Reducing the duration from our default of 30s to 5s
  • Calling pyspy via asyncio.create_subprocess_shell to ensure nothing was unintentionally blocking

The only thing I've found that helps significantly is passing --gil to pyspy. Doing so I don't see any "behind in sampling" messages, and the server is still able to serve ~80% of the requests it normally would. My theory is that the changes to asyncio / GIL in 3.12 might be returning samples on all pending asyncio tasks instead of only some subset of active tasks. These servers are handling ~150-200 QPS, and have a similar number of tasks at any given time, so it may be too much to process in a reasonable timeframe between each collection of samples.

While using --gil is an acceptable temporary fix, ideally we'd like to continue to collect all samples, as using --gil does seem to lose data from gRPC, SSL, and other common C libraries that we're working on improving as well.

Mainly filing this to see if anyone has experienced the same or if this is a known issue, and to look for any ideas to debug further. I can provide any additional details on our setup as needed, but didn't want to flood the post with info right off the bat.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions