Skip to content

Conversation

@Kha
Copy link
Member

@Kha Kha commented Jan 19, 2026

No description provided.

@thorimur
Copy link
Contributor

thorimur commented Jan 19, 2026

In case it helps: I've been adding tracing to try to investigate the cause of this in #12013. The original thought was that maybe there was a race condition with m_shutting_down in spawn_worker() (flipping to true only after we enter the false branch, so that we miss the notification), but now I'm wondering if it has to do with the "maximum number of workers" condition and logic (in that same function). Each of these traces satisfies that condition.

Note: this was necessary because using a debug build apparently fixed the problem!

So far, I've had it trace in that block before waiting (waiting during shutdown), and ten minutes after waiting (timeout after shutdown). The tracing in the critical loop in spawn_worker() looks like this:

...
while (true) {
    if (m_queues_size == 0 && m_shutting_down) {
        break;
    }
    if (m_queues_size == 0 ||
            // If we have reached the maximum number of standard workers (because the
            // maximum was decreased by `task_get`), wait for someone else to become
            // idle before picking up new work.
            m_std_workers.size() - m_idle_std_workers >= m_max_std_workers) {
        if (m_shutting_down) {
                internal_trace("waiting during shutdown");
        }
        auto status = m_queue_cv.wait_for(lock, chrono::seconds(600));
        if (status == std::cv_status::timeout) {
            if (m_shutting_down) {
                internal_trace("timeout_after_shutdown");
            }
            m_queue_cv.wait(lock);
        }
        continue;
    }
    ...

Here are examples of what has been printed in some hanging runs (the end of every code block is the last thing printed), obtained by running runLinter over and over. Note that queues is the size of the queues; sometimes it's 1, sometimes 0. Note also that linting completes in each case.

...
-- Linting passed for Mathlib.
task_manager: waiting during shutdown: tm=0x567bc62e4620 tid_hash=134722822846016 mutex=0x567bc62e4620 queue_cv=0x567bc62e4828 shutting_down=1 queues=1 idle=1 std_workers=13 max_std=12 max_prio=0
task_manager: timeout_after_shutdown: tm=0x567bc62e4620 tid_hash=134722822846016 mutex=0x567bc62e4620 queue_cv=0x567bc62e4828 shutting_down=1 queues=1 idle=1 std_workers=13 max_std=12 max_prio=0
...
-- Linting passed for Mathlib.
<many lines where (only) queues decreases>
...
task_manager: waiting during shutdown: tm=0x5a0b562e6620 tid_hash=133824134510144 mutex=0x5a0b562e6620 queue_cv=0x5a0b562e6828 shutting_down=1 queues=344 idle=1 std_workers=13 max_std=12 max_prio=0
task_manager: waiting during shutdown: tm=0x5a0b562e6620 tid_hash=133824134510144 mutex=0x5a0b562e6620 queue_cv=0x5a0b562e6828 shutting_down=1 queues=295 idle=1 std_workers=13 max_std=12 max_prio=0
task_manager: waiting during shutdown: tm=0x5a0b562e6620 tid_hash=133824134510144 mutex=0x5a0b562e6620 queue_cv=0x5a0b562e6828 shutting_down=1 queues=200 idle=1 std_workers=13 max_std=12 max_prio=0
task_manager: waiting during shutdown: tm=0x5a0b562e6620 tid_hash=133824134510144 mutex=0x5a0b562e6620 queue_cv=0x5a0b562e6828 shutting_down=1 queues=1 idle=1 std_workers=13 max_std=12 max_prio=0
task_manager: waiting during shutdown: tm=0x5a0b562e6620 tid_hash=133824134510144 mutex=0x5a0b562e6620 queue_cv=0x5a0b562e6828 shutting_down=1 queues=1 idle=1 std_workers=13 max_std=12 max_prio=0
task_manager: waiting during shutdown: tm=0x5a0b562e6620 tid_hash=133824224851520 mutex=0x5a0b562e6620 queue_cv=0x5a0b562e6828 shutting_down=1 queues=1 idle=1 std_workers=13 max_std=12 max_prio=0
task_manager: timeout_after_shutdown: tm=0x5a0b562e6620 tid_hash=133824224851520 mutex=0x5a0b562e6620 queue_cv=0x5a0b562e6828 shutting_down=1 queues=1 idle=1 std_workers=13 max_std=12 max_prio=0
...
-- Linting passed for Mathlib.
<many lines where queues decreases>
...
task_manager: waiting during shutdown: tm=0x5f096e973620 tid_hash=132140666693184 mutex=0x5f096e973620 queue_cv=0x5f096e973828 shutting_down=1 queues=230 idle=1 std_workers=13 max_std=12 max_prio=0
task_manager: waiting during shutdown: tm=0x5f096e973620 tid_hash=132140666693184 mutex=0x5f096e973620 queue_cv=0x5f096e973828 shutting_down=1 queues=169 idle=1 std_workers=13 max_std=12 max_prio=0
task_manager: waiting during shutdown: tm=0x5f096e973620 tid_hash=132140666693184 mutex=0x5f096e973620 queue_cv=0x5f096e973828 shutting_down=1 queues=1 idle=1 std_workers=13 max_std=12 max_prio=0
task_manager: waiting during shutdown: tm=0x5f096e973620 tid_hash=132140666693184 mutex=0x5f096e973620 queue_cv=0x5f096e973828 shutting_down=1 queues=1 idle=1 std_workers=13 max_std=12 max_prio=0
task_manager: waiting during shutdown: tm=0x5f096e973620 tid_hash=132140700264000 mutex=0x5f096e973620 queue_cv=0x5f096e973828 shutting_down=1 queues=1 idle=1 std_workers=13 max_std=12 max_prio=0
task_manager: waiting during shutdown: tm=0x5f096e973620 tid_hash=132140700264000 mutex=0x5f096e973620 queue_cv=0x5f096e973828 shutting_down=1 queues=2 idle=1 std_workers=13 max_std=12 max_prio=0
task_manager: timeout_after_shutdown: tm=0x5f096e973620 tid_hash=132140700264000 mutex=0x5f096e973620 queue_cv=0x5f096e973828 shutting_down=1 queues=0 idle=1 std_workers=13 max_std=12 max_prio=0

In most of the successful linting cases, there are no workers waiting during shutdown. But here is an example of a case where a worker is waiting during shutdown, but we do not hang.

...
-- Linting passed for Mathlib.
<many lines where queues decreases>
...
task_manager: waiting during shutdown: tm=0x5bea38448620 tid_hash=128772017747520 mutex=0x5bea38448620 queue_cv=0x5bea38448828 shutting_down=1 queues=429 idle=1 std_workers=13 max_std=12 max_prio=0
task_manager: waiting during shutdown: tm=0x5bea38448620 tid_hash=128772017747520 mutex=0x5bea38448620 queue_cv=0x5bea38448828 shutting_down=1 queues=376 idle=1 std_workers=13 max_std=12 max_prio=0
task_manager: waiting during shutdown: tm=0x5bea38448620 tid_hash=128772017747520 mutex=0x5bea38448620 queue_cv=0x5bea38448828 shutting_down=1 queues=204 idle=1 std_workers=13 max_std=12 max_prio=0
task_manager: waiting during shutdown: tm=0x5bea38448620 tid_hash=128772017747520 mutex=0x5bea38448620 queue_cv=0x5bea38448828 shutting_down=1 queues=4 idle=1 std_workers=13 max_std=12 max_prio=0
task_manager: waiting during shutdown: tm=0x5bea38448620 tid_hash=128772154050112 mutex=0x5bea38448620 queue_cv=0x5bea38448828 shutting_down=1 queues=1 idle=1 std_workers=13 max_std=12 max_prio=0
task_manager: waiting during shutdown: tm=0x5bea38448620 tid_hash=128772154050112 mutex=0x5bea38448620 queue_cv=0x5bea38448828 shutting_down=1 queues=1 idle=1 std_workers=13 max_std=12 max_prio=0

Anyway, just wanted to share this information publicly in case it's helpful.

We're going to run the runLinter loop under this pr toolchain to see if we catch a hang. Please feel free to reach out if any other experiments would help. :)

@github-actions github-actions bot added the toolchain-available A toolchain is available for this PR, at leanprover/lean4-pr-releases:pr-release-NNNN label Jan 19, 2026
@leanprover-community-bot
Copy link
Collaborator

Mathlib CI status (docs):

  • ❗ Batteries/Mathlib CI will not be attempted unless your PR branches off the nightly-with-mathlib branch. Try git rebase 99b26ce49e8071d541692eb52a86731f6addd28a --onto 3dfd125337305c9de6ddcc7c0330c50f0e39fb8e. You can force Mathlib CI using the force-mathlib-ci label. (2026-01-19 22:22:50)

@leanprover-bot
Copy link
Collaborator

Reference manual CI status:

  • ❗ Reference manual CI will not be attempted unless your PR branches off the nightly-with-manual branch. Try git rebase 99b26ce49e8071d541692eb52a86731f6addd28a --onto 4af9cc05924d81e1d5dd35105fdce007c5bbc2e8. You can force reference manual CI using the force-manual-ci label. (2026-01-19 22:22:52)

m_queue_cv.notify_all();
// we can assume that `m_std_workers` will not be changed after this line
}
m_queue_cv.notify_all();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, if I'm not wrong, I think doesn't change much as it stands, as m_shutting_down is always accessed by the workers with the lock acquired, so 'visibility' wouldn't change with this reordering.

I think this #12052 is more likely to be the problem: you can see from @thorimur's comment in this PR that we caught this happening with the workers at the limit:

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

toolchain-available A toolchain is available for this PR, at leanprover/lean4-pr-releases:pr-release-NNNN

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants