Skip to content

test: ensure temporary directories are cleaned up after tests#5163

Open
jetjinser wants to merge 2 commits intonervosnetwork:developfrom
jetjinser:4177
Open

test: ensure temporary directories are cleaned up after tests#5163
jetjinser wants to merge 2 commits intonervosnetwork:developfrom
jetjinser:4177

Conversation

@jetjinser
Copy link
Copy Markdown

@jetjinser jetjinser commented Apr 7, 2026

What problem does this PR solve?

After running make test, temporary directories with the ckb-tmp- prefix remained in $TMPDIR because background tasks held Arc<Shared> references, preventing TempDir cleanup.
This is fixed by broadcasting the global exit signal or passing weak Arc.

For make integration, the old temp_path function would create a TempDir, immediately drop it, and return a plain PathBuf. Obviously the TempDir was gone right after that function returned, but the test code would go ahead and recreate the directory at that same path later—and nobody ever cleaned it up. So ckb-it-* dirs just piled up.
The fix is to stop throwing away the TempDir handle: the Worker now owns one root TempDir for the whole run, and everything else gets a subpath under it. When the root goes out of scope, the entire tree gets wiped automatically.

Issue Number: close #4177

What is changed and how it works?

Unit tests (make test): RpcTestSuite got a Drop impl that calls ckb_stop_handler::broadcast_exit_signals() so background services actually stop before Shared(TempDir) goes away.

Integration tests (make integration): instead of temp_path throwing away the TempDir, the Worker creates a root TempDir upfront and passes the path down to Node::new and Net::new. All test subdirs live under that root, so cleanup is automatic and works.

Related changes

This PR should not have any impact except test.

Fixes the issue where hundreds of `ckb-tmp-*` directories accumulate in `TMPDIR`
after running `make test` or `make integration`.

The `Arc<SledBackend>` held by spawned tasks prevented the `TempDir` from being
dropped. Lets tasks exit and release their refs.
@jetjinser jetjinser requested a review from a team as a code owner April 7, 2026 23:45
@jetjinser jetjinser requested review from zhangsoledad and removed request for a team April 7, 2026 23:45
@jetjinser jetjinser marked this pull request as draft April 8, 2026 00:41
@jetjinser jetjinser marked this pull request as ready for review April 9, 2026 03:34
@quake quake requested a review from Copilot April 10, 2026 02:09
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses lingering temporary directories created during unit (make test) and integration (make integration) test runs by changing how test working directories are owned and by ensuring certain background tasks no longer keep strong references that prevent cleanup.

Changes:

  • Integration tests: introduce a root TempDir owned by the test Worker, and derive node/net working directories from it instead of generating “throwaway” temp paths.
  • Unit tests: adjust background behavior to allow tempdir cleanup (HeaderMap uses Weak, RPC tests trigger stop signaling on teardown).
  • Refactor test APIs: update Spec::before_run, Node::new, and Net::new signatures to accept a root directory.

Reviewed changes

Copilot reviewed 31 out of 31 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
test/src/worker.rs Creates and retains a root TempDir; passes per-spec subpaths into test setup.
test/src/utils.rs Removes temp_path helper that previously discarded TempDir handles.
test/src/specs/mod.rs Updates Spec::before_run to accept a root_dir for per-spec setup.
test/src/specs/** Updates specs to pass root_dir into Node::new and Net::new.
test/src/node.rs Makes node working dirs deterministic under a provided root; stores root_dir on Node.
test/src/net.rs Makes net working dir deterministic under a provided root; updates constructor signature.
test/src/lib.rs Minor import adjustment.
shared/src/types/header_map/mod.rs Prevents background task from keeping HeaderMap alive by using Weak.
rpc/src/tests/mod.rs Adds Drop for RpcTestSuite that broadcasts stop signals on teardown.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +169 to +182
@@ -162,7 +177,9 @@ impl Worker {
})
.unwrap();

let mut nodes = spec.before_run();
let nodes_root = spec_dir.join("nodes");
let mut nodes = spec.before_run(nodes_root);

Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run_spec builds spec_dir solely from spec.name(), and then all node/net working dirs become deterministic under that path. This changes previous behavior where temp_path generated a fresh random directory each time; with retries (or multiple nets/nodes in a spec), stale data can be reused and different specs with the same type-name suffix could collide. Consider making the per-spec root unique per attempt (e.g., include retried or a random suffix / TempDir per spec run) and/or removing any existing spec_dir before creating nodes.

Copilot uses AI. Check for mistakes.
Comment on lines +228 to +233
// Shut down background tasks to release `Arc<Shared>` and allow temp dir cleanup.
// It's should be safe in this test suite.
impl Drop for RpcTestSuite {
fn drop(&mut self) {
ckb_stop_handler::broadcast_exit_signals();
}
Copy link

Copilot AI Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling ckb_stop_handler::broadcast_exit_signals() in RpcTestSuite's Drop will cancel the global stop token for the entire test process and cannot be reset. Since many RPC tests create and drop their own RpcTestSuite, this risks prematurely stopping services in other tests (including tests that run after the first one), leading to flakiness or failures. Prefer an explicit, suite-local shutdown mechanism (e.g., stop/join the services started in setup_rpc_test_suite) or ensure the broadcast is performed once at the very end of the whole test binary rather than on every suite drop.

Suggested change
// Shut down background tasks to release `Arc<Shared>` and allow temp dir cleanup.
// It's should be safe in this test suite.
impl Drop for RpcTestSuite {
fn drop(&mut self) {
ckb_stop_handler::broadcast_exit_signals();
}
// Do not broadcast global exit signals from a suite-local `Drop`.
// `ckb_stop_handler::broadcast_exit_signals()` cancels a process-global stop token
// and cannot be reset, so calling it here would interfere with other tests that
// create and drop their own `RpcTestSuite`.
impl Drop for RpcTestSuite {
fn drop(&mut self) {}

Copilot uses AI. Check for mistakes.
@quake
Copy link
Copy Markdown
Member

quake commented Apr 10, 2026

thanks for the PR, good catch!

I have a small suggestion:

Net::new doesn't need a root_dir parameter. Its working_dir is only used internally for NetworkConfig::path (secret key, peer store, etc.), and Net::working_dir() is never called externally.

Instead of threading root_dir through Spec::before_run - Node - Net (which adds a root_dir field/method to Node and touches ~30 call sites), just let Net own a TempDir directly:

pub struct Net {
    // ...
    _temp_dir: TempDir, // auto-cleanup on drop
}
impl Net {
    // signature unchanged — no root_dir needed
    pub fn new(spec_name: &str, consensus: &Consensus, protocols: Vec<SupportProtocols>) -> Self {
        let temp_dir = tempfile::Builder::new()
            .prefix(&format!("ckb-it-{}-net-", spec_name))
            .tempdir()
            .expect("create tempdir failed");
        let working_dir = temp_dir.path().to_path_buf();
        // ...
    }
}

This eliminates the root_dir field and root_dir() method on Node, keeps Net::new's original signature, and avoids changing all ~30 call sites.

@jetjinser
Copy link
Copy Markdown
Author

jetjinser commented Apr 10, 2026

Instead of threading root_dir through Spec::before_run - Node - Net (which adds a root_dir field/method to Node and touches ~30 call sites), just let Net own a TempDir directly:

pub struct Net {
    // ...
    _temp_dir: TempDir, // auto-cleanup on drop
}
impl Net {
    // signature unchanged — no root_dir needed
    pub fn new(spec_name: &str, consensus: &Consensus, protocols: Vec<SupportProtocols>) -> Self {
        let temp_dir = tempfile::Builder::new()
            .prefix(&format!("ckb-it-{}-net-", spec_name))
            .tempdir()
            .expect("create tempdir failed");
        let working_dir = temp_dir.path().to_path_buf();
        // ...
    }
}

This eliminates the root_dir field and root_dir() method on Node, keeps Net::new's original signature, and avoids changing all ~30 call sites.

Thanks for the review and advice!

However, we cannot simply bind the lifetime of TempDir to Net. The NetworkStat owns a PathBuf from the TempDir, which DumpPeerStoreService uses to keep active file handles open. Since DumpPeerStoreService is dropped after Net, these unreleased handles prevent the TempDir from cleaning up the directory properly (i.e., it fails silently).

Ideally, this would be the best approach for cleanup—in fact, it’s exactly what I attempted before implementing the current solution in this PR.

Please let me know if I’ve overlooked any details or if you have further suggestions!

@quake
Copy link
Copy Markdown
Member

quake commented Apr 10, 2026

However, we cannot simply bind the lifetime of TempDir to Net. The NetworkStat owns a PathBuf from the TempDir, which DumpPeerStoreService uses to keep active file handles open. Since DumpPeerStoreService is dropped after Net, these unreleased handles prevent the TempDir from cleaning up the directory properly (i.e., it fails silently).

Ideally, this would be the best approach for cleanup—in fact, it’s exactly what I attempted before implementing the current solution in this PR.

Please let me know if I’ve overlooked any details or if you have further suggestions!

_async_runtime is declared before _working_dir, so the runtime drops first. This shuts down all spawned tasks including DumpPeerStoreService, which runs its Drop impl (writing peer store to disk) while the directory still exists. Only after that does _working_dir: TempDir drop and clean up the directory.

even if there's a async timing issue, it probably doesn't matter here. DumpPeerStoreService::drop() writes peer store data, which is irrelevant in a test context, we're about to delete the entire directory anyway. A failed write just produces a warn-level log.

Have you tried the simpler approach locally, letting Net own its own TempDir internally (keeping the current Net::new signature unchanged)?. If you haven't tried it, I can give it a shot on my machine to verify it works.

@jetjinser
Copy link
Copy Markdown
Author

jetjinser commented Apr 10, 2026

_async_runtime is declared before _working_dir, so the runtime drops first. This shuts down all spawned tasks including DumpPeerStoreService, which runs its Drop impl (writing peer store to disk) while the directory still exists. Only after that does _working_dir: TempDir drop and clean up the directory.

Yes! I completely forgot about that.

Based on your suggestion, I pushed a new commit.

Comment on lines +169 to +172
// Used to extend the lifecycle of TempPathBuf
let node_paths = nodes
.iter()
.map(|node| node.working_dir())
.map(|node| node.owned_working_dir())
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the reason for the changes to the Notify struct: node_paths is now passed along with Notify to ensure that the TempDir within the Node isn't dropped prematurely when the current function ends. This prevents potential errors in subsequent operations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unit tests and Integration tests do not clean up temporary files

3 participants