Skip to content

[CI] Restore sharding tests#5446

Draft
giordano wants to merge 6 commits intomainfrom
mg/restore-sharding
Draft

[CI] Restore sharding tests#5446
giordano wants to merge 6 commits intomainfrom
mg/restore-sharding

Conversation

@giordano
Copy link
Copy Markdown
Collaborator

Follow up from #5318. Also, install earlyoom to handle OOM's more nicely.

@giordano giordano added the reactant ∇ all day I dream about MLIR label Mar 27, 2026
@giordano
Copy link
Copy Markdown
Collaborator Author

https://github.com/CliMA/Oceananigans.jl/actions/runs/23672027105/job/68967345439?pr=5446#step:7:3168

E0000 00:00:1774656104.133710    9079 rendezvous.cc:116] [id=0] This thread has been waiting for `all reduce RendezvousKey{run_id=RunId: 41, global_devices=[0, 2], num_local_participants=2, collective_op_kind=cross_module, op_id=141}` for 20 seconds and may be stuck. Expected 2 threads to join the rendezvous, but not all of them arrived on time.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 28, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.59%. Comparing base (f70a4e4) to head (57b26a3).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #5446   +/-   ##
=======================================
  Coverage   73.59%   73.59%           
=======================================
  Files         400      400           
  Lines       22867    22867           
=======================================
+ Hits        16829    16830    +1     
+ Misses       6038     6037    -1     
Flag Coverage Δ
buildkite 69.00% <ø> (+<0.01%) ⬆️
julia 69.00% <ø> (+<0.01%) ⬆️
reactant_1 6.45% <ø> (ø)
reactant_2 10.44% <ø> (ø)
reactant_3 9.56% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@giordano giordano force-pushed the mg/restore-sharding branch from 57b26a3 to 4bcf6d7 Compare March 29, 2026 17:45
@giordano giordano marked this pull request as draft March 29, 2026 18:18
@giordano
Copy link
Copy Markdown
Collaborator Author

giordano commented Mar 29, 2026

https://github.com/CliMA/Oceananigans.jl/actions/runs/23715157779/job/69080960191?pr=5446#step:7:3364

┌ Debug: MLIR module written to /tmp/reactant_UzsUOk/module_839_reactant_tokw_post_xla_compile.mlir
└ @ Reactant.MLIR.IR ~/.julia/packages/Reactant/Qucp9/src/mlir/IR/Pass.jl:148
mem avail:  3599 of 15993 MiB (22.50%), swap free: 3067 of 3071 MiB (99.86%)
mem avail:   593 of 15993 MiB ( 3.71%), swap free: 2661 of 3071 MiB (86.63%)
mem avail:   656 of 15993 MiB ( 4.11%), swap free: 2143 of 3071 MiB (69.77%)
mem avail:   691 of 15993 MiB ( 4.32%), swap free: 1634 of 3071 MiB (53.20%)
mem avail:   695 of 15993 MiB ( 4.35%), swap free: 1111 of 3071 MiB (36.18%)
mem avail:   701 of 15993 MiB ( 4.39%), swap free:  601 of 3071 MiB (19.59%)

[9064] signal 15: Terminated

Edit: after reshuffling how the tests are run: https://github.com/CliMA/Oceananigans.jl/actions/runs/23716270980/job/69083893151?pr=5446#step:8:1765

┌ Debug: MLIR module written to /tmp/reactant_p3ESZl/module_839_reactant_tokw_post_xla_compile.mlir
└ @ Reactant.MLIR.IR ~/.julia/packages/Reactant/Qucp9/src/mlir/IR/Pass.jl:148
mem avail:  1620 of 15989 MiB (10.14%), swap free: 3040 of 3071 MiB (98.98%)
mem avail:  1491 of 15989 MiB ( 9.33%), swap free: 2413 of 3071 MiB (78.56%)
mem avail:  1425 of 15989 MiB ( 8.92%), swap free: 1877 of 3071 MiB (61.11%)
mem avail:  1393 of 15989 MiB ( 8.71%), swap free: 1388 of 3071 MiB (45.19%)
E0000 00:00:1774810117.468986    2826 rendezvous.cc:116] [id=0] This thread has been waiting for `all reduce RendezvousKey{run_id=RunId: 41, global_devices=[0, 2], num_local_participants=2, collective_op_kind=cross_module, op_id=141}` for 20 seconds and may be stuck. Expected 2 threads to join the rendezvous, but not all of them arrived on time.
mem avail:  1387 of 15989 MiB ( 8.68%), swap free:  838 of 3071 MiB (27.30%)
mem avail:  1387 of 15989 MiB ( 8.68%), swap free:  241 of 3071 MiB ( 7.86%)
[2814] signal 15: Terminated

This is clearly "only" using too much memory on this machine.

@giordano giordano force-pushed the mg/restore-sharding branch from f987a96 to 989718e Compare March 29, 2026 21:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

reactant ∇ all day I dream about MLIR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants