Defer slicing for compute-on-fetch writer outputs by taimoorsohail · Pull Request #5502 · CliMA/Oceananigans.jl

taimoorsohail · 2026-04-14T05:14:16Z

Summary

This PR fixes a distributed output-writer bug for with_halos=false.

For diagnostics evaluated when output is written (derived fields, reductions, and time averages), interior slicing could be applied too early. In distributed runs, that can remove halo-needed indices from the compute path and trigger halo/bounds/MPI failures.

Minimal Reproducer (before fix)

using Oceananigans
using Oceananigans.OutputWriters

# distributed tripolar setup omitted for brevity
u = model.velocities.u
diag = Field(u * u; indices=(:, :, grid.Nz))  # write-time evaluated derived field

simulation.output_writers[:diag] = JLD2Writer(
    model, (; diag),
    schedule=IterationInterval(1),
    with_halos=false,
    filename="diag.jld2",
    overwrite_existing=true
)

Failure mode: construction/fetch path can use overly-sliced indices for the compute step, causing distributed halo communication failure.

What Changed (by file)

`src/OutputWriters/output_construction.jl`

Added logic to identify outputs that are evaluated at write time and require a halo-safe compute path.
For with_halos=false in those cases, construction now creates:
1. a halo-capable source output for computation, and
2. interior write indices for persisted output.
Returns a deferred internal wrapper instead of eagerly forcing a single sliced field for both roles.

`src/OutputWriters/fetch_output.jl`

Added internal DeferredSlicedOutput handling.
Fetch now:
1. computes/fetches from the halo-capable source output,
2. then applies write slicing (write_indices) before conversion/write.

This preserves correct distributed compute behavior while still writing interior-only data for with_halos=false.

`test/test_jld2_writer.jl`

Added regressions covering with_halos=false for write-time evaluated diagnostics under:
- TimeInterval
- IterationInterval
- AveragedTimeInterval

`test/test_mpi_tripolar.jl`

Added distributed tripolar MPI regression for a computed surface diagnostic with with_halos=false.
Verifies writer setup + first write path avoids prior halo/bounds/MPI failure.

Behavioral Clarification

This does not compute a full 3D domain and then keep a tiny 2D slice.

It computes over the requested region with halo-capable indices (for safe communication), then applies interior slicing only for what is written.

Compatibility

No public API changes.
with_halos=false output shape semantics remain interior-only.
with_halos=true behavior unchanged.

With Codex

taimoorsohail · 2026-04-14T05:43:43Z

@simone-silvestri did I just repeat work you already did in #5492 ??

simone-silvestri · 2026-04-14T05:48:18Z

Nono, this is an independent bug with output writers, good catch

simone-silvestri · 2026-04-14T05:50:47Z


+function fetch_output(output::DeferredSlicedOutput, model)
+    full_output = fetch_output(output.source_output, model)
+    return view(full_output, output.write_indices...)


I think the fetch_output for AbstractFields return plain arrays, while this seems to return a Field?

This computes a view on the result of fetch_output. So if fetch_output returns an Array, then view(full_output, ...) is a SubArray.

This returns something Array-like, which I think is OK?

Co-authored-by: Simone Silvestri <silvestri.simone0@gmail.com>

simone-silvestri · 2026-04-14T06:43:47Z

    return nothing
 end

+function test_jld2_with_halos_false_compute_outputs(arch)


these tests pass also on main, I am not sure what gets exercised here

glwagner · 2026-04-14T15:52:47Z

I might be confused about the description, but we of course do not want to compute 3D and then only output a small 2D slice? Can this sentence be clarified:

The fix defers slicing until fetch/write time for compute-on-fetch outputs.

glwagner · 2026-04-14T15:54:34Z

    additional_kw = user_output isa Field ? NamedTuple() : (; compute=false)

-    return Field(user_output; indices, additional_kw...)
+    if !with_halos && requires_compute_on_fetch(user_output)


I'm struggling to grok what "compute on fetch" means. Can we try to rethink the semantics? I am admittedly pretty dense 😂

what about needs_halo_compute_for_write? Is that better framing?

what are the cases that require halo slicing but do not require the second condition?

Is the condition being sought iscomputed? In that case, it would be a test for non-Nothing Field.operand and could belong in Fields module.

glwagner · 2026-04-14T15:55:05Z

+struct DeferredSlicedOutput{SO, I}
+    source_output :: SO
+    write_indices :: I
+end


If the purpose of this is only to slice of halo regions, should we call it OutputWithoutHalos or something?

or SlicedOutputWithoutHalos

how about ComputeThenSliceOutput? Because that is what it is doing...?

Or ComputeWithHalosThenSliceOutput to be more clear

Ah I see. Maybe ComputedHalosWithoutHalos.

This represents output that must be compute! ?

Is there any case where we would compute without halos? This part is causing the confusion for me

glwagner · 2026-04-14T15:56:12Z

-    return Field(user_output; indices, additional_kw...)
+    if !with_halos && requires_compute_on_fetch(user_output)
+        source_indices = output_indices(user_output, user_indices, true)
+        source_output = Field(user_output; indices=source_indices, additional_kw...)


this answers my earlier concern since the Field is indeed constructed with a non-trivial indices

glwagner · 2026-04-14T15:57:21Z

I might be confused about the description, but we of course do not want to compute 3D and then only output a small 2D slice? Can this sentence be clarified:

The fix defers slicing until fetch/write time for compute-on-fetch outputs.

I think the answer is that this description is a little misleading. Slicing is not completely deferred. We slice TWICE: during output construction, and then further for output. Is that right?

taimoorsohail · 2026-04-16T02:52:24Z

I might be confused about the description, but we of course do not want to compute 3D and then only output a small 2D slice? Can this sentence be clarified:

The fix defers slicing until fetch/write time for compute-on-fetch outputs.

I think the answer is that this description is a little misleading. Slicing is not completely deferred. We slice TWICE: during output construction, and then further for output. Is that right?

Yes, but during output construction, we slice with halos (as the halo information is required for compute) and then for output writing we remove the halos before saving.

…warning

Co-authored-by: Simone Silvestri <silvestri.simone0@gmail.com>

taimoorsohail · 2026-04-16T11:16:30Z

OK, I removed one of the tests which was indeed not accomplishing anything. The other one is good though.

taimoorsohail · 2026-04-16T11:16:48Z

I had a chat with @simone-silvestri and I think this is good to go?

simone-silvestri · 2026-04-16T12:22:50Z

Is this only for distributed? It would be nice to see a test that fails on main (serial) but passes on this PR?

taimoorsohail · 2026-04-17T02:08:52Z

Yes, this is only an issue for distributed, where the domain is partitioned. I believe that tripolar_compute_output_writer_script should fail on main (distributed) but not in this PR (distributed)

navidcy · 2026-04-17T03:29:04Z

Sorry to come late in the chat, but don't we anyway always want to slice off the halos only at the very end, just before writing to disk? Why is this only a distributed-related thing?

taimoorsohail · 2026-04-17T04:20:24Z

@navidcy You are right -- I think that the bug is exposed only in distributed, for some reason it works in serial (i.e. tests pass) but is still a bug...

Defer slicing for compute-on-fetch writer outputs

d7f6ec2

taimoorsohail requested review from glwagner, navidcy and simone-silvestri April 14, 2026 05:15

taimoorsohail self-assigned this Apr 14, 2026

taimoorsohail added bug 🐞 Even a perfect program still has bugs output 💾 labels Apr 14, 2026

simone-silvestri reviewed Apr 14, 2026

View reviewed changes

Comment thread test/test_mpi_tripolar.jl Outdated

Update test_mpi_tripolar.jl

89baaa1

Co-authored-by: Simone Silvestri <silvestri.simone0@gmail.com>

simone-silvestri reviewed Apr 14, 2026

View reviewed changes

glwagner reviewed Apr 14, 2026

View reviewed changes

Merge branch 'main' into ts/remove-early-slicing-in-outputwrters

87d7b61

taimoorsohail added 3 commits April 16, 2026 13:07

Add DeferredSlicedOutput metadata pass-through methods

6f24867

Fix DeferredSlicedOutput indices method import

609a52e

Fix OutputWriters import extensions and clean Advection architecture …

9fff92e

…warning

simone-silvestri reviewed Apr 16, 2026

View reviewed changes

Comment thread src/OutputWriters/fetch_output.jl

taimoorsohail and others added 2 commits April 16, 2026 20:52

Update src/OutputWriters/fetch_output.jl

e1bd943

Co-authored-by: Simone Silvestri <silvestri.simone0@gmail.com>

Remove serial with_halos=false compute-output JLD2 test

918098d

Conversation

taimoorsohail commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Minimal Reproducer (before fix)

What Changed (by file)

src/OutputWriters/output_construction.jl

src/OutputWriters/fetch_output.jl

test/test_jld2_writer.jl

test/test_mpi_tripolar.jl

Behavioral Clarification

Compatibility

Uh oh!

taimoorsohail commented Apr 14, 2026

Uh oh!

simone-silvestri commented Apr 14, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glwagner commented Apr 14, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glwagner Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

glwagner commented Apr 14, 2026

Uh oh!

taimoorsohail commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

taimoorsohail commented Apr 16, 2026

Uh oh!

taimoorsohail commented Apr 16, 2026

Uh oh!

simone-silvestri commented Apr 16, 2026

Uh oh!

taimoorsohail commented Apr 17, 2026

Uh oh!

navidcy commented Apr 17, 2026

Uh oh!

taimoorsohail commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

taimoorsohail commented Apr 14, 2026 •

edited

Loading

`src/OutputWriters/output_construction.jl`

`src/OutputWriters/fetch_output.jl`

`test/test_jld2_writer.jl`

`test/test_mpi_tripolar.jl`

glwagner Apr 14, 2026 •

edited

Loading

taimoorsohail commented Apr 16, 2026 •

edited

Loading