Skip to content

Defer slicing for compute-on-fetch writer outputs#5502

Open
taimoorsohail wants to merge 8 commits intomainfrom
ts/remove-early-slicing-in-outputwrters
Open

Defer slicing for compute-on-fetch writer outputs#5502
taimoorsohail wants to merge 8 commits intomainfrom
ts/remove-early-slicing-in-outputwrters

Conversation

@taimoorsohail
Copy link
Copy Markdown
Collaborator

@taimoorsohail taimoorsohail commented Apr 14, 2026

Summary

This PR fixes a distributed output-writer bug for with_halos=false.

For diagnostics evaluated when output is written (derived fields, reductions, and time averages), interior slicing could be applied too early. In distributed runs, that can remove halo-needed indices from the compute path and trigger halo/bounds/MPI failures.

Minimal Reproducer (before fix)

using Oceananigans
using Oceananigans.OutputWriters

# distributed tripolar setup omitted for brevity
u = model.velocities.u
diag = Field(u * u; indices=(:, :, grid.Nz))  # write-time evaluated derived field

simulation.output_writers[:diag] = JLD2Writer(
    model, (; diag),
    schedule=IterationInterval(1),
    with_halos=false,
    filename="diag.jld2",
    overwrite_existing=true
)

Failure mode: construction/fetch path can use overly-sliced indices for the compute step, causing distributed halo communication failure.

What Changed (by file)

src/OutputWriters/output_construction.jl

  • Added logic to identify outputs that are evaluated at write time and require a halo-safe compute path.
  • For with_halos=false in those cases, construction now creates:
    1. a halo-capable source output for computation, and
    2. interior write indices for persisted output.
  • Returns a deferred internal wrapper instead of eagerly forcing a single sliced field for both roles.

src/OutputWriters/fetch_output.jl

  • Added internal DeferredSlicedOutput handling.
  • Fetch now:
    1. computes/fetches from the halo-capable source output,
    2. then applies write slicing (write_indices) before conversion/write.

This preserves correct distributed compute behavior while still writing interior-only data for with_halos=false.

test/test_jld2_writer.jl

  • Added regressions covering with_halos=false for write-time evaluated diagnostics under:
    • TimeInterval
    • IterationInterval
    • AveragedTimeInterval

test/test_mpi_tripolar.jl

  • Added distributed tripolar MPI regression for a computed surface diagnostic with with_halos=false.
  • Verifies writer setup + first write path avoids prior halo/bounds/MPI failure.

Behavioral Clarification

This does not compute a full 3D domain and then keep a tiny 2D slice.

It computes over the requested region with halo-capable indices (for safe communication), then applies interior slicing only for what is written.

Compatibility

  • No public API changes.
  • with_halos=false output shape semantics remain interior-only.
  • with_halos=true behavior unchanged.
With Codex

@taimoorsohail taimoorsohail self-assigned this Apr 14, 2026
@taimoorsohail taimoorsohail added bug 🐞 Even a perfect program still has bugs output 💾 labels Apr 14, 2026
@taimoorsohail
Copy link
Copy Markdown
Collaborator Author

@simone-silvestri did I just repeat work you already did in #5492 ??

@simone-silvestri
Copy link
Copy Markdown
Collaborator

Nono, this is an independent bug with output writers, good catch


function fetch_output(output::DeferredSlicedOutput, model)
full_output = fetch_output(output.source_output, model)
return view(full_output, output.write_indices...)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the fetch_output for AbstractFields return plain arrays, while this seems to return a Field?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This computes a view on the result of fetch_output. So if fetch_output returns an Array, then view(full_output, ...) is a SubArray.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This returns something Array-like, which I think is OK?

Comment thread test/test_mpi_tripolar.jl Outdated
Co-authored-by: Simone Silvestri <silvestri.simone0@gmail.com>
Comment thread test/test_jld2_writer.jl Outdated
return nothing
end

function test_jld2_with_halos_false_compute_outputs(arch)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these tests pass also on main, I am not sure what gets exercised here

@glwagner
Copy link
Copy Markdown
Member

I might be confused about the description, but we of course do not want to compute 3D and then only output a small 2D slice? Can this sentence be clarified:

The fix defers slicing until fetch/write time for compute-on-fetch outputs.

additional_kw = user_output isa Field ? NamedTuple() : (; compute=false)

return Field(user_output; indices, additional_kw...)
if !with_halos && requires_compute_on_fetch(user_output)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm struggling to grok what "compute on fetch" means. Can we try to rethink the semantics? I am admittedly pretty dense 😂

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about needs_halo_compute_for_write? Is that better framing?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are the cases that require halo slicing but do not require the second condition?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the condition being sought iscomputed? In that case, it would be a test for non-Nothing Field.operand and could belong in Fields module.

struct DeferredSlicedOutput{SO, I}
source_output :: SO
write_indices :: I
end
Copy link
Copy Markdown
Member

@glwagner glwagner Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the purpose of this is only to slice of halo regions, should we call it OutputWithoutHalos or something?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or SlicedOutputWithoutHalos

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about ComputeThenSliceOutput? Because that is what it is doing...?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or ComputeWithHalosThenSliceOutput to be more clear

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. Maybe ComputedHalosWithoutHalos.

This represents output that must be compute! ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any case where we would compute without halos? This part is causing the confusion for me

return Field(user_output; indices, additional_kw...)
if !with_halos && requires_compute_on_fetch(user_output)
source_indices = output_indices(user_output, user_indices, true)
source_output = Field(user_output; indices=source_indices, additional_kw...)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this answers my earlier concern since the Field is indeed constructed with a non-trivial indices

@glwagner
Copy link
Copy Markdown
Member

I might be confused about the description, but we of course do not want to compute 3D and then only output a small 2D slice? Can this sentence be clarified:

The fix defers slicing until fetch/write time for compute-on-fetch outputs.

I think the answer is that this description is a little misleading. Slicing is not completely deferred. We slice TWICE: during output construction, and then further for output. Is that right?

@taimoorsohail
Copy link
Copy Markdown
Collaborator Author

taimoorsohail commented Apr 16, 2026

I might be confused about the description, but we of course do not want to compute 3D and then only output a small 2D slice? Can this sentence be clarified:

The fix defers slicing until fetch/write time for compute-on-fetch outputs.

I think the answer is that this description is a little misleading. Slicing is not completely deferred. We slice TWICE: during output construction, and then further for output. Is that right?

Yes, but during output construction, we slice with halos (as the halo information is required for compute) and then for output writing we remove the halos before saving.

Comment thread src/OutputWriters/fetch_output.jl
taimoorsohail and others added 2 commits April 16, 2026 20:52
@taimoorsohail
Copy link
Copy Markdown
Collaborator Author

OK, I removed one of the tests which was indeed not accomplishing anything. The other one is good though.

@taimoorsohail
Copy link
Copy Markdown
Collaborator Author

I had a chat with @simone-silvestri and I think this is good to go?

@simone-silvestri
Copy link
Copy Markdown
Collaborator

Is this only for distributed? It would be nice to see a test that fails on main (serial) but passes on this PR?

@taimoorsohail
Copy link
Copy Markdown
Collaborator Author

Yes, this is only an issue for distributed, where the domain is partitioned. I believe that tripolar_compute_output_writer_script should fail on main (distributed) but not in this PR (distributed)

@navidcy
Copy link
Copy Markdown
Member

navidcy commented Apr 17, 2026

Sorry to come late in the chat, but don't we anyway always want to slice off the halos only at the very end, just before writing to disk? Why is this only a distributed-related thing?

@taimoorsohail
Copy link
Copy Markdown
Collaborator Author

@navidcy You are right -- I think that the bug is exposed only in distributed, for some reason it works in serial (i.e. tests pass) but is still a bug...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug 🐞 Even a perfect program still has bugs output 💾

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants