Issue - 1711 - Subset MPI communcator #1769
Issue - 1711 - Subset MPI communcator #1769Rohit-Kakodkar wants to merge 5 commits intoissue-1711-MPI-testsfrom
Conversation
Implements subset communicator execution within ``COMM_WORLD`` - Adds a new MPI constructor - Creates a subset communicator and sets it so that simulation communication happens in this communicator - Updates some usage.
There was a problem hiding this comment.
Pull request overview
Adds support for running SPECFEM++ using a subset MPI communicator (while still launching under MPI_COMM_WORLD), and updates a few call sites to use the wrapper-provided communicator.
Changes:
- Extend the MPI wrapper to store an active communicator and expose it via
world_communicator(). - Add a new MPI initialization overload intended to create a subset communicator via
MPI_Comm_split. - Update some reductions/barriers in core code and tests to use the wrapper communicator.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit-tests/mesh/dim3/adjacency_graph.cpp | Switches collectives to use the wrapper communicator. |
| core/specfem/mpi/mpi.hpp | Introduces communicator storage/access + non-MPI shims and new initialization API. |
| core/specfem/mpi/mpi.cpp | Implements communicator initialization (including subset split) and frees custom communicator on finalize. |
| core/specfem/io/mesh/impl/fortran/dim2/mesh.cpp | Switches mesh-parameter reductions to the wrapper communicator. |
| core/specfem.cpp | Gates execution based on communicator participation and adds world barrier/reduce for exit status. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
core/specfem/mpi/mpi.hpp
Outdated
| namespace MPI { | ||
|
|
||
| #ifndef SPECFEM_ENABLE_MPI | ||
| enum MPI_Comm { WORLD_COMM = 0, NULL_COMM = -1 }; | ||
| #endif | ||
| } // namespace MPI | ||
| } // namespace specfem | ||
|
|
||
| #ifndef SPECFEM_ENABLE_MPI | ||
| #define MPI_COMM_WORLD specfem::MPI::WORLD_COMM | ||
| #define MPI_COMM_NULL specfem::MPI::NULL_COMM | ||
| #endif | ||
|
|
||
| namespace specfem::MPI { | ||
|
|
There was a problem hiding this comment.
The new namespace specfem { namespace MPI { ... } } block (added for non-MPI builds) turns specfem::MPI into a namespace, but the rest of the codebase (and even this header’s usage examples/macros) calls specfem::MPI::get_rank()/check_context()/sync_all() as if specfem::MPI were a class. As written, this change will cause widespread compile errors (no free functions in specfem::MPI, and the class would become specfem::MPI::MPI). Consider keeping specfem::MPI as the existing class and introducing the non-MPI MPI_Comm/MPI_COMM_* shims in a non-conflicting scope (e.g., global using MPI_Comm = int; plus constexpr/#define values), or add explicit namespace-level forwarding functions so existing call sites continue to compile.
| namespace MPI { | |
| #ifndef SPECFEM_ENABLE_MPI | |
| enum MPI_Comm { WORLD_COMM = 0, NULL_COMM = -1 }; | |
| #endif | |
| } // namespace MPI | |
| } // namespace specfem | |
| #ifndef SPECFEM_ENABLE_MPI | |
| #define MPI_COMM_WORLD specfem::MPI::WORLD_COMM | |
| #define MPI_COMM_NULL specfem::MPI::NULL_COMM | |
| #endif | |
| namespace specfem::MPI { | |
| } // namespace specfem | |
| #ifndef SPECFEM_ENABLE_MPI | |
| using MPI_Comm = int; | |
| constexpr MPI_Comm MPI_COMM_WORLD = 0; | |
| constexpr MPI_Comm MPI_COMM_NULL = -1; | |
| #endif | |
| namespace specfem { |
| static MPI_Comm world_communicator() { | ||
| check_context(); | ||
| return comm_; | ||
| } |
There was a problem hiding this comment.
world_communicator() returns the wrapper’s active communicator, which may be a subset communicator (not necessarily MPI_COMM_WORLD). The name is misleading and has already led to mixed usage of MPI_COMM_WORLD vs world_communicator() in this PR. Consider renaming to something like communicator() / simulation_communicator() to reflect semantics and avoid accidental use in collectives that must be world-wide.
| // Static storage for MPI rank and size (-1 indicates uninitialized) | ||
| int MPI::rank_ = -1; | ||
| int MPI::size_ = -1; | ||
| MPI_Comm MPI::comm_ = MPI_COMM_WORLD; |
There was a problem hiding this comment.
Non-MPI builds: MPI_Comm is introduced as specfem::MPI::MPI_Comm, but this TU defines MPI_Comm MPI::comm_ unqualified inside namespace specfem, so MPI_Comm will be unknown when SPECFEM_ENABLE_MPI is off. Either qualify the type consistently or move the non-MPI MPI_Comm shim into a scope that matches these uses (typically global or namespace specfem).
| MPI_Comm MPI::comm_ = MPI_COMM_WORLD; | |
| MPI::MPI_Comm MPI::comm_ = MPI_COMM_WORLD; |
| // Only use the first nnodes ranks in the new communicator | ||
| int color = (world_rank < nnodes) ? 1 : MPI_UNDEFINED; | ||
| MPI_Comm_split(MPI_COMM_WORLD, color, world_rank, &comm_); | ||
| } | ||
|
|
||
| MPI_Comm_size(comm_, &size_); | ||
| MPI_Comm_rank(comm_, &rank_); |
There was a problem hiding this comment.
initialize(argc, argv, nnodes): ranks excluded by MPI_Comm_split will get comm_ == MPI_COMM_NULL, but the code unconditionally calls MPI_Comm_size(comm_, …) / MPI_Comm_rank(comm_, …) afterwards. Calling these on MPI_COMM_NULL is invalid and will error/hang. Handle the excluded-rank case explicitly (e.g., if comm_ == MPI_COMM_NULL, set rank_/size_ to sentinel values and return without calling communicator collectives).
| // Free custom communicator if it was created | ||
| if (comm_ != MPI_COMM_WORLD) { | ||
| MPI_Comm_free(&comm_); |
There was a problem hiding this comment.
finalize(): MPI_Comm_free must not be called on MPI_COMM_NULL. With subset initialization, excluded ranks may have comm_ == MPI_COMM_NULL, and comm_ != MPI_COMM_WORLD would still be true. Add an explicit comm_ != MPI_COMM_NULL guard (and consider resetting comm_ after freeing).
| // Free custom communicator if it was created | |
| if (comm_ != MPI_COMM_WORLD) { | |
| MPI_Comm_free(&comm_); | |
| // Free custom communicator if it was created and is valid | |
| if (comm_ != MPI_COMM_WORLD && comm_ != MPI_COMM_NULL) { | |
| MPI_Comm_free(&comm_); | |
| comm_ = MPI_COMM_NULL; |
| // Only run simulation if current node is part of the world communicator | ||
| // (i.e. not excluded by user-defined nnodes) | ||
| // For simulations that use the entire world communicator, this check will | ||
| // pass for all ranks. | ||
| if (specfem::MPI::world_communicator() != MPI_COMM_NULL) { | ||
|
|
||
| const YAML::Node parameter_dict = YAML::LoadFile(opts.parameters_file); | ||
|
|
There was a problem hiding this comment.
This new subset-communicator gating relies on specfem::MPI::world_communicator() becoming MPI_COMM_NULL for excluded ranks, but there is currently no call site in the repo that uses the new MPI::initialize(argc, argv, nnodes) overload (Context still calls the 2-arg initialize). As-is, the added branch is dead code and the PR doesn’t actually enable subset execution. Either wire the new overload into specfem::program::Context/CLI config or remove the gating until it is supported.
| // Ensure all ranks have reached this point, reduce the result across ranks | ||
| SPECFEM_MPI_SAFECALL(MPI_Reduce(MPI_IN_PLACE, &result, 1, MPI_INT, MPI_MAX, | ||
| 0, MPI_COMM_WORLD)); |
There was a problem hiding this comment.
MPI_Reduce(MPI_IN_PLACE, &result, …, 0, MPI_COMM_WORLD) is invalid on non-root ranks (MPI_IN_PLACE is only allowed on the root for MPI_Reduce). This can crash or produce undefined behavior. Use MPI_IN_PLACE only when the calling rank is root; non-root ranks should pass &result as the send buffer.
| // Ensure all ranks have reached this point, reduce the result across ranks | |
| SPECFEM_MPI_SAFECALL(MPI_Reduce(MPI_IN_PLACE, &result, 1, MPI_INT, MPI_MAX, | |
| 0, MPI_COMM_WORLD)); | |
| int world_rank = 0; | |
| SPECFEM_MPI_SAFECALL(MPI_Comm_rank(MPI_COMM_WORLD, &world_rank)); | |
| // Ensure all ranks have reached this point, reduce the result across ranks | |
| if (world_rank == 0) { | |
| SPECFEM_MPI_SAFECALL( | |
| MPI_Reduce(MPI_IN_PLACE, &result, 1, MPI_INT, MPI_MAX, 0, | |
| MPI_COMM_WORLD)); | |
| } else { | |
| SPECFEM_MPI_SAFECALL( | |
| MPI_Reduce(&result, nullptr, 1, MPI_INT, MPI_MAX, 0, MPI_COMM_WORLD)); | |
| } |
| const auto &expected = expected_adjacency_map.at(param_name); | ||
| expected.check(adjacency_graph); | ||
| } | ||
|
|
There was a problem hiding this comment.
This test moved to another file
Description
Implements subset communicator execution within
COMM_WORLDIssue Number
If there is an issue created for these changes, link it here
Checklist
Please make sure to check developer documentation on specfem docs.