Skip to content

Issue - 1711 - Subset MPI communcator #1769

Open
Rohit-Kakodkar wants to merge 5 commits intoissue-1711-MPI-testsfrom
issue-1711-sub-communicators
Open

Issue - 1711 - Subset MPI communcator #1769
Rohit-Kakodkar wants to merge 5 commits intoissue-1711-MPI-testsfrom
issue-1711-sub-communicators

Conversation

@Rohit-Kakodkar
Copy link
Copy Markdown
Collaborator

Description

Implements subset communicator execution within COMM_WORLD

  • Adds a new MPI constructor
  • Creates a subset communicator and sets it so that simulation communication happens in this communicator
  • Updates some usage.

Issue Number

If there is an issue created for these changes, link it here

Checklist

Please make sure to check developer documentation on specfem docs.

  • I ran the code through pre-commit to check style
  • THE DOCUMENTATION BUILDS WITHOUT WARNINGS/ERRORS
  • I have added labels to the PR (see right hand side of the PR page)
  • My code passes all the integration tests
  • I have added sufficient unittests to test my changes
  • I have added/updated documentation for the changes I am proposing
  • I have updated CMakeLists to ensure my code builds
  • My code builds across all platforms

Implements subset communicator execution within ``COMM_WORLD``

- Adds a new MPI constructor
- Creates a subset communicator and sets it so that simulation
  communication happens in this communicator
- Updates some usage.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds support for running SPECFEM++ using a subset MPI communicator (while still launching under MPI_COMM_WORLD), and updates a few call sites to use the wrapper-provided communicator.

Changes:

  • Extend the MPI wrapper to store an active communicator and expose it via world_communicator().
  • Add a new MPI initialization overload intended to create a subset communicator via MPI_Comm_split.
  • Update some reductions/barriers in core code and tests to use the wrapper communicator.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
tests/unit-tests/mesh/dim3/adjacency_graph.cpp Switches collectives to use the wrapper communicator.
core/specfem/mpi/mpi.hpp Introduces communicator storage/access + non-MPI shims and new initialization API.
core/specfem/mpi/mpi.cpp Implements communicator initialization (including subset split) and frees custom communicator on finalize.
core/specfem/io/mesh/impl/fortran/dim2/mesh.cpp Switches mesh-parameter reductions to the wrapper communicator.
core/specfem.cpp Gates execution based on communicator participation and adds world barrier/reduce for exit status.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +25 to +39
namespace MPI {

#ifndef SPECFEM_ENABLE_MPI
enum MPI_Comm { WORLD_COMM = 0, NULL_COMM = -1 };
#endif
} // namespace MPI
} // namespace specfem

#ifndef SPECFEM_ENABLE_MPI
#define MPI_COMM_WORLD specfem::MPI::WORLD_COMM
#define MPI_COMM_NULL specfem::MPI::NULL_COMM
#endif

namespace specfem::MPI {

Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new namespace specfem { namespace MPI { ... } } block (added for non-MPI builds) turns specfem::MPI into a namespace, but the rest of the codebase (and even this header’s usage examples/macros) calls specfem::MPI::get_rank()/check_context()/sync_all() as if specfem::MPI were a class. As written, this change will cause widespread compile errors (no free functions in specfem::MPI, and the class would become specfem::MPI::MPI). Consider keeping specfem::MPI as the existing class and introducing the non-MPI MPI_Comm/MPI_COMM_* shims in a non-conflicting scope (e.g., global using MPI_Comm = int; plus constexpr/#define values), or add explicit namespace-level forwarding functions so existing call sites continue to compile.

Suggested change
namespace MPI {
#ifndef SPECFEM_ENABLE_MPI
enum MPI_Comm { WORLD_COMM = 0, NULL_COMM = -1 };
#endif
} // namespace MPI
} // namespace specfem
#ifndef SPECFEM_ENABLE_MPI
#define MPI_COMM_WORLD specfem::MPI::WORLD_COMM
#define MPI_COMM_NULL specfem::MPI::NULL_COMM
#endif
namespace specfem::MPI {
} // namespace specfem
#ifndef SPECFEM_ENABLE_MPI
using MPI_Comm = int;
constexpr MPI_Comm MPI_COMM_WORLD = 0;
constexpr MPI_Comm MPI_COMM_NULL = -1;
#endif
namespace specfem {

Copilot uses AI. Check for mistakes.
Comment on lines +113 to +116
static MPI_Comm world_communicator() {
check_context();
return comm_;
}
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

world_communicator() returns the wrapper’s active communicator, which may be a subset communicator (not necessarily MPI_COMM_WORLD). The name is misleading and has already led to mixed usage of MPI_COMM_WORLD vs world_communicator() in this PR. Consider renaming to something like communicator() / simulation_communicator() to reflect semantics and avoid accidental use in collectives that must be world-wide.

Copilot uses AI. Check for mistakes.
// Static storage for MPI rank and size (-1 indicates uninitialized)
int MPI::rank_ = -1;
int MPI::size_ = -1;
MPI_Comm MPI::comm_ = MPI_COMM_WORLD;
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-MPI builds: MPI_Comm is introduced as specfem::MPI::MPI_Comm, but this TU defines MPI_Comm MPI::comm_ unqualified inside namespace specfem, so MPI_Comm will be unknown when SPECFEM_ENABLE_MPI is off. Either qualify the type consistently or move the non-MPI MPI_Comm shim into a scope that matches these uses (typically global or namespace specfem).

Suggested change
MPI_Comm MPI::comm_ = MPI_COMM_WORLD;
MPI::MPI_Comm MPI::comm_ = MPI_COMM_WORLD;

Copilot uses AI. Check for mistakes.
Comment on lines +52 to +58
// Only use the first nnodes ranks in the new communicator
int color = (world_rank < nnodes) ? 1 : MPI_UNDEFINED;
MPI_Comm_split(MPI_COMM_WORLD, color, world_rank, &comm_);
}

MPI_Comm_size(comm_, &size_);
MPI_Comm_rank(comm_, &rank_);
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initialize(argc, argv, nnodes): ranks excluded by MPI_Comm_split will get comm_ == MPI_COMM_NULL, but the code unconditionally calls MPI_Comm_size(comm_, …) / MPI_Comm_rank(comm_, …) afterwards. Calling these on MPI_COMM_NULL is invalid and will error/hang. Handle the excluded-rank case explicitly (e.g., if comm_ == MPI_COMM_NULL, set rank_/size_ to sentinel values and return without calling communicator collectives).

Copilot uses AI. Check for mistakes.
Comment on lines +74 to +76
// Free custom communicator if it was created
if (comm_ != MPI_COMM_WORLD) {
MPI_Comm_free(&comm_);
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

finalize(): MPI_Comm_free must not be called on MPI_COMM_NULL. With subset initialization, excluded ranks may have comm_ == MPI_COMM_NULL, and comm_ != MPI_COMM_WORLD would still be true. Add an explicit comm_ != MPI_COMM_NULL guard (and consider resetting comm_ after freeing).

Suggested change
// Free custom communicator if it was created
if (comm_ != MPI_COMM_WORLD) {
MPI_Comm_free(&comm_);
// Free custom communicator if it was created and is valid
if (comm_ != MPI_COMM_WORLD && comm_ != MPI_COMM_NULL) {
MPI_Comm_free(&comm_);
comm_ = MPI_COMM_NULL;

Copilot uses AI. Check for mistakes.
Comment on lines +64 to +71
// Only run simulation if current node is part of the world communicator
// (i.e. not excluded by user-defined nnodes)
// For simulations that use the entire world communicator, this check will
// pass for all ranks.
if (specfem::MPI::world_communicator() != MPI_COMM_NULL) {

const YAML::Node parameter_dict = YAML::LoadFile(opts.parameters_file);

Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new subset-communicator gating relies on specfem::MPI::world_communicator() becoming MPI_COMM_NULL for excluded ranks, but there is currently no call site in the repo that uses the new MPI::initialize(argc, argv, nnodes) overload (Context still calls the 2-arg initialize). As-is, the added branch is dead code and the PR doesn’t actually enable subset execution. Either wire the new overload into specfem::program::Context/CLI config or remove the gating until it is supported.

Copilot uses AI. Check for mistakes.
Comment on lines +111 to +113
// Ensure all ranks have reached this point, reduce the result across ranks
SPECFEM_MPI_SAFECALL(MPI_Reduce(MPI_IN_PLACE, &result, 1, MPI_INT, MPI_MAX,
0, MPI_COMM_WORLD));
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MPI_Reduce(MPI_IN_PLACE, &result, …, 0, MPI_COMM_WORLD) is invalid on non-root ranks (MPI_IN_PLACE is only allowed on the root for MPI_Reduce). This can crash or produce undefined behavior. Use MPI_IN_PLACE only when the calling rank is root; non-root ranks should pass &result as the send buffer.

Suggested change
// Ensure all ranks have reached this point, reduce the result across ranks
SPECFEM_MPI_SAFECALL(MPI_Reduce(MPI_IN_PLACE, &result, 1, MPI_INT, MPI_MAX,
0, MPI_COMM_WORLD));
int world_rank = 0;
SPECFEM_MPI_SAFECALL(MPI_Comm_rank(MPI_COMM_WORLD, &world_rank));
// Ensure all ranks have reached this point, reduce the result across ranks
if (world_rank == 0) {
SPECFEM_MPI_SAFECALL(
MPI_Reduce(MPI_IN_PLACE, &result, 1, MPI_INT, MPI_MAX, 0,
MPI_COMM_WORLD));
} else {
SPECFEM_MPI_SAFECALL(
MPI_Reduce(&result, nullptr, 1, MPI_INT, MPI_MAX, 0, MPI_COMM_WORLD));
}

Copilot uses AI. Check for mistakes.
const auto &expected = expected_adjacency_map.at(param_name);
expected.check(adjacency_graph);
}

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test moved to another file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants