Conversation
73b715e to
32e7d13
Compare
MarcelKoch
left a comment
There was a problem hiding this comment.
The overall structure is good. I have only minor things to add. One thing I definetly would like to see is the removal of the weighting stuff.
| * This is the transpose of the RowGatherer operation: | ||
| * - RowGatherer does: y_local = R * x_distributed (gather remote values) | ||
| * - RowScatterer does: x_distributed += R^T * y_local (scatter and accumulate) |
There was a problem hiding this comment.
nit: It probably makes sense to have the reversed of this in the RowGatherer doc.
| * | ||
| * @return a unique_ptr to the created distributed::RowScatterer | ||
| */ | ||
| static std::unique_ptr<RowScatterer> create_from_gatherer( |
There was a problem hiding this comment.
Probably also want to have the inverse of this in RowGatherer.
| [[nodiscard]] mpi::request apply_async( | ||
| ptr_param<const LinOp> weights, | ||
| ptr_param<const LinOp> local_values) const; |
There was a problem hiding this comment.
Since the weighting is just an extra step before sending the data, I would prefer not to have this overload and instead just require on the user side that they scale their input.
core/distributed/row_scatterer.cpp
Outdated
| auto send_size_in_bytes = | ||
| sizeof(ValueType) * send_size[0] * send_size[1]; | ||
| if (!send_workspace_.get_executor() || | ||
| !mpi_exec->memory_accessible( | ||
| send_workspace_.get_executor())) { | ||
| send_workspace_.set_executor(mpi_exec); | ||
| } | ||
| if (send_size_in_bytes > send_workspace_.get_size()) { | ||
| send_workspace_.resize_and_reset(send_size_in_bytes); | ||
| } | ||
| auto send_buffer = matrix::Dense<ValueType>::create( | ||
| mpi_exec, send_size, | ||
| make_array_view(mpi_exec, send_size[0] * send_size[1], | ||
| reinterpret_cast<ValueType*>( | ||
| send_workspace_.get_data())), | ||
| send_size[1]); |
There was a problem hiding this comment.
I think this might be replaceable with using GenericDenseCache instead of the array as workspace. Same for RowGatherer.
There was a problem hiding this comment.
Why is the send workspace necessary in the first place? Can't we send from local_values directly (assuming GPU aware mpi)?
core/distributed/row_scatterer.cpp
Outdated
| dim<2> recv_size(coll_comm_->get_recv_size(), ncols); | ||
| auto recv_size_in_bytes = | ||
| sizeof(ValueType) * recv_size[0] * recv_size[1]; | ||
| if (!recv_workspace_.get_executor() || | ||
| !mpi_exec->memory_accessible( | ||
| recv_workspace_.get_executor())) { | ||
| recv_workspace_.set_executor(mpi_exec); | ||
| } | ||
| if (recv_size_in_bytes > recv_workspace_.get_size()) { | ||
| recv_workspace_.resize_and_reset(recv_size_in_bytes); | ||
| } |
There was a problem hiding this comment.
Same regarding the GenericDenseCache
core/distributed/row_scatterer.cpp
Outdated
| // Synchronize before MPI (GPU stream safety) | ||
| std::shared_ptr<const gko::detail::Event> ev = nullptr; | ||
| lv_local->get_executor()->run(event::make_record_event(ev)); | ||
| ev->synchronize(); |
There was a problem hiding this comment.
This seems unnecessary. I would not add the event until we split this up as we do for the row gatherer (although I think splitting the row scatter up doesn't make sense).
| // Synchronize before MPI (GPU stream safety) | |
| std::shared_ptr<const gko::detail::Event> ev = nullptr; | |
| lv_local->get_executor()->run(event::make_record_event(ev)); | |
| ev->synchronize(); | |
| exec->synchronize(); |
| * Must have the same number of columns as this matrix | ||
| * and `scatter_indices->get_size()` rows. | ||
| */ | ||
| void scatter_add(const array<int32>* scatter_indices, const Dense* source); |
There was a problem hiding this comment.
should this also be row_scatter_add?
There was a problem hiding this comment.
Probably not. In the long term row_gather should probably be changed instead.
| */ | ||
| static std::unique_ptr<RowScatterer> create_from_gatherer( | ||
| std::shared_ptr<const Executor> exec, | ||
| const RowGatherer<LocalIndexType>& gatherer); |
There was a problem hiding this comment.
| const RowGatherer<LocalIndexType>& gatherer); | |
| ptr_param<const RowGatherer> gatherer); |
Or bare pointer, but it shouldn't be a reference.
| } | ||
|
|
||
|
|
||
| TYPED_TEST(RowScatterer, CanOverlapWorkWithScatter) |
There was a problem hiding this comment.
This test and the ones below are also in test/.... I would suggest removing the ones here.
omp/matrix/dense_kernels.cpp
Outdated
| #pragma omp critical | ||
| tgt_vals[target_row * tgt_stride + j] += val; |
There was a problem hiding this comment.
Maybe use the atomic_add instead?
51d146e to
368863c
Compare
This PR adds a RowScatterer class, similar to the RowGatherer class. It can be created from the RowGatherer, as the communication is essentially the inverse.
It also adds a weighted scattering, which scatters
diag(weights)* local values. To overlap communication and computation, it is split into two parts,apply_asyncand await_and_accumulate, and computation can be done afterapply_asyncreturning ampi::request. Thenwait_and_accumulatedoes areq.wait()internally before doing the accumulation and the receivers.It also adds a
scatter_addoperation + kernel for Dense. I kept it in this PR, but can also extract that into a separate PR.This functionality will be useful when building DD preconditioners (overlapping Schwarz, BDDC etc)