[RFC] Introduce Completion Counters verbs#1701
[RFC] Introduce Completion Counters verbs#1701mrgolin wants to merge 1 commit intolinux-rdma:masterfrom
Conversation
083d6ab to
189d11c
Compare
|
@jgunthorpe @rleon Any comments on this? I'd like to start pushing the implementation. |
|
I've been feeling a high reluctance lately to introduce new verbs that rely on a single HW implementations.. That has not gone well in the past. Consider using dv, but also consider the questions below if it should be a new verb: Can any existing HW implement this, or just one? |
|
As a general comment, completion counters were introduced into libfabric v1.0 (~2016), and the counter APIs are used by MPI. I think only the CXI provider has HW support, however. UEC defines mandatory support for counters as part of the AI Full and HPC profiles (UEC 2.2.5.3.7). The UET-verbs PR (which is very much out of date) only targeted the AI Base profile, so counters were excluded. I expect once UET-verbs expands to cover AI Full and HPC profiles, UEC will ask for completion counters as well. I will review the actual submission here, but my expectation is that the proposal is aligned with a libfabric implementation, which would also align with UEC. |
shefty
left a comment
There was a problem hiding this comment.
Comments are based on comparing to UEC spec requirements, libfabric API, and looking at the CXI implementation, which is the only other HW-based counter implementation I'm aware of.
| number of completed operations. This makes them well suited for applications | ||
| that need to know how many requests have completed without requiring | ||
| per-request details, such as credit based flow control or tracking responses | ||
| from remote peers. |
There was a problem hiding this comment.
libfabric and UEC call out 2 separate types of counters. One counts completed data transfer operations. The other counts the number of application bytes carried by a data transfer. For example, the number of bytes written into a RDMA write target buffer.
There was a problem hiding this comment.
Thanks for your feedback, replied in another comment below.
| *cc_attr*. On success, the returned **ibv_comp_cntr** structure contains | ||
| pointers to the completion and error count values. The maximum number of | ||
| completion counters a device supports is reported by the *max_comp_cntr* | ||
| field of **ibv_device_attr_ex**. |
There was a problem hiding this comment.
Returning direct pointers to the memory locations where the counters are stored is forcing a specific implementation. Counters could be local to the NIC and require specific commands to read. It is more generic and aligns with libfabric to provide function calls to read/write the counters.
From what I can tell looking at the CXI libfabric implementation, this is how counters work on CXI. Counters are a HW object. Reading a counter submits a command that copies the value back to host memory. A counter may also be associated with a threshold value. Once the counter reaches that value, it will also update host memory with the value.
There was a problem hiding this comment.
One of the use cases we want this API to support is direct polling of the counters from GPUs, what requires having counter memory addresses. In this case I actually expect counters to be created with external memory located in HBM so exposing addresses might be redundant.
I think we can go with your suggestion, add ibv_cntr_read verb and pave the path to kernel, then each provider can decide to optimize it by simple memory read when possible.
There was a problem hiding this comment.
One of the use cases we want this API to support is direct polling of the counters from GPUs, what requires having counter memory addresses.
Isn't this handled through dmabuf mechanism?
There was a problem hiding this comment.
Yes, right. In the scenario I described the buffer is allocated in GPU memory, exported as dmabuf which is then passed in when creating counters. GPU will directly read counter values from its memory.
For this case exposing counter VA isn't needed and it might not be available.
| in the returned **ibv_comp_cntr** will point to it. Using external memory | ||
| allows the counter values to reside in application-managed buffers or in | ||
| memory exported through DMA-BUF, enabling zero-copy observation of completion | ||
| progress by co-located processes or devices. |
There was a problem hiding this comment.
This deviates from libfabric and isn't called out by UEC. Letting the app specify where the counter should reside is forcing a very specific HW implementation. I agree with Jason here and would definitely want to have multiple NIC implementations available before adding this.
Can we simplify the API to meet more minimal requirements?
AFAICT, CXI can support an option such as this, but the buffer would be a write-back buffer, not the actual counter. This suggests that a call such as ibv_read_cntr(cntr, &value, flags) might be an option to direct the counter value to a specific memory location, versus the libfabric method of returning the value directly from the function.
There was a problem hiding this comment.
I added this to common verbs because I believe completion counters in GPU memory will be one of the main use cases for verbs completion counters. Device implementations can vary, but one of the options would be maintaining an internal counter and syncing it back to the user-supplied buffer, as you suggested CXI is doing. That said, I understand your concern that this may be difficult or impossible to implement on some devices that otherwise support the rest of the API.
We can either make IBV_COMP_CNTR_INIT_WITH_EXTERNAL_MEM support optional, or move the capability to create counters with external memory to direct verbs. @jgunthorpe what's your opinion on this?
| uint64_t *comp_count; | ||
| uint64_t *err_count; | ||
| uint64_t comp_count_max_value; | ||
| uint64_t err_count_max_value; |
There was a problem hiding this comment.
See prior comments about exposing 'raw counters'.
| uint32_t flags; | ||
| struct ibv_memory_location comp_cntr_ext_mem; | ||
| struct ibv_memory_location err_cntr_ext_mem; | ||
| }; |
There was a problem hiding this comment.
libfabric attributes include an enum for events to count (transfers or bytes). UEC lists support for both as mandatory. Adding an enum here, even if only transfers are included to start with, would make this more extensible.
libfabric also supports blocking counter read calls. It's possible this might align with a verbs ibv_comp_channel (or ibv_cntr_channel?) object.
I don't see any statement in UEC regarding blocking counters (or CQs). However, since libfabric counters include blocking support, they may simply not have thought it important to call that support out separately. (I don't think MPI uses blocking counters.)
CXI implements blocking reads by setting a triggered threshold. HW writes the counter value to memory once it hits the trigger. SW in this case spins until the counter is triggered.
At least based on UEC 1.0, I would defer adding wait or trigger APIs until there's a larger discussion.
There was a problem hiding this comment.
Regarding events vs. bytes I think it's already easy to extend by adding another flag like IBV_COMP_CNTR_INIT_COUNT_BYTES.
For your comment about blocking counters, I would try keep comp_cntr and comp_channel separate in rdma-core. Upper layers (like Libfabric) can combine these constructs to implement more advanced behaviors. Anyway I think we would be able to add this support later if we see the need.
| uint32_t comp_mask; | ||
| uint32_t op_mask; | ||
| }; | ||
| ``` |
There was a problem hiding this comment.
This behavior seems aligned with libfabric, so it should work with UEC.
It depends on device implementation, EFA is going to support this in existing devices.
As @shefty mentioned Libfabric has a similar interface and UE is going to support it as well.
I think the bare minimum needed is the ability to map QP + opcode to a counter during completion handling and increment this counter. With @shefty's suggestion to add ibv_cntr_read, read can be implemented using an admin command. |
Extend verbs interface to support Completion Counters that can be seen as a light-weight alternative to polling CQ. A completion counter object separately counts successful and error completions, can be attached to multiple QPs and be configured to count completions of a subset of operation types. This is especially useful for batch or credit based workloads running on accelerators but can serve many other types of applications as well. Expose supported number of completion counters through query device extended verb. Reviewed-by: Yonatan Nachum <ynachum@amazon.com> Signed-off-by: Michael Margolin <mrgolin@amazon.com>
|
Updated according to suggestions above. |
Extend verbs interface to support Completion Counters that can be seen as a light-weight alternative to polling CQ. A completion counter object separately counts successful and error completions, can be attached to multiple QPs and be configured to count completions of a subset of operation types. This is especially useful for batch or credit based workloads running on accelerators but can serve many other types of applications as well.
This PR intended for discussion over the interface extension suggestion and once agreed following ioctl and kernel changes will be submitted, including first provider and driver implementations.
Thanks.