Skip to content

[RFC] Introduce Completion Counters verbs#1701

Open
mrgolin wants to merge 1 commit intolinux-rdma:masterfrom
amzn:comp-counters
Open

[RFC] Introduce Completion Counters verbs#1701
mrgolin wants to merge 1 commit intolinux-rdma:masterfrom
amzn:comp-counters

Conversation

@mrgolin
Copy link
Copy Markdown
Contributor

@mrgolin mrgolin commented Feb 16, 2026

Extend verbs interface to support Completion Counters that can be seen as a light-weight alternative to polling CQ. A completion counter object separately counts successful and error completions, can be attached to multiple QPs and be configured to count completions of a subset of operation types. This is especially useful for batch or credit based workloads running on accelerators but can serve many other types of applications as well.

This PR intended for discussion over the interface extension suggestion and once agreed following ioctl and kernel changes will be submitted, including first provider and driver implementations.

Thanks.

@mrgolin
Copy link
Copy Markdown
Contributor Author

mrgolin commented Mar 3, 2026

@jgunthorpe @rleon Any comments on this? I'd like to start pushing the implementation.

@jgunthorpe
Copy link
Copy Markdown
Member

I've been feeling a high reluctance lately to introduce new verbs that rely on a single HW implementations.. That has not gone well in the past. Consider using dv, but also consider the questions below if it should be a new verb:

Can any existing HW implement this, or just one?
Are there any prior-art standards that already implement this idea, like UE or something?
What does hardware actually have to do to implement it?

@shefty
Copy link
Copy Markdown
Contributor

shefty commented Mar 13, 2026

As a general comment, completion counters were introduced into libfabric v1.0 (~2016), and the counter APIs are used by MPI. I think only the CXI provider has HW support, however.

UEC defines mandatory support for counters as part of the AI Full and HPC profiles (UEC 2.2.5.3.7). The UET-verbs PR (which is very much out of date) only targeted the AI Base profile, so counters were excluded. I expect once UET-verbs expands to cover AI Full and HPC profiles, UEC will ask for completion counters as well.

I will review the actual submission here, but my expectation is that the proposal is aligned with a libfabric implementation, which would also align with UEC.

@shijin-aws
Copy link
Copy Markdown
Contributor

shijin-aws commented Mar 13, 2026

@shefty Yes, the proposal is aimed to be aligned with Libfabric semantic after discussion between @mrgolin and us
libfabric folks in aws

Copy link
Copy Markdown
Contributor

@shefty shefty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments are based on comparing to UEC spec requirements, libfabric API, and looking at the CXI implementation, which is the only other HW-based counter implementation I'm aware of.

number of completed operations. This makes them well suited for applications
that need to know how many requests have completed without requiring
per-request details, such as credit based flow control or tracking responses
from remote peers.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

libfabric and UEC call out 2 separate types of counters. One counts completed data transfer operations. The other counts the number of application bytes carried by a data transfer. For example, the number of bytes written into a RDMA write target buffer.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your feedback, replied in another comment below.

*cc_attr*. On success, the returned **ibv_comp_cntr** structure contains
pointers to the completion and error count values. The maximum number of
completion counters a device supports is reported by the *max_comp_cntr*
field of **ibv_device_attr_ex**.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning direct pointers to the memory locations where the counters are stored is forcing a specific implementation. Counters could be local to the NIC and require specific commands to read. It is more generic and aligns with libfabric to provide function calls to read/write the counters.

From what I can tell looking at the CXI libfabric implementation, this is how counters work on CXI. Counters are a HW object. Reading a counter submits a command that copies the value back to host memory. A counter may also be associated with a threshold value. Once the counter reaches that value, it will also update host memory with the value.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the use cases we want this API to support is direct polling of the counters from GPUs, what requires having counter memory addresses. In this case I actually expect counters to be created with external memory located in HBM so exposing addresses might be redundant.
I think we can go with your suggestion, add ibv_cntr_read verb and pave the path to kernel, then each provider can decide to optimize it by simple memory read when possible.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the use cases we want this API to support is direct polling of the counters from GPUs, what requires having counter memory addresses.

Isn't this handled through dmabuf mechanism?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, right. In the scenario I described the buffer is allocated in GPU memory, exported as dmabuf which is then passed in when creating counters. GPU will directly read counter values from its memory.
For this case exposing counter VA isn't needed and it might not be available.

in the returned **ibv_comp_cntr** will point to it. Using external memory
allows the counter values to reside in application-managed buffers or in
memory exported through DMA-BUF, enabling zero-copy observation of completion
progress by co-located processes or devices.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This deviates from libfabric and isn't called out by UEC. Letting the app specify where the counter should reside is forcing a very specific HW implementation. I agree with Jason here and would definitely want to have multiple NIC implementations available before adding this.

Can we simplify the API to meet more minimal requirements?

AFAICT, CXI can support an option such as this, but the buffer would be a write-back buffer, not the actual counter. This suggests that a call such as ibv_read_cntr(cntr, &value, flags) might be an option to direct the counter value to a specific memory location, versus the libfabric method of returning the value directly from the function.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this to common verbs because I believe completion counters in GPU memory will be one of the main use cases for verbs completion counters. Device implementations can vary, but one of the options would be maintaining an internal counter and syncing it back to the user-supplied buffer, as you suggested CXI is doing. That said, I understand your concern that this may be difficult or impossible to implement on some devices that otherwise support the rest of the API.

We can either make IBV_COMP_CNTR_INIT_WITH_EXTERNAL_MEM support optional, or move the capability to create counters with external memory to direct verbs. @jgunthorpe what's your opinion on this?

uint64_t *comp_count;
uint64_t *err_count;
uint64_t comp_count_max_value;
uint64_t err_count_max_value;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See prior comments about exposing 'raw counters'.

uint32_t flags;
struct ibv_memory_location comp_cntr_ext_mem;
struct ibv_memory_location err_cntr_ext_mem;
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

libfabric attributes include an enum for events to count (transfers or bytes). UEC lists support for both as mandatory. Adding an enum here, even if only transfers are included to start with, would make this more extensible.

libfabric also supports blocking counter read calls. It's possible this might align with a verbs ibv_comp_channel (or ibv_cntr_channel?) object.

I don't see any statement in UEC regarding blocking counters (or CQs). However, since libfabric counters include blocking support, they may simply not have thought it important to call that support out separately. (I don't think MPI uses blocking counters.)

CXI implements blocking reads by setting a triggered threshold. HW writes the counter value to memory once it hits the trigger. SW in this case spins until the counter is triggered.

At least based on UEC 1.0, I would defer adding wait or trigger APIs until there's a larger discussion.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding events vs. bytes I think it's already easy to extend by adding another flag like IBV_COMP_CNTR_INIT_COUNT_BYTES.

For your comment about blocking counters, I would try keep comp_cntr and comp_channel separate in rdma-core. Upper layers (like Libfabric) can combine these constructs to implement more advanced behaviors. Anyway I think we would be able to add this support later if we see the need.

uint32_t comp_mask;
uint32_t op_mask;
};
```
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior seems aligned with libfabric, so it should work with UEC.

@mrgolin
Copy link
Copy Markdown
Contributor Author

mrgolin commented Mar 18, 2026

@jgunthorpe

Can any existing HW implement this, or just one?

It depends on device implementation, EFA is going to support this in existing devices.

Are there any prior-art standards that already implement this idea, like UE or something?

As @shefty mentioned Libfabric has a similar interface and UE is going to support it as well.

What does hardware actually have to do to implement it?

I think the bare minimum needed is the ability to map QP + opcode to a counter during completion handling and increment this counter. With @shefty's suggestion to add ibv_cntr_read, read can be implemented using an admin command.

Extend verbs interface to support Completion Counters that can be seen
as a light-weight alternative to polling CQ. A completion counter object
separately counts successful and error completions, can be attached to
multiple QPs and be configured to count completions of a subset of
operation types. This is especially useful for batch or credit based
workloads running on accelerators but can serve many other types of
applications as well.

Expose supported number of completion counters through query device
extended verb.

Reviewed-by: Yonatan Nachum <ynachum@amazon.com>
Signed-off-by: Michael Margolin <mrgolin@amazon.com>
@mrgolin
Copy link
Copy Markdown
Contributor Author

mrgolin commented Mar 22, 2026

Updated according to suggestions above.
Looks there is some issue with the pipeline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants