-
Notifications
You must be signed in to change notification settings - Fork 7.2k
Fix crash in ReferenceCounter::CleanupBorrowersOnRefRemoved due to race between message_published_callback and publisher_failed_callback on WORKER_REF_REMOVED_CHANNEL #60495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
… callbacks Signed-off-by: Mao Yancan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively addresses a critical race condition in ReferenceCounter::CleanupBorrowersOnRefRemoved by making the cleanup process idempotent. The introduction of early exit conditions based on whether the object or borrower has already been cleaned up prevents RAY_CHECK failures and improves the robustness of the reference counting mechanism. This is a solid fix for a potential crash scenario.
| // Erase the previous borrower. | ||
| auto it = object_id_refs_.find(object_id); | ||
| RAY_CHECK(it != object_id_refs_.end()) << object_id; | ||
| RAY_CHECK(it->second.mutable_borrow()->borrowers.erase(borrower_addr)); | ||
| DeleteReferenceInternal(it, nullptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment "Erase the previous borrower." on line 1245 is now slightly misleading. The actual erasure of the borrower from the borrowers set happens earlier on line 1235. Line 1246 DeleteReferenceInternal(it, nullptr); is responsible for potentially deleting the entire Reference object if its overall reference count reaches zero, not just erasing the borrower. Consider updating the comment for better clarity.
| // Erase the previous borrower. | |
| auto it = object_id_refs_.find(object_id); | |
| RAY_CHECK(it != object_id_refs_.end()) << object_id; | |
| RAY_CHECK(it->second.mutable_borrow()->borrowers.erase(borrower_addr)); | |
| DeleteReferenceInternal(it, nullptr); | |
| // Delete the reference if its overall count reaches zero. | |
| DeleteReferenceInternal(it, nullptr); |
|
@dayshah @ZacAttack PTAL |
Signed-off-by: Mao Yancan <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Signed-off-by: Mao Yancan <[email protected]>
dayshah
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fix! Seems like this was pretty broken. The idea to make it idempotent sounds right to me but there's some possible side-effects that i'm concerned about.
Can you also write a unit test for the core worker or reference counter that tests this scenario
| return; | ||
| } | ||
|
|
||
| MergeRemoteBorrowers(object_id, borrower_addr, new_borrower_refs); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a possible behavior change here
Before we would MergeRemoteBorrowers even if the ref had been deleted from object_id_refs_ as long as it's in borrowed_refs_. And then MergeRemoteBorrowers would add it into object_id_refs_ - now that doesn't happen.
I think if the borrower death notification comes first, followed by the actual message from the borrower (pretty unlikely but still possible) we could end up in an unexpected state.
| it = object_id_refs_.find(object_id); | ||
| if (it == object_id_refs_.end()) { | ||
| return; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's only possible for MergeRemoteBorrowers to add to object_id_refs_ but it can't ever remove the obj id from it.
Signed-off-by: Mao Yancan <[email protected]>
|
Thanks @dayshah for the review! I agree with your points and have moved the object_id_refs_ check after MergeRemoteBorrowers. To avoid dangling objects, I also removed the early return after erasing the borrower (it->second.mutable_borrow()->borrowers.erase(borrower_addr)). A unit test has been added to cover this race condition. |
Description
Summary
When a borrower process releases an object reference (ref count reaches 0) and then dies immediately, the owner may crash with a RAY_CHECK failure inside ReferenceCounter::CleanupBorrowersOnRefRemoved() .
This is caused by a race in the WORKER_REF_REMOVED_CHANNEL pub/sub path: both message_published_callback (published ref-removed reply) and publisher_failed_callback (long-polling connection failure) can be posted to the subscriber IO service, and both may execute, leading to duplicate cleanup.
Error logs:
Analysis
The owner subscribes in ReferenceCounter::WaitForRefRemoved() and provides two callbacks:
If the borrower publishes the ref-removed message but dies before the subscriber IO service executes message_published_callback , the subscriber can also enqueue publisher_failed_callback . Both callbacks may run, causing the owner to attempt borrower cleanup twice:
Solution
Make the cleanup path idempotent / deduplicated:
In ReferenceCounter::CleanupBorrowersOnRefRemoved :
Related issues
Fixes #60494
Additional information