[Directory] Strengthen single activation guarantees even further.#9958
[Directory] Strengthen single activation guarantees even further.#9958ledjon-behluli wants to merge 11 commits intodotnet:mainfrom
Conversation
src/Orleans.Runtime/Configuration/Options/GrainDirectoryOptions.cs
Outdated
Show resolved
Hide resolved
|
@ReubenBond addressed the feedback |
src/Orleans.Core.Abstractions/Exceptions/DirectoryLeaseHoldException.cs
Outdated
Show resolved
Hide resolved
…eption.cs Co-authored-by: Reuben Bond <203839+ReubenBond@users.noreply.github.com>
2d3d58a to
1840151
Compare
There was a problem hiding this comment.
Pull request overview
This PR strengthens the strongly-consistent distributed grain directory’s single-activation guarantees by introducing “safety lease holds” after ungraceful silo failures and during recovery scenarios, delaying re-registrations to reduce split-brain risk.
Changes:
- Add silo-level and range-level lease hold tracking to directory partitions, including periodic cleanup of expired leases.
- Enforce lease holds during
RegisterAsyncby rejecting registrations with a newDirectoryLeaseHoldException. - Add configuration (
SafetyLeaseHoldDuration) and a new test validating reactivation blocking after an ungraceful shutdown.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| test/Orleans.GrainDirectory.Tests/Orleans.GrainDirectory.Tests.csproj | Adds Microsoft.Extensions.TimeProvider.Testing package for FakeTimeProvider in tests. |
| test/Orleans.GrainDirectory.Tests/GrainDirectory/GrainDirectoryLeaseTests.cs | New test exercising lease-hold behavior after KillSiloAsync. |
| src/Orleans.Runtime/GrainDirectory/GrainDirectoryPartition.cs | Tracks/creates lease holds, applies holds on silo death, and adds cleanup logic + logging. |
| src/Orleans.Runtime/GrainDirectory/GrainDirectoryPartition.Interface.cs | Enforces lease holds on registration by throwing DirectoryLeaseHoldException. |
| src/Orleans.Runtime/GrainDirectory/DistributedGrainDirectory.cs | Computes effective lease duration, wires TimeProvider, and runs periodic lease cleanup requests. |
| src/Orleans.Runtime/Configuration/Options/GrainDirectoryOptions.cs | Adds SafetyLeaseHoldDuration option and documentation; switches to file-scoped namespace. |
| src/Orleans.Core.Abstractions/Exceptions/DirectoryLeaseHoldException.cs | Introduces a new exception type for lease-hold rejections. |
Comments suppressed due to low confidence (2)
src/Orleans.Runtime/Configuration/Options/GrainDirectoryOptions.cs:88
- Typo in XML doc comment: "clcks skues" should be "clock skews".
/// Conditional deregistration is used for lazy clean-up of activations whose prompt deregistration failed for some reason (e.g., message failure).
/// This should always be at least one minute, since we compare the times on the directory partition, so message delays and clcks skues have
/// to be allowed.
test/Orleans.GrainDirectory.Tests/GrainDirectory/GrainDirectoryLeaseTests.cs:65
- Grammar in comment: "its the only one alive" should be "it's the only one alive".
// Time has expired, we can place it now, and it will be the primary as its the only one alive.
Assert.Equal(await leaseGrain.GetAddress(), primary.SiloAddress);
test/Orleans.GrainDirectory.Tests/GrainDirectory/GrainDirectoryLeaseTests.cs
Show resolved
Hide resolved
src/Orleans.Runtime/Configuration/Options/GrainDirectoryOptions.cs
Outdated
Show resolved
Hide resolved
src/Orleans.Core.Abstractions/Exceptions/DirectoryLeaseHoldException.cs
Outdated
Show resolved
Hide resolved
ReubenBond
left a comment
There was a problem hiding this comment.
We should add some more tests. I'll ask copilot to add some more coverage.
…lookup, and multi-grain scenarios Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When a safety lease hold is active (e.g., after an ungraceful silo crash or during cluster startup with non-contiguous view changes), the partition returns DirectoryResult.RetryAfter(remainingDuration) instead of throwing. InvokeAsync checks RetryAfterDelay and waits the suggested duration before retrying, keeping the retry inside the directory layer. Previously, throwing DirectoryLeaseHoldException would propagate to the activation layer, which deactivates with DirectoryFailure and forwards the message to the same silo. Because DirectoryFailure is treated as a transient error, the message ForwardCount is decremented on each cycle, preventing the hop limit from being reached and creating an infinite zero-delay forwarding loop that hangs the system. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Return RetryAfter from RegisterAsync instead of throwing, so InvokeAsync waits the exact remaining lease duration using the injected TimeProvider. This removes the DirectoryLeaseHoldException class entirely. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
@ledjon-behluli I made a minor change to not throw when a lease is still held. Now, the directory returns a 'retry after delay' result instead. |
…s.cs Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
@ReubenBond is there something left on my side? |
This PR adds safety lease holds to the new strongly-consistent grain directory to prevent split-brain scenarios after silo crashes, utilizing both specific grain leases (tombstones) and broader range leases when snapshot transfers fail. Under normal graceful shutdowns the new directory performs handovers, so no range lease holds are applied.
Users can specify the lease duration, otherwise if duration <= zero than it is computed as:
2 × ProbeTimeout × NumMissedProbesLimitwhich should provide a good buffer so that a network-partitioned silo has enough time to realize its isolated and shut itself down before the rest of the cluster allows those grains to reactivate elsewhere.The impact on normal registrations in a healthy cluster should be practically unobservable.
Note, this wont work with the eventually-consistent directory!
Microsoft Reviewers: Open in CodeFlow