Skip to content

[Directory] Strengthen single activation guarantees even further.#9958

Open
ledjon-behluli wants to merge 11 commits intodotnet:mainfrom
ledjon-behluli:grain_leases
Open

[Directory] Strengthen single activation guarantees even further.#9958
ledjon-behluli wants to merge 11 commits intodotnet:mainfrom
ledjon-behluli:grain_leases

Conversation

@ledjon-behluli
Copy link
Copy Markdown
Contributor

@ledjon-behluli ledjon-behluli commented Mar 1, 2026

This PR adds safety lease holds to the new strongly-consistent grain directory to prevent split-brain scenarios after silo crashes, utilizing both specific grain leases (tombstones) and broader range leases when snapshot transfers fail. Under normal graceful shutdowns the new directory performs handovers, so no range lease holds are applied.

Users can specify the lease duration, otherwise if duration <= zero than it is computed as: 2 × ProbeTimeout × NumMissedProbesLimit which should provide a good buffer so that a network-partitioned silo has enough time to realize its isolated and shut itself down before the rest of the cluster allows those grains to reactivate elsewhere.

The impact on normal registrations in a healthy cluster should be practically unobservable.

Note, this wont work with the eventually-consistent directory!

Microsoft Reviewers: Open in CodeFlow

@ledjon-behluli
Copy link
Copy Markdown
Contributor Author

@ReubenBond addressed the feedback

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR strengthens the strongly-consistent distributed grain directory’s single-activation guarantees by introducing “safety lease holds” after ungraceful silo failures and during recovery scenarios, delaying re-registrations to reduce split-brain risk.

Changes:

  • Add silo-level and range-level lease hold tracking to directory partitions, including periodic cleanup of expired leases.
  • Enforce lease holds during RegisterAsync by rejecting registrations with a new DirectoryLeaseHoldException.
  • Add configuration (SafetyLeaseHoldDuration) and a new test validating reactivation blocking after an ungraceful shutdown.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
test/Orleans.GrainDirectory.Tests/Orleans.GrainDirectory.Tests.csproj Adds Microsoft.Extensions.TimeProvider.Testing package for FakeTimeProvider in tests.
test/Orleans.GrainDirectory.Tests/GrainDirectory/GrainDirectoryLeaseTests.cs New test exercising lease-hold behavior after KillSiloAsync.
src/Orleans.Runtime/GrainDirectory/GrainDirectoryPartition.cs Tracks/creates lease holds, applies holds on silo death, and adds cleanup logic + logging.
src/Orleans.Runtime/GrainDirectory/GrainDirectoryPartition.Interface.cs Enforces lease holds on registration by throwing DirectoryLeaseHoldException.
src/Orleans.Runtime/GrainDirectory/DistributedGrainDirectory.cs Computes effective lease duration, wires TimeProvider, and runs periodic lease cleanup requests.
src/Orleans.Runtime/Configuration/Options/GrainDirectoryOptions.cs Adds SafetyLeaseHoldDuration option and documentation; switches to file-scoped namespace.
src/Orleans.Core.Abstractions/Exceptions/DirectoryLeaseHoldException.cs Introduces a new exception type for lease-hold rejections.
Comments suppressed due to low confidence (2)

src/Orleans.Runtime/Configuration/Options/GrainDirectoryOptions.cs:88

  • Typo in XML doc comment: "clcks skues" should be "clock skews".
    /// Conditional deregistration is used for lazy clean-up of activations whose prompt deregistration failed for some reason (e.g., message failure).
    /// This should always be at least one minute, since we compare the times on the directory partition, so message delays and clcks skues have
    /// to be allowed.

test/Orleans.GrainDirectory.Tests/GrainDirectory/GrainDirectoryLeaseTests.cs:65

  • Grammar in comment: "its the only one alive" should be "it's the only one alive".
            // Time has expired, we can place it now, and it will be the primary as its the only one alive.
            Assert.Equal(await leaseGrain.GetAddress(), primary.SiloAddress);

Copy link
Copy Markdown
Member

@ReubenBond ReubenBond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add some more tests. I'll ask copilot to add some more coverage.

ReubenBond and others added 3 commits March 26, 2026 10:11
…lookup, and multi-grain scenarios

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When a safety lease hold is active (e.g., after an ungraceful silo crash or
during cluster startup with non-contiguous view changes), the partition
returns DirectoryResult.RetryAfter(remainingDuration) instead of throwing.
InvokeAsync checks RetryAfterDelay and waits the suggested duration before
retrying, keeping the retry inside the directory layer.

Previously, throwing DirectoryLeaseHoldException would propagate to the
activation layer, which deactivates with DirectoryFailure and forwards the
message to the same silo. Because DirectoryFailure is treated as a transient
error, the message ForwardCount is decremented on each cycle, preventing the
hop limit from being reached and creating an infinite zero-delay forwarding
loop that hangs the system.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Return RetryAfter from RegisterAsync instead of throwing, so InvokeAsync
waits the exact remaining lease duration using the injected TimeProvider.
This removes the DirectoryLeaseHoldException class entirely.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@ReubenBond
Copy link
Copy Markdown
Member

@ledjon-behluli I made a minor change to not throw when a lease is still held. Now, the directory returns a 'retry after delay' result instead.

ledjon-behluli and others added 2 commits April 2, 2026 21:46
…s.cs

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@ledjon-behluli
Copy link
Copy Markdown
Contributor Author

@ReubenBond is there something left on my side?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants