Skip to content

fix(cluster): complete Issue #905 step 2 correctness for Rand and Rou…#915

Open
mochengqian wants to merge 3 commits intoapache:developfrom
mochengqian:feat/issue-905
Open

fix(cluster): complete Issue #905 step 2 correctness for Rand and Rou…#915
mochengqian wants to merge 3 commits intoapache:developfrom
mochengqian:feat/issue-905

Conversation

@mochengqian
Copy link
Copy Markdown

@mochengqian mochengqian commented Apr 18, 2026

What this PR does:

This PR delivers Step 1 (tests + benchmarks) and Step 2 (correctness) for Issue #905, and is intended to be the first mergeable slice with a clean boundary.

It:

  • fixes Rand to choose only from the healthy endpoint slice and return nil when no healthy endpoint exists
  • fixes RoundRobin to return nil when no healthy endpoint exists
  • makes the round-robin cursor safe for concurrent picks
  • adds deterministic regression coverage for the current Rand / RoundRobin correctness issues
  • extends manager-path regression coverage for missing-cluster / unhealthy-endpoint nil behavior
  • adds and refines the benchmark suite, including:
    • PickEndpoint serial baseline
    • cleaner PickEndpoint parallel baseline
    • lookup-only benchmark
    • simple LB hot-path-only benchmark
    • healthy endpoint filtering cost
    • mixed CompareAndSetStore workload
    • consistent-hash resolve baseline

This PR does not start the Step 3+ runtime-consistency / snapshot refactor yet.

Which issue(s) this PR fixes:

Fixes #905

Special notes for your reviewer:

Issue #905 step progress:

  • Add tests and benchmarks

  • Fix current correctness issues

  • Tighten runtime consistency

  • Switch cluster lookup to O(1)

  • Introduce healthy endpoint snapshots

  • Optimize simple LB hot path

  • Optimize consistent-hash LB last

  • This PR intentionally covers Step 1 (tests + benchmarks) and Step 2 (correctness) only.

  • Runtime consistency for UpdateCluster / CompareAndSetStore, O(1) cluster lookup, healthy endpoint snapshots, simple LB hot-path optimization, and consistent-hash optimization will follow in later PRs.

  • Detailed benchmark baseline numbers are posted in a separate PR comment so later steps can compare against the same baseline.

  • Local validation completed:

    • go test ./pkg/model ./pkg/server ./pkg/cluster/loadbalancer/rand ./pkg/cluster/loadbalancer/roundrobin -count=1
    • go test ./pkg/server ./pkg/cluster/loadbalancer/rand ./pkg/cluster/loadbalancer/roundrobin -race -gcflags=-l -count=1
    • go vet ./pkg/server ./pkg/cluster/loadbalancer/... ./pkg/model
    • go test ./pkg/server -run '^$' -bench '^BenchmarkCluster' -benchmem -count=5

Does this PR introduce a user-facing change?:

NONE

@mochengqian
Copy link
Copy Markdown
Author

Issue #905 benchmark baseline (Step 1 + Step 2)

Environment:

  • goos=darwin
  • goarch=arm64
  • cpu=Apple M5

Command:

go test ./pkg/server -run '^$' -bench '^BenchmarkCluster' -benchmem -count=5

Each Mean value below is the arithmetic mean across 5 runs. Range shows the min-max span across the same 5 runs.

Raw output was captured locally for reference.

PickEndpoint / Lookup

Benchmark Mean ns/op Range ns/op B/op allocs/op
PickEndpoint serial / Rand / clusters=1 16.12 15.94-16.29 0 0
PickEndpoint serial / Rand / clusters=32 33.09 32.98-33.32 0 0
PickEndpoint serial / Rand / clusters=256 113.58 112.80-115.30 0 0
PickEndpoint serial / Rand / clusters=1024 685.30 669.10-692.30 0 0
PickEndpoint serial / RoundRobin / clusters=1 13.51 13.29-13.96 0 0
PickEndpoint serial / RoundRobin / clusters=32 29.59 29.48-29.82 0 0
PickEndpoint serial / RoundRobin / clusters=256 111.00 110.20-112.30 0 0
PickEndpoint serial / RoundRobin / clusters=1024 767.10 675.90-1054.00 0 0
Lookup serial / clusters=1 3.33 3.32-3.35 0 0
Lookup serial / clusters=32 17.82 17.77-17.85 0 0
Lookup serial / clusters=256 111.68 111.50-112.20 0 0
Lookup serial / clusters=1024 744.18 738.00-749.60 0 0
PickEndpoint parallel / Rand 100.42 98.42-101.30 0 0
PickEndpoint parallel / RoundRobin 107.66 103.10-110.10 0 0

LB Hot Path / Healthy Filter

Benchmark Mean ns/op Range ns/op B/op allocs/op
LB hot path / Rand / endpoints=4 13.72 13.51-13.93 0 0
LB hot path / Rand / endpoints=64 134.96 131.40-136.70 512 1
LB hot path / Rand / endpoints=512 860.96 852.00-881.30 4864 1
LB hot path / RoundRobin / endpoints=4 9.29 8.96-9.57 0 0
LB hot path / RoundRobin / endpoints=64 125.40 124.60-128.10 512 1
LB hot path / RoundRobin / endpoints=512 858.72 851.60-868.40 4864 1
Healthy filter / endpoints=8, healthy=100% 20.02 19.92-20.11 64 1
Healthy filter / endpoints=8, healthy=50% 19.11 18.98-19.45 64 1
Healthy filter / endpoints=8, healthy=0% 16.78 16.70-16.91 64 1
Healthy filter / endpoints=64, healthy=100% 120.60 120.30-121.70 512 1
Healthy filter / endpoints=64, healthy=50% 102.90 102.00-104.50 512 1
Healthy filter / endpoints=64, healthy=0% 87.86 87.38-88.61 512 1
Healthy filter / endpoints=512, healthy=100% 900.28 894.20-908.50 4864 1
Healthy filter / endpoints=512, healthy=50% 744.82 730.60-758.70 4864 1
Healthy filter / endpoints=512, healthy=0% 498.62 495.40-501.60 4864 1

CAS / Consistent Hash

Benchmark Mean ns/op Range ns/op B/op allocs/op
CompareAndSetStore mixed 586839.40 573165.00-615044.00 1850084 6417
Consistent hash / RingHash 2644.00 2617.00-2690.00 2212 128
Consistent hash / Maglev 1944.20 1921.00-1964.00 1828 99

Notes

  • The split benchmarks are intended to make later step-by-step attribution clearer:

    • Step 4 should primarily move the lookup-only numbers.
    • Step 5/6 should primarily move the healthy-filter and simple LB hot-path numbers.
    • Step 7 should primarily move the consistent-hash numbers.
  • PickEndpoint serial / RoundRobin / clusters=1024 showed one isolated high value in the count=5 suite run. A dedicated rerun with:

    go test ./pkg/server -run '^$' -bench '^BenchmarkClusterPickEndpointSerial/RoundRobin/clusters=1024$' -benchmem -count=10

    was stable at:

    • mean: 687.15 ns/op
    • median: 685.75 ns/op
    • range: 673.60-712.40 ns/op

    I am treating the earlier 1054.00 ns/op sample as an outlier rather than a representative signal.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements Issue #905 Step 1 (tests/benchmarks) and Step 2 (correctness) for cluster endpoint picking, focusing on fixing Rand / RoundRobin behavior with unhealthy endpoints and improving concurrency safety for round-robin picks.

Changes:

  • Fix Rand and RoundRobin to pick only from healthy endpoints and return nil when none are healthy.
  • Make round-robin cursor updates concurrency-safe via atomic operations.
  • Add regression tests and a benchmark suite covering pick-path behavior and related hot paths.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
pkg/server/cluster_manager_test.go Adds/extends manager-path regression tests (missing cluster, unhealthy endpoints) and a concurrent pick race-style test.
pkg/server/cluster_manager_bench_test.go Introduces a benchmark suite for PickEndpoint, lookup cost, LB hot path, health filtering cost, CAS workload, and consistent-hash resolve.
pkg/model/cluster_test.go Adds unit tests for GetEndpoint health filtering, consistent-hash initialization, and Endpoint.GetHost.
pkg/model/cluster.go Changes PrePickEndpointIndex type to uint32 to support atomic increments.
pkg/cluster/loadbalancer/roundrobin/round_robin_test.go Adds deterministic tests for all-unhealthy behavior and ordering/cursor handling.
pkg/cluster/loadbalancer/roundrobin/round_robin.go Updates round-robin handler to return nil when no healthy endpoints and uses atomic cursor increments.
pkg/cluster/loadbalancer/rand/load_balancer_rand_test.go Adds deterministic tests for Rand correctness and all-unhealthy behavior without panics.
pkg/cluster/loadbalancer/rand/load_balancer_rand.go Fixes Rand to use healthy slice length, return nil on empty, and adds a test hook for deterministic randomness.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/model/cluster.go Outdated
Comment thread pkg/server/cluster_manager_bench_test.go
@sonarqubecloud
Copy link
Copy Markdown

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 22.23%. Comparing base (e6be678) to head (3b89c87).

Additional details and impacted files
@@             Coverage Diff             @@
##           develop     #915      +/-   ##
===========================================
+ Coverage    22.05%   22.23%   +0.17%     
===========================================
  Files          270      270              
  Lines        20069    20070       +1     
===========================================
+ Hits          4426     4462      +36     
+ Misses       15193    15160      -33     
+ Partials       450      448       -2     
Flag Coverage Δ
unittests 22.23% <100.00%> (+0.17%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@Alanxtl Alanxtl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cool work

Comment thread pkg/model/cluster.go
HealthChecks []HealthCheckConfig `yaml:"health_checks" json:"health_checks"`
Endpoints []*Endpoint `yaml:"endpoints" json:"endpoints"`
PrePickEndpointIndex int
PrePickEndpointIndex uint32 `yaml:"-" json:"-"` // runtime-only round-robin cursor state
Copy link
Copy Markdown
Contributor

@Alanxtl Alanxtl Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking PrePickEndpointIndex as yaml:"-" json:"-" makes the cursor disappear when CloneStore() serializes the current store and fetchCompareAndSet() rebuilds a fresh store before CompareAndSetStore().
Spring Cloud 的 adapter 会通过 CloneStore -> CompareAndSetStore 这条链路动态刷新 cluster
On the Spring Cloud refresh path, that means every registry sync resets round-robin back to endpoint 0, biasing traffic toward the first healthy endpoint whenever the store is refreshed. Before this change the exported field survived the clone, so this is a behavior regression outside the new unit coverage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] cluster 模块问题梳理与优化方向

4 participants