fix(cluster): complete Issue #905 step 2 correctness for Rand and Rou…#915
fix(cluster): complete Issue #905 step 2 correctness for Rand and Rou…#915mochengqian wants to merge 3 commits intoapache:developfrom
Conversation
|
Issue #905 benchmark baseline (Step 1 + Step 2) Environment:
Command: go test ./pkg/server -run '^$' -bench '^BenchmarkCluster' -benchmem -count=5Each Raw output was captured locally for reference. PickEndpoint / Lookup
LB Hot Path / Healthy Filter
CAS / Consistent Hash
Notes
|
There was a problem hiding this comment.
Pull request overview
Implements Issue #905 Step 1 (tests/benchmarks) and Step 2 (correctness) for cluster endpoint picking, focusing on fixing Rand / RoundRobin behavior with unhealthy endpoints and improving concurrency safety for round-robin picks.
Changes:
- Fix
RandandRoundRobinto pick only from healthy endpoints and returnnilwhen none are healthy. - Make round-robin cursor updates concurrency-safe via atomic operations.
- Add regression tests and a benchmark suite covering pick-path behavior and related hot paths.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/server/cluster_manager_test.go | Adds/extends manager-path regression tests (missing cluster, unhealthy endpoints) and a concurrent pick race-style test. |
| pkg/server/cluster_manager_bench_test.go | Introduces a benchmark suite for PickEndpoint, lookup cost, LB hot path, health filtering cost, CAS workload, and consistent-hash resolve. |
| pkg/model/cluster_test.go | Adds unit tests for GetEndpoint health filtering, consistent-hash initialization, and Endpoint.GetHost. |
| pkg/model/cluster.go | Changes PrePickEndpointIndex type to uint32 to support atomic increments. |
| pkg/cluster/loadbalancer/roundrobin/round_robin_test.go | Adds deterministic tests for all-unhealthy behavior and ordering/cursor handling. |
| pkg/cluster/loadbalancer/roundrobin/round_robin.go | Updates round-robin handler to return nil when no healthy endpoints and uses atomic cursor increments. |
| pkg/cluster/loadbalancer/rand/load_balancer_rand_test.go | Adds deterministic tests for Rand correctness and all-unhealthy behavior without panics. |
| pkg/cluster/loadbalancer/rand/load_balancer_rand.go | Fixes Rand to use healthy slice length, return nil on empty, and adds a test hook for deterministic randomness. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #915 +/- ##
===========================================
+ Coverage 22.05% 22.23% +0.17%
===========================================
Files 270 270
Lines 20069 20070 +1
===========================================
+ Hits 4426 4462 +36
+ Misses 15193 15160 -33
+ Partials 450 448 -2
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| HealthChecks []HealthCheckConfig `yaml:"health_checks" json:"health_checks"` | ||
| Endpoints []*Endpoint `yaml:"endpoints" json:"endpoints"` | ||
| PrePickEndpointIndex int | ||
| PrePickEndpointIndex uint32 `yaml:"-" json:"-"` // runtime-only round-robin cursor state |
There was a problem hiding this comment.
Marking PrePickEndpointIndex as yaml:"-" json:"-" makes the cursor disappear when CloneStore() serializes the current store and fetchCompareAndSet() rebuilds a fresh store before CompareAndSetStore().
Spring Cloud 的 adapter 会通过 CloneStore -> CompareAndSetStore 这条链路动态刷新 cluster
On the Spring Cloud refresh path, that means every registry sync resets round-robin back to endpoint 0, biasing traffic toward the first healthy endpoint whenever the store is refreshed. Before this change the exported field survived the clone, so this is a behavior regression outside the new unit coverage.



What this PR does:
This PR delivers Step 1 (tests + benchmarks) and Step 2 (correctness) for Issue #905, and is intended to be the first mergeable slice with a clean boundary.
It:
Randto choose only from the healthy endpoint slice and returnnilwhen no healthy endpoint existsRoundRobinto returnnilwhen no healthy endpoint existsRand/RoundRobincorrectness issuesPickEndpointserial baselinePickEndpointparallel baselineCompareAndSetStoreworkloadThis PR does not start the Step 3+ runtime-consistency / snapshot refactor yet.
Which issue(s) this PR fixes:
Fixes #905
Special notes for your reviewer:
Issue #905 step progress:
Add tests and benchmarks
Fix current correctness issues
Tighten runtime consistency
Switch cluster lookup to O(1)
Introduce healthy endpoint snapshots
Optimize simple LB hot path
Optimize consistent-hash LB last
This PR intentionally covers Step 1 (tests + benchmarks) and Step 2 (correctness) only.
Runtime consistency for
UpdateCluster/CompareAndSetStore, O(1) cluster lookup, healthy endpoint snapshots, simple LB hot-path optimization, and consistent-hash optimization will follow in later PRs.Detailed benchmark baseline numbers are posted in a separate PR comment so later steps can compare against the same baseline.
Local validation completed:
go test ./pkg/model ./pkg/server ./pkg/cluster/loadbalancer/rand ./pkg/cluster/loadbalancer/roundrobin -count=1go test ./pkg/server ./pkg/cluster/loadbalancer/rand ./pkg/cluster/loadbalancer/roundrobin -race -gcflags=-l -count=1go vet ./pkg/server ./pkg/cluster/loadbalancer/... ./pkg/modelgo test ./pkg/server -run '^$' -bench '^BenchmarkCluster' -benchmem -count=5Does this PR introduce a user-facing change?: