fix inconsistent store pointers between the region-cache and store-cache#1826
Conversation
Signed-off-by: AndreMouche <[email protected]>
Signed-off-by: AndreMouche <[email protected]>
Signed-off-by: AndreMouche <[email protected]>
Signed-off-by: AndreMouche <[email protected]>
Signed-off-by: AndreMouche <[email protected]>
Signed-off-by: AndreMouche <[email protected]>
Signed-off-by: AndreMouche <[email protected]>
Signed-off-by: AndreMouche <[email protected]>
b79e950 to
4c68ce5
Compare
zyguan
left a comment
There was a problem hiding this comment.
LGTM
The following micro-bench results show there is no significant performance regression.
| workload | threads | baseline | this fix | diff |
|---|---|---|---|---|
| pointget | 64 | 84676.63 | 84574.63 | -0.12% |
| batchget | 64 | 24640.70 | 24687.14 | 0.19% |
| tblscan | 64 | 41237.60 | 40384.53 | -2.07% |
| idxscan | 64 | 39560.02 | 39342.72 | -0.55% |
| idxlookup | 64 | 16367.99 | 16306.71 | -0.37% |
📝 WalkthroughWalkthroughIntroduce thread-safe store accessors and in-place metadata updates: add per-store mutex and GetAddr/GetPeerAddr/GetLabels/StoreType, update reResolve to modify Store objects instead of replacing them, remove the Deleted resolve state, and update region cache, region requests, logging, and tests to use the new accessors. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant PD as PD/Meta
participant StoreCache as StoreCache
participant Store as Store (metaMu)
participant RegionCache as RegionCache
participant Health as HealthCheck
PD->>StoreCache: publish store metadata update (addr/labels/type)
StoreCache->>Store: find existing Store pointer
StoreCache->>Store: call updateMetadataFrom(meta)
Store->>Store: lock metaMu and update addr/labels/type, unlock
StoreCache->>RegionCache: keep existing Store pointer (no replace)
RegionCache->>Store: call StoreType()/GetAddr()/GetLabels() when evaluating replicas
Health->>Store: call GetAddr()/StoreType() for health checks / reResolve
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@internal/locate/store_cache.go`:
- Around line 315-320: GetLabels currently returns the internal slice s.labels
directly under Store; change GetLabels in type Store to return a defensive copy
of s.labels while holding metaMu (use s.metaMu.RLock()/RUnlock()) so callers
cannot mutate the internal slice (e.g., via append) and break thread-safety—copy
elements into a new slice and return that copy instead of s.labels.
🧹 Nitpick comments (1)
internal/locate/region_cache_test.go (1)
3118-3120: Avoid mutating Store labels returned byGetLabels
GetLabels()exposes internal metadata; appending here can mutate the store’s labels and mask the update-path the test is trying to validate. Clone before append so the test remains isolated.♻️ Suggested tweak
- labels := store1.GetLabels() - labels = append(labels, &metapb.StoreLabel{Key: "host", Value: "0.0.0.0:20161"}) + labels := append([]*metapb.StoreLabel(nil), store1.GetLabels()...) + labels = append(labels, &metapb.StoreLabel{Key: "host", Value: "0.0.0.0:20161"})
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
internal/locate/region_cache.gointernal/locate/region_cache_test.gointernal/locate/region_request.gointernal/locate/region_request3_test.gointernal/locate/replica_selector_test.gointernal/locate/store_cache.go
🧰 Additional context used
🧬 Code graph analysis (4)
internal/locate/region_request3_test.go (3)
tikvrpc/endpoint.go (1)
TiKV(44-44)internal/locate/region_cache.go (1)
WithMatchLabels(859-863)tikv/region.go (1)
WithMatchLabels(181-183)
internal/locate/region_request.go (6)
util/redact/redact.go (1)
String(34-39)metrics/metrics.go (1)
TiKVStoreLimitErrorCounter(93-93)internal/locate/store_cache.go (1)
Store(216-240)tikv/region.go (1)
Store(142-142)internal/mockstore/mocktikv/cluster.go (1)
Store(762-765)tikvrpc/endpoint.go (1)
TiFlashCompute(47-47)
internal/locate/store_cache.go (3)
tikvrpc/endpoint.go (2)
TiFlash(45-45)TiKV(44-44)metrics/shortcuts.go (1)
RegionCacheCounterWithInvalidateStoreRegionsOK(125-125)internal/logutil/log.go (1)
BgLogger(46-48)
internal/locate/region_cache.go (2)
internal/locate/store_cache.go (1)
Store(216-240)tikvrpc/endpoint.go (2)
TiKV(44-44)TiFlash(45-45)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
- GitHub Check: test
- GitHub Check: integration-raw-tikv (v1ttl)
- GitHub Check: race-test
- GitHub Check: integration-local-race
- GitHub Check: integration-tikv
- GitHub Check: integration-raw-tikv (v2)
- GitHub Check: integration-local
- GitHub Check: integration-next-gen-tikv
🔇 Additional comments (15)
internal/locate/region_request.go (5)
781-788: Logging uses GetAddr—nice alignment with accessor-based metadata.Keeps log output consistent with refreshed store metadata.
1699-1708: Metrics label now uses GetAddr—good consistency with store accessor API.
1781-1783: Store type check via StoreType() is consistent with encapsulation.
1791-1804: TiFlash-related backoff now uses StoreType() accessor—looks good.
2086-2094: ServerIsBusy TiFlash path uses StoreType() accessor—consistent update.internal/locate/region_request3_test.go (6)
237-243: Test liveness hook now uses GetAddr—aligned with accessor API.
368-376: RPC context assertions use GetAddr for both target and proxy—nice consistency.
1069-1072: Leader address derived via GetAddr—keeps tests consistent with store accessors.
1427-1436: Follower label/addr assertions via GetLabels/GetAddr—consistent with new API.
1454-1460: PreferLeader test uses GetAddr for leader match—good accessor alignment.
1519-1593: Leader handoff test now reads/writes leader addr via GetAddr—clean update.internal/locate/replica_selector_test.go (1)
69-72: Store label assertions now use GetLabels—consistent with accessor-based metadata.internal/locate/region_cache_test.go (1)
367-367: Tests now use accessor-based assertionsGood alignment with the new encapsulation (
GetAddr()/GetLabels()), keeping tests consistent with thread-safe metadata access.Also applies to: 445-451, 2303-2310, 3141-3147
internal/locate/store_cache.go (1)
149-154: Metadata access is now well-encapsulated
metaMu+updateMetadataFromand accessor usage keep store metadata consistent and support in-place updates, which helps with pointer consistency.Also applies to: 217-273, 285-313, 337-365, 423-475, 478-525, 560-565, 628-632, 659-665, 689-732, 1119-1123
internal/locate/region_cache.go (1)
379-384: Accessor adoption across region cache paths looks consistentGood to see
StoreType()/GetAddr()/GetLabels()used everywhere, keeping region-cache logic aligned with the new store metadata encapsulation.Also applies to: 802-804, 838-844, 939-940, 1008-1009, 1043-1044, 1055-1056, 1065-1066, 1791-1792, 2202-2210, 2706-2707, 2719-2720, 2818-2822, 2851-2852
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
Signed-off-by: AndreMouche <[email protected]>
Signed-off-by: AndreMouche <[email protected]>
Signed-off-by: AndreMouche <[email protected]>
|
@cfzjywxk @ekexium @MyonKeminta PTAL, thanks |
Signed-off-by: AndreMouche <[email protected]>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: AndreMouche <[email protected]>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| func (s *Store) GetLabels() []*metapb.StoreLabel { | ||
| s.metaMu.RLock() | ||
| defer s.metaMu.RUnlock() | ||
| return s.labels |
There was a problem hiding this comment.
GetLabels() returns the backing slice from the store. Even with the “not safe to modify” note, callers can still accidentally append/mutate it without holding metaMu, which breaks the thread-safety guarantees this PR is introducing. Consider returning a shallow copy of the slice header (e.g., clone the slice) or providing APIs that avoid exposing a mutable slice to callers.
| return s.labels | |
| labels := make([]*metapb.StoreLabel, len(s.labels)) | |
| copy(labels, s.labels) | |
| return labels |
There was a problem hiding this comment.
We intentionally avoid deep cloning here for performance reasons:
-
Performance considerations:
GetLabels()is called frequently throughout the codebase (25+ call sites)- Deep cloning would introduce unnecessary memory allocations and copy overhead
- Labels are typically used in read-only operations (comparison, lookup) rather than modification
-
Current protection measures:
- The function documentation explicitly states: "it is not safe to modify the returned labels"
- Read access is protected by
metaMu.RLock() - Label updates are performed atomically via
updateMetadataFrom()undermetaMu.Lock()protection
-
Risk assessment:
- Modifying the returned labels would violate the API contract
- All current usages are read-only (e.g.,
IsSameLabels,IsLabelsMatch,GetLabelValue) - The performance benefit outweighs the risk, especially given the explicit documentation
The current implementation is appropriate for our use case.
Signed-off-by: AndreMouche <[email protected]>
5562e67 to
61e0568
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 6 out of 6 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: ekexium, zyguan The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
In response to a cherrypick label: cannot checkout |
|
/cherry-pick tidb-8.5 |
fix tikv#1823 Signed-off-by: ti-chi-bot <[email protected]>
|
@zyguan: new pull request created to branch DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository. |
…che (#1826) (#1874) fix #1823 Signed-off-by: ti-chi-bot <[email protected]> Signed-off-by: zyguan <[email protected]> Co-authored-by: Shirly <[email protected]> Co-authored-by: zyguan <[email protected]>
fix #1823
Summary by CodeRabbit
Refactor
Bug Fixes
Tests
✏️ Tip: You can customize this high-level summary in your review settings.