Skip to content

fix inconsistent store pointers between the region-cache and store-cache#1826

Merged
ti-chi-bot[bot] merged 24 commits intotikv:masterfrom
AndreMouche:store_cache_update_addr
Jan 29, 2026
Merged

fix inconsistent store pointers between the region-cache and store-cache#1826
ti-chi-bot[bot] merged 24 commits intotikv:masterfrom
AndreMouche:store_cache_update_addr

Conversation

@AndreMouche
Copy link
Member

@AndreMouche AndreMouche commented Dec 19, 2025

fix #1823

Summary by CodeRabbit

  • Refactor

    • Store metadata handling made thread-safe; metadata updates now occur in-place and logging/address/type reporting improved.
  • Bug Fixes

    • Address and label changes no longer disrupt replica references or selection stability during restarts and re-resolves.
  • Tests

    • Tests updated and extended to cover address/label updates, restart scenarios, and leader/proxy selection consistency.

✏️ Tip: You can customize this high-level summary in your review settings.

@ti-chi-bot ti-chi-bot bot added dco-signoff: yes Indicates the PR's author has signed the dco. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 19, 2025
Signed-off-by: AndreMouche <[email protected]>
@AndreMouche AndreMouche force-pushed the store_cache_update_addr branch from b79e950 to 4c68ce5 Compare December 19, 2025 00:53
Copy link
Contributor

@zyguan zyguan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

The following micro-bench results show there is no significant performance regression.

workload threads baseline this fix diff
pointget 64 84676.63 84574.63 -0.12%
batchget 64 24640.70 24687.14 0.19%
tblscan 64 41237.60 40384.53 -2.07%
idxscan 64 39560.02 39342.72 -0.55%
idxlookup 64 16367.99 16306.71 -0.37%

@ti-chi-bot ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Dec 19, 2025
@coderabbitai
Copy link

coderabbitai bot commented Jan 16, 2026

📝 Walkthrough

Walkthrough

Introduce thread-safe store accessors and in-place metadata updates: add per-store mutex and GetAddr/GetPeerAddr/GetLabels/StoreType, update reResolve to modify Store objects instead of replacing them, remove the Deleted resolve state, and update region cache, region requests, logging, and tests to use the new accessors.

Changes

Cohort / File(s) Summary
Store Cache Core
internal/locate/store_cache.go
Add metaMu sync.RWMutex; add updateMetadataFrom(*metapb.Store) and Put(store); add thread-safe accessors GetAddr(), GetPeerAddr(), GetLabels(); adjust reResolve to update metadata in-place and remove Deleted resolve state.
Region Cache Core
internal/locate/region_cache.go
Replace direct accesses to addr, storeType, labels with GetAddr(), StoreType(), GetLabels(); remove changeToActiveStore; update proxy-store handling, RPCContext.String(), region/journal logic, and logging to use accessors.
Region Request Handling
internal/locate/region_request.go
Use GetAddr() for metrics/logging and StoreType() for TiFlash-related checks and liveness; update metric labels to use accessors.
Tests: Region / Replica / Requests
internal/locate/region_cache_test.go, internal/locate/region_request3_test.go, internal/locate/replica_selector_test.go, internal/locate/region_request*_test.go
Replace direct field access with GetAddr()/GetLabels() across tests; adjust imports and assertions; rename/adjust one health-check test to reflect address-change semantics.
Logging & RPC Contexts
internal/locate/...
Update RPCContext and logging/metric fields to call GetAddr() and StoreType().Name(); replace raw-field prints with accessor calls.
Manifest & Misc
go.mod, manifest_file, test imports
Small module/test import adjustments and minor manifest updates.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant PD as PD/Meta
    participant StoreCache as StoreCache
    participant Store as Store (metaMu)
    participant RegionCache as RegionCache
    participant Health as HealthCheck

    PD->>StoreCache: publish store metadata update (addr/labels/type)
    StoreCache->>Store: find existing Store pointer
    StoreCache->>Store: call updateMetadataFrom(meta)
    Store->>Store: lock metaMu and update addr/labels/type, unlock
    StoreCache->>RegionCache: keep existing Store pointer (no replace)
    RegionCache->>Store: call StoreType()/GetAddr()/GetLabels() when evaluating replicas
    Health->>Store: call GetAddr()/StoreType() for health checks / reResolve
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested labels

lgtm

Suggested reviewers

  • cfzjywxk

Poem

🐰
I nibbled fields and stitched them tight,
A mutex hug to guard the night,
Getters whisper where labels play,
Stores stay whole — no pointers stray,
Hops and logs hum safe and bright.

🚥 Pre-merge checks | ✅ 4 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 22.22% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately describes the main objective: fixing the inconsistency between region-cache and store-cache store pointers when store metadata changes.
Linked Issues check ✅ Passed The PR implements the core solution to issue #1823 by adding accessor methods and updating the store metadata in-place, ensuring region-cache and store-cache maintain consistent pointers.
Out of Scope Changes check ✅ Passed All changes are within scope: adding GetAddr() and GetLabels() accessors with thread-safe guards, implementing updateMetadataFrom() to update stores in-place, replacing direct field accesses with accessors throughout the codebase to maintain consistency with the new accessor-based API.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@internal/locate/store_cache.go`:
- Around line 315-320: GetLabels currently returns the internal slice s.labels
directly under Store; change GetLabels in type Store to return a defensive copy
of s.labels while holding metaMu (use s.metaMu.RLock()/RUnlock()) so callers
cannot mutate the internal slice (e.g., via append) and break thread-safety—copy
elements into a new slice and return that copy instead of s.labels.
🧹 Nitpick comments (1)
internal/locate/region_cache_test.go (1)

3118-3120: Avoid mutating Store labels returned by GetLabels

GetLabels() exposes internal metadata; appending here can mutate the store’s labels and mask the update-path the test is trying to validate. Clone before append so the test remains isolated.

♻️ Suggested tweak
- labels := store1.GetLabels()
- labels = append(labels, &metapb.StoreLabel{Key: "host", Value: "0.0.0.0:20161"})
+ labels := append([]*metapb.StoreLabel(nil), store1.GetLabels()...)
+ labels = append(labels, &metapb.StoreLabel{Key: "host", Value: "0.0.0.0:20161"})
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e5f6398 and c0c4fab.

📒 Files selected for processing (6)
  • internal/locate/region_cache.go
  • internal/locate/region_cache_test.go
  • internal/locate/region_request.go
  • internal/locate/region_request3_test.go
  • internal/locate/replica_selector_test.go
  • internal/locate/store_cache.go
🧰 Additional context used
🧬 Code graph analysis (4)
internal/locate/region_request3_test.go (3)
tikvrpc/endpoint.go (1)
  • TiKV (44-44)
internal/locate/region_cache.go (1)
  • WithMatchLabels (859-863)
tikv/region.go (1)
  • WithMatchLabels (181-183)
internal/locate/region_request.go (6)
util/redact/redact.go (1)
  • String (34-39)
metrics/metrics.go (1)
  • TiKVStoreLimitErrorCounter (93-93)
internal/locate/store_cache.go (1)
  • Store (216-240)
tikv/region.go (1)
  • Store (142-142)
internal/mockstore/mocktikv/cluster.go (1)
  • Store (762-765)
tikvrpc/endpoint.go (1)
  • TiFlashCompute (47-47)
internal/locate/store_cache.go (3)
tikvrpc/endpoint.go (2)
  • TiFlash (45-45)
  • TiKV (44-44)
metrics/shortcuts.go (1)
  • RegionCacheCounterWithInvalidateStoreRegionsOK (125-125)
internal/logutil/log.go (1)
  • BgLogger (46-48)
internal/locate/region_cache.go (2)
internal/locate/store_cache.go (1)
  • Store (216-240)
tikvrpc/endpoint.go (2)
  • TiKV (44-44)
  • TiFlash (45-45)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (8)
  • GitHub Check: test
  • GitHub Check: integration-raw-tikv (v1ttl)
  • GitHub Check: race-test
  • GitHub Check: integration-local-race
  • GitHub Check: integration-tikv
  • GitHub Check: integration-raw-tikv (v2)
  • GitHub Check: integration-local
  • GitHub Check: integration-next-gen-tikv
🔇 Additional comments (15)
internal/locate/region_request.go (5)

781-788: Logging uses GetAddr—nice alignment with accessor-based metadata.

Keeps log output consistent with refreshed store metadata.


1699-1708: Metrics label now uses GetAddr—good consistency with store accessor API.


1781-1783: Store type check via StoreType() is consistent with encapsulation.


1791-1804: TiFlash-related backoff now uses StoreType() accessor—looks good.


2086-2094: ServerIsBusy TiFlash path uses StoreType() accessor—consistent update.

internal/locate/region_request3_test.go (6)

237-243: Test liveness hook now uses GetAddr—aligned with accessor API.


368-376: RPC context assertions use GetAddr for both target and proxy—nice consistency.


1069-1072: Leader address derived via GetAddr—keeps tests consistent with store accessors.


1427-1436: Follower label/addr assertions via GetLabels/GetAddr—consistent with new API.


1454-1460: PreferLeader test uses GetAddr for leader match—good accessor alignment.


1519-1593: Leader handoff test now reads/writes leader addr via GetAddr—clean update.

internal/locate/replica_selector_test.go (1)

69-72: Store label assertions now use GetLabels—consistent with accessor-based metadata.

internal/locate/region_cache_test.go (1)

367-367: Tests now use accessor-based assertions

Good alignment with the new encapsulation (GetAddr() / GetLabels()), keeping tests consistent with thread-safe metadata access.

Also applies to: 445-451, 2303-2310, 3141-3147

internal/locate/store_cache.go (1)

149-154: Metadata access is now well-encapsulated

metaMu + updateMetadataFrom and accessor usage keep store metadata consistent and support in-place updates, which helps with pointer consistency.

Also applies to: 217-273, 285-313, 337-365, 423-475, 478-525, 560-565, 628-632, 659-665, 689-732, 1119-1123

internal/locate/region_cache.go (1)

379-384: Accessor adoption across region cache paths looks consistent

Good to see StoreType() / GetAddr() / GetLabels() used everywhere, keeping region-cache logic aligned with the new store metadata encapsulation.

Also applies to: 802-804, 838-844, 939-940, 1008-1009, 1043-1044, 1055-1056, 1065-1066, 1791-1792, 2202-2210, 2706-2707, 2719-2720, 2818-2822, 2851-2852

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

@ti-chi-bot ti-chi-bot bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 16, 2026
Signed-off-by: AndreMouche <[email protected]>
@ti-chi-bot ti-chi-bot bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jan 16, 2026
@ti-chi-bot ti-chi-bot bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 16, 2026
@AndreMouche
Copy link
Member Author

@cfzjywxk @ekexium @MyonKeminta PTAL, thanks

Signed-off-by: AndreMouche <[email protected]>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: AndreMouche <[email protected]>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

func (s *Store) GetLabels() []*metapb.StoreLabel {
s.metaMu.RLock()
defer s.metaMu.RUnlock()
return s.labels
Copy link

Copilot AI Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GetLabels() returns the backing slice from the store. Even with the “not safe to modify” note, callers can still accidentally append/mutate it without holding metaMu, which breaks the thread-safety guarantees this PR is introducing. Consider returning a shallow copy of the slice header (e.g., clone the slice) or providing APIs that avoid exposing a mutable slice to callers.

Suggested change
return s.labels
labels := make([]*metapb.StoreLabel, len(s.labels))
copy(labels, s.labels)
return labels

Copilot uses AI. Check for mistakes.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We intentionally avoid deep cloning here for performance reasons:

  1. Performance considerations:

    • GetLabels() is called frequently throughout the codebase (25+ call sites)
    • Deep cloning would introduce unnecessary memory allocations and copy overhead
    • Labels are typically used in read-only operations (comparison, lookup) rather than modification
  2. Current protection measures:

    • The function documentation explicitly states: "it is not safe to modify the returned labels"
    • Read access is protected by metaMu.RLock()
    • Label updates are performed atomically via updateMetadataFrom() under metaMu.Lock() protection
  3. Risk assessment:

    • Modifying the returned labels would violate the API contract
    • All current usages are read-only (e.g., IsSameLabels, IsLabelsMatch, GetLabelValue)
    • The performance benefit outweighs the risk, especially given the explicit documentation

The current implementation is appropriate for our use case.

Signed-off-by: AndreMouche <[email protected]>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@AndreMouche
Copy link
Member Author

@ekexium @cfzjywxk PTAL Again,thanks

@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Jan 29, 2026
@ti-chi-bot
Copy link

ti-chi-bot bot commented Jan 29, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ekexium, zyguan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot
Copy link

ti-chi-bot bot commented Jan 29, 2026

[LGTM Timeline notifier]

Timeline:

  • 2025-12-19 08:00:21.602995324 +0000 UTC m=+1805566.416772906: ☑️ agreed by zyguan.
  • 2026-01-29 03:51:38.265566012 +0000 UTC m=+1243525.879522868: ☑️ agreed by ekexium.

@ti-chi-bot ti-chi-bot bot merged commit beba787 into tikv:master Jan 29, 2026
19 checks passed
@ti-chi-bot ti-chi-bot bot added the needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. label Feb 6, 2026
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: cannot checkout release-8.5: error checking out release-8.5: exit status 1. output: error: pathspec 'release-8.5' did not match any file(s) known to git

@zyguan
Copy link
Contributor

zyguan commented Feb 6, 2026

/cherry-pick tidb-8.5

ti-chi-bot pushed a commit to ti-chi-bot/client-go that referenced this pull request Feb 6, 2026
@ti-chi-bot
Copy link
Member

@zyguan: new pull request created to branch tidb-8.5: #1874.
But this PR has conflicts, please resolve them!

Details

In response to this:

/cherry-pick tidb-8.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

ti-chi-bot bot pushed a commit that referenced this pull request Feb 12, 2026
…che (#1826) (#1874)

fix #1823

Signed-off-by: ti-chi-bot <[email protected]>
Signed-off-by: zyguan <[email protected]>

Co-authored-by: Shirly <[email protected]>
Co-authored-by: zyguan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved dco-signoff: yes Indicates the PR's author has signed the dco. lgtm needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Inconsistent store pointers between the region-cache and store-cache cause stale regions to become inaccessible.

5 participants