-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Summary
We are experiencing constant throttling of vreplication flows when Vitess Tablet Throttler is enabled on a large cluster (20 replicas across 2 regions). As soon as throttling is turned on via the UpdateThrottlerConfig gRPC endpoint, vreplication stalls and never recovers. The error persists indefinitely until throttling is disabled.
reason_throttled: vplayer:mv_fix_1:vreplication is denied access due to unexpected error: metric not collected yet
Environment
- Large Vitess cluster with ~20 replicas
- Region A: 10 replicas
- Region B: 10 replicas (50–70 ms cross-region latency)
-Reference keyspace replicating into the main keyspace
- Throttler enabled via
UpdateThrottlerConfig activeCollectInterval: 250 ms
Observed Behavior
- Once throttling is enabled, vreplication stops progressing. _vt.vreplication shows repeated errors:
reason_throttled: vplayer:mv_fix_1:vreplication is denied access due to unexpected error: metric not collected yet
- Running vtctldclient GetThrottlerStatus shows:
- Self-scope metrics are collected correctly.
- Shard-scope metrics consistently fail with metric not collected yet.
The error never clears. Disabling throttling immediately restores vreplication to normal behavior.
Expected Behavior
Throttler metric collection should complete within the 250 ms collection (maybe a long shot for such large clusters) window or degrade gracefully without blocking vreplication. vreplication should not stall indefinitely.
Reproduction Steps
- Run a cluster with 20 replicas across 2 regions.
- Start vreplication into this keyspace.
- Enable the throttler using UpdateThrottlerConfig.
Observe that:
- vreplication stalls,
- shard-scope metrics fail permanently,
- errors persist until throttler is fully disabled.
Suspected Root Cause [We are still debugging if the following points help in fixing the issue or not]
1. Cluster size + cross-region latency exceeds collection window
- Shard-scope metric collection requires the PRIMARY to reach out to all 20 replicas every 250 ms.
activeCollectInterval = 250 * time.Millisecond // PRIMARY polls replicas - Cross-region replicas add 50–70 ms latency for half of these calls.
- We suspect the 250 ms loop is insufficient for 20 RPCs including cross-region calls, creating constant backlog and continuous “metric not collected yet” failures.
2. grpcctmclient concurrency bottleneck (mutex contention)
We suspect issues with go/vt/vttablet/grpctmclient/client.go:
- Single mutex protecting unrelated structures, one mutex (client.mu) guarded:
-rpcClientMap (buffered channel pool)
-rpcDialPoolMap (dedicated connection pool)
vitess/go/vt/vttablet/grpctmclient/client.go
Lines 107 to 117 in a586590
// grpcClient implements both dialer and poolDialer. type grpcClient struct { // This cache of connections is to maximize QPS for ExecuteFetchAs{Dba,App}, // CheckThrottler and FullStatus. Note we'll keep the clients open and close them upon Close() only. // But that's OK because usually the tasks that use them are one-purpose only. // The map is protected by the mutex. mu sync.Mutex rpcClientMap map[string]chan *tmc rpcDialPoolMap map[DialPoolGroup]addrTmcMap }
Implications:
- During tablet inventory refresh or startup, dial-pool initialization can hold the mutex.
- Meanwhile, throttler metric collection calls (CheckThrottler) block on this mutex.
- These calls hit the 1 s timeout, causing repeated “metric not collected yet” errors.
- Increasing tablet_manager_grpc_concurrency (e.g., from 8 → 64) may reduce pressure but does not fix the root problem.
Possible Fix:
- in
go/vt/vttablet/grpctmclient/client.gohave :
// poolMu protects rpcClientMap, dedicatedMu protects rpcDialPoolMap.
poolMu sync.Mutex
dedicatedMu sync.Mutex
- use
poolMufordialPoolanddedidcatedMufordialDedicatedPool
3. Remaining bottleneck after separating poolMu and dedicatedMu
Even after separating the mutexes:
- dialDedicatedPool still holds dedicatedMu while executing network I/O inside createTmc().
- When PRIMARY collects metrics from 20+ replicas, it spawns many goroutines.
- All of them try calling dialDedicatedPool, but serialization on dedicatedMu causes:
- Wait times exceeding several seconds,
- gRPC context deadline exceeded (1 s)
- Continuous “metric not collected yet” throttler errors.
- Since the collector runs every 250 ms but each cycle can take multiple seconds, this leads to a permanent backlog and never-ending throttling.
Possible fix:
- Add the following code to
grpctmclient/client.go
type tmcEntry struct {
once sync.Once
tmc *tmc
err error
}
type addrTmcMap map[string]*tmcEntry
- Change the code in
dialDedicatedPoolto :
func (client *grpcClient) dialDedicatedPool(ctx context.Context, dialPoolGroup DialPoolGroup, tablet *topodatapb.Tablet) (tabletmanagerservicepb.TabletManagerClient, invalidatorFunc, error) {
addr := netutil.JoinHostPort(tablet.Hostname, int32(tablet.PortMap["grpc"]))
opt, err := grpcclient.SecureDialOption(cert, key, ca, crl, name)
if err != nil {
return nil, nil, err
}
client.dedicatedMu.Lock()
if client.rpcDialPoolMap == nil {
client.rpcDialPoolMap = make(map[DialPoolGroup]addrTmcMap)
}
if _, ok := client.rpcDialPoolMap[dialPoolGroup]; !ok {
client.rpcDialPoolMap[dialPoolGroup] = make(addrTmcMap)
}
m := client.rpcDialPoolMap[dialPoolGroup]
entry, ok := m[addr]
if !ok {
entry = &tmcEntry{}
m[addr] = entry
}
*client.dedicatedMu.Unlock()* <====< unlock here before doing a createTmc call
// Initialize connection exactly once, without holding the mutex
entry.once.Do(func() {
entry.tmc, entry.err = client.createTmc(ctx, addr, opt)
})
if entry.err != nil {
return nil, nil, entry.err
}
invalidator := func() {
client.dedicatedMu.Lock()
defer client.dedicatedMu.Unlock()
if entry.tmc != nil && entry.tmc.cc != nil {
entry.tmc.cc.Close()
}
delete(m, addr)
}
return entry.tmc.client, invalidator, nil
}
Binary Version
Vitess V21Operating System and Environment details
Darwin 24.5.0
arm64