[Throttler] Constant throttling of vreplication flow when cluster has large number of replicas

### Summary

We are experiencing constant throttling of vreplication flows when Vitess Tablet Throttler is enabled on a large cluster (20 replicas across 2 regions). As soon as throttling is turned on via the `UpdateThrottlerConfig` gRPC endpoint, vreplication stalls and never recovers. The error persists indefinitely until throttling is disabled.
```
reason_throttled: vplayer:mv_fix_1:vreplication is denied access due to unexpected error: metric not collected yet
```

### Environment
- Large Vitess cluster with ~20 replicas
   - Region A: 10 replicas
   - Region B: 10 replicas (50–70 ms cross-region latency)
-Reference keyspace replicating into the main keyspace
- Throttler enabled via `UpdateThrottlerConfig`
- `activeCollectInterval`: 250 ms

### Observed Behavior

- Once throttling is enabled, vreplication stops progressing. _vt.vreplication shows repeated errors:
```
reason_throttled: vplayer:mv_fix_1:vreplication is denied access due to unexpected error: metric not collected yet
```

- Running vtctldclient GetThrottlerStatus shows:
  -  _Self-scope metrics are collected correctly._
  - _**Shard-scope metrics consistently fail with metric not collected yet.**_

**The error never clears. Disabling throttling immediately restores vreplication to normal behavior.**

### Expected Behavior

Throttler metric collection should complete within the 250 ms collection _(maybe a long shot for such large clusters)_ window or degrade gracefully without blocking vreplication. vreplication should not stall indefinitely.

### Reproduction Steps

- Run a cluster with 20 replicas across 2 regions.
- Start vreplication into this keyspace.
- Enable the throttler using UpdateThrottlerConfig.

#### Observe that:

- vreplication stalls,
- shard-scope metrics fail permanently,
- errors persist until throttler is fully disabled.


### Suspected Root Cause [We are still debugging if the following points help in fixing the issue or not]

#### 1. Cluster size + cross-region latency exceeds collection window
- Shard-scope metric collection requires the PRIMARY to reach out to all 20 replicas every 250 ms.
https://github.com/vitessio/vitess/blob/a5865903f7d12dc39b3156d8f67d1a07dcc79e87/go/vt/vttablet/tabletserver/throttle/throttler.go#L85 
- Cross-region replicas add 50–70 ms latency for half of these calls.
- We suspect the 250 ms loop is insufficient for 20 RPCs including cross-region calls, creating constant backlog and continuous “metric not collected yet” failures.

### 2. grpcctmclient concurrency bottleneck (mutex contention)

We suspect issues with go/vt/vttablet/grpctmclient/client.go:

- Single mutex protecting unrelated structures, one mutex (client.mu) guarded:
  -rpcClientMap (buffered channel pool)
  -rpcDialPoolMap (dedicated connection pool)
https://github.com/vitessio/vitess/blob/a5865903f7d12dc39b3156d8f67d1a07dcc79e87/go/vt/vttablet/grpctmclient/client.go#L107-L117

#### Implications:

- During tablet inventory refresh or startup, dial-pool initialization can hold the mutex.
- Meanwhile, throttler metric collection calls (CheckThrottler) block on this mutex.
- These calls hit the 1 s timeout, causing repeated “metric not collected yet” errors.
- Increasing tablet_manager_grpc_concurrency (e.g., from 8 → 64) may reduce pressure but does not fix the root problem.

#### Possible Fix: 
- in  `go/vt/vttablet/grpctmclient/client.go` have :
```
	// poolMu protects rpcClientMap, dedicatedMu protects rpcDialPoolMap.
	poolMu         sync.Mutex
	dedicatedMu    sync.Mutex
```
- use `poolMu` for `dialPool` and `dedidcatedMu` for `dialDedicatedPool`

 ### 3. Remaining bottleneck after separating poolMu and dedicatedMu

Even after separating the mutexes:
- dialDedicatedPool still holds dedicatedMu while executing network I/O inside createTmc().
- When PRIMARY collects metrics from 20+ replicas, it spawns many goroutines.
- All of them try calling dialDedicatedPool, but serialization on dedicatedMu causes:
  - Wait times exceeding several seconds,
  - gRPC context deadline exceeded (1 s)
  - Continuous “metric not collected yet” throttler errors.
- Since the collector runs every 250 ms but each cycle can take multiple seconds, this leads to a permanent backlog and never-ending throttling.

#### Possible fix:
- Add the following code to `grpctmclient/client.go`
```
type tmcEntry struct {
	once sync.Once
	tmc  *tmc
	err  error
}

type addrTmcMap map[string]*tmcEntry
```
- Change the code in `dialDedicatedPool` to : 
```
func (client *grpcClient) dialDedicatedPool(ctx context.Context, dialPoolGroup DialPoolGroup, tablet *topodatapb.Tablet) (tabletmanagerservicepb.TabletManagerClient, invalidatorFunc, error) {
	addr := netutil.JoinHostPort(tablet.Hostname, int32(tablet.PortMap["grpc"]))
	opt, err := grpcclient.SecureDialOption(cert, key, ca, crl, name)
	if err != nil {
		return nil, nil, err
	}

	client.dedicatedMu.Lock()
	if client.rpcDialPoolMap == nil {
		client.rpcDialPoolMap = make(map[DialPoolGroup]addrTmcMap)
	}
	if _, ok := client.rpcDialPoolMap[dialPoolGroup]; !ok {
		client.rpcDialPoolMap[dialPoolGroup] = make(addrTmcMap)
	}
	m := client.rpcDialPoolMap[dialPoolGroup]
	entry, ok := m[addr]
	if !ok {
		entry = &tmcEntry{}
		m[addr] = entry
	}
	*client.dedicatedMu.Unlock()* <====< unlock here before doing a createTmc call

	// Initialize connection exactly once, without holding the mutex
	entry.once.Do(func() {
		entry.tmc, entry.err = client.createTmc(ctx, addr, opt)
	})

	if entry.err != nil {
		return nil, nil, entry.err
	}

	invalidator := func() {
		client.dedicatedMu.Lock()
		defer client.dedicatedMu.Unlock()
		if entry.tmc != nil && entry.tmc.cc != nil {
			entry.tmc.cc.Close()
		}
		delete(m, addr)
	}
	return entry.tmc.client, invalidator, nil
}
```


### Binary Version

```sh
Vitess V21
```

### Operating System and Environment details

```sh
Darwin 24.5.0
arm64
```

### Log Fragments

```sh

```


	// grpcClient implements both dialer and poolDialer.
	type grpcClient struct {
	// This cache of connections is to maximize QPS for ExecuteFetchAs{Dba,App},
	// CheckThrottler and FullStatus. Note we'll keep the clients open and close them upon Close() only.
	// But that's OK because usually the tasks that use them are one-purpose only.
	// The map is protected by the mutex.
	mu sync.Mutex
	rpcClientMap map[string]chan *tmc
	rpcDialPoolMap map[DialPoolGroup]addrTmcMap
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Throttler] Constant throttling of vreplication flow when cluster has large number of replicas #18930

Summary

Environment

Observed Behavior

Expected Behavior

Reproduction Steps

Observe that:

Suspected Root Cause [We are still debugging if the following points help in fixing the issue or not]

1. Cluster size + cross-region latency exceeds collection window

2. grpcctmclient concurrency bottleneck (mutex contention)

Implications:

Possible Fix:

3. Remaining bottleneck after separating poolMu and dedicatedMu

Possible fix:

Binary Version

Operating System and Environment details

Log Fragments

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Throttler] Constant throttling of vreplication flow when cluster has large number of replicas #18930

Description

Summary

Environment

Observed Behavior

Expected Behavior

Reproduction Steps

Observe that:

Suspected Root Cause [We are still debugging if the following points help in fixing the issue or not]

1. Cluster size + cross-region latency exceeds collection window

2. grpcctmclient concurrency bottleneck (mutex contention)

Implications:

Possible Fix:

3. Remaining bottleneck after separating poolMu and dedicatedMu

Possible fix:

Binary Version

Operating System and Environment details

Log Fragments

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions