coild: SetupIPAM serializes all CNI Add under a global mutex, causing cluster-wide pod-creation stalls (~120 s) on newly-added nodes

**Version:** coil `v2.13.0` (image `ghcr.io/cybozu-go/coil:2.13.0`)
**Kubernetes:** v1.34.4
**OS/Kernel:** Ubuntu 24.04.4, kernel 6.8.0-100
**Runtime:** containerd 2.2.1
**CNI chain:** `coil` → `coil-router` (compat-calico enabled)
**Flags:** `--enable-originating-only=true --compat-calico` (IPv4-only; no egress tunnels in use)

## Summary

On a small number of newly-joined worker nodes, `coild` enters a state in which **every CNI Add request serializes for ~120 seconds**, producing `DeadlineExceeded` errors at the CNI client (containerd) and `Internal: failed to setup pod network IPAM` at the coild gRPC server. The node stays in this state indefinitely (observed >15 h). Deleting the coild pod on that node immediately restores normal sub-second Add latency.

The cluster-wide impact is severe: with two degraded nodes out of 13 we had **144 `ContainerCreating` + 39 `Init:0/1` pods** queued, blocking deployments and autoscaling.

## Evidence

Captured on `coild-c8jrc` (node `tenant-1-default-6q2xx-8j65t`) after ~16 h in the stalled state. A second coild (`coild-9dgrs`) on another node shows the same pattern; all other 11 coilds on the same image/OS/kernel are healthy.

### gRPC metrics (process lifetime ~16 h)

```
grpc_server_started_total{grpc_method="Add"}                    60059
grpc_server_handled_total{grpc_method="Add", grpc_code="OK"}        17863   (30 %)
grpc_server_handled_total{grpc_method="Add", grpc_code="Internal"}  42099   (70 %)
grpc_server_handled_total{grpc_method="Add", grpc_code="DeadlineExceeded"}  0

go_goroutines                                562
process_resident_memory_bytes         195_862_528  (~187 MiB)
```

~70 % of Add RPCs fail server-side with `Internal`; the client (containerd) cancels at its own ~120 s deadline, which is why `DeadlineExceeded` is 0 at the server.

### Add latency distribution

From the most recent 5 000 log lines (492 successful + 869 failed Add calls):

|            | n   | min   | p50   | p95   | max   | mean  |
| ---------- | --- | ----- | ----- | ----- | ----- | ----- |
| Add OK     | 492 | 114 s | 122 s | 123 s | 124 s | 122 s |
| Add Failed | 869 | 120 s | 121 s | 123 s | 124 s | 121 s |

**Even successful Adds take ~120 s.** The system is fully serialized. There are no sub-second successes.

### Log signature

```
{"level":"info","msg":"finished call",
 "grpc.method":"Add",
 "grpc.code":"Internal",
 "grpc.error":"rpc error: code = Internal desc = failed to setup pod network IPAM",
 "grpc.time_ms":121142.97, ...}
```

The error is produced at `v2/runners/coild_server.go:253`:
```go
result, err = s.podNet.SetupIPAM(args.Netns, pod.Name, pod.Namespace, config)
if err != nil {
    ...
    return nil, newInternalError(err, "failed to setup pod network IPAM")
}
```

So the failure point is the **pod network setup** (veth/netns), not IPAM allocation (which would produce `"failed to allocate address"`).

### Goroutines

`go_goroutines = 562` vs. ~50–100 on healthy coilds. Strong indicator of a large queue of gRPC handlers blocked on a shared lock. Unfortunately **coild exposes no `/debug/pprof` endpoint** (see feature request below), so a goroutine dump could not be taken non-destructively.

## Root-cause hypothesis

`pkg/nodenet/pod.go:199–201` takes a **process-wide mutex** (`pn.mu`) for the entire duration of a pod's network setup:

```go
func (pn *podNetwork) SetupIPAM(nsPath, podName, podNS string, conf *PodNetConf) (*current.Result, error) {
    pn.mu.Lock()
    defer pn.mu.Unlock()
    ...
}
```

The critical section holds the lock across:

- `ns.GetNS(nsPath)` — open pod netns
- `lookup(...)` + optional `netlink.LinkDel` — garbage-veth cleanup
- `containerNS.Do(...)` — enter pod netns, `SetupVethWithName`, `LinkSetUp`, `AddrAdd` (v4+v6)
- `ip.SettleAddresses(conf.IFace, 10*time.Second)` — **up to 10 s** for IPv6 DAD
- host-side `LinkByName`, routing/sysctl, and later `ip.SettleAddresses(hName, 10*time.Second)` — another up to 10 s

Any single slow netlink/netns operation (kernel netlink congestion, udev delay, a slow `SettleAddresses`) serializes **every subsequent CNI Add on that node** behind it. Once throughput falls below the arrival rate (kubelet sandbox creation during a node add), the queue grows; Adds spend more time waiting for the lock than on actual work; the client 120 s deadline fires; kubelet retries; the queue never drains.

Observations that support this hypothesis:

1. Every Add, including successes, takes ~120 s — consistent with serialized access under heavy queueing, not with individual slow ops.
2. `go_goroutines ≈ 562` — many gRPC handlers parked on the mutex.
3. The condition is per-node and sticky; restarting coild on that node instantly restores normal operation (nothing in IPAM state or on the node is corrupted).
4. Triggered by bulk pod creation on freshly-added nodes (Kubernetes update wave: nodes added, then kubelet schedules ~80 pods in seconds).
5. The code path and lock are unchanged at least from `v2.11.0` through `main` (same `pn.mu.Lock()` at `SetupIPAM`, `DestroyIPAM`, and `SetupEgress`).

Relationship to earlier issues:

- #339 introduced a `syncBlock` race; #348 reverted it in the v2.12.0 tree. Our v2.13.0 carries the revert (confirmed via `git merge-base --is-ancestor d4a13b5 v2.12.0`).
  Between v2.12.0 and v2.13.0 **no changes touched `pkg/ipam`, `pkg/nodenet`, `runners/coild*`, or `cmd/coild`** (all diffs are in the egress/NAT path). So this is not a regression of #339; it is a long-standing design property of `SetupIPAM` that becomes visible under bursty Add load.

## Reproduction

Observed (not deliberately reproduced) on:

- Kubernetes 1.34.4 worker joining a cluster and receiving ~80 pending pods simultaneously.
- coil `v2.13.0` with `--compat-calico` and `--enable-originating-only=true`, IPv4-only.
- After the first stall, the node never recovers without a coild pod restart.

Minimal synthetic reproduction (not yet run, but should suffice):

1. Cordon a node, drain it, uncordon; schedule ~100 pods onto it at once.
2. Observe `grpc_server_handled_total{grpc_method="Add", grpc_code="Internal"}` rising in lockstep with `process_start_time_seconds` age; `go_goroutines` climbing.
3. CNI Adds in kubelet logs: `plugin type="coil" failed (add): stream terminated by RST_STREAM with error code: CANCEL; rpc error: code = DeadlineExceeded`.

## Suggested fixes (in priority order)

1. **Parallelize per-pod setup**. The current global mutex is only strictly needed for operations on shared host-side resources (e.g. the routing table, FDB). Per-pod netns/veth work is independent and should not block other pods.
   - Replace the single `sync.Mutex` on `podNetwork` with a much smaller critical section around genuinely shared host-side state, or
   - Use a bounded pool of workers (e.g. 8–16) so that one slow netlink call cannot starve the whole node.
2. **Add a watchdog / lock-wait metric**. Expose `coild_setup_mutex_wait_seconds` and `coild_setup_in_flight` histograms; page operators when the lock wait exceeds e.g. 5 s.
3. **Shorter, configurable `SettleAddresses` timeout**. 10 s × 2 per pod with IPv6 can hold the global lock for up to 20 s in the bad case. IPv4-only deployments should skip it entirely (appears already bypassed when `conf.IPv6 == nil`, but the host-side `SettleAddresses(hName, 10*time.Second)` runs unconditionally — please double-check).
4. **Expose `net/http/pprof` (opt-in) in coild**. Currently coild exposes only `/metrics` (9384) and `/healthz|/readyz` (9385). A stuck goroutine situation like this is diagnosable only by killing the process (SIGQUIT), which destroys the evidence. A flag-gated pprof endpoint (e.g. `--pprof-addr=127.0.0.1:9386`) would dramatically improve operator ability to produce the exact stack traces this project needs from users.
5. **Emit a log/metric when an Add takes longer than, say, 10 s**. This gives early warning before the node has silently queued 500 requests.

## Workaround

Deleting the affected coild pod (`kubectl delete pod coild-<x> -n kube-system`) restores normal Add latency immediately; in our environment the replacement coild processed the backlog in under a minute. No state loss was observed.

## Environment details

- 13 worker nodes, same base image and coil `v2.13.0` — only 2 of 13 were affected, both freshly joined during a Kubernetes upgrade wave.
- `coild` args: `--zap-stacktrace-level=panic --enable-originating-only=true --compat-calico`
- No IPAM exhaustion; `AddressPool`/`AddressBlock` CRs healthy; controller leader stable.
- Error on kubelet side (example):
  `networkPlugin cni failed: plugin type="coil" failed (add): stream terminated by RST_STREAM with error code: CANCEL; rpc error: code = DeadlineExceeded`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

coild: SetupIPAM serializes all CNI Add under a global mutex, causing cluster-wide pod-creation stalls (~120 s) on newly-added nodes #368

Summary

Evidence

gRPC metrics (process lifetime ~16 h)

Add latency distribution

Log signature

Goroutines

Root-cause hypothesis

Reproduction

Suggested fixes (in priority order)

Workaround

Environment details

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	n	min	p50	p95	max	mean
Add OK	492	114 s	122 s	123 s	124 s	122 s
Add Failed	869	120 s	121 s	123 s	124 s	121 s

coild: SetupIPAM serializes all CNI Add under a global mutex, causing cluster-wide pod-creation stalls (~120 s) on newly-added nodes #368

Description

Summary

Evidence

gRPC metrics (process lifetime ~16 h)

Add latency distribution

Log signature

Goroutines

Root-cause hypothesis

Reproduction

Suggested fixes (in priority order)

Workaround

Environment details

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions