Skip to content

coild: SetupIPAM serializes all CNI Add under a global mutex, causing cluster-wide pod-creation stalls (~120 s) on newly-added nodes #368

@eumel8

Description

@eumel8

Version: coil v2.13.0 (image ghcr.io/cybozu-go/coil:2.13.0)
Kubernetes: v1.34.4
OS/Kernel: Ubuntu 24.04.4, kernel 6.8.0-100
Runtime: containerd 2.2.1
CNI chain: coilcoil-router (compat-calico enabled)
Flags: --enable-originating-only=true --compat-calico (IPv4-only; no egress tunnels in use)

Summary

On a small number of newly-joined worker nodes, coild enters a state in which every CNI Add request serializes for ~120 seconds, producing DeadlineExceeded errors at the CNI client (containerd) and Internal: failed to setup pod network IPAM at the coild gRPC server. The node stays in this state indefinitely (observed >15 h). Deleting the coild pod on that node immediately restores normal sub-second Add latency.

The cluster-wide impact is severe: with two degraded nodes out of 13 we had 144 ContainerCreating + 39 Init:0/1 pods queued, blocking deployments and autoscaling.

Evidence

Captured on coild-c8jrc (node tenant-1-default-6q2xx-8j65t) after ~16 h in the stalled state. A second coild (coild-9dgrs) on another node shows the same pattern; all other 11 coilds on the same image/OS/kernel are healthy.

gRPC metrics (process lifetime ~16 h)

grpc_server_started_total{grpc_method="Add"}                    60059
grpc_server_handled_total{grpc_method="Add", grpc_code="OK"}        17863   (30 %)
grpc_server_handled_total{grpc_method="Add", grpc_code="Internal"}  42099   (70 %)
grpc_server_handled_total{grpc_method="Add", grpc_code="DeadlineExceeded"}  0

go_goroutines                                562
process_resident_memory_bytes         195_862_528  (~187 MiB)

~70 % of Add RPCs fail server-side with Internal; the client (containerd) cancels at its own ~120 s deadline, which is why DeadlineExceeded is 0 at the server.

Add latency distribution

From the most recent 5 000 log lines (492 successful + 869 failed Add calls):

n min p50 p95 max mean
Add OK 492 114 s 122 s 123 s 124 s 122 s
Add Failed 869 120 s 121 s 123 s 124 s 121 s

Even successful Adds take ~120 s. The system is fully serialized. There are no sub-second successes.

Log signature

{"level":"info","msg":"finished call",
 "grpc.method":"Add",
 "grpc.code":"Internal",
 "grpc.error":"rpc error: code = Internal desc = failed to setup pod network IPAM",
 "grpc.time_ms":121142.97, ...}

The error is produced at v2/runners/coild_server.go:253:

result, err = s.podNet.SetupIPAM(args.Netns, pod.Name, pod.Namespace, config)
if err != nil {
    ...
    return nil, newInternalError(err, "failed to setup pod network IPAM")
}

So the failure point is the pod network setup (veth/netns), not IPAM allocation (which would produce "failed to allocate address").

Goroutines

go_goroutines = 562 vs. ~50–100 on healthy coilds. Strong indicator of a large queue of gRPC handlers blocked on a shared lock. Unfortunately coild exposes no /debug/pprof endpoint (see feature request below), so a goroutine dump could not be taken non-destructively.

Root-cause hypothesis

pkg/nodenet/pod.go:199–201 takes a process-wide mutex (pn.mu) for the entire duration of a pod's network setup:

func (pn *podNetwork) SetupIPAM(nsPath, podName, podNS string, conf *PodNetConf) (*current.Result, error) {
    pn.mu.Lock()
    defer pn.mu.Unlock()
    ...
}

The critical section holds the lock across:

  • ns.GetNS(nsPath) — open pod netns
  • lookup(...) + optional netlink.LinkDel — garbage-veth cleanup
  • containerNS.Do(...) — enter pod netns, SetupVethWithName, LinkSetUp, AddrAdd (v4+v6)
  • ip.SettleAddresses(conf.IFace, 10*time.Second)up to 10 s for IPv6 DAD
  • host-side LinkByName, routing/sysctl, and later ip.SettleAddresses(hName, 10*time.Second) — another up to 10 s

Any single slow netlink/netns operation (kernel netlink congestion, udev delay, a slow SettleAddresses) serializes every subsequent CNI Add on that node behind it. Once throughput falls below the arrival rate (kubelet sandbox creation during a node add), the queue grows; Adds spend more time waiting for the lock than on actual work; the client 120 s deadline fires; kubelet retries; the queue never drains.

Observations that support this hypothesis:

  1. Every Add, including successes, takes ~120 s — consistent with serialized access under heavy queueing, not with individual slow ops.
  2. go_goroutines ≈ 562 — many gRPC handlers parked on the mutex.
  3. The condition is per-node and sticky; restarting coild on that node instantly restores normal operation (nothing in IPAM state or on the node is corrupted).
  4. Triggered by bulk pod creation on freshly-added nodes (Kubernetes update wave: nodes added, then kubelet schedules ~80 pods in seconds).
  5. The code path and lock are unchanged at least from v2.11.0 through main (same pn.mu.Lock() at SetupIPAM, DestroyIPAM, and SetupEgress).

Relationship to earlier issues:

Reproduction

Observed (not deliberately reproduced) on:

  • Kubernetes 1.34.4 worker joining a cluster and receiving ~80 pending pods simultaneously.
  • coil v2.13.0 with --compat-calico and --enable-originating-only=true, IPv4-only.
  • After the first stall, the node never recovers without a coild pod restart.

Minimal synthetic reproduction (not yet run, but should suffice):

  1. Cordon a node, drain it, uncordon; schedule ~100 pods onto it at once.
  2. Observe grpc_server_handled_total{grpc_method="Add", grpc_code="Internal"} rising in lockstep with process_start_time_seconds age; go_goroutines climbing.
  3. CNI Adds in kubelet logs: plugin type="coil" failed (add): stream terminated by RST_STREAM with error code: CANCEL; rpc error: code = DeadlineExceeded.

Suggested fixes (in priority order)

  1. Parallelize per-pod setup. The current global mutex is only strictly needed for operations on shared host-side resources (e.g. the routing table, FDB). Per-pod netns/veth work is independent and should not block other pods.
    • Replace the single sync.Mutex on podNetwork with a much smaller critical section around genuinely shared host-side state, or
    • Use a bounded pool of workers (e.g. 8–16) so that one slow netlink call cannot starve the whole node.
  2. Add a watchdog / lock-wait metric. Expose coild_setup_mutex_wait_seconds and coild_setup_in_flight histograms; page operators when the lock wait exceeds e.g. 5 s.
  3. Shorter, configurable SettleAddresses timeout. 10 s × 2 per pod with IPv6 can hold the global lock for up to 20 s in the bad case. IPv4-only deployments should skip it entirely (appears already bypassed when conf.IPv6 == nil, but the host-side SettleAddresses(hName, 10*time.Second) runs unconditionally — please double-check).
  4. Expose net/http/pprof (opt-in) in coild. Currently coild exposes only /metrics (9384) and /healthz|/readyz (9385). A stuck goroutine situation like this is diagnosable only by killing the process (SIGQUIT), which destroys the evidence. A flag-gated pprof endpoint (e.g. --pprof-addr=127.0.0.1:9386) would dramatically improve operator ability to produce the exact stack traces this project needs from users.
  5. Emit a log/metric when an Add takes longer than, say, 10 s. This gives early warning before the node has silently queued 500 requests.

Workaround

Deleting the affected coild pod (kubectl delete pod coild-<x> -n kube-system) restores normal Add latency immediately; in our environment the replacement coild processed the backlog in under a minute. No state loss was observed.

Environment details

  • 13 worker nodes, same base image and coil v2.13.0 — only 2 of 13 were affected, both freshly joined during a Kubernetes upgrade wave.
  • coild args: --zap-stacktrace-level=panic --enable-originating-only=true --compat-calico
  • No IPAM exhaustion; AddressPool/AddressBlock CRs healthy; controller leader stable.
  • Error on kubelet side (example):
    networkPlugin cni failed: plugin type="coil" failed (add): stream terminated by RST_STREAM with error code: CANCEL; rpc error: code = DeadlineExceeded

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions