docs: add SSA patterns, error handling, and troubleshooting enhancements by doxxx93 · Pull Request #88 · kube-rs/website

doxxx93 · 2026-02-26T01:03:40Z

Summary

controllers/ssa.md (new): Server-Side Apply patterns — common pitfalls (missing apiVersion/kind, force misuse, unnecessary field ownership), status patching, typed SSA
controllers/errors.md (new): Error handling across layers (Client → Api → watcher → Controller), watcher backoff configuration, reconciler error_policy patterns, client-level retry guidance, timeout strategies
troubleshooting.md (enhanced): Added symptom-based diagnosis tables (infinite loop, memory growth, watch recovery, 429 throttling, finalizer deadlock, reconciler not running), debugging tools (RUST_LOG levels, tracing spans, kubectl inspection), and profiling guidance (jemalloc, tokio-console)
mkdocs.yml: Added new pages to navigation — SSA under Concepts (after Reconciler), Error Handling under Operational (after Observability)

Cross-references added to avoid duplication with existing pages (optimization, observability, gc).

Context

Discussed with @clux on Discord — agreed to start with practical patterns (SSA, error handling) as they come up frequently in GitHub issues/discussions.

Signed-off-by: doxxx93 <doxxx93@gmail.com>

clux

This is great. Had a quick read through and spotted a few minor things here and there. Some bits could have follow-ups elsewhere so left some comments, but generally this is very nice and i can see it being helpful!

clux · 2026-02-26T04:50:03Z

+!!! note "Current limitations"
+
+    `error_policy` is a **synchronous** function. You cannot perform async operations (sending metrics, updating status) inside it. For per-key exponential backoff, wrap the reconciler itself — see the pattern described in the [[reconciler]] documentation.
+


yeah, this is a great callout.

actually looking at this some more, there's no such pattern described in the reconciler documentation.

clux · 2026-02-26T05:44:56Z

+
+| Cause | How to verify | Solution |
+|-------|--------------|----------|
+| [Store] not yet initialized | Readiness probe fails | Wait for [Store::wait_until_ready] |


this is probably the more advanced/unlikely one you'd only see if you're using the streams interface (we wait by default in the controller otherwise).

Signed-off-by: doxxx93 <doxxx93@gmail.com>

doxxx93 · 2026-02-26T06:41:32Z

Addressed all review feedback — thanks for the thorough review!

errors.md:

Added silently emphasis (L35)
Fixed watcher comment: "terminates stream" → "tight retry loop" (L41)
Added RetryPolicy v3 code example with link (L122)
Simplified timeout section — removed the 295s client-split advice since it's largely obsolete after - Remove global read_timeout default, add watcher-level idle timeout kube#1945 (L147)
Reworded to "wrapping individual API calls" (L160)

ssa.md:

Added ApplyConfigurations limitation note with kube#649 link (L104)

troubleshooting.md:

Changed memory symptom to "higher than expected Pod memory" (L118)
Added "or" to clarify solutions are not mutually exclusive (L122-123)
Added RBAC / NetworkPolicies row to Watch Connection table (L135)
Removed try_join! — replaced with "batch where possible" (L148)
Added error_policy metrics mention for cleanup failure detection (L156)
Noted Store init is advanced/streams-only (L168)
Added NetworkPolicies row to Reconciler Not Running table (L170)
Added tokio-metrics as lightweight alternative to tokio-console (L258)

(LOL.. github profile mistake..)

clux

some more comments, realized i need to check this more carefully

clux · 2026-02-26T21:20:58Z

+
+!!! note "Current limitation: no ApplyConfigurations in Rust"
+
+    Go's client-go provides [ApplyConfigurations](https://pkg.go.dev/k8s.io/client-go/applyconfigurations) — fully optional builder types designed specifically for SSA. Rust does not have an equivalent yet ([kube#649](https://github.com/kube-rs/kube/issues/649)). Some [k8s-openapi] fields are not fully optional (e.g. certain integer fields like `maxReplicas`), which can make typed partial SSA awkward. Using `serde_json::json!()` for partial patches avoids this issue.


Suggested change

Go's client-go provides [ApplyConfigurations](https://pkg.go.dev/k8s.io/client-go/applyconfigurations) — fully optional builder types designed specifically for SSA. Rust does not have an equivalent yet ([kube#649](https://github.com/kube-rs/kube/issues/649)). Some [k8s-openapi] fields are not fully optional (e.g. certain integer fields like `maxReplicas`), which can make typed partial SSA awkward. Using `serde_json::json!()` for partial patches avoids this issue.

Go's client-go provides [ApplyConfigurations](https://pkg.go.dev/k8s.io/client-go/applyconfigurations) - fully optional builder types designed specifically for SSA. Rust does not have an equivalent yet ([kube#649](https://github.com/kube-rs/kube/issues/649)). Some [k8s-openapi] fields are not fully optional (e.g. certain integer fields like `maxReplicas`), which can make typed partial SSA awkward. Using `serde_json::json!()` for partial patches works around this issue.

clux · 2026-02-26T21:25:45Z

+let pod = tokio::time::timeout(
+    Duration::from_secs(10),
+    api.get("my-pod"),
+).await??;


double questionmark

clux · 2026-02-26T21:30:13Z

+!!! note "Current limitations"
+
+    `error_policy` is a **synchronous** function. You cannot perform async operations (sending metrics, updating status) inside it. For per-key exponential backoff, wrap the reconciler itself — see the pattern described in the [[reconciler]] documentation.
+


actually looking at this some more, there's no such pattern described in the reconciler documentation.

clux · 2026-02-26T21:35:31Z

+Not all errors are retryable:
+
+| Error | Retryable | Reason |
+|-------|-----------|--------|
+| 5xx | Yes | Server-side transient failure |
+| Timeout | Yes | Temporary network issue |
+| 429 Too Many Requests | Yes | Rate limit — wait and retry |
+| Network error | Yes | Temporary connectivity failure |
+| 4xx (400, 403, 404) | No | The request itself is wrong |
+| 409 Conflict | No | SSA ownership conflict — fix the logic |


Because this table sits within the client-level retry right after the RetryPolicy, it gives the impression that the RetryPolicy is doing retrying for all of these errors marked as Yes, but that's not true.

Probably need to restructure this section so that it's less ambiguous.

clux · 2026-02-26T21:42:38Z

+```rust
+// ✗ Uses default field manager → unintended ownership conflicts
+let pp = PatchParams::default();
+
+// ✓ Explicit field manager
+let pp = PatchParams::apply("my-controller");
+```


field managers are required for serverside apply so using PatchParams::default with apply should probably be validated as an error in PatchParams rather than documented here as an eternal footgun.

Agreed — this should be a client-side validation rather than a doc-only warning. PatchParams::validate() already rejects force with non-Apply patches, but doesn't check field_manager: None with Patch::Apply. I'll open an issue on kube-rs/kube for adding this check.

clux · 2026-02-26T21:47:23Z

+let pp = PatchParams::apply("my-controller");
+```
+
+Always specify an explicit field manager. Without one, you risk ownership collisions with other controllers or kubectl users.


what is the implication of collisions if you are not using a field manager?

clux · 2026-02-26T22:04:09Z

+jeprof --svg ./my-controller jeprof.*.heap > heap.svg
+```
+
+If `AHashMap` allocations dominate the profile, the [Store] cache is likely the bottleneck. Apply `.modify()` or switch to [metadata_watcher].


What is the Store cache bottlenecking?

i.e. strange choice of wording. it shouldn't slow anything down, it should just allocate more.

clux · 2026-02-26T22:09:20Z

+
+| Cause | How to verify | Solution |
+|-------|--------------|----------|
+| Re-list memory spikes | Periodic spikes visible in memory graphs | Use `streaming_lists()`, or reduce `page_size` |


in my experience, re-lists actually doesn't cause spikes because once maps have reserved capacity they generally do not give it back that quickly. are you seeing something else?

Signed-off-by: doxxx93 <doxxx93@gmail.com>

add error handling documentation and common troubleshooting patterns

4479b74

Signed-off-by: doxxx93 <doxxx93@gmail.com>

doxxx93 force-pushed the docs/patterns-ssa-errors branch from ba34a80 to 4479b74 Compare February 26, 2026 01:05

clux reviewed Feb 26, 2026

View reviewed changes

update error handling documentation and clarify retry strategies

0b598f7

Signed-off-by: doxxx93 <doxxx93@gmail.com>

clux reviewed Feb 26, 2026

View reviewed changes

doxxx93 marked this pull request as draft February 27, 2026 01:18

refine error handling documentation and clarify retry policies

2f6a112

Signed-off-by: doxxx93 <doxxx93@gmail.com>

		!!! note "Current limitations"

		`error_policy` is a synchronous function. You cannot perform async operations (sending metrics, updating status) inside it. For per-key exponential backoff, wrap the reconciler itself — see the pattern described in the [[reconciler]] documentation.


		!!! note "Current limitation: no ApplyConfigurations in Rust"

		Go's client-go provides [ApplyConfigurations](https://pkg.go.dev/k8s.io/client-go/applyconfigurations) — fully optional builder types designed specifically for SSA. Rust does not have an equivalent yet ([kube#649](https://github.com/kube-rs/kube/issues/649)). Some [k8s-openapi] fields are not fully optional (e.g. certain integer fields like `maxReplicas`), which can make typed partial SSA awkward. Using `serde_json::json!()` for partial patches avoids this issue.

Conversation

doxxx93 commented Feb 26, 2026

Summary

Context

Uh oh!

clux left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

doxxx93 commented Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clux left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

clux Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

clux Feb 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

doxxx93 commented Feb 26, 2026 •

edited

Loading

clux Feb 26, 2026 •

edited

Loading

clux Feb 26, 2026 •

edited

Loading