Skip to content

docs: add SSA patterns, error handling, and troubleshooting enhancements#88

Draft
doxxx93 wants to merge 3 commits intokube-rs:mainfrom
doxxx93:docs/patterns-ssa-errors
Draft

docs: add SSA patterns, error handling, and troubleshooting enhancements#88
doxxx93 wants to merge 3 commits intokube-rs:mainfrom
doxxx93:docs/patterns-ssa-errors

Conversation

@doxxx93
Copy link
Copy Markdown
Member

@doxxx93 doxxx93 commented Feb 26, 2026

Summary

  • controllers/ssa.md (new): Server-Side Apply patterns — common pitfalls (missing apiVersion/kind, force misuse, unnecessary field ownership), status patching, typed SSA
  • controllers/errors.md (new): Error handling across layers (Client → Api → watcher → Controller), watcher backoff configuration, reconciler error_policy patterns, client-level retry guidance, timeout strategies
  • troubleshooting.md (enhanced): Added symptom-based diagnosis tables (infinite loop, memory growth, watch recovery, 429 throttling, finalizer deadlock, reconciler not running), debugging tools (RUST_LOG levels, tracing spans, kubectl inspection), and profiling guidance (jemalloc, tokio-console)
  • mkdocs.yml: Added new pages to navigation — SSA under Concepts (after Reconciler), Error Handling under Operational (after Observability)

Cross-references added to avoid duplication with existing pages (optimization, observability, gc).

Context

Discussed with @clux on Discord — agreed to start with practical patterns (SSA, error handling) as they come up frequently in GitHub issues/discussions.

@doxxx93 doxxx93 force-pushed the docs/patterns-ssa-errors branch from ba34a80 to 4479b74 Compare February 26, 2026 01:05
Copy link
Copy Markdown
Member

@clux clux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great. Had a quick read through and spotted a few minor things here and there. Some bits could have follow-ups elsewhere so left some comments, but generally this is very nice and i can see it being helpful!

Comment thread docs/controllers/errors.md Outdated
Comment thread docs/controllers/errors.md Outdated
Comment on lines +114 to +117
!!! note "Current limitations"

`error_policy` is a **synchronous** function. You cannot perform async operations (sending metrics, updating status) inside it. For per-key exponential backoff, wrap the reconciler itself — see the pattern described in the [[reconciler]] documentation.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this is a great callout.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually looking at this some more, there's no such pattern described in the reconciler documentation.

Comment thread docs/controllers/errors.md Outdated
Comment thread docs/controllers/errors.md Outdated
Comment thread docs/troubleshooting.md Outdated
Comment thread docs/troubleshooting.md Outdated

| Cause | How to verify | Solution |
|-------|--------------|----------|
| [Store] not yet initialized | Readiness probe fails | Wait for [Store::wait_until_ready] |
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is probably the more advanced/unlikely one you'd only see if you're using the streams interface (we wait by default in the controller otherwise).

Comment thread docs/troubleshooting.md
Comment thread docs/troubleshooting.md Outdated
Comment thread docs/troubleshooting.md
Signed-off-by: doxxx93 <doxxx93@gmail.com>
@doxxx93
Copy link
Copy Markdown
Member Author

doxxx93 commented Feb 26, 2026

Addressed all review feedback — thanks for the thorough review!

errors.md:

  • Added silently emphasis (L35)
  • Fixed watcher comment: "terminates stream" → "tight retry loop" (L41)
  • Added RetryPolicy v3 code example with link (L122)
  • Simplified timeout section — removed the 295s client-split advice since it's largely obsolete after - Remove global read_timeout default, add watcher-level idle timeout kube#1945 (L147)
  • Reworded to "wrapping individual API calls" (L160)

ssa.md:

  • Added ApplyConfigurations limitation note with kube#649 link (L104)

troubleshooting.md:

  • Changed memory symptom to "higher than expected Pod memory" (L118)
  • Added "or" to clarify solutions are not mutually exclusive (L122-123)
  • Added RBAC / NetworkPolicies row to Watch Connection table (L135)
  • Removed try_join! — replaced with "batch where possible" (L148)
  • Added error_policy metrics mention for cleanup failure detection (L156)
  • Noted Store init is advanced/streams-only (L168)
  • Added NetworkPolicies row to Reconciler Not Running table (L170)
  • Added tokio-metrics as lightweight alternative to tokio-console (L258)

(LOL.. github profile mistake..)

Copy link
Copy Markdown
Member

@clux clux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some more comments, realized i need to check this more carefully

Comment thread docs/controllers/ssa.md Outdated

!!! note "Current limitation: no ApplyConfigurations in Rust"

Go's client-go provides [ApplyConfigurations](https://pkg.go.dev/k8s.io/client-go/applyconfigurations) — fully optional builder types designed specifically for SSA. Rust does not have an equivalent yet ([kube#649](https://github.com/kube-rs/kube/issues/649)). Some [k8s-openapi] fields are not fully optional (e.g. certain integer fields like `maxReplicas`), which can make typed partial SSA awkward. Using `serde_json::json!()` for partial patches avoids this issue.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Go's client-go provides [ApplyConfigurations](https://pkg.go.dev/k8s.io/client-go/applyconfigurations) fully optional builder types designed specifically for SSA. Rust does not have an equivalent yet ([kube#649](https://github.com/kube-rs/kube/issues/649)). Some [k8s-openapi] fields are not fully optional (e.g. certain integer fields like `maxReplicas`), which can make typed partial SSA awkward. Using `serde_json::json!()` for partial patches avoids this issue.
Go's client-go provides [ApplyConfigurations](https://pkg.go.dev/k8s.io/client-go/applyconfigurations) - fully optional builder types designed specifically for SSA. Rust does not have an equivalent yet ([kube#649](https://github.com/kube-rs/kube/issues/649)). Some [k8s-openapi] fields are not fully optional (e.g. certain integer fields like `maxReplicas`), which can make typed partial SSA awkward. Using `serde_json::json!()` for partial patches works around this issue.

let pod = tokio::time::timeout(
Duration::from_secs(10),
api.get("my-pod"),
).await??;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double questionmark

Comment on lines +114 to +117
!!! note "Current limitations"

`error_policy` is a **synchronous** function. You cannot perform async operations (sending metrics, updating status) inside it. For per-key exponential backoff, wrap the reconciler itself — see the pattern described in the [[reconciler]] documentation.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually looking at this some more, there's no such pattern described in the reconciler documentation.

Comment thread docs/controllers/errors.md Outdated
Comment on lines +136 to +145
Not all errors are retryable:

| Error | Retryable | Reason |
|-------|-----------|--------|
| 5xx | Yes | Server-side transient failure |
| Timeout | Yes | Temporary network issue |
| 429 Too Many Requests | Yes | Rate limit — wait and retry |
| Network error | Yes | Temporary connectivity failure |
| 4xx (400, 403, 404) | No | The request itself is wrong |
| 409 Conflict | No | SSA ownership conflict — fix the logic |
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this table sits within the client-level retry right after the RetryPolicy, it gives the impression that the RetryPolicy is doing retrying for all of these errors marked as Yes, but that's not true.

Probably need to restructure this section so that it's less ambiguous.

Comment thread docs/controllers/ssa.md
Comment on lines +67 to +73
```rust
// ✗ Uses default field manager → unintended ownership conflicts
let pp = PatchParams::default();

// ✓ Explicit field manager
let pp = PatchParams::apply("my-controller");
```
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

field managers are required for serverside apply so using PatchParams::default with apply should probably be validated as an error in PatchParams rather than documented here as an eternal footgun.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — this should be a client-side validation rather than a doc-only warning. PatchParams::validate() already rejects force with non-Apply patches, but doesn't check field_manager: None with Patch::Apply. I'll open an issue on kube-rs/kube for adding this check.

Comment thread docs/controllers/ssa.md Outdated
let pp = PatchParams::apply("my-controller");
```

Always specify an explicit field manager. Without one, you risk ownership collisions with other controllers or kubectl users.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the implication of collisions if you are not using a field manager?

Comment thread docs/troubleshooting.md Outdated
jeprof --svg ./my-controller jeprof.*.heap > heap.svg
```

If `AHashMap` allocations dominate the profile, the [Store] cache is likely the bottleneck. Apply `.modify()` or switch to [metadata_watcher].
Copy link
Copy Markdown
Member

@clux clux Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the Store cache bottlenecking?

i.e. strange choice of wording. it shouldn't slow anything down, it should just allocate more.

Comment thread docs/troubleshooting.md Outdated

| Cause | How to verify | Solution |
|-------|--------------|----------|
| Re-list memory spikes | Periodic spikes visible in memory graphs | Use `streaming_lists()`, or reduce `page_size` |
Copy link
Copy Markdown
Member

@clux clux Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in my experience, re-lists actually doesn't cause spikes because once maps have reserved capacity they generally do not give it back that quickly. are you seeing something else?

@doxxx93 doxxx93 marked this pull request as draft February 27, 2026 01:18
Signed-off-by: doxxx93 <doxxx93@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants