docs: add SSA patterns, error handling, and troubleshooting enhancements#88
docs: add SSA patterns, error handling, and troubleshooting enhancements#88doxxx93 wants to merge 3 commits intokube-rs:mainfrom
Conversation
Signed-off-by: doxxx93 <doxxx93@gmail.com>
ba34a80 to
4479b74
Compare
clux
left a comment
There was a problem hiding this comment.
This is great. Had a quick read through and spotted a few minor things here and there. Some bits could have follow-ups elsewhere so left some comments, but generally this is very nice and i can see it being helpful!
| !!! note "Current limitations" | ||
|
|
||
| `error_policy` is a **synchronous** function. You cannot perform async operations (sending metrics, updating status) inside it. For per-key exponential backoff, wrap the reconciler itself — see the pattern described in the [[reconciler]] documentation. | ||
|
|
There was a problem hiding this comment.
actually looking at this some more, there's no such pattern described in the reconciler documentation.
|
|
||
| | Cause | How to verify | Solution | | ||
| |-------|--------------|----------| | ||
| | [Store] not yet initialized | Readiness probe fails | Wait for [Store::wait_until_ready] | |
There was a problem hiding this comment.
this is probably the more advanced/unlikely one you'd only see if you're using the streams interface (we wait by default in the controller otherwise).
Signed-off-by: doxxx93 <doxxx93@gmail.com>
|
Addressed all review feedback — thanks for the thorough review! errors.md:
ssa.md:
troubleshooting.md:
(LOL.. github profile mistake..) |
clux
left a comment
There was a problem hiding this comment.
some more comments, realized i need to check this more carefully
|
|
||
| !!! note "Current limitation: no ApplyConfigurations in Rust" | ||
|
|
||
| Go's client-go provides [ApplyConfigurations](https://pkg.go.dev/k8s.io/client-go/applyconfigurations) — fully optional builder types designed specifically for SSA. Rust does not have an equivalent yet ([kube#649](https://github.com/kube-rs/kube/issues/649)). Some [k8s-openapi] fields are not fully optional (e.g. certain integer fields like `maxReplicas`), which can make typed partial SSA awkward. Using `serde_json::json!()` for partial patches avoids this issue. |
There was a problem hiding this comment.
| Go's client-go provides [ApplyConfigurations](https://pkg.go.dev/k8s.io/client-go/applyconfigurations) — fully optional builder types designed specifically for SSA. Rust does not have an equivalent yet ([kube#649](https://github.com/kube-rs/kube/issues/649)). Some [k8s-openapi] fields are not fully optional (e.g. certain integer fields like `maxReplicas`), which can make typed partial SSA awkward. Using `serde_json::json!()` for partial patches avoids this issue. | |
| Go's client-go provides [ApplyConfigurations](https://pkg.go.dev/k8s.io/client-go/applyconfigurations) - fully optional builder types designed specifically for SSA. Rust does not have an equivalent yet ([kube#649](https://github.com/kube-rs/kube/issues/649)). Some [k8s-openapi] fields are not fully optional (e.g. certain integer fields like `maxReplicas`), which can make typed partial SSA awkward. Using `serde_json::json!()` for partial patches works around this issue. |
| let pod = tokio::time::timeout( | ||
| Duration::from_secs(10), | ||
| api.get("my-pod"), | ||
| ).await??; |
| !!! note "Current limitations" | ||
|
|
||
| `error_policy` is a **synchronous** function. You cannot perform async operations (sending metrics, updating status) inside it. For per-key exponential backoff, wrap the reconciler itself — see the pattern described in the [[reconciler]] documentation. | ||
|
|
There was a problem hiding this comment.
actually looking at this some more, there's no such pattern described in the reconciler documentation.
| Not all errors are retryable: | ||
|
|
||
| | Error | Retryable | Reason | | ||
| |-------|-----------|--------| | ||
| | 5xx | Yes | Server-side transient failure | | ||
| | Timeout | Yes | Temporary network issue | | ||
| | 429 Too Many Requests | Yes | Rate limit — wait and retry | | ||
| | Network error | Yes | Temporary connectivity failure | | ||
| | 4xx (400, 403, 404) | No | The request itself is wrong | | ||
| | 409 Conflict | No | SSA ownership conflict — fix the logic | |
There was a problem hiding this comment.
Because this table sits within the client-level retry right after the RetryPolicy, it gives the impression that the RetryPolicy is doing retrying for all of these errors marked as Yes, but that's not true.
Probably need to restructure this section so that it's less ambiguous.
| ```rust | ||
| // ✗ Uses default field manager → unintended ownership conflicts | ||
| let pp = PatchParams::default(); | ||
|
|
||
| // ✓ Explicit field manager | ||
| let pp = PatchParams::apply("my-controller"); | ||
| ``` |
There was a problem hiding this comment.
field managers are required for serverside apply so using PatchParams::default with apply should probably be validated as an error in PatchParams rather than documented here as an eternal footgun.
There was a problem hiding this comment.
Agreed — this should be a client-side validation rather than a doc-only warning. PatchParams::validate() already rejects force with non-Apply patches, but doesn't check field_manager: None with Patch::Apply. I'll open an issue on kube-rs/kube for adding this check.
| let pp = PatchParams::apply("my-controller"); | ||
| ``` | ||
|
|
||
| Always specify an explicit field manager. Without one, you risk ownership collisions with other controllers or kubectl users. |
There was a problem hiding this comment.
what is the implication of collisions if you are not using a field manager?
| jeprof --svg ./my-controller jeprof.*.heap > heap.svg | ||
| ``` | ||
|
|
||
| If `AHashMap` allocations dominate the profile, the [Store] cache is likely the bottleneck. Apply `.modify()` or switch to [metadata_watcher]. |
There was a problem hiding this comment.
What is the Store cache bottlenecking?
i.e. strange choice of wording. it shouldn't slow anything down, it should just allocate more.
|
|
||
| | Cause | How to verify | Solution | | ||
| |-------|--------------|----------| | ||
| | Re-list memory spikes | Periodic spikes visible in memory graphs | Use `streaming_lists()`, or reduce `page_size` | |
There was a problem hiding this comment.
in my experience, re-lists actually doesn't cause spikes because once maps have reserved capacity they generally do not give it back that quickly. are you seeing something else?
Signed-off-by: doxxx93 <doxxx93@gmail.com>
Summary
controllers/ssa.md(new): Server-Side Apply patterns — common pitfalls (missing apiVersion/kind, force misuse, unnecessary field ownership), status patching, typed SSAcontrollers/errors.md(new): Error handling across layers (Client → Api → watcher → Controller), watcher backoff configuration, reconciler error_policy patterns, client-level retry guidance, timeout strategiestroubleshooting.md(enhanced): Added symptom-based diagnosis tables (infinite loop, memory growth, watch recovery, 429 throttling, finalizer deadlock, reconciler not running), debugging tools (RUST_LOG levels, tracing spans, kubectl inspection), and profiling guidance (jemalloc, tokio-console)mkdocs.yml: Added new pages to navigation — SSA under Concepts (after Reconciler), Error Handling under Operational (after Observability)Cross-references added to avoid duplication with existing pages (optimization, observability, gc).
Context
Discussed with @clux on Discord — agreed to start with practical patterns (SSA, error handling) as they come up frequently in GitHub issues/discussions.