Skip to content

Status loss during clusterctl move and its impact on Servers and ServerClaims #626

@afritzler

Description

@afritzler

Background

When moving resources using clusterctl move, status fields are not preserved.
For several resources this is acceptable because their status can be fully reconstructed by controllers.

Examples where this is not a problem:

  • Endpoints
  • BMCs
  • BMCSecrets

Their controllers will naturally reconcile and rebuild the required status.

Problem

For Servers and ServerClaims, losing status information causes serious issues.

Specifically:

  • ServerClaims must end up in Bound phase to their corresponding Server
  • The associated Server must end up in the Reserved state
  • After a move, this relationship and state is lost

Additionally, we lose discovery-derived status data, including but not limited to:

  • NetworkInterface information
  • Other hardware and inventory data aggregated during the discovery boot

This data is currently lost after a move and must be manually patched to the resource from backup.

Why this is a problem

  • The system cannot reliably infer the correct post-move state
  • Controllers may not have enough information to safely re-bind claims
  • Re-running discovery is either undesirable or impossible in some scenarios
  • Manual intervention does not scale and is error-prone

Discussion Points / Open Questions

  1. Reconstruction strategy

    • How can we reliably reconstruct Server and ServerClaim status after a move?
    • Can we deterministically infer Bound / Reserved state from spec-only data?
  2. Persistence of discovery data

    • Should discovery results be partially or fully moved from status into spec or another persisted object?
    • Do we need a dedicated inventory or snapshot resource that survives cluster moves?
  3. Move-awareness

    • Should we introduce move-aware reconciliation logic?
    • Is there value in explicitly detecting a “post-move” scenario and running a special recovery flow?
  4. Alternative approaches

    • Explicit export/import of status-like data before and after clusterctl move
    • A controller-driven re-discovery flow with safety guarantees
    • A new abstraction to decouple discovery results from transient controller state

Goal

Develop a clear and reliable concept to:

  • Restore correct Server / ServerClaim states after a cluster move
  • Preserve or safely reconstruct discovery-derived data
  • Make clusterctl move a supported and predictable operation for metal-operator resources

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    Status

    Backlog

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions