kubectl-grove v2: Comprehensive CLI for Grove Cluster Management#338
Draft
athreesh wants to merge 20 commits intoai-dynamo:mainfrom
Draft
kubectl-grove v2: Comprehensive CLI for Grove Cluster Management#338athreesh wants to merge 20 commits intoai-dynamo:mainfrom
athreesh wants to merge 20 commits intoai-dynamo:mainfrom
Conversation
Comprehensive design document for transforming Arborist from a basic diagnostics tool into a full-featured CLI for Grove operations. Key features planned: - P0: arborist status (match RBG), arborist generate (AIC integration) - P1: arborist topology (visualization), arborist health (gang monitoring) - P2: arborist compare (plan vs actual), arborist metrics (Prometheus) - P2+: arborist tui (interactive terminal UI) The strategy is to leapfrog RBG by building observability features they don't have, leveraging Grove's unique data (PlacementScore, ClusterTopology, TerminationDelay countdown). Co-Authored-By: Claude Opus 4.5 <[email protected]>
Major changes: - Rename operator/cmd/arborist/ → operator/cmd/kubectl-grove/ - Delete empty cli-plugin/ directory (was placeholder only) - Update all references from arborist to kubectl-grove - Update requirements doc with PM decisions: - CLI naming: kubectl grove (kubectl plugin) - P0 priority: Parallel (status + topology together) - Plan storage: ConfigMap with grove.io/aic-plan label - TUI priority: Phase 2 (higher than originally planned) - Metrics: Direct pod scraping (no Prometheus dependency) The kubectl-grove plugin will be Grove's answer to kubectl rbg, with differentiating features like topology visualization and PlacementScore display that RBG doesn't have. Co-Authored-By: Claude Opus 4.5 <[email protected]>
…ate, plan) Add comprehensive kubectl plugin functionality for Grove: - status: Show PodCliqueSet status with clique and gang information - health: Gang health dashboard with threshold monitoring - topology: Visualize pod placement across topology domains - generate: Generate Grove manifests using AIConfigurator logic - plan: Store, show, and diff deployment plans Features: - Watch mode for real-time updates (topology, health) - ASCII visualization for topology tree - Plan storage in ConfigMaps for GitOps workflows - Comprehensive test coverage Closes ai-dynamo#329, ai-dynamo#330, ai-dynamo#331, ai-dynamo#332, ai-dynamo#333 Co-Authored-By: Claude Opus 4.5 <[email protected]>
New commands: - tui: Interactive terminal UI with Bubble Tea framework - 4 tab-switchable views: Hierarchy, Topology, Health, Help - Vim-style navigation (j/k, g/G, Enter to expand) - Real-time updates via K8s watch API - metrics: Direct pod metrics scraping - Auto-detect inference engine (SGLang, vLLM, TRT-LLM) - Prometheus format parsing - Watch mode with trend indicators - JSON output support - compare: Plan vs actual comparison - Compare configuration (replicas, GPUs, TP size) - Topology/placement score analysis - Auto-generate diagnosis and recommendations Closes ai-dynamo#334, ai-dynamo#335, ai-dynamo#336 Co-Authored-By: Claude Opus 4.5 <[email protected]>
This was referenced Jan 16, 2026
- Add ANSI color support to topology visualization: - Green/Yellow/Red for pod status (Running/Pending/Failed) - Cyan [P] / Magenta [D] role badges for prefill/decode - Color-coded GPU utilization bars with Unicode blocks - Placement score coloring based on quality - Colored warnings with icons - Respects NO_COLOR env var and TTY detection - Add k9s plugin configuration (cmd/kubectl-grove/k9s/): - plugins.yaml: 14 shortcuts for Grove commands - aliases.yaml: CRD shortcuts (:pcs, :pc, :pg, :ct) - README.md: Installation and usage guide Co-Authored-By: Claude Opus 4.5 <[email protected]>
Collaborator
|
@athreesh while this PR is draft can you please use an existing go module |
- Create new cli-plugin/ module as standalone kubectl-grove CLI - Replace Bubbletea TUI with tview-based Arborist TUI - Add topology visualization to TUI (press 't' to view) - Add hierarchical navigation: Forest -> PodCliqueSet -> PodGang -> PodClique -> Pod - Copy diagnostics package to cli-plugin (can't import internal from another module) - Fix test label selectors (grove.io/podcliqueset -> app.kubernetes.io/part-of) - Fix GPU bar tests to use Unicode characters New Arborist TUI features: - Split-pane UI with resources table and events panel - Drill-down navigation with Enter, back with Escape - Tab to switch between resources/events panes - Auto-refresh every 2 seconds - Color-coded status (green=Running, yellow=Pending, red=Failed) Co-Authored-By: Claude Opus 4.5 <[email protected]>
## AIConfigurator Integration (Rewritten) - Properly execute aiconfigurator CLI as subprocess - Parse generator_config.yaml output (not stdout) - Transform to PodCliqueSet manifests - Use JSON struct tags for sigs.k8s.io/yaml compatibility ## Bug Fixes - Fix metrics port-forward: Use actual SPDY port-forwarding instead of falling back to direct pod IP (which doesn't work outside cluster) - Fix silent failures in arborist_client.go: Log warnings instead of silently continuing on conversion errors - Fix hardcoded image version: Use AIConfigurator-provided image as fallback, with :latest as final default ## New Features - Namespace resolution from kubeconfig context - Shell completion support (bash, zsh, fish) - Topology watch mode (-w flag now works) ## Documentation - Add docs/user-guide/cli.md comprehensive CLI reference - Add docs/designs/cli-update-commands.md for rolling updates design Co-Authored-By: Claude Opus 4.5 <[email protected]>
This PR is in Draft mode for prototyping Grove CLI. ## Line Breakdown (~62K total) | Category | Lines | Notes | |-----------------------|--------|------------------------------------------| | Test files (*_test.go)| ~29K | ~47% of PR - comprehensive test coverage | | cli-plugin/ (total) | ~15K | New CLI module (includes ~7K tests) | | operator/ (non-test) | ~12K | Operator code changes | | docs/ | ~2.8K | User guide, design docs, API reference | | CRD YAML | ~1.7K | Generated CRD manifests | | go.mod/go.sum | ~1.5K | Dependency lockfiles (machine-generated) | Key insight: Almost half (~29K) is test code. Actual implementation is ~30K lines across operator and cli-plugin. ## Split Strategy (if needed for merge) Option to split into 2-3 smaller PRs: 1. **PR 1: cli-plugin migration** (~15K lines) - New standalone CLI module - Moved from operator/cmd/kubectl-grove/ - Includes Arborist TUI, commands, AIC integration 2. **PR 2: Operator changes** (~12K lines) - API changes, webhook validation - Topology constraints - ClusterTopology CRD 3. **PR 3: Tests & Docs** (~32K lines) - All *_test.go files - Documentation updates - Can be reviewed/merged last ## Review Tips - Skip generated files: CRDs, go.sum, zz_generated.* - 74 commits total - can review by commit - Tests validate behavior - review implementation first Co-Authored-By: Claude Opus 4.5 <[email protected]>
Design document proposing kubectl-grove, a kubectl plugin for managing Grove AI inference workloads. Key features: - Arborist TUI for hierarchical resource navigation - Topology visualization with GPU allocation and fragmentation warnings - Status, health, and diagnostics commands - Lifecycle management (rollout, scale, update, restart, apply) Co-Authored-By: Claude Opus 4.5 <[email protected]>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR implements the kubectl-grove CLI with comprehensive commands for managing Grove AI inference workloads on Kubernetes. It includes a complete migration of the CLI to a standalone module and replaces the Bubbletea TUI with the Arborist (tview-based) TUI.
What's Included
CLI Commands
kubectl grove statuskubectl grove topologykubectl grove healthkubectl grove metricskubectl grove tuikubectl grove diagnosticsKey Features
operator/cmd/kubectl-grove/tocli-plugin/topologyandhealthcommandsPR Size Breakdown (~62K lines)
*_test.go)Key insight: Almost half (~29K) is test code. Actual implementation is ~30K lines.
Split Strategy (if needed for merge)
This PR can be split into 2-3 smaller PRs:
PR 1: cli-plugin migration (~15K lines)
cli-plugin/PR 2: Operator changes (~12K lines)
PR 3: Tests & Docs (~32K lines)
*_test.gofilesReview Tips
zz_generated.*Example Usage
Test plan
cd cli-plugin && go test ./...)Related Issues
Closes #329 (kubectl grove status)
Closes #330 (kubectl grove topology)
Closes #331 (kubectl grove generate)
Closes #332 (kubectl grove plan)
Closes #333 (kubectl grove health)
🤖 Generated with Claude Code