Skip to content

feat: Add Kubernetes Metrics Endpoint Auto-Discovery#648

Open
ardecode wants to merge 1 commit intoai-dynamo:mainfrom
ardecode:feat/k8s-metrics-discovery-630
Open

feat: Add Kubernetes Metrics Endpoint Auto-Discovery#648
ardecode wants to merge 1 commit intoai-dynamo:mainfrom
ardecode:feat/k8s-metrics-discovery-630

Conversation

@ardecode
Copy link

@ardecode ardecode commented Feb 6, 2026

Summary

  • Add automatic discovery of Prometheus /metrics endpoints in Kubernetes environments
  • Query K8s API for pods with Dynamo labels or Prometheus annotations
  • Support user-provided label selectors for custom deployments

New CLI Options

  • --metrics-discovery-mode: auto | kubernetes | disabled (default: auto)
  • --metrics-label-selector: Label selector for custom deployments
  • --metrics-namespace: Kubernetes namespace to search

Discovery Hierarchy

  1. Dynamo labels: nvidia.com/metrics-enabled: "true"
  2. Prometheus annotations: prometheus.io/scrape: "true"
  3. User-provided label selector

Test plan

  • Unit tests for discovery module pass
  • Manual test in K8s environment with Dynamo deployment

Part of #630

Summary by CodeRabbit

  • New Features
    • Automatic discovery of metrics endpoints from Kubernetes environments with configurable modes (Auto, Kubernetes, Disabled)
    • New CLI configuration parameters for metrics discovery including label selector and namespace filtering
    • Enhanced metrics manager with integrated Kubernetes-based endpoint detection

@github-actions
Copy link

github-actions bot commented Feb 6, 2026

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@93a2c9b6eefbfe9f5d651077061eccdb0f38e283

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@93a2c9b6eefbfe9f5d651077061eccdb0f38e283

Last updated for commit: 93a2c9bBrowse code

@github-actions github-actions bot added the feat label Feb 6, 2026
@coderabbitai
Copy link

coderabbitai bot commented Feb 6, 2026

Walkthrough

This PR introduces Kubernetes-based metrics endpoint discovery for the aiperf tool. It adds a new kubernetes API client dependency, exposes metrics discovery configuration options (mode, label selector, namespace), defines new discovery-related enums, implements asynchronous Kubernetes pod enumeration and URL construction, integrates discovery into the metrics manager, and provides comprehensive unit tests for the discovery logic.

Changes

Cohort / File(s) Summary
Configuration & Dependencies
pyproject.toml, src/aiperf/common/config/user_config.py
Added kubernetes>=31.0.0 dependency and three new user configuration fields: metrics_discovery_mode (AUTO/KUBERNETES/DISABLED), metrics_label_selector, and metrics_namespace, all exposed via CLI.
Enums
src/aiperf/common/enums/enums.py, src/aiperf/common/enums/__init__.py
Added MetricsDiscoveryMode (AUTO, KUBERNETES, DISABLED) and MetricsDiscoverySource (DYNAMO, PROMETHEUS, SELECTOR, MANUAL, ENDPOINT) enum classes and exported them from the package.
Discovery Implementation
src/aiperf/discovery/__init__.py, src/aiperf/discovery/kubernetes.py
Introduced new discovery package with Kubernetes endpoint discovery module implementing async pod enumeration, scrape URL construction from annotations/defaults, port resolution, and environment detection (in-cluster vs kubeconfig).
Manager Integration
src/aiperf/server_metrics/manager.py
Enhanced metrics manager to support discovery mode configuration, invoke auto-discovery during profile setup, accumulate discovered endpoints alongside user-configured URLs, and handle AUTO/KUBERNETES/DISABLED discovery modes.
Tests
tests/unit/discovery/__init__.py, tests/unit/discovery/test_kubernetes_discovery.py
Added license headers and comprehensive unit tests covering in-cluster detection, pod-to-URL conversion with annotation handling, discovery function behavior, and URL deduplication logic.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A curious bunny hops through pods in the cloud,
Discovering metrics with K8s so proud!
Annotations guide the way, defaults light the path,
Auto-discovery flows like a gentle stream's bath.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: Add Kubernetes Metrics Endpoint Auto-Discovery' accurately describes the main change—introducing automatic discovery of Prometheus metrics endpoints in Kubernetes environments.
Docstring Coverage ✅ Passed Docstring coverage is 95.24% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In `@src/aiperf/discovery/kubernetes.py`:
- Around line 35-69: KubernetesDiscoveryConfig currently inherits from
pydantic.BaseModel; change it to inherit from AIPerfBaseModel and update imports
accordingly by replacing or adding the import for AIPerfBaseModel (instead of
BaseModel) so the class definition reads "class
KubernetesDiscoveryConfig(AIPerfBaseModel):" and remove or stop using the direct
BaseModel import; keep all existing Field definitions and types unchanged.

In `@src/aiperf/server_metrics/manager.py`:
- Around line 466-498: Wrap calls to discover_kubernetes_endpoints inside
_run_metrics_discovery in a try/except so any unexpected exception (e.g., from
asyncio.to_thread or the kubernetes SDK) is caught, logged (use self.warning or
self.error and include the exception details) and an empty list is returned to
avoid bubbling the error up to _profile_configure_command; do this for both the
explicit MetricsDiscoveryMode.KUBERNETES branch and the AUTO branch where
is_running_in_kubernetes() is checked. Also change the final self.debug call to
use the lambda form (self.debug(lambda: "Server Metrics: Not in K8s, skipping
auto-discovery")) to match the file's logging guideline for potentially
expensive messages.
- Around line 121-131: Discovered metric endpoints returned by
_run_metrics_discovery() are already full metrics paths and must not be passed
through normalize_metrics_endpoint_url; change the loop in the auto-discovery
block to append the discovered URL strings as-is (deduplicating by exact string
against self._server_metrics_endpoints) instead of calling
normalize_metrics_endpoint_url, so unique discovered endpoints (from
_pod_to_url/_run_metrics_discovery) are added intact and the info() log uses the
resulting added count.

In `@tests/unit/discovery/test_kubernetes_discovery.py`:
- Around line 70-94: The test test_deduplicates_urls currently only provides one
pod so deduplication isn't exercised; update the fixture so
mock_list.return_value contains two distinct MagicMock pod objects that both map
to the same URL (create a separate duplicate_pod = MagicMock() copying the same
status.pod_ip, metadata.annotations/labels and spec.containers as mock_pod) so
that discover_kubernetes_endpoints and _pod_to_url are invoked twice and the
set-based deduplication is actually tested.
🧹 Nitpick comments (5)
src/aiperf/common/config/user_config.py (1)

567-613: Consider validating K8s-specific options against metrics_discovery_mode.

The metrics_label_selector and metrics_namespace fields are only meaningful when discovery mode is auto or kubernetes. This file already validates unused options elsewhere (see validate_unused_options at Lines 265-320, which raises errors for options set without their required companions).

For consistency with that pattern, consider adding a validator that warns or errors when metrics_label_selector or metrics_namespace is explicitly set but metrics_discovery_mode is DISABLED:

♻️ Suggested validator
`@model_validator`(mode="after")
def validate_metrics_discovery_options(self) -> Self:
    if self.metrics_discovery_mode == MetricsDiscoveryMode.DISABLED:
        k8s_options = []
        if "metrics_label_selector" in self.model_fields_set:
            k8s_options.append("--metrics-label-selector")
        if "metrics_namespace" in self.model_fields_set:
            k8s_options.append("--metrics-namespace")
        if k8s_options:
            raise ValueError(
                f"{', '.join(k8s_options)} can only be used when "
                "--metrics-discovery-mode is 'auto' or 'kubernetes'."
            )
    return self
tests/unit/discovery/test_kubernetes_discovery.py (1)

16-22: Missing negative test case for is_running_in_kubernetes.

Only the True case is tested. Adding a False test provides complete coverage for this function and guards against regressions (e.g., someone changing the env var name).

♻️ Suggested addition
def test_returns_false_when_service_host_not_set(self):
    """Should return False when KUBERNETES_SERVICE_HOST is not set."""
    with patch.dict("os.environ", {}, clear=True):
        assert is_running_in_kubernetes() is False
pyproject.toml (1)

60-60: Add upper bound to kubernetes version constraint; consider making it optional.

The blocking I/O concern is already properly handled in the discovery module—all kubernetes API calls are wrapped with asyncio.to_thread() (see _load_core_api_blocking and _list_running_pods_blocking), so the event loop won't be blocked.

Two remaining improvements:

  1. Version constraint: kubernetes>=31.0.0 has no upper bound, unlike other dependencies (e.g., aiohttp~=3.13.3). Add an upper bound like kubernetes>=31.0.0,<37 or use kubernetes~=31.0 to prevent unexpected breaking changes in future major releases.

  2. Optional dependency: Consider moving this to [project.optional-dependencies] since K8s discovery is guarded by require_in_k8s and can be disabled, so users who don't need it won't incur the install overhead.

src/aiperf/discovery/kubernetes.py (2)

131-143: In large clusters, listing all running pods without a label selector can be expensive.

When label_selector is None (and namespace is also None), this fetches every running pod across all namespaces. The client-side _is_eligible filter then discards most of them. For clusters with thousands of pods, the API response could be very large.

Consider applying a server-side OR-style filter (e.g., always include the Dynamo label in the selector) or documenting that --metrics-label-selector / --metrics-namespace is recommended for large clusters.


174-185: Eligibility fallback is subtly correct but could use a more explicit comment.

Line 185 returns True for any pod when label_selector is set, because the server already filtered to matching pods. The existing comment ("label_selector is already applied server-side; this is a fallback gate") is helpful, but a reader might still wonder why a pod without Dynamo/Prometheus markers is eligible.

Consider expanding slightly:

     # label_selector is already applied server-side; this is a fallback gate.
+    # All pods returned by the K8s API already matched the user's label_selector,
+    # so they are considered eligible even without Dynamo or Prometheus markers.
     return config.label_selector is not None

Comment on lines +35 to +69
class KubernetesDiscoveryConfig(BaseModel):
namespace: str | None = Field(
default=None,
description="Namespace to search. None searches all namespaces.",
)
label_selector: str | None = Field(
default=None,
description=(
"Optional Kubernetes label selector applied server-side at list-time. "
"Used as fallback eligibility for non-Dynamo/non-annotated pods."
),
)
default_scheme: str = Field(
default="http",
description="Default scheme if prometheus.io/scheme is not set.",
)
default_path: str = Field(
default="/metrics",
description="Default path if prometheus.io/path is not set.",
)
default_port: int | None = Field(
default=None,
description=(
"Fallback port if prometheus.io/port is absent and no container port is found. "
"If None, pods without a discoverable port are skipped."
),
)
prefer_port_name: str = Field(
default="metrics",
description="Preferred container port name to use when selecting a port.",
)
require_in_k8s: bool = Field(
default=True,
description="If True, return [] when not running inside Kubernetes.",
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find AIPerfBaseModel definition
fd -e py -x rg -l "class AIPerfBaseModel" {}
# Check where it's imported from
rg "from.*AIPerfBaseModel|import.*AIPerfBaseModel" --type=py -C1

Repository: ai-dynamo/aiperf

Length of output: 7860


🏁 Script executed:

#!/bin/bash
# Check if kubernetes.py is a service or utility module
head -50 src/aiperf/discovery/kubernetes.py | rg -E "^import|^from|^class|^async def|^def" -n
# Check what BaseModel it uses
rg "^from|^import" src/aiperf/discovery/kubernetes.py -n | head -20

Repository: ai-dynamo/aiperf

Length of output: 642


🏁 Script executed:

#!/bin/bash
# Look for other config classes using BaseModel in the discovery module
fd -path "src/aiperf/discovery/*.py" -type f -exec rg -l "class.*Config.*BaseModel" {} \;
# Also check how server_metrics configs are structured
fd -path "src/aiperf/server_metrics/*.py" -type f -exec rg -l "class.*Config.*BaseModel" {} \;

Repository: ai-dynamo/aiperf

Length of output: 397


🏁 Script executed:

#!/bin/bash
# Get AIPerfBaseModel definition
head -100 src/aiperf/common/models/base_models.py | cat -n

Repository: ai-dynamo/aiperf

Length of output: 937


🏁 Script executed:

#!/bin/bash
# Check what the kubernetes.py file looks like (is it a service or utility?)
wc -l src/aiperf/discovery/kubernetes.py
head -35 src/aiperf/discovery/kubernetes.py | cat -n

Repository: ai-dynamo/aiperf

Length of output: 1368


🏁 Script executed:

#!/bin/bash
# Look for other discovery config classes or similar patterns
find src/aiperf/discovery -name "*.py" -type f
# Also search for BaseModel usage in discovery directory
rg "class.*BaseModel" src/aiperf/discovery/ -n

Repository: ai-dynamo/aiperf

Length of output: 209


🏁 Script executed:

#!/bin/bash
# Check what BaseConfig is and how it's used
rg "class BaseConfig" src/aiperf -n -A3 | head -30
# Check examples of config classes using it
rg "BaseConfig" src/aiperf --type=py -l | head -5

Repository: ai-dynamo/aiperf

Length of output: 532


🏁 Script executed:

#!/bin/bash
# Verify the guideline applies to discovery module by checking imports
head -30 src/aiperf/timing/config.py | cat -n
# Compare with another config class that follows guidelines
rg "class.*Config.*AIPerfBaseModel" src/aiperf -n | head -5

Repository: ai-dynamo/aiperf

Length of output: 1617


🏁 Script executed:

#!/bin/bash
# Check if kubernetes.py discovery module is a service or utility module
# by looking at its usage/imports from other modules
rg "from.*kubernetes import|import.*kubernetes" src/aiperf --type=py | head -10
# Also check the discovery module structure
cat src/aiperf/discovery/__init__.py

Repository: ai-dynamo/aiperf

Length of output: 1137


🏁 Script executed:

#!/bin/bash
# Compare: how is BaseModel vs AIPerfBaseModel used for configs
# Check if there's any pattern about BaseModel-only config usage
rg "class.*Config.*\(BaseModel\)" src/aiperf --type=py | head -10

Repository: ai-dynamo/aiperf

Length of output: 207


🏁 Script executed:

#!/bin/bash
# Let's verify the actual usage pattern - check BaseConfig more closely
cat src/aiperf/common/config/base_config.py | head -30
# And check how it's actually used vs AIPerfBaseModel
rg "class.*\(BaseConfig\)" src/aiperf --type=py | head -5
rg "class.*\(AIPerfBaseModel\)" src/aiperf --type=py | wc -l

Repository: ai-dynamo/aiperf

Length of output: 1349


🏁 Script executed:

#!/bin/bash
# Let's understand the distinction: what does BaseConfig add vs BaseModel?
# Check if KubernetesDiscoveryConfig needs YAML serialization
rg "KubernetesDiscoveryConfig" src/aiperf --type=py -B2 -A2 | head -40
# Does kubernetes.py use any BaseConfig features?
grep -n "serialize\|yaml\|dict\|model_dump" src/aiperf/discovery/kubernetes.py

Repository: ai-dynamo/aiperf

Length of output: 2774


Use AIPerfBaseModel for data models.

The guidelines for src/**/*.py state: "Use AIPerfBaseModel for data models and BaseConfig for configuration." KubernetesDiscoveryConfig currently extends pydantic.BaseModel directly. Other data model classes throughout src/aiperf (e.g., TimingConfig, CreditPhaseConfig) use AIPerfBaseModel for consistency. Update the import and class definition to use AIPerfBaseModel.

🤖 Prompt for AI Agents
In `@src/aiperf/discovery/kubernetes.py` around lines 35 - 69,
KubernetesDiscoveryConfig currently inherits from pydantic.BaseModel; change it
to inherit from AIPerfBaseModel and update imports accordingly by replacing or
adding the import for AIPerfBaseModel (instead of BaseModel) so the class
definition reads "class KubernetesDiscoveryConfig(AIPerfBaseModel):" and remove
or stop using the direct BaseModel import; keep all existing Field definitions
and types unchanged.

Comment on lines 121 to 131
# Run auto-discovery if enabled
if self._discovery_mode != MetricsDiscoveryMode.DISABLED:
discovered_urls = await self._run_metrics_discovery()
added = 0
for url in discovered_urls:
normalized = normalize_metrics_endpoint_url(url)
if normalized not in self._server_metrics_endpoints:
self._server_metrics_endpoints.append(normalized)
added += 1
if added > 0:
self.info(f"Server Metrics: Auto-discovery added {added} endpoint(s)")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find the normalize_metrics_endpoint_url function
rg -A 15 "def normalize_metrics_endpoint_url" src/

Repository: ai-dynamo/aiperf

Length of output: 1171


🏁 Script executed:

#!/bin/bash
# Find the _pod_to_url function to understand what paths it generates
rg -A 10 "def _pod_to_url|def _pod_to_endpoints" src/aiperf/server_metrics/

Repository: ai-dynamo/aiperf

Length of output: 42


🏁 Script executed:

#!/bin/bash
# Find kubernetes.py to see the discovery implementation
fd "kubernetes.py" src/ --exec cat {}

Repository: ai-dynamo/aiperf

Length of output: 7136


🏁 Script executed:

#!/bin/bash
# Get the full normalize_metrics_endpoint_url function
rg -A 30 "def normalize_metrics_endpoint_url" src/aiperf/common/metric_utils.py

Repository: ai-dynamo/aiperf

Length of output: 1204


🏁 Script executed:

#!/bin/bash
# Get more lines of the function
rg -A 40 "def normalize_metrics_endpoint_url" src/aiperf/common/metric_utils.py

Repository: ai-dynamo/aiperf

Length of output: 1250


Discovered URLs with custom Prometheus paths are corrupted by normalization.

The normalize_metrics_endpoint_url function appends /metrics to any URL not ending with that suffix. However, discovered URLs from _pod_to_url are already fully qualified metrics endpoints—they include the complete path from the prometheus.io/path annotation or the configured default. Normalizing them treats them as base URLs, causing URLs with custom paths to be malformed (e.g., http://pod:8080/custom/statshttp://pod:8080/custom/stats/metrics).

Discovered URLs should bypass normalization since they are already complete:

Proposed fix
         discovered_urls = await self._run_metrics_discovery()
             added = 0
             for url in discovered_urls:
-                normalized = normalize_metrics_endpoint_url(url)
+                normalized = url  # discovered URLs are already fully qualified
                 if normalized not in self._server_metrics_endpoints:
                     self._server_metrics_endpoints.append(normalized)
                     added += 1
🤖 Prompt for AI Agents
In `@src/aiperf/server_metrics/manager.py` around lines 121 - 131, Discovered
metric endpoints returned by _run_metrics_discovery() are already full metrics
paths and must not be passed through normalize_metrics_endpoint_url; change the
loop in the auto-discovery block to append the discovered URL strings as-is
(deduplicating by exact string against self._server_metrics_endpoints) instead
of calling normalize_metrics_endpoint_url, so unique discovered endpoints (from
_pod_to_url/_run_metrics_discovery) are added intact and the info() log uses the
resulting added count.

Comment on lines 466 to 498
async def _run_metrics_discovery(self) -> list[str]:
"""Run metrics endpoint auto-discovery based on configuration.

Returns:
List of discovered endpoint URLs.
"""
if self._discovery_mode == MetricsDiscoveryMode.DISABLED:
return []

if self._discovery_mode == MetricsDiscoveryMode.KUBERNETES:
if not is_running_in_kubernetes():
self.warning(
"Server Metrics: Kubernetes discovery requested but not in K8s cluster"
)
return []
self.info("Server Metrics: Running Kubernetes discovery...")
cfg = KubernetesDiscoveryConfig(
namespace=self._namespace,
label_selector=self._label_selector,
)
return await discover_kubernetes_endpoints(cfg)

# AUTO mode: try K8s if available
if is_running_in_kubernetes():
self.info("Server Metrics: Running Kubernetes auto-discovery...")
cfg = KubernetesDiscoveryConfig(
namespace=self._namespace,
label_selector=self._label_selector,
)
return await discover_kubernetes_endpoints(cfg)

self.debug("Server Metrics: Not in K8s, skipping auto-discovery")
return []
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Wrap discovery call in try/except for resilience.

discover_kubernetes_endpoints has internal guards, but an unexpected exception (e.g., from asyncio.to_thread, a kubernetes SDK change, or a malformed response) would propagate into _profile_configure_command and fail the entire configuration phase. A defensive catch here keeps discovery failures non-fatal.

Proposed fix
     async def _run_metrics_discovery(self) -> list[str]:
         """Run metrics endpoint auto-discovery based on configuration.

         Returns:
             List of discovered endpoint URLs.
         """
         if self._discovery_mode == MetricsDiscoveryMode.DISABLED:
             return []

-        if self._discovery_mode == MetricsDiscoveryMode.KUBERNETES:
-            if not is_running_in_kubernetes():
-                self.warning(
-                    "Server Metrics: Kubernetes discovery requested but not in K8s cluster"
-                )
-                return []
-            self.info("Server Metrics: Running Kubernetes discovery...")
-            cfg = KubernetesDiscoveryConfig(
-                namespace=self._namespace,
-                label_selector=self._label_selector,
-            )
-            return await discover_kubernetes_endpoints(cfg)
-
-        # AUTO mode: try K8s if available
-        if is_running_in_kubernetes():
-            self.info("Server Metrics: Running Kubernetes auto-discovery...")
-            cfg = KubernetesDiscoveryConfig(
-                namespace=self._namespace,
-                label_selector=self._label_selector,
-            )
-            return await discover_kubernetes_endpoints(cfg)
-
-        self.debug("Server Metrics: Not in K8s, skipping auto-discovery")
-        return []
+        try:
+            if self._discovery_mode == MetricsDiscoveryMode.KUBERNETES:
+                if not is_running_in_kubernetes():
+                    self.warning(
+                        "Server Metrics: Kubernetes discovery requested but not in K8s cluster"
+                    )
+                    return []
+                self.info("Server Metrics: Running Kubernetes discovery...")
+                cfg = KubernetesDiscoveryConfig(
+                    namespace=self._namespace,
+                    label_selector=self._label_selector,
+                )
+                return await discover_kubernetes_endpoints(cfg)
+
+            # AUTO mode: try K8s if available
+            if is_running_in_kubernetes():
+                self.info("Server Metrics: Running Kubernetes auto-discovery...")
+                cfg = KubernetesDiscoveryConfig(
+                    namespace=self._namespace,
+                    label_selector=self._label_selector,
+                )
+                return await discover_kubernetes_endpoints(cfg)
+
+            self.debug(lambda: "Server Metrics: Not in K8s, skipping auto-discovery")
+            return []
+        except Exception as e:
+            self.warning(f"Server Metrics: Auto-discovery failed: {e!r}")
+            return []

Also note: Line 497 uses self.debug("...") without a lambda wrapper, inconsistent with the rest of the file. As per coding guidelines, use lambda for expensive logs: self.debug(lambda: "Server Metrics: Not in K8s, skipping auto-discovery").

🤖 Prompt for AI Agents
In `@src/aiperf/server_metrics/manager.py` around lines 466 - 498, Wrap calls to
discover_kubernetes_endpoints inside _run_metrics_discovery in a try/except so
any unexpected exception (e.g., from asyncio.to_thread or the kubernetes SDK) is
caught, logged (use self.warning or self.error and include the exception
details) and an empty list is returned to avoid bubbling the error up to
_profile_configure_command; do this for both the explicit
MetricsDiscoveryMode.KUBERNETES branch and the AUTO branch where
is_running_in_kubernetes() is checked. Also change the final self.debug call to
use the lambda form (self.debug(lambda: "Server Metrics: Not in K8s, skipping
auto-discovery")) to match the file's logging guideline for potentially
expensive messages.

Comment on lines 70 to 94
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Deduplication test doesn't exercise cross-pod deduplication.

_pod_to_url is called once per pod and returns a single URL. With only one mock pod in mock_list.return_value = [mock_pod], the set in discover_kubernetes_endpoints trivially has one element. The comment "Should only appear once despite matching both Dynamo and Prometheus" is misleading — the function doesn't produce multiple URLs per pod regardless.

To meaningfully test deduplication, supply two distinct pod objects that resolve to the same URL:

♻️ Suggested fix
     async def test_deduplicates_urls(self):
-        """Should deduplicate URLs found via multiple methods."""
-        mock_pod = MagicMock()
-        mock_pod.status.pod_ip = "10.1.2.3"
-        mock_pod.metadata.annotations = {
+        """Should deduplicate identical URLs from different pods."""
+        pod1 = MagicMock()
+        pod1.status.pod_ip = "10.1.2.3"
+        pod1.metadata.annotations = {
             "prometheus.io/port": "8080",
             "prometheus.io/scrape": "true",
         }
-        mock_pod.metadata.labels = {"nvidia.com/metrics-enabled": "true"}
-        mock_pod.spec.containers = []
+        pod1.metadata.labels = {"nvidia.com/metrics-enabled": "true"}
+        pod1.spec.containers = []
+
+        pod2 = MagicMock()
+        pod2.status.pod_ip = "10.1.2.3"
+        pod2.metadata.annotations = {
+            "prometheus.io/port": "8080",
+            "prometheus.io/scrape": "true",
+        }
+        pod2.metadata.labels = {}
+        pod2.spec.containers = []
 
         with (
             patch.dict("os.environ", {"KUBERNETES_SERVICE_HOST": "10.0.0.1"}),
             patch("kubernetes.config.load_incluster_config"),
             patch("kubernetes.client.CoreV1Api") as mock_api,
             patch("aiperf.discovery.kubernetes._list_running_pods") as mock_list,
         ):
-            mock_list.return_value = [mock_pod]
+            mock_list.return_value = [pod1, pod2]
             mock_api.return_value = MagicMock()
 
             urls = await discover_kubernetes_endpoints(KubernetesDiscoveryConfig())
 
-        # Should only appear once despite matching both Dynamo and Prometheus
+        # Same URL from two pods should appear exactly once
         assert urls.count("http://10.1.2.3:8080/metrics") == 1
🤖 Prompt for AI Agents
In `@tests/unit/discovery/test_kubernetes_discovery.py` around lines 70 - 94, The
test test_deduplicates_urls currently only provides one pod so deduplication
isn't exercised; update the fixture so mock_list.return_value contains two
distinct MagicMock pod objects that both map to the same URL (create a separate
duplicate_pod = MagicMock() copying the same status.pod_ip,
metadata.annotations/labels and spec.containers as mock_pod) so that
discover_kubernetes_endpoints and _pod_to_url are invoked twice and the
set-based deduplication is actually tested.

@codecov
Copy link

codecov bot commented Feb 6, 2026

Codecov Report

❌ Patch coverage is 64.37500% with 57 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/aiperf/discovery/kubernetes.py 65.71% 27 Missing and 9 partials ⚠️
src/aiperf/server_metrics/manager.py 32.25% 15 Missing and 6 partials ⚠️

📢 Thoughts on this report? Let us know!

@ardecode ardecode marked this pull request as draft February 6, 2026 00:43
@ardecode ardecode marked this pull request as ready for review February 6, 2026 00:46
@ardecode ardecode force-pushed the feat/k8s-metrics-discovery-630 branch from 1d80533 to 1592f6a Compare February 7, 2026 00:47
@ardecode ardecode marked this pull request as draft February 7, 2026 00:48
@ardecode ardecode force-pushed the feat/k8s-metrics-discovery-630 branch from 1592f6a to 62a37cc Compare February 7, 2026 01:12
@ardecode ardecode marked this pull request as ready for review February 7, 2026 01:13
Copy link
Contributor

@ajcasagrande ajcasagrande left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ardecode It's looking really good so far! I got through most of it so far, A few files I may need to take a closer look at. Thanks!

Comment on lines 759 to 778
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

albeit the long name, I'm thinking we may want to prefix everything with --server so --server-metrics-discovery-mode. This will just make it obvious to the user they are server metrics related, as it could confuse them.

In the near future, when we enable config v2 with yaml support, it will be much more streamlined:

server_metrics:
  discovery: 
    type: kubernetes
    label_selector: app=vllm,env=prod
    namespace: inference
server_metrics:
  discovery:  disabled
server_metrics:
  discovery:  auto

or similar....

the type field is a discriminated union for pydantic schema validation.

Comment on lines +193 to +222
def _resolve_port(
pod: V1Pod,
annotation_port: str | None,
preferred_port_name: str,
fallback: int | None,
) -> int | None:
"""Resolve port from annotation -> named port -> first port -> fallback."""
if annotation_port:
try:
return int(annotation_port)
except ValueError:
pass

spec = getattr(pod, "spec", None)
containers = getattr(spec, "containers", None) if spec else None
if not containers:
return fallback

first_port = None
for container in containers:
for port_spec in getattr(container, "ports", None) or []:
container_port = getattr(port_spec, "container_port", None)
if not container_port:
continue
if first_port is None:
first_port = int(container_port)
if getattr(port_spec, "name", None) == preferred_port_name:
return int(container_port)

return first_port if first_port is not None else fallback
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there any way this doesn't have to use all the getattrs? or is that just a side effect of the kubernetes library?

Add automatic discovery of Prometheus /metrics endpoints in Kubernetes
environments by querying the K8s API for pods with specific labels and
annotations.

Discovery hierarchy (priority order):
1. Dynamo labels: nvidia.com/metrics-enabled: "true"
2. Prometheus annotations: prometheus.io/scrape: "true"
3. User-provided label selector via --metrics-label-selector

New CLI options:
- --metrics-discovery-mode: auto | kubernetes | disabled (default: auto)
- --metrics-label-selector: Label selector for custom deployments
- --metrics-namespace: Kubernetes namespace to search

Part of ai-dynamo#630

Signed-off-by: ardecode <desaiarijit@gmail.com>
@ardecode ardecode force-pushed the feat/k8s-metrics-discovery-630 branch from 62a37cc to 93a2c9b Compare February 8, 2026 22:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments