feat: Add Kubernetes Metrics Endpoint Auto-Discovery by ardecode · Pull Request #648 · ai-dynamo/aiperf

ardecode · 2026-02-06T00:35:06Z

Summary

Add automatic discovery of Prometheus /metrics endpoints in Kubernetes environments
Query K8s API for pods with Dynamo labels or Prometheus annotations
Support user-provided label selectors for custom deployments

New CLI Options

--metrics-discovery-mode: auto | kubernetes | disabled (default: auto)
--metrics-label-selector: Label selector for custom deployments
--metrics-namespace: Kubernetes namespace to search

Discovery Hierarchy

Dynamo labels: nvidia.com/metrics-enabled: "true"
Prometheus annotations: prometheus.io/scrape: "true"
User-provided label selector

Test plan

Unit tests for discovery module pass
Manual test in K8s environment with Dynamo deployment

Part of #630

Summary by CodeRabbit

New Features
- Automatic discovery of metrics endpoints from Kubernetes environments with configurable modes (Auto, Kubernetes, Disabled)
- New CLI configuration parameters for metrics discovery including label selector and namespace filtering
- Enhanced metrics manager with integrated Kubernetes-based endpoint detection

github-actions · 2026-02-06T00:35:15Z

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@93a2c9b6eefbfe9f5d651077061eccdb0f38e283

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@93a2c9b6eefbfe9f5d651077061eccdb0f38e283

Last updated for commit: 93a2c9b • Browse code

coderabbitai · 2026-02-06T00:41:01Z

Walkthrough

This PR introduces Kubernetes-based metrics endpoint discovery for the aiperf tool. It adds a new kubernetes API client dependency, exposes metrics discovery configuration options (mode, label selector, namespace), defines new discovery-related enums, implements asynchronous Kubernetes pod enumeration and URL construction, integrates discovery into the metrics manager, and provides comprehensive unit tests for the discovery logic.

Changes

Cohort / File(s)	Summary
Configuration & Dependencies `pyproject.toml`, `src/aiperf/common/config/user_config.py`	Added kubernetes>=31.0.0 dependency and three new user configuration fields: metrics_discovery_mode (AUTO/KUBERNETES/DISABLED), metrics_label_selector, and metrics_namespace, all exposed via CLI.
Enums `src/aiperf/common/enums/enums.py`, `src/aiperf/common/enums/__init__.py`	Added MetricsDiscoveryMode (AUTO, KUBERNETES, DISABLED) and MetricsDiscoverySource (DYNAMO, PROMETHEUS, SELECTOR, MANUAL, ENDPOINT) enum classes and exported them from the package.
Discovery Implementation `src/aiperf/discovery/__init__.py`, `src/aiperf/discovery/kubernetes.py`	Introduced new discovery package with Kubernetes endpoint discovery module implementing async pod enumeration, scrape URL construction from annotations/defaults, port resolution, and environment detection (in-cluster vs kubeconfig).
Manager Integration `src/aiperf/server_metrics/manager.py`	Enhanced metrics manager to support discovery mode configuration, invoke auto-discovery during profile setup, accumulate discovered endpoints alongside user-configured URLs, and handle AUTO/KUBERNETES/DISABLED discovery modes.
Tests `tests/unit/discovery/__init__.py`, `tests/unit/discovery/test_kubernetes_discovery.py`	Added license headers and comprehensive unit tests covering in-cluster detection, pod-to-URL conversion with annotation handling, discovery function behavior, and URL deduplication logic.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 A curious bunny hops through pods in the cloud,
Discovering metrics with K8s so proud!
Annotations guide the way, defaults light the path,
Auto-discovery flows like a gentle stream's bath. ✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'feat: Add Kubernetes Metrics Endpoint Auto-Discovery' accurately describes the main change—introducing automatic discovery of Prometheus metrics endpoints in Kubernetes environments.
Docstring Coverage	✅ Passed	Docstring coverage is 95.24% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Fix all issues with AI agents

In `@src/aiperf/discovery/kubernetes.py`:
- Around line 35-69: KubernetesDiscoveryConfig currently inherits from
pydantic.BaseModel; change it to inherit from AIPerfBaseModel and update imports
accordingly by replacing or adding the import for AIPerfBaseModel (instead of
BaseModel) so the class definition reads "class
KubernetesDiscoveryConfig(AIPerfBaseModel):" and remove or stop using the direct
BaseModel import; keep all existing Field definitions and types unchanged.

In `@src/aiperf/server_metrics/manager.py`:
- Around line 466-498: Wrap calls to discover_kubernetes_endpoints inside
_run_metrics_discovery in a try/except so any unexpected exception (e.g., from
asyncio.to_thread or the kubernetes SDK) is caught, logged (use self.warning or
self.error and include the exception details) and an empty list is returned to
avoid bubbling the error up to _profile_configure_command; do this for both the
explicit MetricsDiscoveryMode.KUBERNETES branch and the AUTO branch where
is_running_in_kubernetes() is checked. Also change the final self.debug call to
use the lambda form (self.debug(lambda: "Server Metrics: Not in K8s, skipping
auto-discovery")) to match the file's logging guideline for potentially
expensive messages.
- Around line 121-131: Discovered metric endpoints returned by
_run_metrics_discovery() are already full metrics paths and must not be passed
through normalize_metrics_endpoint_url; change the loop in the auto-discovery
block to append the discovered URL strings as-is (deduplicating by exact string
against self._server_metrics_endpoints) instead of calling
normalize_metrics_endpoint_url, so unique discovered endpoints (from
_pod_to_url/_run_metrics_discovery) are added intact and the info() log uses the
resulting added count.

In `@tests/unit/discovery/test_kubernetes_discovery.py`:
- Around line 70-94: The test test_deduplicates_urls currently only provides one
pod so deduplication isn't exercised; update the fixture so
mock_list.return_value contains two distinct MagicMock pod objects that both map
to the same URL (create a separate duplicate_pod = MagicMock() copying the same
status.pod_ip, metadata.annotations/labels and spec.containers as mock_pod) so
that discover_kubernetes_endpoints and _pod_to_url are invoked twice and the
set-based deduplication is actually tested.

🧹 Nitpick comments (5)

src/aiperf/common/config/user_config.py (1)
567-613: Consider validating K8s-specific options against metrics_discovery_mode.

The metrics_label_selector and metrics_namespace fields are only meaningful when discovery mode is auto or kubernetes. This file already validates unused options elsewhere (see validate_unused_options at Lines 265-320, which raises errors for options set without their required companions).

For consistency with that pattern, consider adding a validator that warns or errors when metrics_label_selector or metrics_namespace is explicitly set but metrics_discovery_mode is DISABLED:
♻️ Suggested validator
`@model_validator`(mode="after")
def validate_metrics_discovery_options(self) -> Self:
    if self.metrics_discovery_mode == MetricsDiscoveryMode.DISABLED:
        k8s_options = []
        if "metrics_label_selector" in self.model_fields_set:
            k8s_options.append("--metrics-label-selector")
        if "metrics_namespace" in self.model_fields_set:
            k8s_options.append("--metrics-namespace")
        if k8s_options:
            raise ValueError(
                f"{', '.join(k8s_options)} can only be used when "
                "--metrics-discovery-mode is 'auto' or 'kubernetes'."
            )
    return self
tests/unit/discovery/test_kubernetes_discovery.py (1)
16-22: Missing negative test case for is_running_in_kubernetes.

Only the True case is tested. Adding a False test provides complete coverage for this function and guards against regressions (e.g., someone changing the env var name).
♻️ Suggested addition
def test_returns_false_when_service_host_not_set(self):
    """Should return False when KUBERNETES_SERVICE_HOST is not set."""
    with patch.dict("os.environ", {}, clear=True):
        assert is_running_in_kubernetes() is False
pyproject.toml (1)

60-60: Add upper bound to kubernetes version constraint; consider making it optional.

The blocking I/O concern is already properly handled in the discovery module—all kubernetes API calls are wrapped with asyncio.to_thread() (see _load_core_api_blocking and _list_running_pods_blocking), so the event loop won't be blocked.

Two remaining improvements:

Version constraint: kubernetes>=31.0.0 has no upper bound, unlike other dependencies (e.g., aiohttp~=3.13.3). Add an upper bound like kubernetes>=31.0.0,<37 or use kubernetes~=31.0 to prevent unexpected breaking changes in future major releases.

Optional dependency: Consider moving this to [project.optional-dependencies] since K8s discovery is guarded by require_in_k8s and can be disabled, so users who don't need it won't incur the install overhead.
src/aiperf/discovery/kubernetes.py (2)
131-143: In large clusters, listing all running pods without a label selector can be expensive.

When label_selector is None (and namespace is also None), this fetches every running pod across all namespaces. The client-side _is_eligible filter then discards most of them. For clusters with thousands of pods, the API response could be very large.

Consider applying a server-side OR-style filter (e.g., always include the Dynamo label in the selector) or documenting that --metrics-label-selector / --metrics-namespace is recommended for large clusters.

174-185: Eligibility fallback is subtly correct but could use a more explicit comment.

Line 185 returns True for any pod when label_selector is set, because the server already filtered to matching pods. The existing comment ("label_selector is already applied server-side; this is a fallback gate") is helpful, but a reader might still wonder why a pod without Dynamo/Prometheus markers is eligible.

Consider expanding slightly:
     # label_selector is already applied server-side; this is a fallback gate.
+    # All pods returned by the K8s API already matched the user's label_selector,
+    # so they are considered eligible even without Dynamo or Prometheus markers.
     return config.label_selector is not None

coderabbitai · 2026-02-06T00:41:05Z

src/aiperf/server_metrics/discovery/kubernetes.py

+class KubernetesDiscoveryConfig(BaseModel):
+    namespace: str | None = Field(
+        default=None,
+        description="Namespace to search. None searches all namespaces.",
+    )
+    label_selector: str | None = Field(
+        default=None,
+        description=(
+            "Optional Kubernetes label selector applied server-side at list-time. "
+            "Used as fallback eligibility for non-Dynamo/non-annotated pods."
+        ),
+    )
+    default_scheme: str = Field(
+        default="http",
+        description="Default scheme if prometheus.io/scheme is not set.",
+    )
+    default_path: str = Field(
+        default="/metrics",
+        description="Default path if prometheus.io/path is not set.",
+    )
+    default_port: int | None = Field(
+        default=None,
+        description=(
+            "Fallback port if prometheus.io/port is absent and no container port is found. "
+            "If None, pods without a discoverable port are skipped."
+        ),
+    )
+    prefer_port_name: str = Field(
+        default="metrics",
+        description="Preferred container port name to use when selecting a port.",
+    )
+    require_in_k8s: bool = Field(
+        default=True,
+        description="If True, return [] when not running inside Kubernetes.",
+    )


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Find AIPerfBaseModel definition fd -e py -x rg -l "class AIPerfBaseModel" {} # Check where it's imported from rg "from.*AIPerfBaseModel|import.*AIPerfBaseModel" --type=py -C1

Repository: ai-dynamo/aiperf

Length of output: 7860

🏁 Script executed:

#!/bin/bash # Check if kubernetes.py is a service or utility module head -50 src/aiperf/discovery/kubernetes.py | rg -E "^import|^from|^class|^async def|^def" -n # Check what BaseModel it uses rg "^from|^import" src/aiperf/discovery/kubernetes.py -n | head -20

Repository: ai-dynamo/aiperf

Length of output: 642

🏁 Script executed:

#!/bin/bash # Look for other config classes using BaseModel in the discovery module fd -path "src/aiperf/discovery/*.py" -type f -exec rg -l "class.*Config.*BaseModel" {} \; # Also check how server_metrics configs are structured fd -path "src/aiperf/server_metrics/*.py" -type f -exec rg -l "class.*Config.*BaseModel" {} \;

Repository: ai-dynamo/aiperf

Length of output: 397

🏁 Script executed:

#!/bin/bash # Get AIPerfBaseModel definition head -100 src/aiperf/common/models/base_models.py | cat -n

Repository: ai-dynamo/aiperf

Length of output: 937

🏁 Script executed:

#!/bin/bash # Check what the kubernetes.py file looks like (is it a service or utility?) wc -l src/aiperf/discovery/kubernetes.py head -35 src/aiperf/discovery/kubernetes.py | cat -n

Repository: ai-dynamo/aiperf

Length of output: 1368

🏁 Script executed:

#!/bin/bash # Look for other discovery config classes or similar patterns find src/aiperf/discovery -name "*.py" -type f # Also search for BaseModel usage in discovery directory rg "class.*BaseModel" src/aiperf/discovery/ -n

Repository: ai-dynamo/aiperf

Length of output: 209

🏁 Script executed:

#!/bin/bash # Check what BaseConfig is and how it's used rg "class BaseConfig" src/aiperf -n -A3 | head -30 # Check examples of config classes using it rg "BaseConfig" src/aiperf --type=py -l | head -5

Repository: ai-dynamo/aiperf

Length of output: 532

🏁 Script executed:

#!/bin/bash # Verify the guideline applies to discovery module by checking imports head -30 src/aiperf/timing/config.py | cat -n # Compare with another config class that follows guidelines rg "class.*Config.*AIPerfBaseModel" src/aiperf -n | head -5

Repository: ai-dynamo/aiperf

Length of output: 1617

🏁 Script executed:

#!/bin/bash # Check if kubernetes.py discovery module is a service or utility module # by looking at its usage/imports from other modules rg "from.*kubernetes import|import.*kubernetes" src/aiperf --type=py | head -10 # Also check the discovery module structure cat src/aiperf/discovery/__init__.py

Repository: ai-dynamo/aiperf

Length of output: 1137

🏁 Script executed:

#!/bin/bash # Compare: how is BaseModel vs AIPerfBaseModel used for configs # Check if there's any pattern about BaseModel-only config usage rg "class.*Config.*\(BaseModel\)" src/aiperf --type=py | head -10

Repository: ai-dynamo/aiperf

Length of output: 207

🏁 Script executed:

#!/bin/bash # Let's verify the actual usage pattern - check BaseConfig more closely cat src/aiperf/common/config/base_config.py | head -30 # And check how it's actually used vs AIPerfBaseModel rg "class.*\(BaseConfig\)" src/aiperf --type=py | head -5 rg "class.*\(AIPerfBaseModel\)" src/aiperf --type=py | wc -l

Repository: ai-dynamo/aiperf

Length of output: 1349

🏁 Script executed:

#!/bin/bash # Let's understand the distinction: what does BaseConfig add vs BaseModel? # Check if KubernetesDiscoveryConfig needs YAML serialization rg "KubernetesDiscoveryConfig" src/aiperf --type=py -B2 -A2 | head -40 # Does kubernetes.py use any BaseConfig features? grep -n "serialize\|yaml\|dict\|model_dump" src/aiperf/discovery/kubernetes.py

Repository: ai-dynamo/aiperf

Length of output: 2774

Use AIPerfBaseModel for data models.

The guidelines for src/**/*.py state: "Use AIPerfBaseModel for data models and BaseConfig for configuration." KubernetesDiscoveryConfig currently extends pydantic.BaseModel directly. Other data model classes throughout src/aiperf (e.g., TimingConfig, CreditPhaseConfig) use AIPerfBaseModel for consistency. Update the import and class definition to use AIPerfBaseModel.

🤖 Prompt for AI Agents

In `@src/aiperf/discovery/kubernetes.py` around lines 35 - 69, KubernetesDiscoveryConfig currently inherits from pydantic.BaseModel; change it to inherit from AIPerfBaseModel and update imports accordingly by replacing or adding the import for AIPerfBaseModel (instead of BaseModel) so the class definition reads "class KubernetesDiscoveryConfig(AIPerfBaseModel):" and remove or stop using the direct BaseModel import; keep all existing Field definitions and types unchanged.

coderabbitai · 2026-02-06T00:41:05Z

src/aiperf/server_metrics/manager.py

+        # Run auto-discovery if enabled
+        if self._discovery_mode != MetricsDiscoveryMode.DISABLED:
+            discovered_urls = await self._run_metrics_discovery()
+            added = 0
+            for url in discovered_urls:
+                normalized = normalize_metrics_endpoint_url(url)
+                if normalized not in self._server_metrics_endpoints:
+                    self._server_metrics_endpoints.append(normalized)
+                    added += 1
+            if added > 0:
+                self.info(f"Server Metrics: Auto-discovery added {added} endpoint(s)")


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Find the normalize_metrics_endpoint_url function rg -A 15 "def normalize_metrics_endpoint_url" src/

Repository: ai-dynamo/aiperf

Length of output: 1171

🏁 Script executed:

#!/bin/bash # Find the _pod_to_url function to understand what paths it generates rg -A 10 "def _pod_to_url|def _pod_to_endpoints" src/aiperf/server_metrics/

Repository: ai-dynamo/aiperf

Length of output: 42

🏁 Script executed:

#!/bin/bash # Find kubernetes.py to see the discovery implementation fd "kubernetes.py" src/ --exec cat {}

Repository: ai-dynamo/aiperf

Length of output: 7136

🏁 Script executed:

#!/bin/bash # Get the full normalize_metrics_endpoint_url function rg -A 30 "def normalize_metrics_endpoint_url" src/aiperf/common/metric_utils.py

Repository: ai-dynamo/aiperf

Length of output: 1204

🏁 Script executed:

#!/bin/bash # Get more lines of the function rg -A 40 "def normalize_metrics_endpoint_url" src/aiperf/common/metric_utils.py

Repository: ai-dynamo/aiperf

Length of output: 1250

Discovered URLs with custom Prometheus paths are corrupted by normalization.

The normalize_metrics_endpoint_url function appends /metrics to any URL not ending with that suffix. However, discovered URLs from _pod_to_url are already fully qualified metrics endpoints—they include the complete path from the prometheus.io/path annotation or the configured default. Normalizing them treats them as base URLs, causing URLs with custom paths to be malformed (e.g., http://pod:8080/custom/stats → http://pod:8080/custom/stats/metrics).

Discovered URLs should bypass normalization since they are already complete:

Proposed fix

discovered_urls = await self._run_metrics_discovery() added = 0 for url in discovered_urls: - normalized = normalize_metrics_endpoint_url(url) + normalized = url # discovered URLs are already fully qualified if normalized not in self._server_metrics_endpoints: self._server_metrics_endpoints.append(normalized) added += 1

🤖 Prompt for AI Agents

In `@src/aiperf/server_metrics/manager.py` around lines 121 - 131, Discovered metric endpoints returned by _run_metrics_discovery() are already full metrics paths and must not be passed through normalize_metrics_endpoint_url; change the loop in the auto-discovery block to append the discovered URL strings as-is (deduplicating by exact string against self._server_metrics_endpoints) instead of calling normalize_metrics_endpoint_url, so unique discovered endpoints (from _pod_to_url/_run_metrics_discovery) are added intact and the info() log uses the resulting added count.

coderabbitai · 2026-02-06T00:41:05Z

src/aiperf/server_metrics/manager.py

+    async def _run_metrics_discovery(self) -> list[str]:
+        """Run metrics endpoint auto-discovery based on configuration.
+
+        Returns:
+            List of discovered endpoint URLs.
+        """
+        if self._discovery_mode == MetricsDiscoveryMode.DISABLED:
+            return []
+
+        if self._discovery_mode == MetricsDiscoveryMode.KUBERNETES:
+            if not is_running_in_kubernetes():
+                self.warning(
+                    "Server Metrics: Kubernetes discovery requested but not in K8s cluster"
+                )
+                return []
+            self.info("Server Metrics: Running Kubernetes discovery...")
+            cfg = KubernetesDiscoveryConfig(
+                namespace=self._namespace,
+                label_selector=self._label_selector,
+            )
+            return await discover_kubernetes_endpoints(cfg)
+
+        # AUTO mode: try K8s if available
+        if is_running_in_kubernetes():
+            self.info("Server Metrics: Running Kubernetes auto-discovery...")
+            cfg = KubernetesDiscoveryConfig(
+                namespace=self._namespace,
+                label_selector=self._label_selector,
+            )
+            return await discover_kubernetes_endpoints(cfg)
+
+        self.debug("Server Metrics: Not in K8s, skipping auto-discovery")
+        return []


🛠️ Refactor suggestion | 🟠 Major

Wrap discovery call in try/except for resilience.

discover_kubernetes_endpoints has internal guards, but an unexpected exception (e.g., from asyncio.to_thread, a kubernetes SDK change, or a malformed response) would propagate into _profile_configure_command and fail the entire configuration phase. A defensive catch here keeps discovery failures non-fatal.

Proposed fix

async def _run_metrics_discovery(self) -> list[str]: """Run metrics endpoint auto-discovery based on configuration. Returns: List of discovered endpoint URLs. """ if self._discovery_mode == MetricsDiscoveryMode.DISABLED: return [] - if self._discovery_mode == MetricsDiscoveryMode.KUBERNETES: - if not is_running_in_kubernetes(): - self.warning( - "Server Metrics: Kubernetes discovery requested but not in K8s cluster" - ) - return [] - self.info("Server Metrics: Running Kubernetes discovery...") - cfg = KubernetesDiscoveryConfig( - namespace=self._namespace, - label_selector=self._label_selector, - ) - return await discover_kubernetes_endpoints(cfg) - - # AUTO mode: try K8s if available - if is_running_in_kubernetes(): - self.info("Server Metrics: Running Kubernetes auto-discovery...") - cfg = KubernetesDiscoveryConfig( - namespace=self._namespace, - label_selector=self._label_selector, - ) - return await discover_kubernetes_endpoints(cfg) - - self.debug("Server Metrics: Not in K8s, skipping auto-discovery") - return [] + try: + if self._discovery_mode == MetricsDiscoveryMode.KUBERNETES: + if not is_running_in_kubernetes(): + self.warning( + "Server Metrics: Kubernetes discovery requested but not in K8s cluster" + ) + return [] + self.info("Server Metrics: Running Kubernetes discovery...") + cfg = KubernetesDiscoveryConfig( + namespace=self._namespace, + label_selector=self._label_selector, + ) + return await discover_kubernetes_endpoints(cfg) + + # AUTO mode: try K8s if available + if is_running_in_kubernetes(): + self.info("Server Metrics: Running Kubernetes auto-discovery...") + cfg = KubernetesDiscoveryConfig( + namespace=self._namespace, + label_selector=self._label_selector, + ) + return await discover_kubernetes_endpoints(cfg) + + self.debug(lambda: "Server Metrics: Not in K8s, skipping auto-discovery") + return [] + except Exception as e: + self.warning(f"Server Metrics: Auto-discovery failed: {e!r}") + return []

Also note: Line 497 uses self.debug("...") without a lambda wrapper, inconsistent with the rest of the file. As per coding guidelines, use lambda for expensive logs: self.debug(lambda: "Server Metrics: Not in K8s, skipping auto-discovery").

🤖 Prompt for AI Agents

In `@src/aiperf/server_metrics/manager.py` around lines 466 - 498, Wrap calls to discover_kubernetes_endpoints inside _run_metrics_discovery in a try/except so any unexpected exception (e.g., from asyncio.to_thread or the kubernetes SDK) is caught, logged (use self.warning or self.error and include the exception details) and an empty list is returned to avoid bubbling the error up to _profile_configure_command; do this for both the explicit MetricsDiscoveryMode.KUBERNETES branch and the AUTO branch where is_running_in_kubernetes() is checked. Also change the final self.debug call to use the lambda form (self.debug(lambda: "Server Metrics: Not in K8s, skipping auto-discovery")) to match the file's logging guideline for potentially expensive messages.

coderabbitai · 2026-02-06T00:41:05Z

tests/unit/discovery/test_kubernetes_discovery.py

⚠️ Potential issue | 🟡 Minor

Deduplication test doesn't exercise cross-pod deduplication.

_pod_to_url is called once per pod and returns a single URL. With only one mock pod in mock_list.return_value = [mock_pod], the set in discover_kubernetes_endpoints trivially has one element. The comment "Should only appear once despite matching both Dynamo and Prometheus" is misleading — the function doesn't produce multiple URLs per pod regardless.

To meaningfully test deduplication, supply two distinct pod objects that resolve to the same URL:

♻️ Suggested fix

async def test_deduplicates_urls(self): - """Should deduplicate URLs found via multiple methods.""" - mock_pod = MagicMock() - mock_pod.status.pod_ip = "10.1.2.3" - mock_pod.metadata.annotations = { + """Should deduplicate identical URLs from different pods.""" + pod1 = MagicMock() + pod1.status.pod_ip = "10.1.2.3" + pod1.metadata.annotations = { "prometheus.io/port": "8080", "prometheus.io/scrape": "true", } - mock_pod.metadata.labels = {"nvidia.com/metrics-enabled": "true"} - mock_pod.spec.containers = [] + pod1.metadata.labels = {"nvidia.com/metrics-enabled": "true"} + pod1.spec.containers = [] + + pod2 = MagicMock() + pod2.status.pod_ip = "10.1.2.3" + pod2.metadata.annotations = { + "prometheus.io/port": "8080", + "prometheus.io/scrape": "true", + } + pod2.metadata.labels = {} + pod2.spec.containers = [] with ( patch.dict("os.environ", {"KUBERNETES_SERVICE_HOST": "10.0.0.1"}), patch("kubernetes.config.load_incluster_config"), patch("kubernetes.client.CoreV1Api") as mock_api, patch("aiperf.discovery.kubernetes._list_running_pods") as mock_list, ): - mock_list.return_value = [mock_pod] + mock_list.return_value = [pod1, pod2] mock_api.return_value = MagicMock() urls = await discover_kubernetes_endpoints(KubernetesDiscoveryConfig()) - # Should only appear once despite matching both Dynamo and Prometheus + # Same URL from two pods should appear exactly once assert urls.count("http://10.1.2.3:8080/metrics") == 1

🤖 Prompt for AI Agents

In `@tests/unit/discovery/test_kubernetes_discovery.py` around lines 70 - 94, The test test_deduplicates_urls currently only provides one pod so deduplication isn't exercised; update the fixture so mock_list.return_value contains two distinct MagicMock pod objects that both map to the same URL (create a separate duplicate_pod = MagicMock() copying the same status.pod_ip, metadata.annotations/labels and spec.containers as mock_pod) so that discover_kubernetes_endpoints and _pod_to_url are invoked twice and the set-based deduplication is actually tested.

codecov · 2026-02-06T00:41:42Z

Codecov Report

❌ Patch coverage is 64.37500% with 57 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/aiperf/discovery/kubernetes.py	65.71%	27 Missing and 9 partials ⚠️
src/aiperf/server_metrics/manager.py	32.25%	15 Missing and 6 partials ⚠️

📢 Thoughts on this report? Let us know!

ajcasagrande

@ardecode It's looking really good so far! I got through most of it so far, A few files I may need to take a closer look at. Thanks!

ajcasagrande · 2026-02-07T19:39:47Z

docs/cli_options.md

albeit the long name, I'm thinking we may want to prefix everything with --server so --server-metrics-discovery-mode. This will just make it obvious to the user they are server metrics related, as it could confuse them.

In the near future, when we enable config v2 with yaml support, it will be much more streamlined:

server_metrics: discovery: type: kubernetes label_selector: app=vllm,env=prod namespace: inference

server_metrics: discovery: disabled

server_metrics: discovery: auto

or similar....

the type field is a discriminated union for pydantic schema validation.

src/aiperf/discovery/__init__.py

src/aiperf/discovery/kubernetes.py

ajcasagrande · 2026-02-07T19:55:41Z

src/aiperf/server_metrics/discovery/kubernetes.py

+def _resolve_port(
+    pod: V1Pod,
+    annotation_port: str | None,
+    preferred_port_name: str,
+    fallback: int | None,
+) -> int | None:
+    """Resolve port from annotation -> named port -> first port -> fallback."""
+    if annotation_port:
+        try:
+            return int(annotation_port)
+        except ValueError:
+            pass
+
+    spec = getattr(pod, "spec", None)
+    containers = getattr(spec, "containers", None) if spec else None
+    if not containers:
+        return fallback
+
+    first_port = None
+    for container in containers:
+        for port_spec in getattr(container, "ports", None) or []:
+            container_port = getattr(port_spec, "container_port", None)
+            if not container_port:
+                continue
+            if first_port is None:
+                first_port = int(container_port)
+            if getattr(port_spec, "name", None) == preferred_port_name:
+                return int(container_port)
+
+    return first_port if first_port is not None else fallback


is there any way this doesn't have to use all the getattrs? or is that just a side effect of the kubernetes library?

src/aiperf/common/enums/enums.py

src/aiperf/server_metrics/manager.py

Add automatic discovery of Prometheus /metrics endpoints in Kubernetes environments by querying the K8s API for pods with specific labels and annotations. Discovery hierarchy (priority order): 1. Dynamo labels: nvidia.com/metrics-enabled: "true" 2. Prometheus annotations: prometheus.io/scrape: "true" 3. User-provided label selector via --metrics-label-selector New CLI options: - --metrics-discovery-mode: auto | kubernetes | disabled (default: auto) - --metrics-label-selector: Label selector for custom deployments - --metrics-namespace: Kubernetes namespace to search Part of ai-dynamo#630 Signed-off-by: ardecode <desaiarijit@gmail.com>

github-actions bot added the feat label Feb 6, 2026

coderabbitai bot reviewed Feb 6, 2026

View reviewed changes

ardecode marked this pull request as draft February 6, 2026 00:43

ardecode marked this pull request as ready for review February 6, 2026 00:46

ardecode force-pushed the feat/k8s-metrics-discovery-630 branch from 1d80533 to 1592f6a Compare February 7, 2026 00:47

ardecode marked this pull request as draft February 7, 2026 00:48

ardecode force-pushed the feat/k8s-metrics-discovery-630 branch from 1592f6a to 62a37cc Compare February 7, 2026 01:12

ardecode marked this pull request as ready for review February 7, 2026 01:13

ajcasagrande reviewed Feb 7, 2026

View reviewed changes

ardecode force-pushed the feat/k8s-metrics-discovery-630 branch from 62a37cc to 93a2c9b Compare February 8, 2026 22:28

Conversation

ardecode commented Feb 6, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

New CLI Options

Discovery Hierarchy

Test plan

Summary by CodeRabbit

Uh oh!

github-actions bot commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Try out this PR

Uh oh!

coderabbitai bot commented Feb 6, 2026

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Feb 6, 2026

Codecov Report

Uh oh!

ajcasagrande left a comment

Choose a reason for hiding this comment

Uh oh!

ajcasagrande Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ajcasagrande Feb 7, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

ardecode commented Feb 6, 2026 •

edited by coderabbitai bot

Loading

github-actions bot commented Feb 6, 2026 •

edited

Loading