feat: Add Kubernetes Metrics Endpoint Auto-Discovery#648
feat: Add Kubernetes Metrics Endpoint Auto-Discovery#648ardecode wants to merge 1 commit intoai-dynamo:mainfrom
Conversation
Try out this PRQuick install: pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@93a2c9b6eefbfe9f5d651077061eccdb0f38e283Recommended with virtual environment (using uv): uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@93a2c9b6eefbfe9f5d651077061eccdb0f38e283Last updated for commit: |
WalkthroughThis PR introduces Kubernetes-based metrics endpoint discovery for the aiperf tool. It adds a new kubernetes API client dependency, exposes metrics discovery configuration options (mode, label selector, namespace), defines new discovery-related enums, implements asynchronous Kubernetes pod enumeration and URL construction, integrates discovery into the metrics manager, and provides comprehensive unit tests for the discovery logic. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In `@src/aiperf/discovery/kubernetes.py`:
- Around line 35-69: KubernetesDiscoveryConfig currently inherits from
pydantic.BaseModel; change it to inherit from AIPerfBaseModel and update imports
accordingly by replacing or adding the import for AIPerfBaseModel (instead of
BaseModel) so the class definition reads "class
KubernetesDiscoveryConfig(AIPerfBaseModel):" and remove or stop using the direct
BaseModel import; keep all existing Field definitions and types unchanged.
In `@src/aiperf/server_metrics/manager.py`:
- Around line 466-498: Wrap calls to discover_kubernetes_endpoints inside
_run_metrics_discovery in a try/except so any unexpected exception (e.g., from
asyncio.to_thread or the kubernetes SDK) is caught, logged (use self.warning or
self.error and include the exception details) and an empty list is returned to
avoid bubbling the error up to _profile_configure_command; do this for both the
explicit MetricsDiscoveryMode.KUBERNETES branch and the AUTO branch where
is_running_in_kubernetes() is checked. Also change the final self.debug call to
use the lambda form (self.debug(lambda: "Server Metrics: Not in K8s, skipping
auto-discovery")) to match the file's logging guideline for potentially
expensive messages.
- Around line 121-131: Discovered metric endpoints returned by
_run_metrics_discovery() are already full metrics paths and must not be passed
through normalize_metrics_endpoint_url; change the loop in the auto-discovery
block to append the discovered URL strings as-is (deduplicating by exact string
against self._server_metrics_endpoints) instead of calling
normalize_metrics_endpoint_url, so unique discovered endpoints (from
_pod_to_url/_run_metrics_discovery) are added intact and the info() log uses the
resulting added count.
In `@tests/unit/discovery/test_kubernetes_discovery.py`:
- Around line 70-94: The test test_deduplicates_urls currently only provides one
pod so deduplication isn't exercised; update the fixture so
mock_list.return_value contains two distinct MagicMock pod objects that both map
to the same URL (create a separate duplicate_pod = MagicMock() copying the same
status.pod_ip, metadata.annotations/labels and spec.containers as mock_pod) so
that discover_kubernetes_endpoints and _pod_to_url are invoked twice and the
set-based deduplication is actually tested.
🧹 Nitpick comments (5)
src/aiperf/common/config/user_config.py (1)
567-613: Consider validating K8s-specific options againstmetrics_discovery_mode.The
metrics_label_selectorandmetrics_namespacefields are only meaningful when discovery mode isautoorkubernetes. This file already validates unused options elsewhere (seevalidate_unused_optionsat Lines 265-320, which raises errors for options set without their required companions).For consistency with that pattern, consider adding a validator that warns or errors when
metrics_label_selectorormetrics_namespaceis explicitly set butmetrics_discovery_modeisDISABLED:♻️ Suggested validator
`@model_validator`(mode="after") def validate_metrics_discovery_options(self) -> Self: if self.metrics_discovery_mode == MetricsDiscoveryMode.DISABLED: k8s_options = [] if "metrics_label_selector" in self.model_fields_set: k8s_options.append("--metrics-label-selector") if "metrics_namespace" in self.model_fields_set: k8s_options.append("--metrics-namespace") if k8s_options: raise ValueError( f"{', '.join(k8s_options)} can only be used when " "--metrics-discovery-mode is 'auto' or 'kubernetes'." ) return selftests/unit/discovery/test_kubernetes_discovery.py (1)
16-22: Missing negative test case foris_running_in_kubernetes.Only the
Truecase is tested. Adding aFalsetest provides complete coverage for this function and guards against regressions (e.g., someone changing the env var name).♻️ Suggested addition
def test_returns_false_when_service_host_not_set(self): """Should return False when KUBERNETES_SERVICE_HOST is not set.""" with patch.dict("os.environ", {}, clear=True): assert is_running_in_kubernetes() is Falsepyproject.toml (1)
60-60: Add upper bound to kubernetes version constraint; consider making it optional.The blocking I/O concern is already properly handled in the discovery module—all kubernetes API calls are wrapped with
asyncio.to_thread()(see_load_core_api_blockingand_list_running_pods_blocking), so the event loop won't be blocked.Two remaining improvements:
Version constraint:
kubernetes>=31.0.0has no upper bound, unlike other dependencies (e.g.,aiohttp~=3.13.3). Add an upper bound likekubernetes>=31.0.0,<37or usekubernetes~=31.0to prevent unexpected breaking changes in future major releases.Optional dependency: Consider moving this to
[project.optional-dependencies]since K8s discovery is guarded byrequire_in_k8sand can be disabled, so users who don't need it won't incur the install overhead.src/aiperf/discovery/kubernetes.py (2)
131-143: In large clusters, listing all running pods without a label selector can be expensive.When
label_selectorisNone(andnamespaceis alsoNone), this fetches every running pod across all namespaces. The client-side_is_eligiblefilter then discards most of them. For clusters with thousands of pods, the API response could be very large.Consider applying a server-side
OR-style filter (e.g., always include the Dynamo label in the selector) or documenting that--metrics-label-selector/--metrics-namespaceis recommended for large clusters.
174-185: Eligibility fallback is subtly correct but could use a more explicit comment.Line 185 returns
Truefor any pod whenlabel_selectoris set, because the server already filtered to matching pods. The existing comment ("label_selector is already applied server-side; this is a fallback gate") is helpful, but a reader might still wonder why a pod without Dynamo/Prometheus markers is eligible.Consider expanding slightly:
# label_selector is already applied server-side; this is a fallback gate. + # All pods returned by the K8s API already matched the user's label_selector, + # so they are considered eligible even without Dynamo or Prometheus markers. return config.label_selector is not None
| class KubernetesDiscoveryConfig(BaseModel): | ||
| namespace: str | None = Field( | ||
| default=None, | ||
| description="Namespace to search. None searches all namespaces.", | ||
| ) | ||
| label_selector: str | None = Field( | ||
| default=None, | ||
| description=( | ||
| "Optional Kubernetes label selector applied server-side at list-time. " | ||
| "Used as fallback eligibility for non-Dynamo/non-annotated pods." | ||
| ), | ||
| ) | ||
| default_scheme: str = Field( | ||
| default="http", | ||
| description="Default scheme if prometheus.io/scheme is not set.", | ||
| ) | ||
| default_path: str = Field( | ||
| default="/metrics", | ||
| description="Default path if prometheus.io/path is not set.", | ||
| ) | ||
| default_port: int | None = Field( | ||
| default=None, | ||
| description=( | ||
| "Fallback port if prometheus.io/port is absent and no container port is found. " | ||
| "If None, pods without a discoverable port are skipped." | ||
| ), | ||
| ) | ||
| prefer_port_name: str = Field( | ||
| default="metrics", | ||
| description="Preferred container port name to use when selecting a port.", | ||
| ) | ||
| require_in_k8s: bool = Field( | ||
| default=True, | ||
| description="If True, return [] when not running inside Kubernetes.", | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Find AIPerfBaseModel definition
fd -e py -x rg -l "class AIPerfBaseModel" {}
# Check where it's imported from
rg "from.*AIPerfBaseModel|import.*AIPerfBaseModel" --type=py -C1Repository: ai-dynamo/aiperf
Length of output: 7860
🏁 Script executed:
#!/bin/bash
# Check if kubernetes.py is a service or utility module
head -50 src/aiperf/discovery/kubernetes.py | rg -E "^import|^from|^class|^async def|^def" -n
# Check what BaseModel it uses
rg "^from|^import" src/aiperf/discovery/kubernetes.py -n | head -20Repository: ai-dynamo/aiperf
Length of output: 642
🏁 Script executed:
#!/bin/bash
# Look for other config classes using BaseModel in the discovery module
fd -path "src/aiperf/discovery/*.py" -type f -exec rg -l "class.*Config.*BaseModel" {} \;
# Also check how server_metrics configs are structured
fd -path "src/aiperf/server_metrics/*.py" -type f -exec rg -l "class.*Config.*BaseModel" {} \;Repository: ai-dynamo/aiperf
Length of output: 397
🏁 Script executed:
#!/bin/bash
# Get AIPerfBaseModel definition
head -100 src/aiperf/common/models/base_models.py | cat -nRepository: ai-dynamo/aiperf
Length of output: 937
🏁 Script executed:
#!/bin/bash
# Check what the kubernetes.py file looks like (is it a service or utility?)
wc -l src/aiperf/discovery/kubernetes.py
head -35 src/aiperf/discovery/kubernetes.py | cat -nRepository: ai-dynamo/aiperf
Length of output: 1368
🏁 Script executed:
#!/bin/bash
# Look for other discovery config classes or similar patterns
find src/aiperf/discovery -name "*.py" -type f
# Also search for BaseModel usage in discovery directory
rg "class.*BaseModel" src/aiperf/discovery/ -nRepository: ai-dynamo/aiperf
Length of output: 209
🏁 Script executed:
#!/bin/bash
# Check what BaseConfig is and how it's used
rg "class BaseConfig" src/aiperf -n -A3 | head -30
# Check examples of config classes using it
rg "BaseConfig" src/aiperf --type=py -l | head -5Repository: ai-dynamo/aiperf
Length of output: 532
🏁 Script executed:
#!/bin/bash
# Verify the guideline applies to discovery module by checking imports
head -30 src/aiperf/timing/config.py | cat -n
# Compare with another config class that follows guidelines
rg "class.*Config.*AIPerfBaseModel" src/aiperf -n | head -5Repository: ai-dynamo/aiperf
Length of output: 1617
🏁 Script executed:
#!/bin/bash
# Check if kubernetes.py discovery module is a service or utility module
# by looking at its usage/imports from other modules
rg "from.*kubernetes import|import.*kubernetes" src/aiperf --type=py | head -10
# Also check the discovery module structure
cat src/aiperf/discovery/__init__.pyRepository: ai-dynamo/aiperf
Length of output: 1137
🏁 Script executed:
#!/bin/bash
# Compare: how is BaseModel vs AIPerfBaseModel used for configs
# Check if there's any pattern about BaseModel-only config usage
rg "class.*Config.*\(BaseModel\)" src/aiperf --type=py | head -10Repository: ai-dynamo/aiperf
Length of output: 207
🏁 Script executed:
#!/bin/bash
# Let's verify the actual usage pattern - check BaseConfig more closely
cat src/aiperf/common/config/base_config.py | head -30
# And check how it's actually used vs AIPerfBaseModel
rg "class.*\(BaseConfig\)" src/aiperf --type=py | head -5
rg "class.*\(AIPerfBaseModel\)" src/aiperf --type=py | wc -lRepository: ai-dynamo/aiperf
Length of output: 1349
🏁 Script executed:
#!/bin/bash
# Let's understand the distinction: what does BaseConfig add vs BaseModel?
# Check if KubernetesDiscoveryConfig needs YAML serialization
rg "KubernetesDiscoveryConfig" src/aiperf --type=py -B2 -A2 | head -40
# Does kubernetes.py use any BaseConfig features?
grep -n "serialize\|yaml\|dict\|model_dump" src/aiperf/discovery/kubernetes.pyRepository: ai-dynamo/aiperf
Length of output: 2774
Use AIPerfBaseModel for data models.
The guidelines for src/**/*.py state: "Use AIPerfBaseModel for data models and BaseConfig for configuration." KubernetesDiscoveryConfig currently extends pydantic.BaseModel directly. Other data model classes throughout src/aiperf (e.g., TimingConfig, CreditPhaseConfig) use AIPerfBaseModel for consistency. Update the import and class definition to use AIPerfBaseModel.
🤖 Prompt for AI Agents
In `@src/aiperf/discovery/kubernetes.py` around lines 35 - 69,
KubernetesDiscoveryConfig currently inherits from pydantic.BaseModel; change it
to inherit from AIPerfBaseModel and update imports accordingly by replacing or
adding the import for AIPerfBaseModel (instead of BaseModel) so the class
definition reads "class KubernetesDiscoveryConfig(AIPerfBaseModel):" and remove
or stop using the direct BaseModel import; keep all existing Field definitions
and types unchanged.
| # Run auto-discovery if enabled | ||
| if self._discovery_mode != MetricsDiscoveryMode.DISABLED: | ||
| discovered_urls = await self._run_metrics_discovery() | ||
| added = 0 | ||
| for url in discovered_urls: | ||
| normalized = normalize_metrics_endpoint_url(url) | ||
| if normalized not in self._server_metrics_endpoints: | ||
| self._server_metrics_endpoints.append(normalized) | ||
| added += 1 | ||
| if added > 0: | ||
| self.info(f"Server Metrics: Auto-discovery added {added} endpoint(s)") |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Find the normalize_metrics_endpoint_url function
rg -A 15 "def normalize_metrics_endpoint_url" src/Repository: ai-dynamo/aiperf
Length of output: 1171
🏁 Script executed:
#!/bin/bash
# Find the _pod_to_url function to understand what paths it generates
rg -A 10 "def _pod_to_url|def _pod_to_endpoints" src/aiperf/server_metrics/Repository: ai-dynamo/aiperf
Length of output: 42
🏁 Script executed:
#!/bin/bash
# Find kubernetes.py to see the discovery implementation
fd "kubernetes.py" src/ --exec cat {}Repository: ai-dynamo/aiperf
Length of output: 7136
🏁 Script executed:
#!/bin/bash
# Get the full normalize_metrics_endpoint_url function
rg -A 30 "def normalize_metrics_endpoint_url" src/aiperf/common/metric_utils.pyRepository: ai-dynamo/aiperf
Length of output: 1204
🏁 Script executed:
#!/bin/bash
# Get more lines of the function
rg -A 40 "def normalize_metrics_endpoint_url" src/aiperf/common/metric_utils.pyRepository: ai-dynamo/aiperf
Length of output: 1250
Discovered URLs with custom Prometheus paths are corrupted by normalization.
The normalize_metrics_endpoint_url function appends /metrics to any URL not ending with that suffix. However, discovered URLs from _pod_to_url are already fully qualified metrics endpoints—they include the complete path from the prometheus.io/path annotation or the configured default. Normalizing them treats them as base URLs, causing URLs with custom paths to be malformed (e.g., http://pod:8080/custom/stats → http://pod:8080/custom/stats/metrics).
Discovered URLs should bypass normalization since they are already complete:
Proposed fix
discovered_urls = await self._run_metrics_discovery()
added = 0
for url in discovered_urls:
- normalized = normalize_metrics_endpoint_url(url)
+ normalized = url # discovered URLs are already fully qualified
if normalized not in self._server_metrics_endpoints:
self._server_metrics_endpoints.append(normalized)
added += 1🤖 Prompt for AI Agents
In `@src/aiperf/server_metrics/manager.py` around lines 121 - 131, Discovered
metric endpoints returned by _run_metrics_discovery() are already full metrics
paths and must not be passed through normalize_metrics_endpoint_url; change the
loop in the auto-discovery block to append the discovered URL strings as-is
(deduplicating by exact string against self._server_metrics_endpoints) instead
of calling normalize_metrics_endpoint_url, so unique discovered endpoints (from
_pod_to_url/_run_metrics_discovery) are added intact and the info() log uses the
resulting added count.
| async def _run_metrics_discovery(self) -> list[str]: | ||
| """Run metrics endpoint auto-discovery based on configuration. | ||
|
|
||
| Returns: | ||
| List of discovered endpoint URLs. | ||
| """ | ||
| if self._discovery_mode == MetricsDiscoveryMode.DISABLED: | ||
| return [] | ||
|
|
||
| if self._discovery_mode == MetricsDiscoveryMode.KUBERNETES: | ||
| if not is_running_in_kubernetes(): | ||
| self.warning( | ||
| "Server Metrics: Kubernetes discovery requested but not in K8s cluster" | ||
| ) | ||
| return [] | ||
| self.info("Server Metrics: Running Kubernetes discovery...") | ||
| cfg = KubernetesDiscoveryConfig( | ||
| namespace=self._namespace, | ||
| label_selector=self._label_selector, | ||
| ) | ||
| return await discover_kubernetes_endpoints(cfg) | ||
|
|
||
| # AUTO mode: try K8s if available | ||
| if is_running_in_kubernetes(): | ||
| self.info("Server Metrics: Running Kubernetes auto-discovery...") | ||
| cfg = KubernetesDiscoveryConfig( | ||
| namespace=self._namespace, | ||
| label_selector=self._label_selector, | ||
| ) | ||
| return await discover_kubernetes_endpoints(cfg) | ||
|
|
||
| self.debug("Server Metrics: Not in K8s, skipping auto-discovery") | ||
| return [] |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
Wrap discovery call in try/except for resilience.
discover_kubernetes_endpoints has internal guards, but an unexpected exception (e.g., from asyncio.to_thread, a kubernetes SDK change, or a malformed response) would propagate into _profile_configure_command and fail the entire configuration phase. A defensive catch here keeps discovery failures non-fatal.
Proposed fix
async def _run_metrics_discovery(self) -> list[str]:
"""Run metrics endpoint auto-discovery based on configuration.
Returns:
List of discovered endpoint URLs.
"""
if self._discovery_mode == MetricsDiscoveryMode.DISABLED:
return []
- if self._discovery_mode == MetricsDiscoveryMode.KUBERNETES:
- if not is_running_in_kubernetes():
- self.warning(
- "Server Metrics: Kubernetes discovery requested but not in K8s cluster"
- )
- return []
- self.info("Server Metrics: Running Kubernetes discovery...")
- cfg = KubernetesDiscoveryConfig(
- namespace=self._namespace,
- label_selector=self._label_selector,
- )
- return await discover_kubernetes_endpoints(cfg)
-
- # AUTO mode: try K8s if available
- if is_running_in_kubernetes():
- self.info("Server Metrics: Running Kubernetes auto-discovery...")
- cfg = KubernetesDiscoveryConfig(
- namespace=self._namespace,
- label_selector=self._label_selector,
- )
- return await discover_kubernetes_endpoints(cfg)
-
- self.debug("Server Metrics: Not in K8s, skipping auto-discovery")
- return []
+ try:
+ if self._discovery_mode == MetricsDiscoveryMode.KUBERNETES:
+ if not is_running_in_kubernetes():
+ self.warning(
+ "Server Metrics: Kubernetes discovery requested but not in K8s cluster"
+ )
+ return []
+ self.info("Server Metrics: Running Kubernetes discovery...")
+ cfg = KubernetesDiscoveryConfig(
+ namespace=self._namespace,
+ label_selector=self._label_selector,
+ )
+ return await discover_kubernetes_endpoints(cfg)
+
+ # AUTO mode: try K8s if available
+ if is_running_in_kubernetes():
+ self.info("Server Metrics: Running Kubernetes auto-discovery...")
+ cfg = KubernetesDiscoveryConfig(
+ namespace=self._namespace,
+ label_selector=self._label_selector,
+ )
+ return await discover_kubernetes_endpoints(cfg)
+
+ self.debug(lambda: "Server Metrics: Not in K8s, skipping auto-discovery")
+ return []
+ except Exception as e:
+ self.warning(f"Server Metrics: Auto-discovery failed: {e!r}")
+ return []Also note: Line 497 uses self.debug("...") without a lambda wrapper, inconsistent with the rest of the file. As per coding guidelines, use lambda for expensive logs: self.debug(lambda: "Server Metrics: Not in K8s, skipping auto-discovery").
🤖 Prompt for AI Agents
In `@src/aiperf/server_metrics/manager.py` around lines 466 - 498, Wrap calls to
discover_kubernetes_endpoints inside _run_metrics_discovery in a try/except so
any unexpected exception (e.g., from asyncio.to_thread or the kubernetes SDK) is
caught, logged (use self.warning or self.error and include the exception
details) and an empty list is returned to avoid bubbling the error up to
_profile_configure_command; do this for both the explicit
MetricsDiscoveryMode.KUBERNETES branch and the AUTO branch where
is_running_in_kubernetes() is checked. Also change the final self.debug call to
use the lambda form (self.debug(lambda: "Server Metrics: Not in K8s, skipping
auto-discovery")) to match the file's logging guideline for potentially
expensive messages.
There was a problem hiding this comment.
Deduplication test doesn't exercise cross-pod deduplication.
_pod_to_url is called once per pod and returns a single URL. With only one mock pod in mock_list.return_value = [mock_pod], the set in discover_kubernetes_endpoints trivially has one element. The comment "Should only appear once despite matching both Dynamo and Prometheus" is misleading — the function doesn't produce multiple URLs per pod regardless.
To meaningfully test deduplication, supply two distinct pod objects that resolve to the same URL:
♻️ Suggested fix
async def test_deduplicates_urls(self):
- """Should deduplicate URLs found via multiple methods."""
- mock_pod = MagicMock()
- mock_pod.status.pod_ip = "10.1.2.3"
- mock_pod.metadata.annotations = {
+ """Should deduplicate identical URLs from different pods."""
+ pod1 = MagicMock()
+ pod1.status.pod_ip = "10.1.2.3"
+ pod1.metadata.annotations = {
"prometheus.io/port": "8080",
"prometheus.io/scrape": "true",
}
- mock_pod.metadata.labels = {"nvidia.com/metrics-enabled": "true"}
- mock_pod.spec.containers = []
+ pod1.metadata.labels = {"nvidia.com/metrics-enabled": "true"}
+ pod1.spec.containers = []
+
+ pod2 = MagicMock()
+ pod2.status.pod_ip = "10.1.2.3"
+ pod2.metadata.annotations = {
+ "prometheus.io/port": "8080",
+ "prometheus.io/scrape": "true",
+ }
+ pod2.metadata.labels = {}
+ pod2.spec.containers = []
with (
patch.dict("os.environ", {"KUBERNETES_SERVICE_HOST": "10.0.0.1"}),
patch("kubernetes.config.load_incluster_config"),
patch("kubernetes.client.CoreV1Api") as mock_api,
patch("aiperf.discovery.kubernetes._list_running_pods") as mock_list,
):
- mock_list.return_value = [mock_pod]
+ mock_list.return_value = [pod1, pod2]
mock_api.return_value = MagicMock()
urls = await discover_kubernetes_endpoints(KubernetesDiscoveryConfig())
- # Should only appear once despite matching both Dynamo and Prometheus
+ # Same URL from two pods should appear exactly once
assert urls.count("http://10.1.2.3:8080/metrics") == 1🤖 Prompt for AI Agents
In `@tests/unit/discovery/test_kubernetes_discovery.py` around lines 70 - 94, The
test test_deduplicates_urls currently only provides one pod so deduplication
isn't exercised; update the fixture so mock_list.return_value contains two
distinct MagicMock pod objects that both map to the same URL (create a separate
duplicate_pod = MagicMock() copying the same status.pod_ip,
metadata.annotations/labels and spec.containers as mock_pod) so that
discover_kubernetes_endpoints and _pod_to_url are invoked twice and the
set-based deduplication is actually tested.
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
1d80533 to
1592f6a
Compare
1592f6a to
62a37cc
Compare
ajcasagrande
left a comment
There was a problem hiding this comment.
@ardecode It's looking really good so far! I got through most of it so far, A few files I may need to take a closer look at. Thanks!
docs/cli_options.md
Outdated
There was a problem hiding this comment.
albeit the long name, I'm thinking we may want to prefix everything with --server so --server-metrics-discovery-mode. This will just make it obvious to the user they are server metrics related, as it could confuse them.
In the near future, when we enable config v2 with yaml support, it will be much more streamlined:
server_metrics:
discovery:
type: kubernetes
label_selector: app=vllm,env=prod
namespace: inferenceserver_metrics:
discovery: disabledserver_metrics:
discovery: autoor similar....
the type field is a discriminated union for pydantic schema validation.
| def _resolve_port( | ||
| pod: V1Pod, | ||
| annotation_port: str | None, | ||
| preferred_port_name: str, | ||
| fallback: int | None, | ||
| ) -> int | None: | ||
| """Resolve port from annotation -> named port -> first port -> fallback.""" | ||
| if annotation_port: | ||
| try: | ||
| return int(annotation_port) | ||
| except ValueError: | ||
| pass | ||
|
|
||
| spec = getattr(pod, "spec", None) | ||
| containers = getattr(spec, "containers", None) if spec else None | ||
| if not containers: | ||
| return fallback | ||
|
|
||
| first_port = None | ||
| for container in containers: | ||
| for port_spec in getattr(container, "ports", None) or []: | ||
| container_port = getattr(port_spec, "container_port", None) | ||
| if not container_port: | ||
| continue | ||
| if first_port is None: | ||
| first_port = int(container_port) | ||
| if getattr(port_spec, "name", None) == preferred_port_name: | ||
| return int(container_port) | ||
|
|
||
| return first_port if first_port is not None else fallback |
There was a problem hiding this comment.
is there any way this doesn't have to use all the getattrs? or is that just a side effect of the kubernetes library?
Add automatic discovery of Prometheus /metrics endpoints in Kubernetes environments by querying the K8s API for pods with specific labels and annotations. Discovery hierarchy (priority order): 1. Dynamo labels: nvidia.com/metrics-enabled: "true" 2. Prometheus annotations: prometheus.io/scrape: "true" 3. User-provided label selector via --metrics-label-selector New CLI options: - --metrics-discovery-mode: auto | kubernetes | disabled (default: auto) - --metrics-label-selector: Label selector for custom deployments - --metrics-namespace: Kubernetes namespace to search Part of ai-dynamo#630 Signed-off-by: ardecode <desaiarijit@gmail.com>
62a37cc to
93a2c9b
Compare
Summary
/metricsendpoints in Kubernetes environmentsNew CLI Options
--metrics-discovery-mode: auto | kubernetes | disabled (default: auto)--metrics-label-selector: Label selector for custom deployments--metrics-namespace: Kubernetes namespace to searchDiscovery Hierarchy
nvidia.com/metrics-enabled: "true"prometheus.io/scrape: "true"Test plan
Part of #630
Summary by CodeRabbit