[Dashboard] Support autoscaler v2 for cluster-level node metrics #60504

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

jinbum-kim wants to merge 5 commits into ray-project:master from jinbum-kim:fix/dashboard-autoscaler-v2-cluster-metrics

+196 −6

Contributor

jinbum-kim commented Jan 26, 2026

Description

This PR makes Dashboard ReporterAgent work correctly with Autoscaler v2 for cluster-level node metrics (cluster_*_nodes).

Previously, ReporterAgent relied on the v1 autoscaler’s JSON debug payload stored in GCS internal KV under DEBUG_AUTOSCALING_STATUS to compute metrics like cluster_active_nodes, cluster_pending_nodes, and cluster_failed_nodes.
However, Autoscaler v2 doesn’t populate that KV key, so when v2 is enabled these metrics end up missing / always empty.

To fix that, when Autoscaler v2 is enabled the legacy KV path is skipped and cluster status is fetched via the RPC (get_cluster_status()). The result is then reshaped into the cluster_stats format that _to_records() already expects and sent through the same metrics pipeline.

On top of that, Autoscaler v2 introduces an extra node state, idle, so this PR also adds a cluster_idle_nodes gauge. This metric is emitted only when v2 is enabled, so it won’t affect v1 behavior.

Related issues

Closes: #59930

Additional information

Implementation details

ReporterAgent branches based on whether Autoscaler v2 is enabled:
- v1: keep the existing behavior (read DEBUG_AUTOSCALING_STATUS from GCS internal KV)
- v2: skip DEBUG_AUTOSCALING_STATUS and fetch cluster status using get_cluster_status()
For the v2 path, the fetched status is normalized to match the shape expected by _to_records():
- active_nodes, idle_nodes: counted by ray_node_type_name
- pending_nodes: converted into a list of (ip, node_type, details) tuples
- failed_nodes: converted into a list of (ip, node_type) tuples

New metric: `cluster_idle_nodes`

Adds cluster_idle_nodes to expose the new Autoscaler v2 node state idle_nodes.
Uses has_autoscaler_v2_stats so cluster_idle_nodes is emitted only when Autoscaler v2 is on.

jinbum-kim requested a review from a team as a code owner

January 26, 2026 19:16

gemini-code-assist bot reviewed

View reviewed changes

Contributor

gemini-code-assist bot left a comment

Code Review

The pull request successfully integrates Autoscaler v2 support for cluster-level node metrics, including the new cluster_idle_nodes gauge. The changes correctly branch logic based on the autoscaler version, ensuring backward compatibility with v1 while enabling new v2 features. The new test cases adequately cover both v1 and v2 scenarios, verifying the correct metric emission and data handling. However, there is a critical issue regarding a blocking call within an asynchronous context that needs to be addressed.

python/ray/dashboard/modules/reporter/reporter_agent.py Show resolved Hide resolved

python/ray/dashboard/modules/reporter/reporter_agent.py Show resolved Hide resolved

cursor bot reviewed

View reviewed changes

python/ray/dashboard/modules/reporter/reporter_agent.py Outdated Show resolved Hide resolved

jinbum-kim force-pushed the fix/dashboard-autoscaler-v2-cluster-metrics branch 2 times, most recently from f70738b to 6a646d3 Compare

January 26, 2026 20:05

ray-gardener bot added core observability community-contribution labels

Collaborator

edoakes commented Jan 27, 2026

@sampan-s-nayak @rueian PTAL

sampan-s-nayak reviewed

View reviewed changes

python/ray/dashboard/modules/reporter/reporter_agent.py Outdated Show resolved Hide resolved

sampan-s-nayak reviewed

View reviewed changes

python/ray/dashboard/modules/reporter/reporter_agent.py Show resolved Hide resolved

sampan-s-nayak reviewed

View reviewed changes

python/ray/dashboard/modules/reporter/reporter_agent.py

                           )
+                          # Autoscaler v2 only - get cluster_status from gcs via RPC(get_cluster_status())
+                          autoscaler_v2_enabled = is_autoscaler_v2(gcs_client=self._gcs_client)

Contributor

sampan-s-nayak Jan 27, 2026

should we fetch this once and make it a member variable so that we can keep referring to it? (we can consider doing this if it is not possible to switch between autoscaler v1 and v2 once a ray cluster has started)

Contributor Author

jinbum-kim Jan 29, 2026

Hello, thank you for review.

I've confirmed that the autoscaler version is determined by env variables set before process start, and I haven't found any logic supporting runtime v1↔v2 switching in my investigation so far.

However, I'm slightly concerned about edge cases where one-time caching could lead to incorrect branching:

Custom deployments running multiple autoscaler components (e.g., built-in autoscaler process + separate autoscaler pod)
Transient detection inconsistencies during initial boot

This is an edge case, but I’m not sure what’s best here.

Contributor

sampan-s-nayak Jan 30, 2026

sampan-s-nayak reviewed

View reviewed changes

python/ray/dashboard/modules/reporter/reporter_agent.py Outdated

+                          # Autoscaler v2 only - get cluster_status from gcs via RPC(get_cluster_status())
+                          autoscaler_v2_enabled = is_autoscaler_v2(gcs_client=self._gcs_client)
+                          if self._is_head_node and autoscaler_v2_enabled:
+                              cluster_stats = await asyncio.to_thread(self._get_cluster_stats_v2)

Contributor

sampan-s-nayak Jan 27, 2026

could we if possible run this on self.executor instead of spawning a new thread everytime?

Contributor Author

jinbum-kim Jan 29, 2026 •

edited

Loading

Since _async_compose_stats_payload() is already executing inside self._executor (via _run_in_executor()), and RAY_DASHBOARD_REPORTER_AGENT_TPE_MAX_WORKERS defaults to 1, submitting a nested task to the same executor causes it to wait indefinitely for itself to finish.

That is why I utilized asyncio.to_thread to offload the work to a separate pool.

Contributor

sampan-s-nayak Jan 30, 2026

oh in that case do we really need to offload work here? We can let self._get_cluster_stats_v2 to continue running on this thread. because of the existence of GIL I dont think we will be getting sufficient benefits to justify offloading this piece of work

Contributor Author

jinbum-kim Jan 31, 2026

Initially I submitted this as a sync RPC call, but an automated reviewer warned that calling a blocking sync call inside an async method could block the main event loop, so I updated it and force-pushed.

However, I confirmed get_cluster_status() that the RPC call releases the GIL (with nogil). So it should be fine to call _get_cluster_stats_v2() sync here.

I’ll update the PR to call _get_cluster_stats_v2 sync

cursor bot reviewed

View reviewed changes

python/ray/dashboard/modules/reporter/reporter_agent.py Show resolved Hide resolved

python/ray/dashboard/modules/reporter/reporter_agent.py Outdated Show resolved Hide resolved

python/ray/dashboard/modules/reporter/reporter_agent.py Outdated Show resolved Hide resolved

jinbum-kim force-pushed the fix/dashboard-autoscaler-v2-cluster-metrics branch from df71613 to 4d35dcb Compare

January 29, 2026 10:31

jinbum-kim requested review from a team, SongGuyang, aslonnie, edoakes, jjyao, kfstorm, matthewdeng, raulchen and richardliaw as code owners

January 29, 2026 10:31

jinbum-kim requested review from a team and WangTaoTheTonic as code owners

January 29, 2026 10:31

jinbum-kim marked this pull request as draft

January 29, 2026 10:32

jinbum-kim closed this

jinbum-kim deleted the fix/dashboard-autoscaler-v2-cluster-metrics branch

January 29, 2026 10:32

cursor bot reviewed

View reviewed changes

cursor bot left a comment

Cursor Bugbot has reviewed your changes and found 3 potential issues.

ci/ray_ci/automation/push_ray_image.py Outdated Show resolved Hide resolved

ci/ray_ci/automation/push_ray_image.py Outdated Show resolved Hide resolved

cpp/src/ray/util/process_helper.cc Show resolved Hide resolved

jinbum-kim added 2 commits

January 29, 2026 10:58


          fix(dashboard): support autoscaler v2 cluster node metrics without bl…

bb0d092

…ocking

Signed-off-by: jinbum-kim <[email protected]>


          chore: simplify cluster stats dict casting and shorten comments

44c78f4

Signed-off-by: jinbum-kim <[email protected]>

jinbum-kim restored the fix/dashboard-autoscaler-v2-cluster-metrics branch

January 29, 2026 10:59

jinbum-kim reopened this

jinbum-kim force-pushed the fix/dashboard-autoscaler-v2-cluster-metrics branch from 4d35dcb to 44c78f4 Compare

January 29, 2026 10:59


          Merge branch 'master' into fix/dashboard-autoscaler-v2-cluster-metrics

cee7659

Contributor Author

jinbum-kim commented Jan 29, 2026

I rebased incorrectly and the PR history got polluted. I force-pushed a cleaned-up branch. Sorry about that.

jinbum-kim marked this pull request as ready for review

January 29, 2026 14:18

aslonnie removed request for a team, aslonnie and jjyao

January 29, 2026 20:49

jinbum-kim added 2 commits

January 31, 2026 18:05


          Merge branch 'master' into fix/dashboard-autoscaler-v2-cluster-metrics

08bcc25


          refactor(dashboard): remove unnecessary asyncio.to_thread for cluster…

0bbb4f7

… stats

Signed-off-by: jinbum-kim <[email protected]>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

sampan-s-nayak sampan-s-nayak left review comments

cursor[bot] cursor[bot] left review comments

richardliaw Awaiting requested review from richardliaw

matthewdeng Awaiting requested review from matthewdeng

edoakes Awaiting requested review from edoakes

SongGuyang Awaiting requested review from SongGuyang

raulchen Awaiting requested review from raulchen

kfstorm Awaiting requested review from kfstorm

WangTaoTheTonic Awaiting requested review from WangTaoTheTonic

+1 more reviewer

gemini-code-assist[bot] gemini-code-assist[bot] left review comments

At least 1 approving review is required to merge this pull request.

Labels

community-contribution core observability