Skip to content

Conversation

@jinbum-kim
Copy link
Contributor

Description

This PR makes Dashboard ReporterAgent work correctly with Autoscaler v2 for cluster-level node metrics (cluster_*_nodes).

Previously, ReporterAgent relied on the v1 autoscaler’s JSON debug payload stored in GCS internal KV under DEBUG_AUTOSCALING_STATUS to compute metrics like cluster_active_nodes, cluster_pending_nodes, and cluster_failed_nodes.
However, Autoscaler v2 doesn’t populate that KV key, so when v2 is enabled these metrics end up missing / always empty.

To fix that, when Autoscaler v2 is enabled the legacy KV path is skipped and cluster status is fetched via the RPC (get_cluster_status()). The result is then reshaped into the cluster_stats format that _to_records() already expects and sent through the same metrics pipeline.

On top of that, Autoscaler v2 introduces an extra node state, idle, so this PR also adds a cluster_idle_nodes gauge. This metric is emitted only when v2 is enabled, so it won’t affect v1 behavior.

Related issues

Closes: #59930

Additional information

Implementation details

  • ReporterAgent branches based on whether Autoscaler v2 is enabled:

    • v1: keep the existing behavior (read DEBUG_AUTOSCALING_STATUS from GCS internal KV)
    • v2: skip DEBUG_AUTOSCALING_STATUS and fetch cluster status using get_cluster_status()
  • For the v2 path, the fetched status is normalized to match the shape expected by _to_records():

    • active_nodes, idle_nodes: counted by ray_node_type_name
    • pending_nodes: converted into a list of (ip, node_type, details) tuples
    • failed_nodes: converted into a list of (ip, node_type) tuples

New metric: cluster_idle_nodes

  • Adds cluster_idle_nodes to expose the new Autoscaler v2 node state idle_nodes.
  • Uses has_autoscaler_v2_stats so cluster_idle_nodes is emitted only when Autoscaler v2 is on.

@jinbum-kim jinbum-kim requested a review from a team as a code owner January 26, 2026 19:16
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request successfully integrates Autoscaler v2 support for cluster-level node metrics, including the new cluster_idle_nodes gauge. The changes correctly branch logic based on the autoscaler version, ensuring backward compatibility with v1 while enabling new v2 features. The new test cases adequately cover both v1 and v2 scenarios, verifying the correct metric emission and data handling. However, there is a critical issue regarding a blocking call within an asynchronous context that needs to be addressed.

@jinbum-kim jinbum-kim force-pushed the fix/dashboard-autoscaler-v2-cluster-metrics branch 2 times, most recently from f70738b to 6a646d3 Compare January 26, 2026 20:05
@ray-gardener ray-gardener bot added core Issues that should be addressed in Ray Core observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling community-contribution Contributed by the community labels Jan 27, 2026
@edoakes
Copy link
Collaborator

edoakes commented Jan 27, 2026

@sampan-s-nayak @rueian PTAL

)

# Autoscaler v2 only - get cluster_status from gcs via RPC(get_cluster_status())
autoscaler_v2_enabled = is_autoscaler_v2(gcs_client=self._gcs_client)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we fetch this once and make it a member variable so that we can keep referring to it? (we can consider doing this if it is not possible to switch between autoscaler v1 and v2 once a ray cluster has started)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello, thank you for review.

I've confirmed that the autoscaler version is determined by env variables set before process start, and I haven't found any logic supporting runtime v1↔v2 switching in my investigation so far.

However, I'm slightly concerned about edge cases where one-time caching could lead to incorrect branching:

  • Custom deployments running multiple autoscaler components (e.g., built-in autoscaler process + separate autoscaler pod)
  • Transient detection inconsistencies during initial boot

This is an edge case, but I’m not sure what’s best here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc: @rueian

# Autoscaler v2 only - get cluster_status from gcs via RPC(get_cluster_status())
autoscaler_v2_enabled = is_autoscaler_v2(gcs_client=self._gcs_client)
if self._is_head_node and autoscaler_v2_enabled:
cluster_stats = await asyncio.to_thread(self._get_cluster_stats_v2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we if possible run this on self.executor instead of spawning a new thread everytime?

Copy link
Contributor Author

@jinbum-kim jinbum-kim Jan 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since _async_compose_stats_payload() is already executing inside self._executor (via _run_in_executor()), and RAY_DASHBOARD_REPORTER_AGENT_TPE_MAX_WORKERS defaults to 1, submitting a nested task to the same executor causes it to wait indefinitely for itself to finish.

That is why I utilized asyncio.to_thread to offload the work to a separate pool.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh in that case do we really need to offload work here? We can let self._get_cluster_stats_v2 to continue running on this thread. because of the existence of GIL I dont think we will be getting sufficient benefits to justify offloading this piece of work

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially I submitted this as a sync RPC call, but an automated reviewer warned that calling a blocking sync call inside an async method could block the main event loop, so I updated it and force-pushed.

However, I confirmed get_cluster_status() that the RPC call releases the GIL (with nogil). So it should be fine to call _get_cluster_stats_v2() sync here.

I’ll update the PR to call _get_cluster_stats_v2 sync

@jinbum-kim jinbum-kim force-pushed the fix/dashboard-autoscaler-v2-cluster-metrics branch from df71613 to 4d35dcb Compare January 29, 2026 10:31
@jinbum-kim jinbum-kim requested review from a team and WangTaoTheTonic as code owners January 29, 2026 10:31
@jinbum-kim jinbum-kim marked this pull request as draft January 29, 2026 10:32
@jinbum-kim jinbum-kim closed this Jan 29, 2026
@jinbum-kim jinbum-kim deleted the fix/dashboard-autoscaler-v2-cluster-metrics branch January 29, 2026 10:32
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

@jinbum-kim jinbum-kim restored the fix/dashboard-autoscaler-v2-cluster-metrics branch January 29, 2026 10:59
@jinbum-kim jinbum-kim reopened this Jan 29, 2026
@jinbum-kim jinbum-kim force-pushed the fix/dashboard-autoscaler-v2-cluster-metrics branch from 4d35dcb to 44c78f4 Compare January 29, 2026 10:59
@jinbum-kim
Copy link
Contributor Author

I rebased incorrectly and the PR history got polluted. I force-pushed a cleaned-up branch. Sorry about that.

@jinbum-kim jinbum-kim marked this pull request as ready for review January 29, 2026 14:18
@aslonnie aslonnie removed request for a team, aslonnie and jjyao January 29, 2026 20:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core observability Issues related to the Ray Dashboard, Logging, Metrics, Tracing, and/or Profiling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Core][Dashboard][Autoscaler] ReporterAgent cluster metrics missing when using autoscaler v2 (still reads DEBUG_AUTOSCALING_STATUS)

3 participants