-
Notifications
You must be signed in to change notification settings - Fork 7.2k
[Dashboard] Support autoscaler v2 for cluster-level node metrics #60504
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[Dashboard] Support autoscaler v2 for cluster-level node metrics #60504
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
The pull request successfully integrates Autoscaler v2 support for cluster-level node metrics, including the new cluster_idle_nodes gauge. The changes correctly branch logic based on the autoscaler version, ensuring backward compatibility with v1 while enabling new v2 features. The new test cases adequately cover both v1 and v2 scenarios, verifying the correct metric emission and data handling. However, there is a critical issue regarding a blocking call within an asynchronous context that needs to be addressed.
f70738b to
6a646d3
Compare
|
@sampan-s-nayak @rueian PTAL |
| ) | ||
|
|
||
| # Autoscaler v2 only - get cluster_status from gcs via RPC(get_cluster_status()) | ||
| autoscaler_v2_enabled = is_autoscaler_v2(gcs_client=self._gcs_client) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we fetch this once and make it a member variable so that we can keep referring to it? (we can consider doing this if it is not possible to switch between autoscaler v1 and v2 once a ray cluster has started)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello, thank you for review.
I've confirmed that the autoscaler version is determined by env variables set before process start, and I haven't found any logic supporting runtime v1↔v2 switching in my investigation so far.
However, I'm slightly concerned about edge cases where one-time caching could lead to incorrect branching:
- Custom deployments running multiple autoscaler components (e.g., built-in autoscaler process + separate autoscaler pod)
- Transient detection inconsistencies during initial boot
This is an edge case, but I’m not sure what’s best here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc: @rueian
| # Autoscaler v2 only - get cluster_status from gcs via RPC(get_cluster_status()) | ||
| autoscaler_v2_enabled = is_autoscaler_v2(gcs_client=self._gcs_client) | ||
| if self._is_head_node and autoscaler_v2_enabled: | ||
| cluster_stats = await asyncio.to_thread(self._get_cluster_stats_v2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we if possible run this on self.executor instead of spawning a new thread everytime?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since _async_compose_stats_payload() is already executing inside self._executor (via _run_in_executor()), and RAY_DASHBOARD_REPORTER_AGENT_TPE_MAX_WORKERS defaults to 1, submitting a nested task to the same executor causes it to wait indefinitely for itself to finish.
That is why I utilized asyncio.to_thread to offload the work to a separate pool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh in that case do we really need to offload work here? We can let self._get_cluster_stats_v2 to continue running on this thread. because of the existence of GIL I dont think we will be getting sufficient benefits to justify offloading this piece of work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially I submitted this as a sync RPC call, but an automated reviewer warned that calling a blocking sync call inside an async method could block the main event loop, so I updated it and force-pushed.
However, I confirmed get_cluster_status() that the RPC call releases the GIL (with nogil). So it should be fine to call _get_cluster_stats_v2() sync here.
I’ll update the PR to call _get_cluster_stats_v2 sync
df71613 to
4d35dcb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cursor Bugbot has reviewed your changes and found 3 potential issues.
…ocking Signed-off-by: jinbum-kim <[email protected]>
Signed-off-by: jinbum-kim <[email protected]>
4d35dcb to
44c78f4
Compare
|
I rebased incorrectly and the PR history got polluted. I force-pushed a cleaned-up branch. Sorry about that. |
Description
This PR makes Dashboard ReporterAgent work correctly with Autoscaler v2 for cluster-level node metrics (
cluster_*_nodes).Previously, ReporterAgent relied on the v1 autoscaler’s JSON debug payload stored in GCS internal KV under
DEBUG_AUTOSCALING_STATUSto compute metrics likecluster_active_nodes,cluster_pending_nodes, andcluster_failed_nodes.However, Autoscaler v2 doesn’t populate that KV key, so when v2 is enabled these metrics end up missing / always empty.
To fix that, when Autoscaler v2 is enabled the legacy KV path is skipped and cluster status is fetched via the RPC (
get_cluster_status()). The result is then reshaped into thecluster_statsformat that_to_records()already expects and sent through the same metrics pipeline.On top of that, Autoscaler v2 introduces an extra node state, idle, so this PR also adds a
cluster_idle_nodesgauge. This metric is emitted only when v2 is enabled, so it won’t affect v1 behavior.Related issues
Closes: #59930
Additional information
Implementation details
ReporterAgent branches based on whether Autoscaler v2 is enabled:
DEBUG_AUTOSCALING_STATUSfrom GCS internal KV)DEBUG_AUTOSCALING_STATUSand fetch cluster status usingget_cluster_status()For the v2 path, the fetched status is normalized to match the shape expected by
_to_records():active_nodes,idle_nodes: counted byray_node_type_namepending_nodes: converted into a list of(ip, node_type, details)tuplesfailed_nodes: converted into a list of(ip, node_type)tuplesNew metric:
cluster_idle_nodescluster_idle_nodesto expose the new Autoscaler v2 node stateidle_nodes.has_autoscaler_v2_statssocluster_idle_nodesis emitted only when Autoscaler v2 is on.