-
Notifications
You must be signed in to change notification settings - Fork 3
Open
Labels
Highnew featureNew feature that is newly developed and adds additional functionality to the packageNew feature that is newly developed and adds additional functionality to the package
Description
This ticket tracks the progress of integrating ROSE with HPC-Cray SmartSim DataClient.
Very high-level design:
┌──────────────────────────────────────────────────────────────────────────────┐
│ ROSE │
│──────────────────────────────────────────────────────────────────────────────│
│ • Manages and scales distributed AL/RL, and ML workflows across HPC resources│
│ • Submits multiple Learners in parallel (each manages Simulation + Training) │
│ • Owns an integrated DataClient (leveraging SmartSim) for in-memory exchange │
│ │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ DataClient (SmartSim) │ │
│ │────────────────────────────────────────────────────────────────────│ │
│ │ • ROSE-integrated interface for data movement │ │
│ │ • Uses SmartSim/Redis backend for in-memory tensor exchange │ │
│ │ • Enables fast communication between Simulations and AI tasks │ │
│ └────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ Learners (Parallel) │ │
│ │────────────────────────────────────────────────────────────────────│ │
│ │ Each Learner orchestrates coupled Simulation and Training tasks: │ │
│ │ │ │
│ │ ┌────────────────────────┐ ┌────────────────────────┐ │ │
│ │ │ Simulation Task │──────▶│ AI / Training Task │ │ │
│ │ │ (Data Producer) │◀──────│ (Data Consumer/Updater)│ │ │
│ │ └────────────────────────┘ └────────────────────────┘ │ │
│ │ │ │ │ │
│ │ └─────────── uses DataClient ─────┘ │ │
│ │ │ │
│ │ ... (many Learners running concurrently, managed by ROSE) ... │ │
│ └────────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────────────────┐ │
│ │ Underlying SmartSim / Redis/Dragon Backend (In-memory Store) │ │
│ │────────────────────────────────────────────────────────────────────│ │
│ │ • Global in-memory database shared across all Learners │ │
│ │ • Handles tensors, metadata, and model states │ │
│ │ • Scales across HPC nodes with SmartSim integration │ │
│ └────────────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────────────┘
Current work is under the following branch: https://github.com/radical-cybertools/ROSE/tree/integration/smartsim
Example (not a final version): https://github.com/radical-cybertools/ROSE/blob/integration/smartsim/examples/data_flow_learner.py
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
Highnew featureNew feature that is newly developed and adds additional functionality to the packageNew feature that is newly developed and adds additional functionality to the package