Skip to content

Integrate ROSE + SmartSim Data Client #73

@AymenFJA

Description

@AymenFJA

This ticket tracks the progress of integrating ROSE with HPC-Cray SmartSim DataClient.

Very high-level design:

┌──────────────────────────────────────────────────────────────────────────────┐
│                                   ROSE                                       │
│──────────────────────────────────────────────────────────────────────────────│
│ • Manages and scales distributed AL/RL, and ML workflows across HPC resources│
│ • Submits multiple Learners in parallel (each manages Simulation + Training) │
│ • Owns an integrated DataClient (leveraging SmartSim) for in-memory exchange │
│                                                                              │
│   ┌────────────────────────────────────────────────────────────────────┐     │
│   │                         DataClient (SmartSim)                      │     │
│   │────────────────────────────────────────────────────────────────────│     │
│   │ • ROSE-integrated interface for data movement                      │     │
│   │ • Uses SmartSim/Redis backend for in-memory tensor exchange        │     │
│   │ • Enables fast communication between Simulations and AI tasks      │     │
│   └────────────────────────────────────────────────────────────────────┘     │
│                                                                              │
│   ┌────────────────────────────────────────────────────────────────────┐     │
│   │                            Learners (Parallel)                     │     │
│   │────────────────────────────────────────────────────────────────────│     │
│   │  Each Learner orchestrates coupled Simulation and Training tasks:  │     │
│   │                                                                    │     │
│   │   ┌────────────────────────┐       ┌────────────────────────┐      │     │
│   │   │   Simulation Task      │──────▶│  AI / Training Task    │      │     │
│   │   │  (Data Producer)       │◀──────│ (Data Consumer/Updater)│      │     │
│   │   └────────────────────────┘       └────────────────────────┘      │     │
│   │            │                                │                      │     │
│   │           └─────────── uses DataClient ─────┘                      │     │
│   │                                                                    │     │
│   │   ... (many Learners running concurrently, managed by ROSE) ...    │     │
│   └────────────────────────────────────────────────────────────────────┘     │
│                                                                              │
│   ┌────────────────────────────────────────────────────────────────────┐     │
│   │       Underlying SmartSim / Redis/Dragon Backend (In-memory Store) │     │
│   │────────────────────────────────────────────────────────────────────│     │
│   │ • Global in-memory database shared across all Learners             │     │
│   │ • Handles tensors, metadata, and model states                      │     │
│   │ • Scales across HPC nodes with SmartSim integration                │     │
│   └────────────────────────────────────────────────────────────────────┘     │
└──────────────────────────────────────────────────────────────────────────────┘

Current work is under the following branch: https://github.com/radical-cybertools/ROSE/tree/integration/smartsim
Example (not a final version): https://github.com/radical-cybertools/ROSE/blob/integration/smartsim/examples/data_flow_learner.py

Metadata

Metadata

Assignees

Labels

Highnew featureNew feature that is newly developed and adds additional functionality to the package

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions