System design and components of the SWF Testbed.
The testbed plan is based on ePIC streaming computing model WG discussions in the streaming computing model meeting1, guided by the ePIC streaming computing model report2, and the ePIC workflow management system requirements draft3.
The testbed prototypes the ePIC streaming computing model's workflows and dataflows from Echelon 0 (E0) egress (the DAQ exit buffer) through the processing that takes place at the two Echelon 1 computing facilities at BNL and JLab.
The testbed scope, timeline and workplan are described in a planning document4. Detailed progress tracking and development discussion is in a progress document5.
See the E0-E1 overview slide deck 6 for more information on the E0-E1 workflow and dataflow. The following is a schematic of the system the testbed targets (from the blue DAQ external subnet rightwards).
Figure: E0-E1 data flow and processing schematic
Overall system design and implementation notes:
- We aim to follow the Software Statement of Principles of the EIC and ePIC in the design and implementation of the testbed software.
- The implementation language is Python 3.9 or greater.
- Testbed modules are implemented as a set of loosely coupled agents, each with a specific role in the system.
- The agents communicate via messaging, using ActiveMQ as the message broker.
- The PanDA 7 distributed workload management system and its ancillary components are used for workflow orchestration and workload execution.
- The Rucio 8 distributed data management system is used for management and distribution of data and associated metadata, in close orchestration with PanDA.
- High quality monitoring and centralized management of system data (metadata, bookkeeping, logs etc.) is a primary design goal. Monitoring and system data gathering and distribution is implemented via a web service backed by a relational database, with a REST API for data access and reporting.
The streaming workflow (swf prefix) set of repositories make up the software for the ePIC streaming workflow testbed project, development begun in June 2025. This swf-testbed repository serves as the umbrella repository for the testbed. It's the central place for documentation, overall configuration, and high-level project information.
The repositories mapping to testbed components are:
A web service providing system monitoring and comprehensive information about the testbed's state, both via browser-based dashboards and a json based REST API.
This module manages the databases used by the testbed, and offers a REST API for other agents in the system to report status and retrieve information. It acts as a listener for the ActiveMQ message broker, receiving messages from other agents, storing relevant data in the database and presenting message histories in the monitor. It hosts a Model Context Protocol (MCP) server for the agents to share information with LLM clients to create an intelligent assistant for the testbed.
This is the information agent designed to simulate the Data Acquisition (DAQ) system and other EIC machine and ePIC detector influences on streaming processing. This dynamic simulator acts as the primary input and driver of activity within the testbed.
This is the central data handling agent within the testbed. It listens to the swf-daqsim-agent, manages Rucio subscriptions of run datasets and STF files, create new run datasets, and sends messages to the swf-processing-agent for run processing and to the swf-fastmon-agent for new STF availability. It will also have a 'watcher' role to identify and report stalls or anomalies.
This is the prompt processing agent that configures and submits PanDA processing jobs to execute the streaming workflows of the testbed.
This is the fast monitoring agent designed to consume (fractions of) STF data for quick, near real-time monitoring. This agent will reside at the E1s and perform remote data reads from STF files in the DAQ exit buffer, skimming a fraction of the data of interest for fast monitoring. The agent will be notified of new STF availability by the swf-data-agent.
This agent may be added in the future for managing Model Context Protocol (MCP) services. For the moment, this is done in swf-monitor (colocated with the agent data the MCP services provide).
Note Paul Nilsson's ask-panda example of MCP server and client; we want to integrate it into the testbed. Tadashi Maeno has also implemented MCP capability on the core PanDA services, we will want to integrate that as well.
The testbed supports two deployment modes:
Development Mode (Docker-managed infrastructure):
- PostgreSQL and ActiveMQ run as Docker containers
- Managed by
docker-compose.yml - Started/stopped via
swf-testbed start/stop - Ideal for development and testing environments
System Mode (System-managed infrastructure):
- PostgreSQL and ActiveMQ run as system services (e.g.,
postgresql-16.service,artemis.service) - Managed by system service manager (systemd)
- Testbed manages only agent processes via
swf-testbed start-local/stop-local - Typical for shared development systems like pandaserver02
Both modes use supervisord to manage Python agent processes. Use python report_system_status.py to verify service availability and determine which mode is active.
The database schema for the monitoring system is automatically maintained in the swf-monitor repository. View the current schema:
To generate the schema manually:
cd swf-monitor/src
python manage.py dbml > ../testbed-schema.dbmlTo visualize the schema, paste the DBML content into dbdiagram.io.
The testbed uses a global sequential agent ID system to ensure unique agent identification:
Agent Naming Convention:
- Format:
{agent_type}-agent-{username}-{sequential_id} - Examples:
data-agent-wenauseic-1,processing-agent-wenauseic-2,daq-simulator-wenauseic-3
Sequential ID Assignment:
- Global counter shared across all agent types
- Stored in PersistentState database model
- Thread-safe atomic assignment via
SELECT FOR UPDATE - API endpoint:
/api/state/next-agent-id/
Benefits:
- Guaranteed unique agent names across the system
- Easy tracking of agent generations over time
- No collision risk even with concurrent agent startups
- Human-readable sequential numbering (1, 2, 3...)
This replaces the previous random suffix approach and ensures long-term uniqueness as the system scales.
Namespaces provide logical isolation for workflows and agents:
- Purpose: Allow users to discriminate their workflows from others and from other users
- Scope: All workflow messages include namespace for filtering
- Configuration: Set in
workflows/testbed.tomlbefore running workflows - Collaboration: Multiple users can share a namespace for collaborative work
- UI Integration: Monitor UI supports namespace filtering on agents, executions, and messages
Example namespace configuration:
[testbed]
namespace = "epic-fastmon-dev"The testbed implements a three-layer workflow architecture to support both complex orchestration and agile parameter experimentation:
Layer 1: Snakemake (Complex Workflows)
- For complex multi-facility workflows with dependencies
- Handles orchestration-heavy scenarios (e.g., calibration workflows)
- Integrates with PanDA for distributed execution
- Used when workflow logic involves conditional execution and complex dependencies
Layer 2: TOML Configuration (Parameter Management)
- Human-readable parameter definitions for workflow variants
- Supports systematic experimentation and comparison
- Version control friendly for tracking parameter evolution
- Enables rapid iteration on workflow configurations
Layer 3: Python + SimPy (Execution Engine)
- Direct SimPy integration for simulation and execution
- Native Python expressiveness for workflow logic
- Real-time parameter-driven execution
- Optimized for fast processing workflows requiring rapid experimentation
Workflow Type Classification:
- Fast Processing Workflows: Use TOML + Python+SimPy (Layers 2 & 3) for rapid experimentation with worker counts, processing targets, and sampling rates
- Complex Orchestration Workflows: Use all three layers when sophisticated dependency management and multi-facility coordination are required
This architecture maintains testbed agility while supporting both simple parameter experiments and complex production-like workflows.
The following diagram shows the testbed's agent-based architecture and data flows:
graph LR
DAQ[E1/E2 Fast Monitor]
PanDA[PanDA]
Rucio[Rucio]
ActiveMQ[ActiveMQ]
PostgreSQL[PostgreSQL]
DAQSim[swf-daqsim-agent]
DataAgent[swf-data-agent]
ProcAgent[swf-processing-agent]
FastMon[swf-fastmon-agent]
Monitor[swf-monitor]
WebUI[Web Dashboard]
RestAPI[REST API]
MCP[MCP Server]
DAQSim -->|1| ActiveMQ
ActiveMQ -->|1| DataAgent
ActiveMQ -->|1| ProcAgent
DataAgent -->|2| Rucio
ProcAgent -->|3| PanDA
DAQSim -->|4| ActiveMQ
ActiveMQ -->|4| DataAgent
ActiveMQ -->|4| ProcAgent
ActiveMQ -->|4| FastMon
DataAgent -->|5| Rucio
ProcAgent -->|6| PanDA
FastMon -.->|7| DAQ
ActiveMQ -.-> Monitor
Monitor --> PostgreSQL
Monitor --> WebUI
Monitor --> RestAPI
Monitor --> MCP
Figure: Testbed agent architecture and data flow diagram
Workflow Steps:
- Run Start - daqsim-agent generates a run start broadcast message indicating a new datataking run is beginning
- Dataset Creation - data-agent sees the run start message and has Rucio create a dataset for the run
- Processing Task - processing-agent sees the run start message and establishes a PanDA processing task for the run
- STF Available - daqsim-agent generates a broadcast message that a new STF data file is available
- STF Transfer - data-agent sees the message and initiates Rucio registration and transfer of the STF file to E1 facilities
- STF Processing - processing-agent sees the new STF file in the dataset and transferred to the E1 by Rucio, and initiates a PanDA job to process the STF
- Fast Monitoring - fastmon-agent sees the broadcast message that a new STF data file is available and performs a partial read to inject a data sample into E1/E2 fast monitoring
