Skip to content

Releases: KatherLab/MediSwarm

v1.4.1

08 Apr 05:02
40182cc

Choose a tag to compare

MediSwarm v1.4.1

Changes

  • Update 2-site deploy test configuration to use DL0 (RUMC_1) + DL2 (MHA_1)
  • Bump version to 1.4.1

Deploy Test Validation

Successfully validated challenge_1DivideAndConquer over Tailscale VPN with 2 clients:

  • 30+ error-free training rounds across DL0 (RUMC_1) and DL2 (MHA_1)
  • P2P model exchange (689MB model): ~2-3 seconds
  • Adaptive epoch calculation with EPOCHS_MAX_CAP=10 working correctly
  • Both swarm_config and swarm_start phases completed cleanly

Full Changelog

v1.4.0...v1.4.1

v1.4.0

07 Apr 18:47

Choose a tag to compare

What's New

Webviewer (Live Monitor)

  • Fix age column flicker — Replaced <meta http-equiv="refresh"> with JS-based auto-refresh and client-side age ticking. Age now counts up smoothly without resetting to 0s on reload.
  • Hostname column — Dashboard now shows which machine each run is coming from (parsed from heartbeat.json).
  • Error status detection — Runs that hit FATAL_SYSTEM_ERROR, EXECUTION_EXCEPTION, RuntimeError, OutOfMemoryError, or CUDA out of memory are now flagged with a red "error" badge instead of appearing as "stale" or "finished".
  • Default metrics visibility — Only train/val ACC and AUC-ROC are shown by default in charts. All other series are hidden but toggleable via the Chart.js legend.
  • Label distribution chart — Detail page now shows a grouped bar chart of class counts per train/val/test split, parsed from console output.

Training

  • Reduce EPOCHS_MAX_CAP default from 20 → 10, preventing excessive epochs on small sites (e.g. RUMC_1 with 22 samples was doing 20 epochs per round, now capped at 10). Override with EPOCHS_MAX_CAP env var.

Heartbeat / Live Sync

  • Hostname field added to heartbeat.json output
  • ANSI escape code stripping from RUN_NAME (fixes garbled names from colored terminal output)
  • Quote cleanup on kit_version field

CI/CD

  • Deploy test workflow now triggers on release publish instead of weekly schedule (manual dispatch retained)

Housekeeping

MediSwarm v1.3.0

05 Apr 14:11
2943ec2

Choose a tag to compare

MediSwarm v1.3.0

Released: 2026-04-05

Major release adding the STAMP histopathology classification pipeline, FedProx aggregation strategy, comprehensive CI/CD infrastructure, Duke benchmark pipeline, and expanded documentation with architecture diagrams.


🔬 STAMP Classification Pipeline

Full support for KatherLab STAMP 2.4.0 histopathology classification in federated learning:

  • Separate Dockerfile_STAMP — Python 3.11, PyTorch 2.7.1, CUDA 12.6 (independent from ODELIA's Python 3.10/PyTorch 2.2.2 image)
  • Build flagbuildDockerImageAndStartupKits.sh now accepts -d / --dockerfile to select between Dockerfile_ODELIA and Dockerfile_STAMP
  • Synthetic dataset generator — Creates 2 sites × 15 patients with H5 feature files for integration testing
  • Integration tests — Preflight check, local training, and NVFlare simulation mode (3 rounds, 2 clients)
  • Per-round metrics CSVSTAMPMetricsCallback writes ground-truth/prediction probabilities and summary metrics per epoch

Two Docker Images

After v1.3.0, MediSwarm maintains two Docker images:

Image Python PyTorch Use Case
jefftud/odelia:<ver> 3.10 2.2.2 3D breast MRI classification
jefftud/stamp:<ver> 3.11 2.7.1 STAMP histopathology classification

🔄 FedProx Aggregation Strategy

Alternative to FedAvg for improved convergence with non-IID medical data:

  • FedProxCallback — Lightning callback adds proximal term (μ/2) × ‖w_local − w_global‖² to gradient updates
  • Cross-pipeline — Compatible with both ODELIA (pytorch_lightning) and STAMP (lightning)
  • Configurable — Set FEDPROX_MU environment variable (default: 0 = disabled, recommended: 0.001–0.01)
  • Documentationdocs/AGGREGATION_STRATEGIES.md compares FedAvg, FedProx, Scaffold, and FedOpt with decision matrix

🧪 CI/CD for STAMP

Expanded test infrastructure covering both pipelines:

  • Unit teststest_stamp_training.py (465 lines), test_stamp_model_wrapper.py (257 lines), test_fedprox_callback.py (286 lines)
  • Integration tests — STAMP Docker build + preflight + local training + simulation in pr-test.yaml
  • Unified packagesunit-tests.yaml switched from pytorch-lightning to unified lightning package
  • Timeout — PR test timeout increased from 45 to 60 minutes
  • Cleanup — CI cleanup step now kills stamp and nvflare containers alongside odelia

📊 Duke Benchmark Pipeline

Automated end-to-end benchmarking on the Duke Breast MRI dataset:

  • run_duke_benchmark.sh — Orchestrates build → deploy → swarm training → result collection → local model comparison
  • Configurable deploydeploy_and_test.sh reads SITES and SERVER_NAME from deploy_sites.conf (backward-compatible defaults)
  • deploy_sites.conf.example — Template with dl0/dl2/dl3 configuration for TUD compute cluster
  • Results templatedocs/DUKE_BENCHMARK_RESULTS.md for recording benchmark outcomes

📐 Architecture Documentation

Expanded README from 46 lines to 214 lines:

  • System Architecture — Mermaid diagram showing site-to-server topology with NVFlare aggregation
  • Training Pipeline — Mermaid sequence diagram showing federated learning round lifecycle
  • Supported Pipelines — Comparison table (ODELIA 3D CNN vs STAMP Classification)
  • Key Features — Privacy, Docker reproducibility, multi-pipeline support
  • Project Structure — Annotated directory tree

🔐 Differential Privacy Assessment

Gap analysis and roadmap (documentation only — implementation deferred to v1.4.0):

  • docs/DIFFERENTIAL_PRIVACY.md — Current PercentilePrivacy is gradient clipping, NOT formal (ε,δ)-DP. Detailed analysis of Opacus/DP-SGD integration path, compatibility issues, and privacy budget accounting
  • docs/DIFFERENTIAL_PRIVACY_DECISION.md — Architecture decision record

Changed

  • deploy_and_test.sh container matching broadened to include stamp and nvflare alongside odelia
  • CI pr-test.yaml timeout increased from 45 to 60 minutes
  • CI cleanup step now kills stamp and nvflare containers

Stats

  • 31 files changed, 3,465 insertions, 162 deletions
  • 16 new files created
  • 9 pull requests (#252#260)

Upgrade Notes

  • No breaking changes from v1.2.0
  • ODELIA pipeline users: no action required — Dockerfile_ODELIA is unchanged
  • STAMP pipeline users: build with ./buildDockerImageAndStartupKits.sh -d docker_config/Dockerfile_STAMP -p <project>
  • FedProx: opt-in via FEDPROX_MU env var — set to 0 or leave unset for standard FedAvg behavior

Full Changelog: v1.2.0...v1.3.0

MediSwarm v1.2.0

04 Apr 21:36
de4e5d3

Choose a tag to compare

MediSwarm v1.2.0

Highlights

This release introduces STAMP classification support for swarm learning, a prediction workflow for external test data, significant code deduplication, improved training stability, and comprehensive documentation for making standalone training code MediSwarm-compatible.

New Features

STAMP Classification Job (#249)

  • New STAMP_classification job for swarm learning with STAMP's data pipeline (H5 features + clinical tables)
  • Supports VIT, MLP, TransMIL, and other STAMP model architectures
  • Configurable via STAMP_* environment variables
  • Stratified train/val split with STAMP's data loading pipeline

Prediction Workflow (#247)

  • New prediction workflow for evaluating trained swarm models on external test data
  • Supports both ODELIA 3D CNN and STAMP classification models
  • Configurable via environment variables for model path, data directory, and output format

Weighted Epochs Per Site (#251)

  • Replaces hardcoded per-site epoch dictionaries with a formula-based approach
  • Formula: epochs = base_epochs × (reference_size / num_train_samples), clamped to [1, max_cap]
  • Sites with fewer training samples get more local epochs per round, equalizing gradient updates across sites
  • Configurable via EPOCHS_PER_ROUND, EPOCHS_REFERENCE_DATASET_SIZE, EPOCHS_MAX_CAP env vars

Best + Last Model Checkpoints (#251)

  • finalize_training() now saves both best (by monitor metric) and latest checkpoints
  • Deployers can choose between peak-validation and final-aggregated models

Server Dashboard Enhancement (#240)

  • Enhanced server-side monitoring dashboard for real-time swarm training visibility

Client Stability Improvements (#245)

  • Systemd service for VPN with auto-reconnect and keepalive
  • GPU health check script for pre-training and Docker health checks
  • Docker container restart policies (--restart=on-failure:5)
  • VPN health monitor with automatic service restart after consecutive failures

Infrastructure & DevOps

Code Deduplication (#241)

  • Consolidated 5 duplicate challenge job directories into shared _shared/custom/ with symlinks
  • Moved build scripts to scripts/build/ and CI scripts to scripts/ci/
  • Single source of truth for training code across all ODELIA/challenge jobs

NVFlare Workflow Enhancements (#242)

  • Cross-site evaluation (CSE) workflow added to server and client configs
  • Tuned timeouts from 100-hour placeholders to practical values
  • Explicit metric comparator configuration
  • PercentilePrivacy filter for gradient quality control

Automated Tests (#243)

  • New unit test suite in tests/unit_tests/ (models_config, env_config, data_module)
  • GitHub Actions workflow for unit tests on PRs
  • Fixed hardcoded paths in test_challenge_models.py

Docker Build Optimization (#250)

  • Reordered Dockerfile layers: pip installs (expensive, stable) before apt installs (cheap, frequent CVE bumps)
  • Added --no-cache-dir flags to reduce image size
  • Consolidated RUN layers for better caching

NVFlare 2.7.2 Upgrade (#235, #236)

  • Upgraded from NVFlare 2.5.x to 2.7.2

Bug Fixes

  • Fix integration test printed icons (#224)
  • Fix site name argument ordering (#237, fixes #227)
  • Update CI Node.js version (#238, fixes #222)
  • Fix CI apt-get update permissions (#239)
  • Fix CLI flags for env vars lost when using sudo (#230)

Documentation

  • MediSwarm Compatibility Guide (#244, addresses #216) — step-by-step guide for making standalone training code MediSwarm-compatible
  • Updated README with correct repository links

Training Improvements (#246)

  • Class-weighted loss for imbalanced datasets
  • Gradient accumulation (effective batch size of 8)
  • Gradient clipping (val=1.0) to prevent explosion
  • 16-mixed precision for stability

Full Changelog: v1.1.0...v1.2.0

v1.1.0 — Challenge Models

02 Apr 21:11

Choose a tag to compare

MediSwarm v1.1.0 — Challenge Models Release

This release integrates five ODELIA challenge models into MediSwarm for federated swarm training, along with infrastructure improvements for deployment, testing, and CI.

New Challenge Models

Job Model Architecture
challenge_1DivideAndConquer ResidualEncoder
challenge_2BCN_AIM SwinUNETR
challenge_3agaldran MViT v2
challenge_4abmil CrossModalAttentionABMIL + Swin
challenge_5pimed ResNet18

Each challenge job is a self-contained NVFlare application with its own model code, data pipeline, configs, and synthetic dataset generator.

Highlights

  • --job flag for docker.sh — Participants can now run preflight checks and local training for any challenge model:
    ./docker.sh --preflight_check --job challenge_5pimed --data_dir $DATADIR --scratch_dir $SCRATCHDIR --GPU device=0
  • Pretrained weight caching — Large model weights (checkpoint_final.pth, mvit_v2_s-ae3be167.pth) are stored outside job directories to prevent NVFlare from bundling them during job submission
  • MODEL_NAME env var fix — All challenge jobs hardcode their MODEL_NAME to prevent the docker.sh default (MST) from silently overriding the intended model
  • Deployment automation — New deploy_and_test.sh script for multi-site Docker image push, startup kit deployment, and swarm lifecycle management
  • Live sync — New kit_live_sync/ for startup kit synchronization with heartbeat monitoring
  • CI reliability — Fixed script permissions, auto-install of gdown, NVFlare submodule sync

Breaking Changes

None. The default behavior of docker.sh (without --job) remains unchanged and runs ODELIA_ternary_classification.

odelia-challenge-v1.0

10 Jul 11:40
e7a8509

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: https://github.com/KatherLab/MediSwarm/commits/Odelia_Challenge