Turns RGB-D streams into a persistent, queryable 3D object map β objects get stable IDs, 3D positions, CLIP embeddings, and semantic labels, updated in real time.
pip install "rtsm[gpu]" --extra-index-url https://download.pytorch.org/whl/cu128
rtsm demo246 ms/frame Β· 74 objects tracked Β· Apache 2.0 Β· Python 3.12+ Β· RTX 3080β5090
Demo Video Β· Docs Β· PyPI
- Builds a live 3D object map from RGB-D + pose streams (ARKit, RealSense, or recorded sessions)
- Assigns persistent IDs to objects across viewpoints and time β not per-frame detection, real tracking
- Stores spatial, semantic, and temporal metadata per object (position, CLIP embedding, label confidence, view history)
- Supports semantic + spatial queries (e.g. "red bin near dock 3") via REST API and MCP
- SLAM-agnostic β sits above any perception stack that provides RGB-D + pose
rtsm demo runs a pre-recorded 50-frame indoor scene through the full pipeline with 3D visualization:
rtsm demo # full pipeline + 3D viewer (opens browser)
rtsm demo --no-viz # headless β API only at localhost:8002No hardware needed β replay uses a bundled recording.
Try searching for these objects (type in the search bar or use the API):
tissue box Β· doll Β· laptop Β· pillow Β· curtain Β· lamp Β· humidifier
rtsm demoruns a short 50-frame clip. For the full room sweep (240 frames), clone the repo withgit lfs install && git clonethen runrtsm --replay recordings/session1.
Watch the full demo on YouTube
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β RTSM β Real-Time Spatio-Semantic Memory β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
ββββββββββββββββββββ ββββββββββββββββββββ ββββββββββββββββββββ
β Calabi Lens β β D435i + SLAM β β Recorded β
β (ARKit iOS) β β (RTABMap) β β Session β
ββββββββββ¬ββββββββββ ββββββββββ¬ββββββββββ ββββββββββ¬ββββββββββ
β WebSocket β ZeroMQ β --replay
βΌ βΌ βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β I/O Layer β
β β
β βββββββββββββββ βββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β WebSocket β β ZMQ Bridge β β Replay β β Recorder β β
β β Receiver β β (sensors) β β Receiver β β (--record) β β
β ββββββββ¬βββββββ ββββββββ¬βββββββ ββββββββ¬ββββββββ ββββββββββββββββ β
β ββββββββββββββββββ΄βββββββββββββββββ β
β β β
β ββββββββΌββββββββ ββββββββββββββββ β
β β IngestQueue βββββ>β FramePacket β β
β β (buffer) β β (RGB,D,Pose) β β
β ββββββββββββββββ ββββββββ¬ββββββββ β
β β β
βββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββ
β
βββββββββββββββββββββββΌββββββββββββββββββββββββ
β Ingest Gate β
β (keyframe priority, sweep-based skip) β
βββββββββββββββββββββββ¬ββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββ
β Perception Pipeline β
β β
β ββββββββββββββββββ ββββββββββββββββββ β
β β Grounding DINO β β SAM2 β Default: grounded_sam2 β
β β (detection + ββ>β (box-prompted β GDINO detects β SAM2 segments β
β β labels) β β masks) β (Apache 2.0, no AGPL) β
β ββββββββββββββββββ βββββββββ¬βββββββββ β
β β β
β βΌ β
β βββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β Mask Staging βββββ>β Top-K Select βββββ>β CLIP Encode β β
β β (heuristics) β β (priority) β β(224x224 crop)β β
β βββββββββββββββββ ββββββββββββββββ ββββββββ¬ββββββββ β
β β β
β βββββββββββββββββ βββββββΌβββββββββ β
β β Vocab Classifyβ<ββββββββ Embeddings β β
β β (label + conf)β β (512-D L2) β β
β βββββββββ¬ββββββββ ββββββββββββββββ β
β β β
βββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Association β
β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββββ β
β β Proximity βββββ>β Embedding βββββ>β Score Fusion β β
β β Query β β Cosine Sim β β (match/create)β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββββ β
β β
βββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Working Memory β
β β
β ObjectState: β
β - id, xyz_world (3D position) β
β - emb_mean, emb_gallery (CLIP embeddings) β
β - view_bins (multi-view fusion) β
β - label_scores (EWMA label confidence) β
β - stability, hits, confirmed β
β - image_crops (JPEG snapshots) β
β β
β Proto -> Confirmed (hits >= 2, stability >= 0.55, views >= 1) β
β β
βββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Long-Term Memory (FAISS / Milvus) β
β β
β Semantic Search: query(text) -> CLIP -> top-k objects β
β β
βββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β API & Visualization β
β β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ β
β β REST API β β WebSocket β β 3D Demo β β
β β /objects β β point clouds β β (Three.js) β β
β β /search β β objects_update β β β β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
# GPU β permissive license (SAM2 + Grounding DINO, Apache 2.0)
pip install "rtsm[gpu]" --extra-index-url https://download.pytorch.org/whl/cu128
# GPU β ultralytics backends (FastSAM + YOLOE, AGPL-3.0)
pip install "rtsm[gpu-ultralytics]" --extra-index-url https://download.pytorch.org/whl/cu128
# Everything (GPU + visualization)
pip install "rtsm[all]" --extra-index-url https://download.pytorch.org/whl/cu128git clone https://github.com/calabi-inc/rtsm.git
cd rtsm
pip install ".[gpu]" --extra-index-url https://download.pytorch.org/whl/cu128python scripts/fetch_models.py # all default models (SAM2, GDINO, CLIP)
python scripts/fetch_models.py --only sam2 # or individuallyLicense note:
rtsm[gpu]uses only Apache 2.0 / MIT dependencies.rtsm[gpu-ultralytics]adds theultralyticspackage (AGPL-3.0) for FastSAM and YOLOE backends.CUDA version: Use
cu128for most GPUs (RTX 3080β5090). For Blackwell-only features usecu130. See PyTorch install for other options.
rtsm # starts pipeline + API + visualization# Set io.receiver: zeromq in config/rtsm.yaml first
rtsmrtsm --replay recordings/session1# Record only (no GPU needed β works with core-only install)
rtsm --record recordings/my_session --record-only
# Record while running pipeline
rtsm --record recordings/my_session
# Replay at original rate
rtsm --replay recordings/my_sessionRecordings are self-contained directories with raw WebSocket data. Replay feeds the exact same bytes through the full pipeline, preserving all time-dependent behavior.
curl http://localhost:8000/objects # list all objects
curl "http://localhost:8000/search/semantic?query=red%20mug&top_k=5" # semantic search
curl http://localhost:8000/stats/detailed # system stats
curl http://localhost:8000/stats/analytics # runtime analyticsRTSM supports multiple segmentation backends via segmentation.backend in config/rtsm.yaml:
| Backend | License | Description | Seg time* | Pipeline total* | Labels |
|---|---|---|---|---|---|
grounded_sam2 |
Apache 2.0 | Grounding DINO detect + SAM2 segment | 217 ms | 531 ms | Open-vocab (text-prompted) |
sam2 |
Apache 2.0 | SAM2 auto-mask (segment everything) | ~860 ms | ~1000 ms | None (class-agnostic) |
fastsam |
AGPL-3.0 | FastSAM (segment everything) | ~50 ms | ~200 ms | None (class-agnostic) |
yoloe |
AGPL-3.0 | YOLOE detection + segmentation | ~60 ms | ~210 ms | Open-vocab / 1200+ built-in |
dual |
AGPL-3.0 | FastSAM + YOLOE with IoU merge | 100 ms | 246 ms | Dual-confirmed labels |
Mean on RTX 5090, 640x480 input.
Default: grounded_sam2 β permissive license, open-vocabulary, no AGPL dependency.
segmentation:
backend: grounded_sam2 # or: sam2, fastsam, yoloe, dual
fastsam,yoloe, anddualrequirepip install "rtsm[gpu-ultralytics]".
Benchmarked on RTX 5090 (32 GB), iPhone ARKit recording (240 frames, 76s indoor scene), 640x480 RGB input.
| Metric | dual (FastSAM + YOLOE) | grounded_sam2 (GDINO + SAM2) |
|---|---|---|
| Mean latency | 246 ms | 531 ms |
| P50 latency | 213 ms | 486 ms |
| P95 latency | 604 ms | 942 ms |
| Masks/frame | 25.7 | 11.3 |
| Objects confirmed | 74 | 42 |
| Confirmation rate | 65.4% | 59.2% |
| License | AGPL-3.0 | Apache-2.0 |
Full breakdown: Benchmarks |
reports/backend_comparison.md
See config/rtsm.yaml for full configuration options:
- Camera intrinsics β focal length, resolution
- I/O endpoints β ZeroMQ addresses for camera and SLAM
- Pipeline tuning β mask filtering, association thresholds
- Memory settings β object promotion, expiry, vector store
rtsm/
βββ core/ # Pipeline, association, ingest gate, data models
βββ models/ # SAM2, Grounding DINO, FastSAM, YOLOE, CLIP adapters
βββ stores/ # Working memory, proximity index, sweep cache, vector stores
βββ io/ # WebSocket + ZeroMQ receivers, recorder, replayer
βββ analytics/ # Runtime analytics (latency, segmentation, congestion)
βββ api/ # REST API server (FastAPI)
βββ visualization/ # WebSocket server, TSDF fusion, 3D demo
βββ utils/ # Mask staging, transforms, helpers
config/
βββ rtsm.yaml # Main configuration
βββ clip/vocab.yaml # CLIP vocabulary
scripts/
βββ fetch_models.py # Download models
βββ debug_segmentation.py # A/B segmentation viewer
βββ benchmark_backends.py # Backend benchmark
- Dual-confirmation segmentation (FastSAM + YOLOE)
- AGPL-clean default (SAM2 + Grounding DINO, Apache 2.0)
- YOLOE prompt-free (1200+ LVIS categories)
- WebSocket receiver for Calabi Lens (ARKit iOS)
- Record/replay system for offline testing
- A/B segmentation debug tooling
- Real-time analytics dashboard
- Agent interface (MCP β 6 tools via SSE)
- Evaluation framework (ArUco ground truth, precision/recall)
- More protocols (ROS 2, gRPC)
- LLM integration for high-level queries
- Docker image
RTSM builds on excellent open-source work:
- SAM 2 β Ravi et al., 2024. arXiv:2408.00714 Β· GitHub
- Grounding DINO β Liu et al., 2023. arXiv:2303.05499 Β· GitHub
- FastSAM β Zhao et al., 2023. arXiv:2306.12156 Β· GitHub
- YOLOE β THU-MIG, ICCV 2025. GitHub Β· Ultralytics
- CLIP β Radford et al., 2021. arXiv:2103.00020 Β· GitHub
- RTAB-Map β LabbΓ© & Michaud, 2019. Paper Β· GitHub
@software{chang2025rtsm,
author = {Chang, Chi Feng},
title = {{RTSM}: Real-Time Spatio-Semantic Memory},
year = {2025},
url = {https://github.com/calabi-inc/rtsm},
note = {Object-centric queryable memory for spatial AI and robotics}
}Apache-2.0
Built by Chi Feng, Chang
