Skip to content

calabi-inc/rtsm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

92 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

RTSM β€” Real-Time Spatial Memory for Robots

PyPI License Python

RTSM Demo

Turns RGB-D streams into a persistent, queryable 3D object map β€” objects get stable IDs, 3D positions, CLIP embeddings, and semantic labels, updated in real time.

pip install "rtsm[gpu]" --extra-index-url https://download.pytorch.org/whl/cu128
rtsm demo

246 ms/frame Β· 74 objects tracked Β· Apache 2.0 Β· Python 3.12+ Β· RTX 3080–5090

Demo Video Β· Docs Β· PyPI


What It Does

  • Builds a live 3D object map from RGB-D + pose streams (ARKit, RealSense, or recorded sessions)
  • Assigns persistent IDs to objects across viewpoints and time β€” not per-frame detection, real tracking
  • Stores spatial, semantic, and temporal metadata per object (position, CLIP embedding, label confidence, view history)
  • Supports semantic + spatial queries (e.g. "red bin near dock 3") via REST API and MCP
  • SLAM-agnostic β€” sits above any perception stack that provides RGB-D + pose

Try It

rtsm demo runs a pre-recorded 50-frame indoor scene through the full pipeline with 3D visualization:

rtsm demo              # full pipeline + 3D viewer (opens browser)
rtsm demo --no-viz     # headless β€” API only at localhost:8002

No hardware needed β€” replay uses a bundled recording.

Try searching for these objects (type in the search bar or use the API): tissue box Β· doll Β· laptop Β· pillow Β· curtain Β· lamp Β· humidifier

rtsm demo runs a short 50-frame clip. For the full room sweep (240 frames), clone the repo with git lfs install && git clone then run rtsm --replay recordings/session1.

Watch the full demo on YouTube


Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                 RTSM β€” Real-Time Spatio-Semantic Memory                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
  β”‚  Calabi Lens     β”‚   β”‚   D435i + SLAM   β”‚   β”‚  Recorded        β”‚
  β”‚  (ARKit iOS)     β”‚   β”‚   (RTABMap)      β”‚   β”‚  Session         β”‚
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
           β”‚ WebSocket            β”‚ ZeroMQ               β”‚ --replay
           β–Ό                      β–Ό                       β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  I/O Layer                                                               β”‚
β”‚                                                                          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚  WebSocket  β”‚  β”‚  ZMQ Bridge β”‚  β”‚   Replay     β”‚  β”‚  Recorder    β”‚   β”‚
β”‚  β”‚  Receiver   β”‚  β”‚  (sensors)  β”‚  β”‚  Receiver    β”‚  β”‚  (--record)  β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                              β”‚
β”‚                          β”‚                                               β”‚
β”‚                   β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                  β”‚
β”‚                   β”‚ IngestQueue  │────>β”‚ FramePacket  β”‚                  β”‚
β”‚                   β”‚  (buffer)    β”‚     β”‚ (RGB,D,Pose) β”‚                  β”‚
β”‚                   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜                  β”‚
β”‚                                               β”‚                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                β”‚
                          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                          β”‚              Ingest Gate                    β”‚
                          β”‚   (keyframe priority, sweep-based skip)     β”‚
                          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Perception Pipeline                                                     β”‚
β”‚                                                                          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                   β”‚
β”‚  β”‚ Grounding DINO β”‚  β”‚     SAM2       β”‚    Default: grounded_sam2          β”‚
β”‚  β”‚ (detection +   │─>β”‚ (box-prompted  β”‚    GDINO detects β†’ SAM2 segments   β”‚
β”‚  β”‚  labels)       β”‚  β”‚  masks)        β”‚    (Apache 2.0, no AGPL)           β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                   β”‚
β”‚                              β”‚                                            β”‚
β”‚                 β–Ό                                                        β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”             β”‚
β”‚  β”‚ Mask Staging  │────>β”‚ Top-K Select │────>β”‚ CLIP Encode  β”‚             β”‚
β”‚  β”‚ (heuristics)  β”‚     β”‚  (priority)  β”‚     β”‚(224x224 crop)β”‚             β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜             β”‚
β”‚                                                    β”‚                     β”‚
β”‚                     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”            β”‚
β”‚                     β”‚ Vocab Classifyβ”‚<───────│  Embeddings  β”‚            β”‚
β”‚                     β”‚ (label + conf)β”‚        β”‚  (512-D L2)  β”‚            β”‚
β”‚                     β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β”‚
β”‚                             β”‚                                            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Association                                                             β”‚
β”‚                                                                          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”               β”‚
β”‚  β”‚  Proximity  │────>β”‚  Embedding  │────>β”‚  Score Fusion β”‚               β”‚
β”‚  β”‚   Query     β”‚     β”‚  Cosine Sim β”‚     β”‚ (match/create)β”‚               β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜               β”‚
β”‚                                                                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                β”‚
                β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Working Memory                                                          β”‚
β”‚                                                                          β”‚
β”‚  ObjectState:                                                            β”‚
β”‚    - id, xyz_world (3D position)                                         β”‚
β”‚    - emb_mean, emb_gallery (CLIP embeddings)                             β”‚
β”‚    - view_bins (multi-view fusion)                                       β”‚
β”‚    - label_scores (EWMA label confidence)                                β”‚
β”‚    - stability, hits, confirmed                                          β”‚
β”‚    - image_crops (JPEG snapshots)                                        β”‚
β”‚                                                                          β”‚
β”‚  Proto -> Confirmed (hits >= 2, stability >= 0.55, views >= 1)           β”‚
β”‚                                                                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                β”‚
                β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Long-Term Memory (FAISS / Milvus)                                       β”‚
β”‚                                                                          β”‚
β”‚  Semantic Search: query(text) -> CLIP -> top-k objects                   β”‚
β”‚                                                                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                β”‚
                β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  API & Visualization                                                     β”‚
β”‚                                                                          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”‚
β”‚  β”‚    REST API     β”‚  β”‚    WebSocket    β”‚  β”‚     3D Demo     β”‚           β”‚
β”‚  β”‚    /objects     β”‚  β”‚  point clouds   β”‚  β”‚    (Three.js)   β”‚           β”‚
β”‚  β”‚    /search      β”‚  β”‚  objects_update β”‚  β”‚                 β”‚           β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β”‚
β”‚                                                                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Installation

From PyPI (recommended)

# GPU β€” permissive license (SAM2 + Grounding DINO, Apache 2.0)
pip install "rtsm[gpu]" --extra-index-url https://download.pytorch.org/whl/cu128

# GPU β€” ultralytics backends (FastSAM + YOLOE, AGPL-3.0)
pip install "rtsm[gpu-ultralytics]" --extra-index-url https://download.pytorch.org/whl/cu128

# Everything (GPU + visualization)
pip install "rtsm[all]" --extra-index-url https://download.pytorch.org/whl/cu128

From Source

git clone https://github.com/calabi-inc/rtsm.git
cd rtsm
pip install ".[gpu]" --extra-index-url https://download.pytorch.org/whl/cu128

Download Models

python scripts/fetch_models.py                # all default models (SAM2, GDINO, CLIP)
python scripts/fetch_models.py --only sam2    # or individually

License note: rtsm[gpu] uses only Apache 2.0 / MIT dependencies. rtsm[gpu-ultralytics] adds the ultralytics package (AGPL-3.0) for FastSAM and YOLOE backends.

CUDA version: Use cu128 for most GPUs (RTX 3080–5090). For Blackwell-only features use cu130. See PyTorch install for other options.


Usage

Live β€” iPhone (ARKit over WebSocket)

rtsm                   # starts pipeline + API + visualization

Live β€” RealSense D435i + RTAB-Map

# Set io.receiver: zeromq in config/rtsm.yaml first
rtsm

Replay a Recorded Session

rtsm --replay recordings/session1

Record & Replay

# Record only (no GPU needed β€” works with core-only install)
rtsm --record recordings/my_session --record-only

# Record while running pipeline
rtsm --record recordings/my_session

# Replay at original rate
rtsm --replay recordings/my_session

Recordings are self-contained directories with raw WebSocket data. Replay feeds the exact same bytes through the full pipeline, preserving all time-dependent behavior.

API

curl http://localhost:8000/objects                                    # list all objects
curl "http://localhost:8000/search/semantic?query=red%20mug&top_k=5"  # semantic search
curl http://localhost:8000/stats/detailed                             # system stats
curl http://localhost:8000/stats/analytics                            # runtime analytics

Segmentation Backends

RTSM supports multiple segmentation backends via segmentation.backend in config/rtsm.yaml:

Backend License Description Seg time* Pipeline total* Labels
grounded_sam2 Apache 2.0 Grounding DINO detect + SAM2 segment 217 ms 531 ms Open-vocab (text-prompted)
sam2 Apache 2.0 SAM2 auto-mask (segment everything) ~860 ms ~1000 ms None (class-agnostic)
fastsam AGPL-3.0 FastSAM (segment everything) ~50 ms ~200 ms None (class-agnostic)
yoloe AGPL-3.0 YOLOE detection + segmentation ~60 ms ~210 ms Open-vocab / 1200+ built-in
dual AGPL-3.0 FastSAM + YOLOE with IoU merge 100 ms 246 ms Dual-confirmed labels

Mean on RTX 5090, 640x480 input.

Default: grounded_sam2 β€” permissive license, open-vocabulary, no AGPL dependency.

segmentation:
  backend: grounded_sam2    # or: sam2, fastsam, yoloe, dual

fastsam, yoloe, and dual require pip install "rtsm[gpu-ultralytics]".


Performance

Benchmarked on RTX 5090 (32 GB), iPhone ARKit recording (240 frames, 76s indoor scene), 640x480 RGB input.

Metric dual (FastSAM + YOLOE) grounded_sam2 (GDINO + SAM2)
Mean latency 246 ms 531 ms
P50 latency 213 ms 486 ms
P95 latency 604 ms 942 ms
Masks/frame 25.7 11.3
Objects confirmed 74 42
Confirmation rate 65.4% 59.2%
License AGPL-3.0 Apache-2.0

Full breakdown: Benchmarks | reports/backend_comparison.md


Configuration

See config/rtsm.yaml for full configuration options:

  • Camera intrinsics β€” focal length, resolution
  • I/O endpoints β€” ZeroMQ addresses for camera and SLAM
  • Pipeline tuning β€” mask filtering, association thresholds
  • Memory settings β€” object promotion, expiry, vector store

Project Structure

rtsm/
β”œβ”€β”€ core/           # Pipeline, association, ingest gate, data models
β”œβ”€β”€ models/         # SAM2, Grounding DINO, FastSAM, YOLOE, CLIP adapters
β”œβ”€β”€ stores/         # Working memory, proximity index, sweep cache, vector stores
β”œβ”€β”€ io/             # WebSocket + ZeroMQ receivers, recorder, replayer
β”œβ”€β”€ analytics/      # Runtime analytics (latency, segmentation, congestion)
β”œβ”€β”€ api/            # REST API server (FastAPI)
β”œβ”€β”€ visualization/  # WebSocket server, TSDF fusion, 3D demo
└── utils/          # Mask staging, transforms, helpers
config/
β”œβ”€β”€ rtsm.yaml       # Main configuration
└── clip/vocab.yaml  # CLIP vocabulary
scripts/
β”œβ”€β”€ fetch_models.py          # Download models
β”œβ”€β”€ debug_segmentation.py    # A/B segmentation viewer
└── benchmark_backends.py    # Backend benchmark

Roadmap

  • Dual-confirmation segmentation (FastSAM + YOLOE)
  • AGPL-clean default (SAM2 + Grounding DINO, Apache 2.0)
  • YOLOE prompt-free (1200+ LVIS categories)
  • WebSocket receiver for Calabi Lens (ARKit iOS)
  • Record/replay system for offline testing
  • A/B segmentation debug tooling
  • Real-time analytics dashboard
  • Agent interface (MCP β€” 6 tools via SSE)
  • Evaluation framework (ArUco ground truth, precision/recall)
  • More protocols (ROS 2, gRPC)
  • LLM integration for high-level queries
  • Docker image

Acknowledgments

RTSM builds on excellent open-source work:


Cite

@software{chang2025rtsm,
  author       = {Chang, Chi Feng},
  title        = {{RTSM}: Real-Time Spatio-Semantic Memory},
  year         = {2025},
  url          = {https://github.com/calabi-inc/rtsm},
  note         = {Object-centric queryable memory for spatial AI and robotics}
}

License

Apache-2.0


Built by Chi Feng, Chang

Releases

No releases published

Packages

 
 
 

Contributors