Merge pull request #35 from tokk-nv/fix/asr-stream-resilience-and-riva-docs

tokk-nv · web-flow · commit 4cc4e38927d0 · 2026-03-14T14:33:17.000-07:00
fix(asr): auto-restart stream on unexpected death, prevent pipeline crash
diff --git a/docs/development/todo_asr_stream_stale_after_mute.md b/docs/development/todo_asr_stream_stale_after_mute.md
@@ -1,80 +1,90 @@
-# ASR stream goes stale after mute/unmute or long silence
+# ASR stream goes stale / dies prematurely
 
-After muting then unmuting the mic (or after a prolonged period where no speech reaches Riva), the ASR stream silently stops producing results even though PCM audio is still flowing.
+The Riva ASR gRPC stream can die mid-session — silently stopping result production or terminating entirely. This happens in multiple scenarios: after mute/unmute, after long idle periods, or even during normal operation with 0 results.
 
 ## Observed behavior
 
+### Scenario 1: Stale after long LLM block + mute/unmute
 - Session `c87be1b2` (2026-03-14): 9 turns completed successfully.
 - Turn 8 triggered a degenerate LLM reasoning loop (10,101 chars, **91.89 s** wall-clock).
 - During that wait, the user muted and later unmuted the mic.
-- After turn 9 completed, no further `asr_final` events appeared for ~1.5 min despite the green **user_amplitude** waveform being visible on the timeline (PCM capture was healthy).
-- Terminal showed no ASR errors; the stream ended normally at session close with `Stream task timeout, cancelling`.
+- After turn 9 completed, no further `asr_final` events appeared for ~1.5 min despite the green **user_amplitude** waveform being visible on the timeline.
+- Terminal showed no ASR errors; the stream ended normally at session close.
+
+### Scenario 2: Stream dies after ~2 min (normal operation, no mute)
+- Session on jat-4cbb47141bb7 (2026-03-14): 3 turns completed in ~2 min.
+- After turn 3, Riva ASR stream ended with 16 results total.
+- `_feed_pcm_to_pipeline` continued sending PCM but `send_audio()` raised `RuntimeError: Stream not started` — **crashing the entire pipeline**.
+
+### Scenario 3: Stream dies with 0 results (~23s)
+- Session `f9748641` on same device: ASR stream started, 0 results received, stream ended after ~23s.
+- Same `RuntimeError` crash.
+
+### Scenario 4: USB contention with Brio 4K camera
+- On Jetsons with Brio 4K (USB 3.0) + USB audio, severe bus contention causes:
+  - Camera `VIDIOC_REQBUFS: errno=19 (No such device)` — camera disappears from bus.
+  - `arecord: audio open error: Device or resource busy` — audio device locked by previous pipeline.
+  - ASR stream dies with 0 results; pipeline crashes before user even speaks.
 
 ## Why amplitude shows but ASR does not
 
-In `_feed_pcm_to_pipeline`, amplitude is always computed and sent to the client (lines 1005-1024) regardless of `mic_muted`. The ASR send is gated:
+In `_feed_pcm_to_pipeline`, amplitude is always computed and sent to the client regardless of `mic_muted`. The ASR send is gated:
 
 ```python
 if not mic_muted:
-    await asr.send_audio(pcm_bytes)
+    accepted = await asr.send_audio(pcm_bytes)
 ```
 
-So the timeline waveform looks alive, but if the Riva gRPC stream has internally timed out (or VAD state has gone stale after 90+ seconds of silence/mute), newly sent audio produces no results.
+So the timeline waveform looks alive, but if the Riva gRPC stream has internally timed out or died, newly sent audio is silently dropped (or previously, would crash).
 
-## Probable root cause (needs confirmation)
+## Root causes
 
 Riva Streaming ASR has internal session limits:
 - **gRPC keepalive / idle timeout**: if no audio is sent for an extended period the server may silently close the stream.
-- **VAD state**: after a long silence gap, the VAD model may reset or require a fresh trigger to start detecting speech again.
-- **Maximum session duration**: Riva may cap single-stream duration; after that, the stream yields no more results even though it stays open.
-
-The exact Riva behavior here is unconfirmed — the stream appeared open (no error logged) but stopped producing finals.
+- **VAD state**: after a long silence gap, the VAD model may reset or require a fresh trigger.
+- **Maximum session duration**: Riva may cap single-stream duration (~2 min observed); after that, the stream yields no more results.
+- **USB bus contention**: on Jetson devices with multiple USB peripherals (especially high-bandwidth cameras like Brio 4K), the audio device can become temporarily unavailable, preventing `arecord` from opening.
 
-## What is already in place
+## Implemented fixes (2026-03-14)
 
-- `mic_muted` gates `asr.send_audio()` in the classic pipeline (line 1003).
-- On mute, 0.5 s of silence is injected (`b"\x00" * int(16000 * 2 * 0.5)`) to flush any pending VAD partial (line 1041-1044).
-- On unmute, `mic_muted = False` resumes sending PCM to ASR.
-- No stream-health monitoring or automatic restart exists today.
-
-## Proposed solutions (pick one or combine)
-
-### Option A: Keep-alive noise during mute
+### Fix 1: Graceful `send_audio` (riva.py)
+`send_audio()` now returns `bool` instead of raising `RuntimeError` when the stream is dead. Returns `False` if `_sync_audio_queue` is `None`, allowing the PCM feeder to continue without crashing.
 
-While `mic_muted` is True, instead of sending nothing, send **very low amplitude white noise** (e.g., ±10 out of ±32768) at normal cadence. This keeps the gRPC stream active and the VAD model warm without triggering false speech detection.
+### Fix 2: Log-once warning (_feed_pcm_to_pipeline)
+When `send_audio` returns `False`, a warning is logged once per dead-stream episode: `[asr] send_audio dropped — ASR stream not active (waiting for auto-restart)`. The flag resets on stream restart.
 
-Pros: Simplest change; no stream lifecycle management. \
-Cons: Assumes the Riva stream itself is still healthy; does not help if the stream has a hard session-duration cap.
+### Fix 3: Auto-restart in asr_consumer (voice_pipeline.py)
+`asr_consumer` now wraps the `async for result in asr.receive_results()` loop in a `while not stopped.is_set()` loop. When the inner iterator ends (stream died) and the pipeline is still running:
+1. Increments restart counter (max 10).
+2. Logs a WARNING with result count and restart number.
+3. Emits `asr_stream_restart` timeline event.
+4. Calls `asr.stop_stream()` → sleep with exponential backoff (2s, 4s, ..., max 10s) → `asr.start_stream()`.
+5. Resets the `send_audio` log-once flag and result counter.
+6. Re-enters the `async for` loop on the fresh stream.
 
-### Option B: Restart ASR stream after stale timeout
+## Remaining work
 
-Monitor elapsed time since the last `asr_final`. If no final arrives within a configurable window (e.g., 60 s while unmuted), tear down the current `RivaASRBackend` stream and create a fresh one.
-
-1. Track `_last_asr_final_time` in the turn executor; update it on every `asr_final`.
-2. In `server_capture_consumer` (or a watchdog task), check `time.time() - _last_asr_final_time > ASR_STALE_TIMEOUT`.
-3. If stale and `not mic_muted`: call `asr.stop()`, then `asr.start()` to open a fresh streaming session.
-4. Log `[asr] Stream restarted after stale timeout` at WARNING level.
-
-Pros: Covers all root causes (idle timeout, VAD reset, session-duration cap). \
-Cons: Slightly more complex; brief gap in ASR coverage during restart (~200 ms).
-
-### Option C: Proactive stream rotation
-
-After every turn (or every N turns), close and re-open the ASR stream. This preempts any session-duration limit and keeps the stream fresh.
-
-Pros: Eliminates stale state entirely. \
-Cons: Adds latency at turn boundaries; may lose a partial if speech is ongoing during rotation.
+### Option A: Keep-alive noise during mute
+While `mic_muted` is True, send very low amplitude white noise (e.g., ±10 out of ±32768) at normal cadence. This keeps the gRPC stream active and the VAD model warm.
 
-## Recommendation
+Pros: Prevents idle timeout during mute. \
+Cons: Does not help if stream has a hard session-duration cap (but auto-restart covers that now).
 
-**Option A + B combined**: send keep-alive noise during mute (A) to prevent idle timeout, and add a stale-timeout watchdog (B) as a safety net for unexpected stream failures. Option C is heavier and only needed if Riva has a hard session cap that A+B cannot address.
+### Device contention mitigations
+- Investigate separating camera and audio onto different USB host controllers.
+- Consider CSI camera instead of USB to free USB bandwidth entirely.
+- Current `arecord` retry logic (8 attempts with backoff) helps, but persistent `Device or resource busy` across all retries indicates the previous pipeline's `arecord` process was not killed before the new one started.
 
-## Diagnosis checklist (before implementing)
+## Diagnosis checklist
 
-- [ ] Confirm Riva Streaming ASR session limits: check `riva_asr` service config for `max_duration_seconds`, keepalive settings, or gRPC deadline.
-- [ ] Add a log line in `RivaASRBackend` when the gRPC response iterator ends (to distinguish "server closed stream" from "no results but stream open").
-- [ ] Reproduce by muting for 60+ s mid-session and verifying ASR stops producing results on unmute.
+- [x] Add auto-restart when gRPC stream ends unexpectedly.
+- [x] Make `send_audio` graceful (no crash on dead stream).
+- [x] Emit timeline events for stream restarts.
+- [ ] Confirm Riva session limits: check `riva_asr` config for `max_duration_seconds`, keepalive settings, or gRPC deadline.
+- [ ] Implement keep-alive noise during mute (Option A).
+- [ ] Investigate cleanup of old `arecord` processes on WebSocket reconnect.
 
 ## Effort
 
-**Small–Medium**: Option A is ~30 min (noise generator in `_feed_pcm_to_pipeline`). Option B is ~1–2 hours (watchdog task + stream restart plumbing + tests).
+**Done**: Auto-restart (Fix 3) + graceful send_audio (Fix 1). \
+**Remaining**: Keep-alive noise ~30 min. Device cleanup investigation ~1–2 hours.
diff --git a/docs/setup_riva.md b/docs/setup_riva.md
@@ -268,6 +268,7 @@ riva_model_loc="riva-model-repo"  # Docker volume (default)
 # Language/model selection
 asr_acoustic_model="parakeet_1.1b"  # Default for ARM64 v2.24.0
 asr_language_code="en-US"           # ASR language
+asr_accessory_model="silero_diarizer"  # Adds Silero VAD + speaker diarization
 use_asr_streaming_throughput_mode=false  # false=low latency (recommended)
 
 tts_language_code=("multi")           # TTS language
@@ -280,9 +281,17 @@ tts_language_code=("multi")           # TTS language
 - Language codes available: `en-US`, `multi` (multilingual)
 - Pre-optimized for Jetson GPUs (no build step required)
 
+**ASR accessory model** (`asr_accessory_model`):
+- Set to `"silero_diarizer"` to deploy with **Silero VAD** and speaker diarization
+- This makes the `parakeet-1.1b-en-US-asr-streaming-silero-vad-sortformer` model available alongside the base `parakeet-1.1b-en-US-asr-streaming`
+- The Silero VAD variant provides better voice activity detection — without it, the base model often clips the beginning of utterances (e.g., "How many monitors do you see?" becomes "monitors do you see") because its default VAD reacts too late to speech onset
+- Only available when `asr_acoustic_model` is `"parakeet_1.1b"`
+- After changing this setting, re-run `riva_init.sh` and `riva_start.sh`
+
 **For Multi-modal AI Studio and Live RIVA WebUI**, recommended settings:
 - Enable ASR + TTS only (NLP/NMT not needed)
 - Use default `parakeet_1.1b` for ASR (best quality/latency balance)
+- Set `asr_accessory_model="silero_diarizer"` for Silero VAD support
 - Keep `use_asr_streaming_throughput_mode=false` for real-time voice apps
 - SSL/TLS can be added later for production deployments
 
diff --git a/src/multi_modal_ai_studio/backends/asr/riva.py b/src/multi_modal_ai_studio/backends/asr/riva.py
@@ -174,19 +174,21 @@ async def start_stream(self) -> None:
 
         self.logger.info("Riva ASR stream started")
 
-    async def send_audio(self, audio_chunk: bytes) -> None:
+    async def send_audio(self, audio_chunk: bytes) -> bool:
         """Send audio chunk for recognition.
 
         Args:
             audio_chunk: Raw PCM audio bytes (16kHz, 16-bit, mono)
 
-        Raises:
-            RuntimeError: If stream not started
+        Returns:
+            True if audio was queued, False if stream is not active (caller should
+            not treat this as fatal — the stream may be restarting).
         """
         if self._sync_audio_queue is None:
-            raise RuntimeError("Stream not started. Call start_stream() first.")
+            return False
 
         self._sync_audio_queue.put(audio_chunk)
+        return True
 
     async def receive_results(self) -> AsyncIterator[ASRResult]:
         """Yield recognition results as they become available.
diff --git a/src/multi_modal_ai_studio/webui/voice_pipeline.py b/src/multi_modal_ai_studio/webui/voice_pipeline.py
@@ -1001,7 +1001,10 @@ async def _feed_pcm_to_pipeline(
             return (last_amplitude_time, False, 0.0, 0.0)
         now = time.time() - session.timeline.start_time
         if not mic_muted:
-            await asr.send_audio(pcm_bytes)
+            accepted = await asr.send_audio(pcm_bytes)
+            if not accepted and not getattr(_feed_pcm_to_pipeline, "_warned_dead_stream", False):
+                _feed_pcm_to_pipeline._warned_dead_stream = True
+                logger.warning("[asr] send_audio dropped — ASR stream not active (waiting for auto-restart)")
         amplitudes = _pcm_rms_slices(pcm_bytes, sample_rate=16000, window_s=_amplitude_window_s)
         did_send = False
         amp = 0.0
@@ -1151,13 +1154,19 @@ async def receive_loop() -> None:
     async def asr_consumer() -> None:
         """Independent ASR task: forward every partial/final to client immediately; enqueue finals for turn_executor.
         Enables barge-in (turn_executor can be cancelled when new final arrives) and avoids phantom partial at tts_complete.
-        On stream end, if we had a partial but no final (e.g. user stopped before VAD), enqueue a synthetic final so one turn runs."""
+        On stream end, if we had a partial but no final (e.g. user stopped before VAD), enqueue a synthetic final so one turn runs.
+
+        Auto-restart: if Riva's gRPC stream dies (idle timeout, server-side limit, or bus contention)
+        and the pipeline has not been stopped, the stream is restarted with exponential backoff."""
         last_asr_final_text: Optional[str] = None
         last_asr_final_ts: Optional[float] = None
         last_partial_text: Optional[str] = None
         last_partial_ts: Optional[float] = None
         asr_received_count = 0
+        _MAX_ASR_RESTARTS = 10
+        _asr_restart_count = 0
         try:
+          while not stopped.is_set():
             async for result in asr.receive_results():
                 if stopped.is_set():
                     break
@@ -1242,16 +1251,51 @@ async def asr_consumer() -> None:
                     asr_consumer._finals_count = finals_count
                     logger.info("[asr] asr_final #%d enqueued for LLM/TTS: %r", finals_count, text[:80])
                     finals_queue.put_nowait(result)
+
+            # --- Inner async-for ended (stream died or returned None) ---
+            if stopped.is_set():
+                break
+
+            _asr_restart_count += 1
+            if _asr_restart_count > _MAX_ASR_RESTARTS:
+                logger.error("[asr] Exceeded max restarts (%d); giving up", _MAX_ASR_RESTARTS)
+                break
+
+            backoff = min(2.0 * _asr_restart_count, 10.0)
+            logger.warning(
+                "[asr] Stream died after %d result(s); restarting (%d/%d) in %.1fs",
+                asr_received_count, _asr_restart_count, _MAX_ASR_RESTARTS, backoff,
+            )
+            try:
+                now_ts = (time.time() - session.timeline.start_time) if session.timeline.start_time else 0
+                session.timeline.add_event(
+                    "asr_stream_restart", Lane.SPEECH,
+                    data={"restart": _asr_restart_count, "prev_results": asr_received_count},
+                )
+                await send_event({
+                    "event_type": "asr_stream_restart",
+                    "lane": "speech",
+                    "data": {"restart": _asr_restart_count, "prev_results": asr_received_count},
+                    "timestamp": now_ts,
+                })
+            except Exception:
+                pass
+
+            await asr.stop_stream()
+            await asyncio.sleep(backoff)
+            if stopped.is_set():
+                break
+            await asr.start_stream()
+            _feed_pcm_to_pipeline._warned_dead_stream = False
+            asr_received_count = 0
+            logger.info("[asr] Stream restarted successfully (%d/%d)", _asr_restart_count, _MAX_ASR_RESTARTS)
+          # --- end while ---
         except asyncio.CancelledError:
             pass
         except Exception as e:
             logger.exception("asr_consumer error: %s", e)
         finally:
-            logger.info("[asr] Stream ended; received %d ASR result(s) total", asr_received_count)
-            # Only create a synthetic final when the stream had no final at all (e.g. user stopped before
-            # VAD sent a final). Do NOT create one when we already had a final and the last partial is
-            # different (e.g. Riva sent early final "How about computer?" then partials "joke") — that
-            # would create a phantom extra turn; the partial is the tail of the same utterance.
+            logger.info("[asr] Stream ended; received %d ASR result(s) total (restarts=%d)", asr_received_count, _asr_restart_count)
             if last_partial_text and last_asr_final_text is None:
                 try:
                     now_ts = (time.time() - session.timeline.start_time) if session.timeline.start_time else 0
@@ -1264,9 +1308,7 @@ async def asr_consumer() -> None:
                     )
                     logger.info("[asr] Stream ended with only partial; enqueueing synthetic final for LLM/TTS: %r", last_partial_text[:80])
                     finals_queue.put_nowait(synthetic)
-                    # Add to timeline so replay/saved session has this final (use partial's time, not stream-end)
                     session.timeline.add_event("asr_final", Lane.SPEECH, data={"text": last_partial_text, "confidence": 1.0})
-                    # Send asr_final to client so UI shows final_transcript (otherwise only partials were sent)
                     await send_event({
                         "event_type": "asr_final",
                         "lane": "speech",