Skip to content

[google] RealtimeModel + external VAD: generate_reply() conflicts with activity-based audio flow, STT transcript discarded #5408

@Robert-Kung

Description

@Robert-Kung

Bug Description

When using google.realtime.RealtimeModel for multi-turn voice conversations, I observed response latency escalating from ~1s to 20–50s as conversations progress. Attempting to mitigate this by switching to external VAD (Silero + automatic_activity_detection.disabled=True) made things worse — introducing repeated generate_reply timed out errors on top of the existing latency escalation.

Investigation revealed two independent SDK-side issues:

Bug 1 — generate_reply() conflicts with the activity-based audio flow:

When external VAD is used (automatic_activity_detection.disabled=True), the SDK calls generate_reply() after Gemini has already begun processing audio delivered via activity_start/end signals. This injects a redundant ActivityEnd + a "." placeholder turn, causing a state conflict and a 5-second timeout.

The signal flow conflict:

1. User speaks → Silero VAD → activity_start → push_audio → activity_end
   → Gemini receives audio via activity signals, BEGINS PROCESSING

2. STT + EOU completes → on_user_turn_completed → SDK default path:
   → commit_audio()         ← no-op in Google plugin (L1380), just logs warning
   → _generate_reply()      ← user_message=None (see Bug 2)
   → _realtime_reply_task() → rt_session.generate_reply():
     a. Sends ActivityEnd    ← REDUNDANT, already sent in step 1
     b. Sends send_client_content(text=".", role="user", turn_complete=True)
        ← CONFLICTS with step 1: Gemini is mid-processing the audio turn
     c. Waits 5s for generation_created event ← NEVER FIRES → timeout

Code references (v1.5.2):

  • realtime_api.py L730–733: generate_reply() sends ActivityEnd when _in_user_activity
  • realtime_api.py L743–745: sends send_client_content(text=".", turn_complete=True)
  • realtime_api.py L748–753: 5-second timeout → generate_reply timed out error

Bug 2 — STT transcript unconditionally discarded:

After on_user_turn_completed returns, the SDK hardcodes:

# agent_activity.py L1832
if isinstance(self.llm, llm.RealtimeModel):
    user_message = None  # type: ignore

This discards the STT transcript unconditionally. _generate_reply() receives user_message=None, and _realtime_reply_task passes user_input=None — the full STT transcript is never forwarded to Gemini.

This means there is no supported way to use external STT text as input to a RealtimeModel, which blocks the most effective workaround for Gemini's audio token accumulation problem (server-side issue tracked separately via Google Issue Tracker #493438050).

Expected Behavior

  1. When external VAD is used with RealtimeModel, generate_reply() should not conflict with Gemini's activity-based audio processing — no redundant ActivityEnd or "." placeholder should be injected while Gemini is mid-processing an audio turn.

  2. When external STT is configured alongside a RealtimeModel, the SDK should provide an option to forward the STT transcript to the realtime model instead of unconditionally discarding it.

Reproduction Steps

**Step 1 — Observe latency escalation with native Gemini VAD:**


session = AgentSession(
    llm=google.realtime.RealtimeModel(
        model="gemini-2.5-flash-native-audio-preview-12-2025",
        # automatic_activity_detection defaults to True (native Gemini VAD)
    ),
)


Conduct a 5+ turn voice conversation. Response latency escalates (measured from OpenTelemetry traces: `user_speaking` END → `agent_speaking` START):

| Session | Turn gaps (s) | Avg | Max |
|---|---|---|---|
| Session A | 13.3 → 1.2 → 6.4 → **22.8****30.6** → 5.2 → 2.8 | 11.8 | 30.6 |
| Session B | 7.5 → 1.1 → **22.6** → 1.1 → **26.5** | 11.7 | 26.5 |
| Session C | 2.6 → 12.5 → 1.0 → 17.9 → 1.2 → 16.8 → **49.8** → 1.1 | 12.9 | 49.8 |

**Step 2 — Switch to external VAD → `generate_reply timed out`:**

Following the [Gemini turn detection docs](https://docs.livekit.io/agents/models/realtime/plugins/gemini/#turn-detection), switch to external VAD:


session = AgentSession(
    vad=silero.VAD.load(...),
    stt=inference.STT(model="deepgram/nova-2", language="zh-TW"),
    turn_handling=TurnHandlingOptions(turn_detection=MultilingualModel()),
    llm=google.realtime.RealtimeModel(
        model="gemini-2.5-flash-native-audio-preview-12-2025",
        realtime_input_config=RealtimeInputConfig(
            automatic_activity_detection=AutomaticActivityDetection(disabled=True),
        ),
    ),
)


Result — latency gets **worse**, not better. Logs show:


WARN  commit_audio is not supported by Gemini Realtime API.        (×5)
ERROR failed to generate a reply: generate_reply timed out         (×4)
       waiting for generation_created event.


| Session | Turn gaps (s) | Avg | Max | 2nd-half ratio |
|---|---|---|---|---|
| Session D (ext VAD) | 6.2 → 9.4 → **14.7****26.4****23.8****38.1** | 19.8 | 38.1 | **2.9×** |

External VAD is worse (avg 19.8s vs native VAD 11–13s) because the `generate_reply timed out` errors add 5s+ of wasted time per turn on top of the audio token accumulation.

Operating System

Linux (LiveKit Cloud deployment); also reproduced locally on Ubuntu 24.04.

Models Used

gemini-2.5-flash-native-audio-preview-12-2025, deepgram/nova-2

Package Versions

livekit-agents==1.5.2
livekit-plugins-google==1.5.2
livekit-plugins-silero==1.5.2
livekit-plugins-turn-detector==1.5.2

Session/Room/Call IDs

No response

Proposed Solution

Two independent fixes:

**Fix 1Prevent `generate_reply()` conflict with activity-based flow (`realtime_api.py`)**

When Gemini has already received audio via the `activity_start/end` path and is processing it, `generate_reply()` should not inject a redundant `ActivityEnd` and `"."` placeholder turn. Options:
- Track whether an active audio turn is being processed; if so, skip the placeholder and wait for the natural `generation_created` event from the audio path
- Or skip `generate_reply()` entirely when `_in_user_activity` was recently `True`

**Fix 2Support text-input mode for `RealtimeModel` (`agent_activity.py`)**

Provide an opt-in to forward the STT transcript to the realtime model instead of discarding it. The current hardcoded `user_message = None` at L1832 makes a text-input workaround impossible without monkey-patching:


# Current (L1832):
if isinstance(self.llm, llm.RealtimeModel):
    user_message = None  # ignore stt transcription for realtime model

# Proposed: respect a capability flag or configuration option
if isinstance(self.llm, llm.RealtimeModel):
    if not getattr(self.llm.capabilities, 'text_input_mode', False):
        user_message = None  # existing behavior for native audio mode
    # else: keep user_message → _realtime_reply_task forwards it via update_chat_ctx


This would let users opt in to text-input mode for realtime models, enabling the effective workaround for Gemini's audio token accumulation without requiring internal monkey-patches.

Additional Context

Verified workaround (requires monkey-patching):

By monkey-patching push_audio and start_user_activity to no-ops (preventing audio from reaching Gemini) and forwarding the STT transcript via on_user_turn_completed + generate_reply(user_input=text) + StopResponse, latency becomes completely stable:

Session Transport Turn gaps (s) Avg Max 2nd-half ratio
Session E WebRTC 1.9 → 1.9 → 2.2 → 2.0 → 2.3 → 1.9 → 1.9 2.0 2.3 1.0×
Session F SIP 17.2 → 4.1 → 5.3 → 2.2 → 6.1 → 2.1 → 1.8 5.6 17.2 0.3×

Session E (WebRTC): 7 turns over 126s, all within 1.9–2.3s with zero escalation. No generate_reply timed out errors in either session.

Workaround code
class TextInputRealtimeModel(google.realtime.RealtimeModel):
    """Intercept audio push; use text-input mode to avoid audio token accumulation."""
    def session(self):
        sess = super().session()
        sess.push_audio = lambda frame: None
        sess.start_user_activity = lambda: None
        return sess

# In Agent.on_user_turn_completed:
async def on_user_turn_completed(self, turn_ctx, new_message) -> None:
    user_text = new_message.text_content
    if user_text:
        self.session.generate_reply(user_input=user_text)
    raise StopResponse()  # block SDK default path (which would send "." placeholder)

Background — Gemini audio token accumulation:

The Gemini Live API accumulates audio tokens in the session context (~25 tokens/sec). Over a multi-turn conversation, the growing context causes Gemini's per-turn processing time to increase linearly. This is a server-side Gemini behavior (tracked separately via Google Issue Tracker #493438050), but it creates a strong need for a text-input mode to keep latency stable. The SDK currently blocks this approach due to Bugs #1 and #2 above.

Related issues:

Screenshots and Recordings

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions