[google] RealtimeModel + external VAD: generate_reply() conflicts with activity-based audio flow, STT transcript discarded

### Bug Description

When using `google.realtime.RealtimeModel` for multi-turn voice conversations, I observed **response latency escalating from ~1s to 20–50s** as conversations progress. Attempting to mitigate this by switching to external VAD (Silero + `automatic_activity_detection.disabled=True`) made things **worse** — introducing repeated `generate_reply timed out` errors on top of the existing latency escalation.

Investigation revealed **two independent SDK-side issues**:

**Bug 1 — `generate_reply()` conflicts with the activity-based audio flow:**

When external VAD is used (`automatic_activity_detection.disabled=True`), the SDK calls `generate_reply()` after Gemini has already begun processing audio delivered via `activity_start/end` signals. This injects a redundant `ActivityEnd` + a `"."` placeholder turn, causing a state conflict and a 5-second timeout.

The signal flow conflict:

```
1. User speaks → Silero VAD → activity_start → push_audio → activity_end
   → Gemini receives audio via activity signals, BEGINS PROCESSING

2. STT + EOU completes → on_user_turn_completed → SDK default path:
   → commit_audio()         ← no-op in Google plugin (L1380), just logs warning
   → _generate_reply()      ← user_message=None (see Bug 2)
   → _realtime_reply_task() → rt_session.generate_reply():
     a. Sends ActivityEnd    ← REDUNDANT, already sent in step 1
     b. Sends send_client_content(text=".", role="user", turn_complete=True)
        ← CONFLICTS with step 1: Gemini is mid-processing the audio turn
     c. Waits 5s for generation_created event ← NEVER FIRES → timeout
```

Code references (v1.5.2):
- `realtime_api.py` L730–733: `generate_reply()` sends `ActivityEnd` when `_in_user_activity`
- `realtime_api.py` L743–745: sends `send_client_content(text=".", turn_complete=True)`
- `realtime_api.py` L748–753: 5-second timeout → `generate_reply timed out` error

**Bug 2 — STT transcript unconditionally discarded:**

After `on_user_turn_completed` returns, the SDK hardcodes:

```python
# agent_activity.py L1832
if isinstance(self.llm, llm.RealtimeModel):
    user_message = None  # type: ignore
```

This discards the STT transcript unconditionally. `_generate_reply()` receives `user_message=None`, and `_realtime_reply_task` passes `user_input=None` — the full STT transcript is never forwarded to Gemini.

This means there is **no supported way** to use external STT text as input to a `RealtimeModel`, which blocks the most effective workaround for Gemini's audio token accumulation problem (server-side issue tracked separately via [Google Issue Tracker #493438050](https://issuetracker.google.com/issues/493438050)).

### Expected Behavior

1. When external VAD is used with `RealtimeModel`, `generate_reply()` should not conflict with Gemini's activity-based audio processing — no redundant `ActivityEnd` or `"."` placeholder should be injected while Gemini is mid-processing an audio turn.

2. When external STT is configured alongside a `RealtimeModel`, the SDK should provide an option to forward the STT transcript to the realtime model instead of unconditionally discarding it.

### Reproduction Steps

```bash
**Step 1 — Observe latency escalation with native Gemini VAD:**


session = AgentSession(
    llm=google.realtime.RealtimeModel(
        model="gemini-2.5-flash-native-audio-preview-12-2025",
        # automatic_activity_detection defaults to True (native Gemini VAD)
    ),
)


Conduct a 5+ turn voice conversation. Response latency escalates (measured from OpenTelemetry traces: `user_speaking` END → `agent_speaking` START):

| Session | Turn gaps (s) | Avg | Max |
|---|---|---|---|
| Session A | 13.3 → 1.2 → 6.4 → **22.8** → **30.6** → 5.2 → 2.8 | 11.8 | 30.6 |
| Session B | 7.5 → 1.1 → **22.6** → 1.1 → **26.5** | 11.7 | 26.5 |
| Session C | 2.6 → 12.5 → 1.0 → 17.9 → 1.2 → 16.8 → **49.8** → 1.1 | 12.9 | 49.8 |

**Step 2 — Switch to external VAD → `generate_reply timed out`:**

Following the [Gemini turn detection docs](https://docs.livekit.io/agents/models/realtime/plugins/gemini/#turn-detection), switch to external VAD:


session = AgentSession(
    vad=silero.VAD.load(...),
    stt=inference.STT(model="deepgram/nova-2", language="zh-TW"),
    turn_handling=TurnHandlingOptions(turn_detection=MultilingualModel()),
    llm=google.realtime.RealtimeModel(
        model="gemini-2.5-flash-native-audio-preview-12-2025",
        realtime_input_config=RealtimeInputConfig(
            automatic_activity_detection=AutomaticActivityDetection(disabled=True),
        ),
    ),
)


Result — latency gets **worse**, not better. Logs show:


WARN  commit_audio is not supported by Gemini Realtime API.        (×5)
ERROR failed to generate a reply: generate_reply timed out         (×4)
       waiting for generation_created event.


| Session | Turn gaps (s) | Avg | Max | 2nd-half ratio |
|---|---|---|---|---|
| Session D (ext VAD) | 6.2 → 9.4 → **14.7** → **26.4** → **23.8** → **38.1** | 19.8 | 38.1 | **2.9×** |

External VAD is worse (avg 19.8s vs native VAD 11–13s) because the `generate_reply timed out` errors add 5s+ of wasted time per turn on top of the audio token accumulation.
```

### Operating System

Linux (LiveKit Cloud deployment); also reproduced locally on Ubuntu 24.04.

### Models Used

gemini-2.5-flash-native-audio-preview-12-2025, deepgram/nova-2

### Package Versions

```bash
livekit-agents==1.5.2
livekit-plugins-google==1.5.2
livekit-plugins-silero==1.5.2
livekit-plugins-turn-detector==1.5.2
```

### Session/Room/Call IDs

_No response_

### Proposed Solution

```python
Two independent fixes:

**Fix 1 — Prevent `generate_reply()` conflict with activity-based flow (`realtime_api.py`)**

When Gemini has already received audio via the `activity_start/end` path and is processing it, `generate_reply()` should not inject a redundant `ActivityEnd` and `"."` placeholder turn. Options:
- Track whether an active audio turn is being processed; if so, skip the placeholder and wait for the natural `generation_created` event from the audio path
- Or skip `generate_reply()` entirely when `_in_user_activity` was recently `True`

**Fix 2 — Support text-input mode for `RealtimeModel` (`agent_activity.py`)**

Provide an opt-in to forward the STT transcript to the realtime model instead of discarding it. The current hardcoded `user_message = None` at L1832 makes a text-input workaround impossible without monkey-patching:


# Current (L1832):
if isinstance(self.llm, llm.RealtimeModel):
    user_message = None  # ignore stt transcription for realtime model

# Proposed: respect a capability flag or configuration option
if isinstance(self.llm, llm.RealtimeModel):
    if not getattr(self.llm.capabilities, 'text_input_mode', False):
        user_message = None  # existing behavior for native audio mode
    # else: keep user_message → _realtime_reply_task forwards it via update_chat_ctx


This would let users opt in to text-input mode for realtime models, enabling the effective workaround for Gemini's audio token accumulation without requiring internal monkey-patches.
```

### Additional Context

**Verified workaround (requires monkey-patching):**

By monkey-patching `push_audio` and `start_user_activity` to no-ops (preventing audio from reaching Gemini) and forwarding the STT transcript via `on_user_turn_completed` + `generate_reply(user_input=text)` + `StopResponse`, latency becomes completely stable:

| Session | Transport | Turn gaps (s) | Avg | Max | 2nd-half ratio |
|---|---|---|---|---|---|
| Session E | WebRTC | 1.9 → 1.9 → 2.2 → 2.0 → 2.3 → 1.9 → 1.9 | **2.0** | **2.3** | **1.0×** |
| Session F | SIP | 17.2 → 4.1 → 5.3 → 2.2 → 6.1 → 2.1 → 1.8 | 5.6 | 17.2 | **0.3×** ↓ |

Session E (WebRTC): 7 turns over 126s, all within 1.9–2.3s with zero escalation. No `generate_reply timed out` errors in either session.

<details>
<summary>Workaround code</summary>

```python
class TextInputRealtimeModel(google.realtime.RealtimeModel):
    """Intercept audio push; use text-input mode to avoid audio token accumulation."""
    def session(self):
        sess = super().session()
        sess.push_audio = lambda frame: None
        sess.start_user_activity = lambda: None
        return sess

# In Agent.on_user_turn_completed:
async def on_user_turn_completed(self, turn_ctx, new_message) -> None:
    user_text = new_message.text_content
    if user_text:
        self.session.generate_reply(user_input=user_text)
    raise StopResponse()  # block SDK default path (which would send "." placeholder)
```

</details>

**Background — Gemini audio token accumulation:**

The Gemini Live API accumulates audio tokens in the session context (~25 tokens/sec). Over a multi-turn conversation, the growing context causes Gemini's per-turn processing time to increase linearly. This is a server-side Gemini behavior (tracked separately via [Google Issue Tracker #493438050](https://issuetracker.google.com/issues/493438050)), but it creates a strong need for a text-input mode to keep latency stable. The SDK currently blocks this approach due to Bugs #1 and #2 above.

**Related issues:**
- #3560 — `generate_reply timed out` (closed, 20+ reports, timeout mechanism patched in #4237 — but root cause here is different)
- #4623 — `generate_reply timed out` (closed)
- #2165 — `generate_reply` not working in Gemini realtime (closed, 32 comments)
- Google Issue Tracker [#493438050](https://issuetracker.google.com/issues/493438050) — High latency in Gemini Live API Audio-to-Audio (Won't Fix)

### Screenshots and Recordings

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[google] RealtimeModel + external VAD: generate_reply() conflicts with activity-based audio flow, STT transcript discarded #5408

Bug Description

Expected Behavior

Reproduction Steps

Operating System

Models Used

Package Versions

Session/Room/Call IDs

Proposed Solution

Additional Context

Screenshots and Recordings

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Session	Transport	Turn gaps (s)	Avg	Max	2nd-half ratio
Session E	WebRTC	1.9 → 1.9 → 2.2 → 2.0 → 2.3 → 1.9 → 1.9	2.0	2.3	1.0×
Session F	SIP	17.2 → 4.1 → 5.3 → 2.2 → 6.1 → 2.1 → 1.8	5.6	17.2	0.3× ↓

[google] RealtimeModel + external VAD: generate_reply() conflicts with activity-based audio flow, STT transcript discarded #5408

Description

Bug Description

Expected Behavior

Reproduction Steps

Operating System

Models Used

Package Versions

Session/Room/Call IDs

Proposed Solution

Additional Context

Screenshots and Recordings

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions