Voice conversation (STT → LLM → TTS pipeline)

avancado · 12 min · Por Ana Brainiall

The anatomy of a voice conversation

Voice conversation with AI is a chain of 3 APIs:

`
[You speak] → Microphone → STT (Whisper) → text
↓
LLM (Claude/GPT)
↓
[You hear] ← Speaker ← TTS (pf_dora) ← text
`

Each step adds latency. For the experience to feel natural (like a human conversation), the total needs to stay under 1.5 seconds. In 2026, this is achievable — but it requires careful engineering.

diagrama de fluxo com 3 blocos coloridos — STT (azul), LLM (roxo), TTS (verde) —

Realistic latency in 2026

Measured during real conversations on Brainiall:

Audio capture (mic → WAV): ~100ms (hardware-dependent)
STT (Whisper Large v3): 300–600ms for a 3–5s phrase
LLM (Claude Haiku for speed): 400–900ms to first token
TTS (pf_dora via unified-api): 300–500ms for 3–5s of audio
Playback (speaker latency): ~50ms

Total first-token-to-speech: 1150–2150ms. Acceptable if the model starts "speaking" early (streaming).

Streaming is everything

Without streaming, each step waits for the previous one to finish: 600ms + 900ms + 500ms = 2000ms minimum.

With streaming:
- STT can start transcribing while you're still speaking (VAD — Voice Activity Detection)
- LLM starts generating tokens before STT finishes (with some intent prediction)
- TTS starts narrating the first words while the LLM is still generating the last ones

Effective latency drops to 400–700ms. It feels natural.

VAD: knowing when to stop listening

The subtlest challenge: detecting when you've stopped speaking. Stop too early and it cuts off your sentence. Stop too late and it adds 500ms of latency.

Techniques:
- Absolute silence for 600ms: simple, but doesn't handle natural thinking pauses
- Silero VAD: a neural model that detects end-of-speech with ~95% accuracy in <50ms
- Confidence from STT: Whisper returns a confidence score; if it drops, you've likely finished speaking
- Interruption detection: user starts speaking again → cancels the ongoing TTS and restarts the cycle

Brainiall uses Silero VAD combined with a dynamic silence threshold (adjusts based on ambient noise).

Choosing a model: latency vs. quality

In voice mode, it's usually worth trading a bit of LLM quality for speed:

Claude Haiku 4.5: ~400ms to first token, concise responses, $2/1M tokens
GPT-5 mini: ~350ms, more creative than Haiku, $3/1M tokens
Gemini 3 Flash: ~250ms, excellent for short responses, $2/1M tokens

For conversations where quality matters more than latency (e.g., a detailed language tutor), step up to Claude Sonnet 4.6 or full GPT-5.

Use cases where voice mode truly shines

Language conversation practice: practice speaking English with an AI that responds naturally
Hands-free assistant: while driving, cooking, or working out
Accessibility: for people who have difficulty typing
Brainstorming on the go: capture ideas by speaking instead of writing
Tutoring: quick question-and-answer flow that feels more natural and engaging
Business — phone support: replace clunky IVR systems with natural, flowing conversation

Common pitfalls

Background noise: ambient sound causes VAD to misfire; use a headset or directional mic
TTS echo: if you're using laptop speakers, the mic can pick up the TTS output and transcribe it back; use a headset
Speech overlap: user interrupts, system is slow to react = frustration; implement fast cancellation
Perceived vs. real latency: 1s latency feels fine in text but feels sluggish in voice; optimize for <500ms whenever possible

diagrama de armadilhas — 4 situações comuns com ícones + solução; barulho (heads

Basic browser implementation

For quick experimentation:

`javascript
// 1. Capture
const stream = await navigator.mediaDevices.getUserMedia({audio: true});
const mediaRecorder = new MediaRecorder(stream);

// 2. Send chunks every 500ms
mediaRecorder.ondataavailable = async (e) => {
const formData = new FormData();
formData.append('file', e.data);
const r = await fetch('/api/transcribe', {method:'POST', body: formData});
const {text} = await r.json();
// 3. Send to LLM, receive response
// 4. Send response to /api/tts, play the result
};
mediaRecorder.start(500);
`

Brainiall already offers this out of the box in the chat: just click the microphone icon and press-and-hold.

Try it right now

In the Brainiall chat, click the microphone icon and press-and-hold. Speak, release, and receive a response in both text and audio. The Pro plan at $29 includes full voice support; Business unlocks premium voices and priority latency.