Transcribe Hours of Audio/Video in any language with Precision

iniciante · 8 min · Por Ana Brainiall

Why Whisper Became the STT Standard

Whisper, released by OpenAI in 2022 (open source), changed the game for Speech-to-Text. It was trained on 680,000 hours of transcribed multilingual audio — 10x more data than any previous model. That gave it three advantages competitors still haven't matched:

1. Robust multilingual support: excellent across 99 languages, including PT-BR, PT-PT, and regional dialects
2. Noise tolerance: works on audio with background music, street noise, and overlapping conversations
3. Automatic punctuation: decides on its own where to place commas, periods, and paragraph breaks — no editing needed

At Brainiall we use Whisper Large v3 (the largest, most accurate), running on a dedicated GPU for latency under 15s on clips up to 10 minutes long.

gráfico de barras comparando precisão (Word Error Rate) em PT-BR — Whisper Large

How the Model "Listens"

Whisper converts audio into Mel spectrograms — a visual representation of frequency vs. time. The model is a Transformer encoder-decoder that treats the spectrogram as input and generates text as output, very similar to how translation models work.

The real magic is that Whisper was trained on a simultaneous multi-objective task:
- Transcribe in the same language (STT)
- Translate to English (STT + translation)
- Identify the language without prior notice
- Segment with timestamps

This means a single model handles transcription + translation + language identification — three tasks that previously required three separate models.

Supported Formats and Practical Limits

Brainiall accepts:
- Formats: mp3, mp4, wav, ogg, webm, m4a, flac, mpeg
- Maximum size: 25 MB per file
- Recommended duration: up to 10 minutes per request — for longer audio, split it up
- Sample rate: any — will be resampled to 16kHz internally
- Channels: mono or stereo — both work (stereo is converted to mono)

To transcribe a 1-hour podcast, split it into 10-minute chunks using ffmpeg and concatenate the transcriptions afterward.

Quality by Audio Type

Excellent (>97% accuracy):
- Podcasts recorded with a dedicated microphone
- Corporate interviews in a quiet room
- Editorial video narration
- Speeches on Zoom/Meet teleconferences

Good (90–95% accuracy):
- Meeting recordings via laptop
- Classes recorded on a smartphone
- Vlogs filmed in a calm outdoor environment

Challenging (<85% accuracy):
- Sung music (Whisper tries but gets lyrics wrong frequently)
- Audio with multiple people speaking simultaneously
- Compressed phone calls (8kHz)
- Highly specific regionalisms and slang

matriz visual de 4 quadrantes com exemplos de cada nível de precisão e causa — m

Prompt Tips

Whisper accepts an initial_prompt — a string that guides the transcription. Use it for:

Specific vocabulary: "This is a meeting about cardiology including terms such as angioplasty, stent, myocardial infarction"
Proper names: "The speakers are Fábio Suizu and Maria Santos"
Formatting style: "Use uppercase letters for titles, separate paragraphs at each topic change"
Dialect: "Brazilian Portuguese with São Paulo expressions"

This can boost accuracy by 3–5 percentage points on challenging audio.

Practical Use Cases

Automatic subtitling: transcribe + add timestamps + format as SRT
Meeting notes: transcribe the entire call + ask the LLM to summarize it
Video search: convert your file into text that's searchable and indexable
Real-time assistant: STT + LLM + TTS = a complete voice assistant
Accessibility: automatic captions for corporate training videos

Try It Right Now

In the Brainiall chat, click the file attachment clip, send an MP3 or MP4, and ask "transcribe this audio". Or use the API at the /api/transcribe route. The Pro plan at $5.99 includes generous usage; the Business plan includes API credits for external automation.