How AI Transcription Actually Works: A Technical Deep Dive

Modern speech-to-text AI can transcribe your voice with near-human accuracy. But how does it actually work? In this deep dive, we'll explore the technology behind models like OpenAI Whisper, from audio processing to neural network architectures.

The Speech Recognition Pipeline

When you speak into a microphone, your voice goes through several transformation stages before becoming text:

Audio Capture — Microphone converts sound waves to electrical signals
Digital Sampling — Analog signals sampled at 16kHz (16,000 times per second)
Feature Extraction — Audio converted to Mel spectrogram representation
Neural Network Processing — Transformer model processes features
Token Decoding — Output tokens converted to readable text

Step 1: Audio to Mel Spectrogram

Raw audio is just a sequence of amplitude values over time. To make it useful for machine learning, we convert it to a Mel spectrogram—a visual representation that shows which frequencies are present at each moment.

What is a Mel Spectrogram?

A Mel spectrogram uses the Mel scale, which approximates how humans perceive pitch. Low frequencies are spaced linearly while high frequencies are spaced logarithmically—matching how our ears work. This makes the representation more efficient for speech recognition.

According to the Whisper paper, the model uses:

16kHz sampling rate — Standard for speech (human speech is 85-255 Hz fundamental, harmonics to ~8kHz)
80 Mel channels — Frequency resolution across the audible spectrum
25ms windows — Each frame captures 25 milliseconds of audio
10ms stride — Windows overlap by 15ms for smooth transitions

The result is a 2D image-like representation: time on the x-axis, frequency on the y-axis, and intensity as brightness. This is what the neural network actually "sees."

Step 2: The Encoder-Decoder Transformer Architecture

Whisper uses the Transformer architecture, the same foundation behind GPT and other modern AI. It consists of two main components:

The Encoder

The encoder processes the Mel spectrogram and creates a rich internal representation of the audio. It uses self-attention to understand relationships between different parts of the audio—crucial for handling accents, background noise, and context.

Two convolutional layers first downsample the spectrogram
Sinusoidal positional embeddings add timing information
Multiple Transformer blocks apply self-attention
Output: A sequence of hidden states representing the audio

The Decoder

The decoder generates text tokens one at a time, using cross-attention to focus on relevant parts of the encoded audio. It's similar to how GPT generates text, but conditioned on audio instead of previous text.

Learned positional embeddings track output position
Cross-attention attends to encoder output
Self-attention maintains coherence in generated text
Output: Probability distribution over possible next tokens

Step 3: Tokenization and Decoding

Whisper uses Byte Pair Encoding (BPE) tokenization, the same as GPT-2. Instead of predicting individual characters or whole words, it predicts subword units:

Common words become single tokens ("the" → [1169])
Rare words split into subwords ("transcription" → [trans] [cript] [ion])
Special tokens handle tasks (<|transcribe|>, <|translate|>, <|en|>)
Timestamp tokens enable word-level timing (<|0.00|>, <|2.50|>)

Model Sizes and Trade-offs

Whisper comes in multiple sizes, each balancing accuracy against speed and memory:

Which Model Should You Use?

For real-time dictation, large-v3-turbo offers the best balance. It achieves near-large-v3 accuracy at 4x the speed. For offline batch processing where time doesn't matter, large-v3 provides the highest accuracy.

Training: The Secret Sauce

What makes Whisper special isn't the architecture—it's the training data. OpenAI trained it on 680,000 hours of labeled audio from the internet, covering:

99 languages — From English to Welsh to Yoruba
Multiple accents — British, American, Indian English, etc.
Various audio quality — Podcasts, phone calls, meetings
Different domains — Technical, medical, legal, casual speech

The large-v3 model was further trained on 1 million hours of weakly labeled audio plus 4 million hours of pseudo-labeled audio using earlier Whisper versions. This massive scale is why it handles real-world audio so well.

Multi-Task Learning

Whisper isn't just a transcription model—it's trained on multiple tasks simultaneously:

Transcription — Convert speech to text in the same language
Translation — Convert non-English speech to English text
Language Detection — Identify the spoken language
Voice Activity Detection — Identify when speech is present
Timestamp Prediction — Align words to precise times

The model selects tasks via special tokens. For example, <|en|><|transcribe|> tells it to transcribe English, while <|es|><|translate|> tells it to translate Spanish to English.

Local vs Cloud: What's Different?

When you use local transcription (like Speakly's default mode), the entire model runs on your computer:

GPU acceleration — Apple Silicon uses Metal, NVIDIA uses CUDA
No network latency — Audio never leaves your device
Privacy — Your voice data stays completely local
Offline capable — Works without internet connection

Cloud APIs (OpenAI, Groq, Deepgram) run the same or similar models on powerful remote servers. The trade-off is speed and potentially cost versus privacy and latency.

Beyond Whisper: Other Approaches

While Whisper dominates the open-source space, commercial providers use different approaches:

Deepgram Nova-2

Deepgram trains custom models from scratch on domain-specific data. They offer specialized variants for meetings, phone calls, and medical transcription. Their architecture is proprietary but optimized for streaming real-time use.

Google Cloud Speech-to-Text

Google's Chirp model uses their Universal Speech Model (USM), trained on 12 million hours across 300+ languages. It excels at low-resource languages where Whisper struggles.

ElevenLabs Scribe

ElevenLabs Scribe claims 96.7% accuracy for English—among the highest reported. It includes built-in speaker diarization (identifying who said what) and audio event detection.

The Future: Where Is This Going?

Speech recognition is rapidly improving. Key trends to watch:

Real-time streaming — Lower latency for instant transcription
Multimodal models — GPT-4o can process audio directly without separate transcription
On-device AI — Apple's Neural Engine, Qualcomm's AI accelerators
Personalization — Models that adapt to your voice and vocabulary
Context awareness — Understanding what application you're using