How AI Transcription Actually Works: A Technical Deep Dive
Understand the technology behind modern speech recognition. From Mel spectrograms to Transformer architectures, learn how AI converts your voice to text.

Modern speech-to-text AI can transcribe your voice with near-human accuracy. But how does it actually work? In this deep dive, we'll explore the technology behind models like OpenAI Whisper, from audio processing to neural network architectures.
The Speech Recognition Pipeline
When you speak into a microphone, your voice goes through several transformation stages before becoming text:
- Audio Capture — Microphone converts sound waves to electrical signals
- Digital Sampling — Analog signals sampled at 16kHz (16,000 times per second)
- Feature Extraction — Audio converted to Mel spectrogram representation
- Neural Network Processing — Transformer model processes features
- Token Decoding — Output tokens converted to readable text
Step 1: Audio to Mel Spectrogram
Raw audio is just a sequence of amplitude values over time. To make it useful for machine learning, we convert it to a Mel spectrogram—a visual representation that shows which frequencies are present at each moment.
According to the Whisper paper, the model uses:
- 16kHz sampling rate — Standard for speech (human speech is 85-255 Hz fundamental, harmonics to ~8kHz)
- 80 Mel channels — Frequency resolution across the audible spectrum
- 25ms windows — Each frame captures 25 milliseconds of audio
- 10ms stride — Windows overlap by 15ms for smooth transitions
The result is a 2D image-like representation: time on the x-axis, frequency on the y-axis, and intensity as brightness. This is what the neural network actually "sees."
Step 2: The Encoder-Decoder Transformer Architecture
Whisper uses the Transformer architecture, the same foundation behind GPT and other modern AI. It consists of two main components:
The Encoder
The encoder processes the Mel spectrogram and creates a rich internal representation of the audio. It uses self-attention to understand relationships between different parts of the audio—crucial for handling accents, background noise, and context.
- Two convolutional layers first downsample the spectrogram
- Sinusoidal positional embeddings add timing information
- Multiple Transformer blocks apply self-attention
- Output: A sequence of hidden states representing the audio
The Decoder
The decoder generates text tokens one at a time, using cross-attention to focus on relevant parts of the encoded audio. It's similar to how GPT generates text, but conditioned on audio instead of previous text.
- Learned positional embeddings track output position
- Cross-attention attends to encoder output
- Self-attention maintains coherence in generated text
- Output: Probability distribution over possible next tokens
Step 3: Tokenization and Decoding
Whisper uses Byte Pair Encoding (BPE) tokenization, the same as GPT-2. Instead of predicting individual characters or whole words, it predicts subword units:
- Common words become single tokens ("the" →
[1169]) - Rare words split into subwords ("transcription" →
[trans] [cript] [ion]) - Special tokens handle tasks (
<|transcribe|>,<|translate|>,<|en|>) - Timestamp tokens enable word-level timing (
<|0.00|>,<|2.50|>)
Model Sizes and Trade-offs
Whisper comes in multiple sizes, each balancing accuracy against speed and memory:
Training: The Secret Sauce
What makes Whisper special isn't the architecture—it's the training data. OpenAI trained it on 680,000 hours of labeled audio from the internet, covering:
- 99 languages — From English to Welsh to Yoruba
- Multiple accents — British, American, Indian English, etc.
- Various audio quality — Podcasts, phone calls, meetings
- Different domains — Technical, medical, legal, casual speech
The large-v3 model was further trained on 1 million hours of weakly labeled audio plus 4 million hours of pseudo-labeled audio using earlier Whisper versions. This massive scale is why it handles real-world audio so well.
Multi-Task Learning
Whisper isn't just a transcription model—it's trained on multiple tasks simultaneously:
- Transcription — Convert speech to text in the same language
- Translation — Convert non-English speech to English text
- Language Detection — Identify the spoken language
- Voice Activity Detection — Identify when speech is present
- Timestamp Prediction — Align words to precise times
The model selects tasks via special tokens. For example, <|en|><|transcribe|> tells it to transcribe English, while <|es|><|translate|> tells it to translate Spanish to English.
Local vs Cloud: What's Different?
When you use local transcription (like Speakly's default mode), the entire model runs on your computer:
- GPU acceleration — Apple Silicon uses Metal, NVIDIA uses CUDA
- No network latency — Audio never leaves your device
- Privacy — Your voice data stays completely local
- Offline capable — Works without internet connection
Cloud APIs (OpenAI, Groq, Deepgram) run the same or similar models on powerful remote servers. The trade-off is speed and potentially cost versus privacy and latency.
Beyond Whisper: Other Approaches
While Whisper dominates the open-source space, commercial providers use different approaches:
Deepgram Nova-2
Deepgram trains custom models from scratch on domain-specific data. They offer specialized variants for meetings, phone calls, and medical transcription. Their architecture is proprietary but optimized for streaming real-time use.
Google Cloud Speech-to-Text
Google's Chirp model uses their Universal Speech Model (USM), trained on 12 million hours across 300+ languages. It excels at low-resource languages where Whisper struggles.
ElevenLabs Scribe
ElevenLabs Scribe claims 96.7% accuracy for English—among the highest reported. It includes built-in speaker diarization (identifying who said what) and audio event detection.
The Future: Where Is This Going?
Speech recognition is rapidly improving. Key trends to watch:
- Real-time streaming — Lower latency for instant transcription
- Multimodal models — GPT-4o can process audio directly without separate transcription
- On-device AI — Apple's Neural Engine, Qualcomm's AI accelerators
- Personalization — Models that adapt to your voice and vocabulary
- Context awareness — Understanding what application you're using
Further Reading
- Robust Speech Recognition via Large-Scale Weak Supervision — Original Whisper paper
- Whisper on GitHub — Open-source implementation
- Attention Is All You Need — Transformer architecture paper
- Hugging Face Whisper Guide — Using Whisper with Transformers library
Try AI Transcription Locally
Experience Whisper running entirely on your device. No cloud required, complete privacy. Try Speakly free for 7 days.
Download Now