Why Speech Recognition is Becoming an LLM Problem!
- Abhigyan Raman
(Founding Engineer, Sarvam.ai)
I am Full-Time Data Janitor ...

I am Full-Time Data Janitor & Part-Time Model Builder
I am Full-Time Data Janitor & Part-Time Model Builder
Scales very well
Doesn't very well
I am Full-Time Data Janitor & Part-Time Model Builder
Scales very well
Doesn't very well
And what doesn't scale, eventually dies..
(Kardashev law)
Why Speech Recognition is Becoming an LLM Problem!
Pre Scaling Era
Transition Era
LLM Era
Not End to End
Small Models (HMM-GMM, DeepSpeech, CTC, RNNT like models)
(Statistical LM, Normalization, Punctuation, LID, Spk-Dz)
Why Speech Recognition is Becoming an LLM Problem!
Pre Scaling Era
Transition Era
LLM Era
Not End to End
Small Models (HMM-GMM, DeepSpeech, CTC, RNNT like models)
(Statistical LM, Normalization, Punctuation, LID, Spk-Dz)
"Figuring out"
Encoder-only (Wav2Vec2)
Encoder-Decoder (Whisper)
End to End, Multi-task
Data Scaling
Bigger Models
Why Speech Recognition is Becoming an LLM Problem!
Pre Scaling Era
Transition Era
LLM Era
Not End to End
Small Models (HMM-GMM, DeepSpeech, CTC, RNNT like models)
(Statistical LM, Normalization, Punctuation, LID, Spk-Dz)
"Figuring out Phase"
Encoder-only (Wav2Vec2)
Encoder-Decoder (Whisper)
End to End, Multi-task
Data Scaling (O 100k hrs)
Bigger Models
Huge Data Scaling
- O(10M) hrs
LLM as Decoder
Generalized Intelligence
(Audio-encoder reamined same)
History – The "Dark Ages"
A tangled web of disconnected boxes. 🛑 The Pipeline Problem (Pre-End-to-End)
-
The Old Way: A Frankenstein monster of components.
- 🧩 Handcrafted Features: MFCCs (Human-designed).
- 📊 Acoustic Model: HMM/GMM (Phoneme probabilities).
- 📚 Language Model: N-Gram (Totally separate brain).
- 🔤 Lexicon: G2P (Rule-based).
- The Failure Mode: If one piece broke, the whole chain failed. No shared intelligence.
The First Revolution – "Frame Prediction"
"Frame-by-Frame" guessing system.
- The Breakthrough: End-to-End Deep Learning.
- The Players: DeepSpeech, Wav2Vec2, Conformer.
- The Logic: Predict a letter for every split-second of audio.
-
The "But...": 😰
- Still needed CTC (Connectionist Temporal Classification).
- Output was raw:
hello how are u(No caps, no punctuation). - The Band-Aid: We still needed external LMs and Normalizers to make it readable.
The Pivot (Whisper)
Whisper: Weak Supervision, Strong Decoder
- The Paradigm Shift: 680k hours of noisy, diverse data.
- Speech-as-Tokens: First to treat audio tasks (Translate, Transcribe, Timestamp) as just special tokens.
- The Lesson: A strong autoregressive decoder can "hallucinate correctly"—fixing acoustic errors using context.
- The Result: No more specialized tuning. Zero-shot performance became the standard.
The "Aha!" Moment – Audio is just a Token
Enter the Audio-LLM (The Generalist)
- The Concept: If the Decoder is the key... why not use the ultimate decoder? (The LLM).
-
The Mechanics:
- Audio features projected into the same vector space as text.
-
The Possibilities:
- 🎤 Speech Prompt: "Summarize this audio."
- 🗣️ Audio Context: "Who is the second speaker?"
- 💬 Mixed Modality: "Translate this song lyrics and explain the metaphor."
Traditional ASR: CTC like
MLP
self attention
MLP
self attention
MLP
self attention
Transformer Encoder
Blocks
MLP
Decoder layer
MLP
self attention
MLP
self attention
MLP
self attention
Transformer Encoder
Blocks




Input Audio
MLP
Decoder layer
Traditional ASR: CTC like
MLP
self attention
MLP
self attention
MLP
self attention
Transformer Encoder
Blocks




Input Audio Frames
MLP
Decoder layer
Traditional ASR: CTC like
Traditional ASR: Conformer
MLP
self attention
MLP
self attention
MLP
self attention
Transformer Encoder
Blocks

Input Audio Frame 1
MLP
Decoder layer
Traditional ASR: Conformer
MLP
self attention
MLP
self attention
MLP
self attention
Transformer Encoder
Blocks

The
Input Audio Frame 1
Traditional ASR: Conformer
MLP
self attention
MLP
self attention
MLP
self attention
Transformer Encoder
Blocks

Input Audio Frame 2
Traditional ASR: Conformer
MLP
self attention
MLP
self attention
MLP
self attention
Transformer Encoder
Blocks

quick
Input Audio Frame 2
Traditional ASR: Conformer
MLP
self attention
MLP
self attention
MLP
self attention
Transformer Encoder
Blocks

Input Audio Frame 3
Traditional ASR: Conformer
MLP
self attention
MLP
self attention
MLP
self attention
Transformer Encoder
Blocks

brown
Input Audio Frame 3
Traditional ASR: Conformer
MLP
self attention
MLP
self attention
MLP
self attention
Transformer Encoder
Blocks

Input Audio Frame 4
MLP
self attention
MLP
self attention
MLP
self attention
Transformer Encoder
Blocks

fox
Input Audio Frame 4
Traditional ASR: CTC like
MLP
self attention
MLP
self attention
MLP
self attention
Transformer Encoder
Blocks

Input Audio Frame 4





MLP
Decoder layer
Traditional ASR: CTC like
MLP
self attention
MLP
cross attention
self attention
MLP
self attention
MLP
self attention
Transformer Encoder
Blocks
MLP
cross attention
self attention
Transformer Decoder
Blocks
MLP
cross attention
self attention
Autoregressive ASR: Whisper Large v3

MLP
self attention
MLP
cross attention
self attention
MLP
self attention
MLP
self attention
Transformer Encoder
Blocks
MLP
cross attention
self attention
Transformer Decoder
Blocks
MLP
cross attention
self attention
Autoregressive ASR: Whisper Large v3
Input Audio

MLP
self attention
MLP
cross attention
self attention
MLP
self attention
MLP
self attention
Transformer Encoder
Blocks
MLP
cross attention
self attention
Transformer Decoder
Blocks
cross attention
MLP
cross attention
self attention
Autoregressive ASR: Whisper Large v3
Input Audio

MLP
self attention
MLP
cross attention
self attention
MLP
self attention
MLP
self attention
Transformer Encoder
Blocks
SOT
EN
0.0
transcribe
Input Audio
MLP
cross attention
self attention
Transformer Decoder
Blocks
cross attention
MLP
cross attention
self attention
Autoregressive ASR: Whisper Large v3
Prefix

MLP
self attention
MLP
cross attention
self attention
MLP
self attention
MLP
self attention
Transformer Encoder
Blocks
The
MLP
cross attention
self attention
MLP
cross attention
self attention
Transformer Decoder
Blocks
cross attention
SOT
EN
0.0
transcribe
Autoregressive ASR: Whisper Large v3
Input Audio
Prefix

MLP
self attention
MLP
cross attention
self attention
MLP
self attention
MLP
self attention
Transformer Encoder
Blocks
The
Input Audio
MLP
cross attention
self attention
MLP
cross attention
self attention
Transformer Decoder
Blocks
cross attention
Autoregressive ASR: Whisper Large v3
SOT
EN
0.0
transcribe
The
Prefix

MLP
self attention
MLP
cross attention
self attention
MLP
self attention
MLP
self attention
Transformer Encoder
Blocks
The
Input Audio
MLP
cross attention
self attention
MLP
cross attention
self attention
Transformer Decoder
Blocks
cross attention
SOT
EN
0.0
transcribe
The
quick
Autoregressive ASR: Whisper Large v3
Prefix

MLP
self attention
MLP
cross attention
self attention
MLP
self attention
MLP
self attention
Transformer Encoder
Blocks
The
Input Audio
MLP
cross attention
self attention
MLP
cross attention
self attention
Transformer Decoder
Blocks
cross attention
SOT
EN
0.0
transcribe
The
quick
quick
Autoregressive ASR: Whisper Large v3
Prefix

MLP
self attention
MLP
cross attention
self attention
MLP
self attention
MLP
self attention
Transformer Encoder
Blocks
The
Input Audio
MLP
cross attention
self attention
MLP
cross attention
self attention
Transformer Decoder
Blocks
cross attention
SOT
EN
0.0
transcribe
The
quick
quick
brown
Autoregressive ASR: Whisper Large v3
Prefix

MLP
self attention
MLP
cross attention
self attention
MLP
self attention
MLP
self attention
Transformer Encoder
Blocks
The
Input Audio
MLP
cross attention
self attention
MLP
cross attention
self attention
Transformer Decoder
Blocks
cross attention
SOT
EN
0.0
transcribe
The
quick
quick
brown
brown
Autoregressive ASR: Whisper Large v3
Prefix

MLP
self attention
MLP
cross attention
self attention
MLP
self attention
MLP
self attention
Transformer Encoder
Blocks
The
Input Audio
MLP
cross attention
self attention
MLP
cross attention
self attention
Transformer Decoder
Blocks
cross attention
SOT
EN
0.0
transcribe
The
quick
quick
brown
brown
fox
Autoregressive ASR: Whisper Large v3
Prefix
MLP
self attention
MLP
self attention
MLP
self attention
Whisper
Audio Encoder
Blocks
Voxtral Audio LLM
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention




Input Audio (81s)
MLP
self attention
MLP
self attention
MLP
self attention
Whisper
Audio Encoder
Blocks
Voxtral Audio LLM
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
Input Audio chunks




30s
30s
21s
PAD
MLP
self attention
MLP
self attention
MLP
self attention
Whisper
Audio Encoder
Blocks
Voxtral Audio LLM
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
Input Audio chunks




30s
30s
21s
PAD
Audio Embedding
Audio Embedding
MLP
self attention
MLP
self attention
MLP
self attention
Whisper
Audio Encoder
Blocks
Voxtral Audio LLM
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
Input Audio chunks




30s
30s
21s
PAD
Audio Embedding
Audio Embedding
Audio Embedding
Temporal Downsampling Adapter Layer
MLP
self attention
MLP
self attention
MLP
self attention
Whisper
Audio Encoder
Blocks
Voxtral Audio LLM
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
Input Audio chunks




30s
30s
21s
PAD
Audio Embedding
Audio Embedding
Audio Embedding
Temporal Downsampling Adapter Layer



Audio Features
MLP
self attention
MLP
self attention
MLP
self attention
Whisper
Audio Encoder
Blocks
Voxtral Audio LLM
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
Input Audio chunks




30s
30s
21s
PAD
Audio Embedding
Audio Embedding
Audio Embedding
Temporal Downsampling Adapter Layer
please
trans
cribe
Text Features



Audio Features
+
MLP
self attention
MLP
self attention
MLP
self attention
Whisper
Audio Encoder
Blocks
Voxtral Audio LLM
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
Input Audio chunks




30s
30s
21s
PAD
Audio Embedding
Audio Embedding
Audio Embedding
Temporal Downsampling Adapter Layer
please
trans
cribe
Text Features



Audio Features
Mistral LLM Decoder Blocks
please
trans
cribe
Text Features
+
MLP
self attention
MLP
self attention
MLP
self attention
Whisper
Audio Encoder
Blocks
Voxtral Audio LLM
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
Input Audio chunks




30s
30s
21s
PAD
Audio Embedding
Audio Embedding
Audio Embedding
Temporal Downsampling Adapter Layer
please
trans
cribe
Text Features



Audio Features
Mistral LLM Decoder Blocks
The
please
trans
cribe
Text Features
+
MLP
self attention
MLP
self attention
MLP
self attention
Whisper
Audio Encoder
Blocks
Voxtral Audio LLM
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
Input Audio chunks




30s
30s
21s
PAD
Audio Embedding
Audio Embedding
Audio Embedding
Temporal Downsampling Adapter Layer
please
trans
cribe
Text Features



Audio Features
Mistral LLM Decoder Blocks
The
The
please
trans
cribe
Text Features
+
MLP
self attention
MLP
self attention
MLP
self attention
Whisper
Audio Encoder
Blocks
Voxtral Audio LLM
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
Input Audio chunks




30s
30s
21s
PAD
Audio Embedding
Audio Embedding
Audio Embedding
Temporal Downsampling Adapter Layer
please
trans
cribe
Text Features



Audio Features
Mistral LLM Decoder Blocks
The
The
quick
please
trans
cribe
Text Features
+
MLP
self attention
MLP
self attention
MLP
self attention
Whisper
Audio Encoder
Blocks
Voxtral Audio LLM
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
Input Audio chunks




30s
30s
21s
PAD
Audio Embedding
Audio Embedding
Audio Embedding
Temporal Downsampling Adapter Layer
please
trans
cribe
Text Features



Audio Features
Mistral LLM Decoder Blocks
The
The
quick
quick
please
trans
cribe
Text Features
+
MLP
self attention
MLP
self attention
MLP
self attention
Whisper
Audio Encoder
Blocks
Voxtral Audio LLM
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
Input Audio chunks




30s
30s
21s
PAD
Audio Embedding
Audio Embedding
Audio Embedding
Temporal Downsampling Adapter Layer
please
trans
cribe
Text Features



Audio Features
Mistral LLM Decoder Blocks
The
The
quick
quick
brown
please
trans
cribe
Text Features
+
MLP
self attention
MLP
self attention
MLP
self attention
Whisper
Audio Encoder
Blocks
Voxtral Audio LLM
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
Input Audio chunks




30s
30s
21s
PAD
Audio Embedding
Audio Embedding
Audio Embedding
Temporal Downsampling Adapter Layer
please
trans
cribe
Text Features



Audio Features
Mistral LLM Decoder Blocks
The
The
quick
quick
brown
brown
please
trans
cribe
Text Features
+
MLP
self attention
MLP
self attention
MLP
self attention
Whisper
Audio Encoder
Blocks
Voxtral Audio LLM
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
Input Audio chunks




30s
30s
21s
PAD
Audio Embedding
Audio Embedding
Audio Embedding
Temporal Downsampling Adapter Layer
please
trans
cribe
Text Features



Audio Features
Mistral LLM Decoder Blocks
The
The
quick
quick
brown
brown
fox
please
trans
cribe
Text Features
+
General Audio-LLMs are too creative. They hallucinate content that isn't there and struggle with strict timestamp alignment.
But there is a flaw with Audio-LLMs
-
Drops of 2026 :
- 🗣️ VibeVoice - ASR (Microsoft): The "Long-Form Beast." 60-min single-pass understanding. Unifies ASR + Diarization + Timestamps in one model.
- 🛡️ Qwen3-ASR (Alibaba): The "Secure Specialist." 1.7B params. Context understanding + Long form + Real time.
- ⚡ Voxtral Transcribe 2 (Mistral): The "Speed Demon." Real-time streaming with configurable delay (80–480ms).
🇮🇳 Sarvam Audio: 22 Indian Languages + English. User-controlled output styles (Verbatim vs. Normalized), Long context Biasing, Long form support etc.
Recent Shift (Specialized ASR-LLMs)
Why We Need Them?
Security, Precision, Control.
-
1. Instruction Restriction (Safety):
- Problem: Users can "jailbreak" general LLMs via voice.
- Solution: Qwen3-ASR is trained as "ASR-Only." It explicitly ignores natural language instructions to prevent injection and hallucinations.
-
2. Formatting Control (UX):
- Problem: Doctors need "Verbatim" text; Subtitles need "Clean" text.
- Solution: Sarvam Audio allows user-defined output styles (Literal vs. Normalized) via prompt control.
-
3. Hotword Biasing:
- Solution: Voxtral allows injecting ~100 context words (domain terms) to bias the transcript, fixing the "rare word" problem.
The Training "Secret Sauce"
The 5-Stage Recipe for SOTA Models
- Step 1: Alignment Pretraining: Freeze LLM. Train first the Adapter and then Adapter+Encoder on millions of hours of audio-text pairs.
- Step 2: Long-Context Expansion: Expand context window to 32k tokens (~40 mins).
- Step 3: Supervised Fine-Tuning (Auxiliary): Train on translation, speaker ID, and QA to build "world model" understanding.
-
Step 4: ASR-Specific SFT: The "Hardening" phase. Train with special tokens (
<beg_spk>,<end_time>) and force strict transcription behavior. - Step 5: DPO/GRPO (Alignment): Use Reinforcement Learning (Direct Preference Optimization) to penalize hallucinations and reward timestamp accuracy.
Thank you!
PyTorch Day
By Sshubam Verma
PyTorch Day
- 12