- Abhigyan Raman
(Founding Engineer, Sarvam.ai)
Scales very well
Doesn't very well
Scales very well
Doesn't very well
(Kardashev law)
Transition Era
Not End to End
Small Models (HMM-GMM, DeepSpeech, CTC, RNNT like models)
(Statistical LM, Normalization, Punctuation, LID, Spk-Dz)
Transition Era
Not End to End
Small Models (HMM-GMM, DeepSpeech, CTC, RNNT like models)
(Statistical LM, Normalization, Punctuation, LID, Spk-Dz)
"Figuring out"
Encoder-only (Wav2Vec2)
Encoder-Decoder (Whisper)
End to End, Multi-task
Data Scaling
Bigger Models
Transition Era
Not End to End
Small Models (HMM-GMM, DeepSpeech, CTC, RNNT like models)
(Statistical LM, Normalization, Punctuation, LID, Spk-Dz)
"Figuring out Phase"
Encoder-only (Wav2Vec2)
Encoder-Decoder (Whisper)
End to End, Multi-task
Data Scaling (O 100k hrs)
Bigger Models
Huge Data Scaling
- O(10M) hrs
LLM as Decoder
Generalized Intelligence
(Audio-encoder reamined same)
A tangled web of disconnected boxes. 🛑 The Pipeline Problem (Pre-End-to-End)
"Frame-by-Frame" guessing system.
hello how are u (No caps, no punctuation).Whisper: Weak Supervision, Strong Decoder
Enter the Audio-LLM (The Generalist)
Traditional ASR: CTC like
MLP
self attention
MLP
self attention
MLP
self attention
Transformer Encoder
Blocks
MLP
Decoder layer
MLP
self attention
MLP
self attention
MLP
self attention
Transformer Encoder
Blocks
Input Audio
MLP
Decoder layer
Traditional ASR: CTC like
MLP
self attention
MLP
self attention
MLP
self attention
Transformer Encoder
Blocks
Input Audio Frames
MLP
Decoder layer
Traditional ASR: CTC like
Traditional ASR: Conformer
MLP
self attention
MLP
self attention
MLP
self attention
Transformer Encoder
Blocks
Input Audio Frame 1
MLP
Decoder layer
Traditional ASR: Conformer
MLP
self attention
MLP
self attention
MLP
self attention
Transformer Encoder
Blocks
The
Input Audio Frame 1
Traditional ASR: Conformer
MLP
self attention
MLP
self attention
MLP
self attention
Transformer Encoder
Blocks
Input Audio Frame 2
Traditional ASR: Conformer
MLP
self attention
MLP
self attention
MLP
self attention
Transformer Encoder
Blocks
quick
Input Audio Frame 2
Traditional ASR: Conformer
MLP
self attention
MLP
self attention
MLP
self attention
Transformer Encoder
Blocks
Input Audio Frame 3
Traditional ASR: Conformer
MLP
self attention
MLP
self attention
MLP
self attention
Transformer Encoder
Blocks
brown
Input Audio Frame 3
Traditional ASR: Conformer
MLP
self attention
MLP
self attention
MLP
self attention
Transformer Encoder
Blocks
Input Audio Frame 4
MLP
self attention
MLP
self attention
MLP
self attention
Transformer Encoder
Blocks
fox
Input Audio Frame 4
Traditional ASR: CTC like
MLP
self attention
MLP
self attention
MLP
self attention
Transformer Encoder
Blocks
Input Audio Frame 4
MLP
Decoder layer
Traditional ASR: CTC like
MLP
self attention
MLP
cross attention
self attention
MLP
self attention
MLP
self attention
Transformer Encoder
Blocks
MLP
cross attention
self attention
Transformer Decoder
Blocks
MLP
cross attention
self attention
Autoregressive ASR: Whisper Large v3
MLP
self attention
MLP
cross attention
self attention
MLP
self attention
MLP
self attention
Transformer Encoder
Blocks
MLP
cross attention
self attention
Transformer Decoder
Blocks
MLP
cross attention
self attention
Autoregressive ASR: Whisper Large v3
Input Audio
MLP
self attention
MLP
cross attention
self attention
MLP
self attention
MLP
self attention
Transformer Encoder
Blocks
MLP
cross attention
self attention
Transformer Decoder
Blocks
cross attention
MLP
cross attention
self attention
Autoregressive ASR: Whisper Large v3
Input Audio
MLP
self attention
MLP
cross attention
self attention
MLP
self attention
MLP
self attention
Transformer Encoder
Blocks
SOT
EN
0.0
transcribe
Input Audio
MLP
cross attention
self attention
Transformer Decoder
Blocks
cross attention
MLP
cross attention
self attention
Autoregressive ASR: Whisper Large v3
Prefix
MLP
self attention
MLP
cross attention
self attention
MLP
self attention
MLP
self attention
Transformer Encoder
Blocks
The
MLP
cross attention
self attention
MLP
cross attention
self attention
Transformer Decoder
Blocks
cross attention
SOT
EN
0.0
transcribe
Autoregressive ASR: Whisper Large v3
Input Audio
Prefix
MLP
self attention
MLP
cross attention
self attention
MLP
self attention
MLP
self attention
Transformer Encoder
Blocks
The
Input Audio
MLP
cross attention
self attention
MLP
cross attention
self attention
Transformer Decoder
Blocks
cross attention
Autoregressive ASR: Whisper Large v3
SOT
EN
0.0
transcribe
The
Prefix
MLP
self attention
MLP
cross attention
self attention
MLP
self attention
MLP
self attention
Transformer Encoder
Blocks
The
Input Audio
MLP
cross attention
self attention
MLP
cross attention
self attention
Transformer Decoder
Blocks
cross attention
SOT
EN
0.0
transcribe
The
quick
Autoregressive ASR: Whisper Large v3
Prefix
MLP
self attention
MLP
cross attention
self attention
MLP
self attention
MLP
self attention
Transformer Encoder
Blocks
The
Input Audio
MLP
cross attention
self attention
MLP
cross attention
self attention
Transformer Decoder
Blocks
cross attention
SOT
EN
0.0
transcribe
The
quick
quick
Autoregressive ASR: Whisper Large v3
Prefix
MLP
self attention
MLP
cross attention
self attention
MLP
self attention
MLP
self attention
Transformer Encoder
Blocks
The
Input Audio
MLP
cross attention
self attention
MLP
cross attention
self attention
Transformer Decoder
Blocks
cross attention
SOT
EN
0.0
transcribe
The
quick
quick
brown
Autoregressive ASR: Whisper Large v3
Prefix
MLP
self attention
MLP
cross attention
self attention
MLP
self attention
MLP
self attention
Transformer Encoder
Blocks
The
Input Audio
MLP
cross attention
self attention
MLP
cross attention
self attention
Transformer Decoder
Blocks
cross attention
SOT
EN
0.0
transcribe
The
quick
quick
brown
brown
Autoregressive ASR: Whisper Large v3
Prefix
MLP
self attention
MLP
cross attention
self attention
MLP
self attention
MLP
self attention
Transformer Encoder
Blocks
The
Input Audio
MLP
cross attention
self attention
MLP
cross attention
self attention
Transformer Decoder
Blocks
cross attention
SOT
EN
0.0
transcribe
The
quick
quick
brown
brown
fox
Autoregressive ASR: Whisper Large v3
Prefix
MLP
self attention
MLP
self attention
MLP
self attention
Whisper
Audio Encoder
Blocks
Voxtral Audio LLM
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
Input Audio (81s)
MLP
self attention
MLP
self attention
MLP
self attention
Whisper
Audio Encoder
Blocks
Voxtral Audio LLM
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
Input Audio chunks
30s
30s
21s
PAD
MLP
self attention
MLP
self attention
MLP
self attention
Whisper
Audio Encoder
Blocks
Voxtral Audio LLM
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
Input Audio chunks
30s
30s
21s
PAD
Audio Embedding
Audio Embedding
MLP
self attention
MLP
self attention
MLP
self attention
Whisper
Audio Encoder
Blocks
Voxtral Audio LLM
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
Input Audio chunks
30s
30s
21s
PAD
Audio Embedding
Audio Embedding
Audio Embedding
Temporal Downsampling Adapter Layer
MLP
self attention
MLP
self attention
MLP
self attention
Whisper
Audio Encoder
Blocks
Voxtral Audio LLM
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
Input Audio chunks
30s
30s
21s
PAD
Audio Embedding
Audio Embedding
Audio Embedding
Temporal Downsampling Adapter Layer
Audio Features
MLP
self attention
MLP
self attention
MLP
self attention
Whisper
Audio Encoder
Blocks
Voxtral Audio LLM
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
Input Audio chunks
30s
30s
21s
PAD
Audio Embedding
Audio Embedding
Audio Embedding
Temporal Downsampling Adapter Layer
please
trans
cribe
Text Features
Audio Features
+
MLP
self attention
MLP
self attention
MLP
self attention
Whisper
Audio Encoder
Blocks
Voxtral Audio LLM
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
Input Audio chunks
30s
30s
21s
PAD
Audio Embedding
Audio Embedding
Audio Embedding
Temporal Downsampling Adapter Layer
please
trans
cribe
Text Features
Audio Features
Mistral LLM Decoder Blocks
please
trans
cribe
Text Features
+
MLP
self attention
MLP
self attention
MLP
self attention
Whisper
Audio Encoder
Blocks
Voxtral Audio LLM
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
Input Audio chunks
30s
30s
21s
PAD
Audio Embedding
Audio Embedding
Audio Embedding
Temporal Downsampling Adapter Layer
please
trans
cribe
Text Features
Audio Features
Mistral LLM Decoder Blocks
The
please
trans
cribe
Text Features
+
MLP
self attention
MLP
self attention
MLP
self attention
Whisper
Audio Encoder
Blocks
Voxtral Audio LLM
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
Input Audio chunks
30s
30s
21s
PAD
Audio Embedding
Audio Embedding
Audio Embedding
Temporal Downsampling Adapter Layer
please
trans
cribe
Text Features
Audio Features
Mistral LLM Decoder Blocks
The
The
please
trans
cribe
Text Features
+
MLP
self attention
MLP
self attention
MLP
self attention
Whisper
Audio Encoder
Blocks
Voxtral Audio LLM
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
Input Audio chunks
30s
30s
21s
PAD
Audio Embedding
Audio Embedding
Audio Embedding
Temporal Downsampling Adapter Layer
please
trans
cribe
Text Features
Audio Features
Mistral LLM Decoder Blocks
The
The
quick
please
trans
cribe
Text Features
+
MLP
self attention
MLP
self attention
MLP
self attention
Whisper
Audio Encoder
Blocks
Voxtral Audio LLM
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
Input Audio chunks
30s
30s
21s
PAD
Audio Embedding
Audio Embedding
Audio Embedding
Temporal Downsampling Adapter Layer
please
trans
cribe
Text Features
Audio Features
Mistral LLM Decoder Blocks
The
The
quick
quick
please
trans
cribe
Text Features
+
MLP
self attention
MLP
self attention
MLP
self attention
Whisper
Audio Encoder
Blocks
Voxtral Audio LLM
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
Input Audio chunks
30s
30s
21s
PAD
Audio Embedding
Audio Embedding
Audio Embedding
Temporal Downsampling Adapter Layer
please
trans
cribe
Text Features
Audio Features
Mistral LLM Decoder Blocks
The
The
quick
quick
brown
please
trans
cribe
Text Features
+
MLP
self attention
MLP
self attention
MLP
self attention
Whisper
Audio Encoder
Blocks
Voxtral Audio LLM
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
Input Audio chunks
30s
30s
21s
PAD
Audio Embedding
Audio Embedding
Audio Embedding
Temporal Downsampling Adapter Layer
please
trans
cribe
Text Features
Audio Features
Mistral LLM Decoder Blocks
The
The
quick
quick
brown
brown
please
trans
cribe
Text Features
+
MLP
self attention
MLP
self attention
MLP
self attention
Whisper
Audio Encoder
Blocks
Voxtral Audio LLM
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
MLP
self attention
Input Audio chunks
30s
30s
21s
PAD
Audio Embedding
Audio Embedding
Audio Embedding
Temporal Downsampling Adapter Layer
please
trans
cribe
Text Features
Audio Features
Mistral LLM Decoder Blocks
The
The
quick
quick
brown
brown
fox
please
trans
cribe
Text Features
+
General Audio-LLMs are too creative. They hallucinate content that isn't there and struggle with strict timestamp alignment.
🇮🇳 Sarvam Audio: 22 Indian Languages + English. User-controlled output styles (Verbatim vs. Normalized), Long context Biasing, Long form support etc.
Security, Precision, Control.
The 5-Stage Recipe for SOTA Models
<beg_spk>, <end_time>) and force strict transcription behavior.