Why Speech Recognition is Becoming an LLM Problem!

- Abhigyan Raman

(Founding Engineer, Sarvam.ai)

I am Full-Time Data Janitor                      ...

I am Full-Time Data Janitor & Part-Time Model Builder

I am Full-Time Data Janitor & Part-Time Model Builder

Scales very well
Doesn't very well

I am Full-Time Data Janitor & Part-Time Model Builder

Scales very well
Doesn't very well

And what doesn't scale, eventually dies..

(Kardashev law)

Why Speech Recognition is Becoming an LLM Problem!

Pre Scaling Era

Transition Era

LLM Era

Not End to End

Small Models (HMM-GMM, DeepSpeech, CTC, RNNT like models)

(Statistical LM, Normalization, Punctuation, LID, Spk-Dz)

Why Speech Recognition is Becoming an LLM Problem!

Pre Scaling Era

Transition Era

LLM Era

Not End to End

Small Models (HMM-GMM, DeepSpeech, CTC, RNNT like models)

(Statistical LM, Normalization, Punctuation, LID, Spk-Dz)

"Figuring out"

Encoder-only (Wav2Vec2)

Encoder-Decoder (Whisper)

End to End, Multi-task

Data Scaling

Bigger Models

Why Speech Recognition is Becoming an LLM Problem!

Pre Scaling Era

Transition Era

LLM Era

Not End to End

Small Models (HMM-GMM, DeepSpeech, CTC, RNNT like models)

(Statistical LM, Normalization, Punctuation, LID, Spk-Dz)

"Figuring out Phase"

Encoder-only (Wav2Vec2)

Encoder-Decoder (Whisper)

End to End, Multi-task

Data Scaling (O 100k hrs)

Bigger Models

Huge Data Scaling
- O(10M) hrs

LLM as Decoder

Generalized Intelligence

(Audio-encoder reamined same)

History – The "Dark Ages"

A tangled web of disconnected boxes. 🛑 The Pipeline Problem (Pre-End-to-End)

  • The Old Way: A Frankenstein monster of components.
    • 🧩 Handcrafted Features: MFCCs (Human-designed).
    • 📊 Acoustic Model: HMM/GMM (Phoneme probabilities).
    • 📚 Language Model: N-Gram (Totally separate brain).
    • 🔤 Lexicon: G2P (Rule-based).
  • The Failure Mode: If one piece broke, the whole chain failed. No shared intelligence.

The First Revolution – "Frame Prediction"

"Frame-by-Frame" guessing system.

  • The Breakthrough: End-to-End Deep Learning.
  • The Players: DeepSpeech, Wav2Vec2, Conformer.
  • The Logic: Predict a letter for every split-second of audio.
  • The "But...": 😰
    • Still needed CTC (Connectionist Temporal Classification).
    • Output was raw: hello how are u (No caps, no punctuation).
    • The Band-Aid: We still needed external LMs and Normalizers to make it readable.

The Pivot (Whisper)

Whisper: Weak Supervision, Strong Decoder

  • The Paradigm Shift: 680k hours of noisy, diverse data.
  • Speech-as-Tokens: First to treat audio tasks (Translate, Transcribe, Timestamp) as just special tokens.
  • The Lesson: A strong autoregressive decoder can "hallucinate correctly"—fixing acoustic errors using context.
  • The Result: No more specialized tuning. Zero-shot performance became the standard.

The "Aha!" Moment – Audio is just a Token

Enter the Audio-LLM (The Generalist)

  • The Concept: If the Decoder is the key... why not use the ultimate decoder? (The LLM).
  • The Mechanics:
    • Audio features projected into the same vector space as text.
  • The Possibilities:
    • 🎤 Speech Prompt: "Summarize this audio."
    • 🗣️ Audio Context: "Who is the second speaker?"
    • 💬 Mixed Modality: "Translate this song lyrics and explain the metaphor."

Traditional ASR: CTC like

MLP

self attention

MLP

self attention

MLP

self attention

Transformer Encoder

Blocks

MLP

Decoder layer

MLP

self attention

MLP

self attention

MLP

self attention

Transformer Encoder

Blocks

Input Audio

MLP

Decoder layer

Traditional ASR: CTC like

MLP

self attention

MLP

self attention

MLP

self attention

Transformer Encoder

Blocks

Input Audio Frames

MLP

Decoder layer

Traditional ASR: CTC like

Traditional ASR: Conformer

MLP

self attention

MLP

self attention

MLP

self attention

Transformer Encoder

Blocks

Input Audio Frame 1

MLP

Decoder layer

Traditional ASR: Conformer

MLP

self attention

MLP

self attention

MLP

self attention

Transformer Encoder

Blocks

The

Input Audio Frame 1

Traditional ASR: Conformer

MLP

self attention

MLP

self attention

MLP

self attention

Transformer Encoder

Blocks

Input Audio Frame 2

Traditional ASR: Conformer

MLP

self attention

MLP

self attention

MLP

self attention

Transformer Encoder

Blocks

quick

Input Audio Frame 2

Traditional ASR: Conformer

MLP

self attention

MLP

self attention

MLP

self attention

Transformer Encoder

Blocks

Input Audio Frame 3

Traditional ASR: Conformer

MLP

self attention

MLP

self attention

MLP

self attention

Transformer Encoder

Blocks

brown

Input Audio Frame 3

Traditional ASR: Conformer

MLP

self attention

MLP

self attention

MLP

self attention

Transformer Encoder

Blocks

Input Audio Frame 4

MLP

self attention

MLP

self attention

MLP

self attention

Transformer Encoder

Blocks

fox

Input Audio Frame 4

Traditional ASR: CTC like

MLP

self attention

MLP

self attention

MLP

self attention

Transformer Encoder

Blocks

Input Audio Frame 4

MLP

Decoder layer

Traditional ASR: CTC like

MLP

self attention

MLP

cross attention

self attention

MLP

self attention

MLP

self attention

Transformer Encoder

Blocks

MLP

cross attention

self attention

Transformer Decoder

Blocks

MLP

cross attention

self attention

Autoregressive ASR: Whisper Large v3

MLP

self attention

MLP

cross attention

self attention

MLP

self attention

MLP

self attention

Transformer Encoder

Blocks

MLP

cross attention

self attention

Transformer Decoder

Blocks

MLP

cross attention

self attention

Autoregressive ASR: Whisper Large v3

Input Audio

MLP

self attention

MLP

cross attention

self attention

MLP

self attention

MLP

self attention

Transformer Encoder

Blocks

MLP

cross attention

self attention

Transformer Decoder

Blocks

cross attention

MLP

cross attention

self attention

Autoregressive ASR: Whisper Large v3

Input Audio

MLP

self attention

MLP

cross attention

self attention

MLP

self attention

MLP

self attention

Transformer Encoder

Blocks

SOT

EN

0.0

transcribe

Input Audio

MLP

cross attention

self attention

Transformer Decoder

Blocks

cross attention

MLP

cross attention

self attention

Autoregressive ASR: Whisper Large v3

Prefix

MLP

self attention

MLP

cross attention

self attention

MLP

self attention

MLP

self attention

Transformer Encoder

Blocks

The

MLP

cross attention

self attention

MLP

cross attention

self attention

Transformer Decoder

Blocks

cross attention

SOT

EN

0.0

transcribe

Autoregressive ASR: Whisper Large v3

Input Audio

Prefix

MLP

self attention

MLP

cross attention

self attention

MLP

self attention

MLP

self attention

Transformer Encoder

Blocks

The

Input Audio

MLP

cross attention

self attention

MLP

cross attention

self attention

Transformer Decoder

Blocks

cross attention

Autoregressive ASR: Whisper Large v3

SOT

EN

0.0

transcribe

The

Prefix

MLP

self attention

MLP

cross attention

self attention

MLP

self attention

MLP

self attention

Transformer Encoder

Blocks

The

Input Audio

MLP

cross attention

self attention

MLP

cross attention

self attention

Transformer Decoder

Blocks

cross attention

SOT

EN

0.0

transcribe

The

quick

Autoregressive ASR: Whisper Large v3

Prefix

MLP

self attention

MLP

cross attention

self attention

MLP

self attention

MLP

self attention

Transformer Encoder

Blocks

The

Input Audio

MLP

cross attention

self attention

MLP

cross attention

self attention

Transformer Decoder

Blocks

cross attention

SOT

EN

0.0

transcribe

The

quick

quick

Autoregressive ASR: Whisper Large v3

Prefix

MLP

self attention

MLP

cross attention

self attention

MLP

self attention

MLP

self attention

Transformer Encoder

Blocks

The

Input Audio

MLP

cross attention

self attention

MLP

cross attention

self attention

Transformer Decoder

Blocks

cross attention

SOT

EN

0.0

transcribe

The

quick

quick

brown

Autoregressive ASR: Whisper Large v3

Prefix

MLP

self attention

MLP

cross attention

self attention

MLP

self attention

MLP

self attention

Transformer Encoder

Blocks

The

Input Audio

MLP

cross attention

self attention

MLP

cross attention

self attention

Transformer Decoder

Blocks

cross attention

SOT

EN

0.0

transcribe

The

quick

quick

brown

brown

Autoregressive ASR: Whisper Large v3

Prefix

MLP

self attention

MLP

cross attention

self attention

MLP

self attention

MLP

self attention

Transformer Encoder

Blocks

The

Input Audio

MLP

cross attention

self attention

MLP

cross attention

self attention

Transformer Decoder

Blocks

cross attention

SOT

EN

0.0

transcribe

The

quick

quick

brown

brown

fox

Autoregressive ASR: Whisper Large v3

Prefix

MLP

self attention

MLP

self attention

MLP

self attention

Whisper

Audio Encoder

Blocks

Voxtral Audio LLM

MLP

self attention

MLP

self attention

MLP

self attention

MLP

self attention

MLP

self attention

MLP

self attention

Input Audio (81s)

MLP

self attention

MLP

self attention

MLP

self attention

Whisper

Audio Encoder

Blocks

Voxtral Audio LLM

MLP

self attention

MLP

self attention

MLP

self attention

MLP

self attention

MLP

self attention

MLP

self attention

Input Audio chunks

30s

30s

21s

PAD

MLP

self attention

MLP

self attention

MLP

self attention

Whisper

Audio Encoder

Blocks

Voxtral Audio LLM

MLP

self attention

MLP

self attention

MLP

self attention

MLP

self attention

MLP

self attention

MLP

self attention

Input Audio chunks

30s

30s

21s

PAD

Audio Embedding

Audio Embedding

MLP

self attention

MLP

self attention

MLP

self attention

Whisper

Audio Encoder

Blocks

Voxtral Audio LLM

MLP

self attention

MLP

self attention

MLP

self attention

MLP

self attention

MLP

self attention

MLP

self attention

Input Audio chunks

30s

30s

21s

PAD

Audio Embedding

Audio Embedding

Audio Embedding

Temporal Downsampling Adapter Layer

MLP

self attention

MLP

self attention

MLP

self attention

Whisper

Audio Encoder

Blocks

Voxtral Audio LLM

MLP

self attention

MLP

self attention

MLP

self attention

MLP

self attention

MLP

self attention

MLP

self attention

Input Audio chunks

30s

30s

21s

PAD

Audio Embedding

Audio Embedding

Audio Embedding

Temporal Downsampling Adapter Layer

Audio Features

MLP

self attention

MLP

self attention

MLP

self attention

Whisper

Audio Encoder

Blocks

Voxtral Audio LLM

MLP

self attention

MLP

self attention

MLP

self attention

MLP

self attention

MLP

self attention

MLP

self attention

Input Audio chunks

30s

30s

21s

PAD

Audio Embedding

Audio Embedding

Audio Embedding

Temporal Downsampling Adapter Layer

please

trans

cribe

Text Features

Audio Features

+

MLP

self attention

MLP

self attention

MLP

self attention

Whisper

Audio Encoder

Blocks

Voxtral Audio LLM

MLP

self attention

MLP

self attention

MLP

self attention

MLP

self attention

MLP

self attention

MLP

self attention

Input Audio chunks

30s

30s

21s

PAD

Audio Embedding

Audio Embedding

Audio Embedding

Temporal Downsampling Adapter Layer

please

trans

cribe

Text Features

Audio Features

Mistral LLM Decoder Blocks

please

trans

cribe

Text Features

+

MLP

self attention

MLP

self attention

MLP

self attention

Whisper

Audio Encoder

Blocks

Voxtral Audio LLM

MLP

self attention

MLP

self attention

MLP

self attention

MLP

self attention

MLP

self attention

MLP

self attention

Input Audio chunks

30s

30s

21s

PAD

Audio Embedding

Audio Embedding

Audio Embedding

Temporal Downsampling Adapter Layer

please

trans

cribe

Text Features

Audio Features

Mistral LLM Decoder Blocks

The

please

trans

cribe

Text Features

+

MLP

self attention

MLP

self attention

MLP

self attention

Whisper

Audio Encoder

Blocks

Voxtral Audio LLM

MLP

self attention

MLP

self attention

MLP

self attention

MLP

self attention

MLP

self attention

MLP

self attention

Input Audio chunks

30s

30s

21s

PAD

Audio Embedding

Audio Embedding

Audio Embedding

Temporal Downsampling Adapter Layer

please

trans

cribe

Text Features

Audio Features

Mistral LLM Decoder Blocks

The

The

please

trans

cribe

Text Features

+

MLP

self attention

MLP

self attention

MLP

self attention

Whisper

Audio Encoder

Blocks

Voxtral Audio LLM

MLP

self attention

MLP

self attention

MLP

self attention

MLP

self attention

MLP

self attention

MLP

self attention

Input Audio chunks

30s

30s

21s

PAD

Audio Embedding

Audio Embedding

Audio Embedding

Temporal Downsampling Adapter Layer

please

trans

cribe

Text Features

Audio Features

Mistral LLM Decoder Blocks

The

The

quick

please

trans

cribe

Text Features

+

MLP

self attention

MLP

self attention

MLP

self attention

Whisper

Audio Encoder

Blocks

Voxtral Audio LLM

MLP

self attention

MLP

self attention

MLP

self attention

MLP

self attention

MLP

self attention

MLP

self attention

Input Audio chunks

30s

30s

21s

PAD

Audio Embedding

Audio Embedding

Audio Embedding

Temporal Downsampling Adapter Layer

please

trans

cribe

Text Features

Audio Features

Mistral LLM Decoder Blocks

The

The

quick

quick

please

trans

cribe

Text Features

+

MLP

self attention

MLP

self attention

MLP

self attention

Whisper

Audio Encoder

Blocks

Voxtral Audio LLM

MLP

self attention

MLP

self attention

MLP

self attention

MLP

self attention

MLP

self attention

MLP

self attention

Input Audio chunks

30s

30s

21s

PAD

Audio Embedding

Audio Embedding

Audio Embedding

Temporal Downsampling Adapter Layer

please

trans

cribe

Text Features

Audio Features

Mistral LLM Decoder Blocks

The

The

quick

quick

brown

please

trans

cribe

Text Features

+

MLP

self attention

MLP

self attention

MLP

self attention

Whisper

Audio Encoder

Blocks

Voxtral Audio LLM

MLP

self attention

MLP

self attention

MLP

self attention

MLP

self attention

MLP

self attention

MLP

self attention

Input Audio chunks

30s

30s

21s

PAD

Audio Embedding

Audio Embedding

Audio Embedding

Temporal Downsampling Adapter Layer

please

trans

cribe

Text Features

Audio Features

Mistral LLM Decoder Blocks

The

The

quick

quick

brown

brown

please

trans

cribe

Text Features

+

MLP

self attention

MLP

self attention

MLP

self attention

Whisper

Audio Encoder

Blocks

Voxtral Audio LLM

MLP

self attention

MLP

self attention

MLP

self attention

MLP

self attention

MLP

self attention

MLP

self attention

Input Audio chunks

30s

30s

21s

PAD

Audio Embedding

Audio Embedding

Audio Embedding

Temporal Downsampling Adapter Layer

please

trans

cribe

Text Features

Audio Features

Mistral LLM Decoder Blocks

The

The

quick

quick

brown

brown

fox

please

trans

cribe

Text Features

+

General Audio-LLMs are too creative. They hallucinate content that isn't there and struggle with strict timestamp alignment.

But there is a flaw with Audio-LLMs

  • Drops of 2026 :
    • 🗣️ VibeVoice - ASR (Microsoft): The "Long-Form Beast." 60-min single-pass understanding. Unifies ASR + Diarization + Timestamps in one model.
    • 🛡️ Qwen3-ASR (Alibaba): The "Secure Specialist." 1.7B params. Context understanding + Long form + Real time.
    • Voxtral Transcribe 2 (Mistral): The "Speed Demon." Real-time streaming with configurable delay (80–480ms).

🇮🇳 Sarvam Audio: 22 Indian Languages + English. User-controlled output styles (Verbatim vs. Normalized), Long context Biasing, Long form support etc.

Recent Shift (Specialized ASR-LLMs)

Why We Need Them?

Security, Precision, Control.

  • 1. Instruction Restriction (Safety):
    • Problem: Users can "jailbreak" general LLMs via voice.
    • Solution: Qwen3-ASR is trained as "ASR-Only." It explicitly ignores natural language instructions to prevent injection and hallucinations.
  • 2. Formatting Control (UX):
    • Problem: Doctors need "Verbatim" text; Subtitles need "Clean" text.
    • Solution: Sarvam Audio allows user-defined output styles (Literal vs. Normalized) via prompt control.
  • 3. Hotword Biasing:
    • Solution: Voxtral allows injecting ~100 context words (domain terms) to bias the transcript, fixing the "rare word" problem.

The Training "Secret Sauce"

The 5-Stage Recipe for SOTA Models

  • Step 1: Alignment Pretraining: Freeze LLM. Train first the Adapter and then Adapter+Encoder on millions of hours of audio-text pairs.
  • Step 2: Long-Context Expansion: Expand context window to 32k tokens (~40 mins)
  • Step 3: Supervised Fine-Tuning (Auxiliary): Train on translation, speaker ID, and QA to build "world model" understanding.
  • Step 4: ASR-Specific SFT: The "Hardening" phase. Train with special tokens (<beg_spk>, <end_time>) and force strict transcription behavior.
  • Step 5: DPO/GRPO (Alignment): Use Reinforcement Learning (Direct Preference Optimization) to penalize hallucinations and reward timestamp accuracy.

Thank you!

PyTorch Day

By Sshubam Verma

PyTorch Day

  • 12