An overview of the paper published for INTERSPEECH 2023
20-24 August 2023, Dublin, Ireland
WhisperX: Time-Accurate Speech Transcription of Long-Form Audio by Max Bain, Jaesung Huh, Tengda Han, Andrew Zisserman
Visual Geometry Group, University of Oxford
Overview
The paper introduces WhisperX, a system designed to transcribe long-form audio (e.g. meetings, podcasts, videos) with accurate word-level time stamps, while being efficient (fast) and avoiding the pitfalls of previous approaches when dealing with long vs continuous speech. WhisperX builds on OpenAI’s Whisper model and addresses two main problems:
- Timestamp inaccuracy in Whisper for utterances/words: out-of-the-box, Whisper gives timestamps for segments, but these are often quite coarse or error-prone; word-level timestamps aren’t directly provided or reliable.
- Scalability / efficiency for long audio: Whisper is typically trained on ~30 second chunks; naively feeding longer audio or using sliding windows with overlap causes boundary issues and heavy compute; transcribing very long inputs (minutes to hours) becomes slow or inaccurate due to drift etc.
WhisperX adds components to:
- Pre-segment the audio using Voice Activity Detection (VAD).
- Cut & merge segments using heuristics so that segments fed to Whisper match its training duration (≈30s) and do not cut through active speech.
- Force align phonemes (via an external phoneme recognition model) to get precise word-level timestamps.
They show that this combined approach yields state-of-the-art performance for long-audio transcription and word segmentation benchmarks, with large speedups (≈12×) when using batched inference enabled by the VAD + cut & merge strategy.
System Components
WhisperX comprises several stages.
- Voice Activity Detection (VAD)
- First, audio is passed through a VAD model to detect where speech is active vs inactive.
- The purposes: skip over non-speech regions, avoid wasting ASR compute on silence; allow chunk boundaries to avoid splitting speech; also helps with later constrained alignment by giving local speech boundaries.
- VAD Cut & Merge
- Cut: If an active speech region (a segment) is longer than the maximum duration Whisper is trained on (~30 seconds), they split it (“min-cut”) at the point of minimal voice activity (i.e. where speech is least active) to avoid cutting through densely overlapping speech. This ensures each segment is no longer than Whisper’s input size.
- Merge: Conversely, very short speech segments are problematic (they may miss context, cause inefficiency). After cutting, neighbouring segments that are short (aggregate span less than a threshold τ) are merged, to get durations closer to the training-width (they report τ ≈ 30 seconds works best). This improves both speed (fewer segments → fewer ASR passes) and accuracy (more context).
- Whisper Transcription
- The processed speech segments (after cut & merge) are transcribed with Whisper in parallel (batched inference). They avoid conditioning on previous text (i.e. each segment is transcribed independently) to prevent propagation of errors or drift. Also they use the version of Whisper that does not rely on internal timestamp outputs for word alignment, since these timestamp tokens are unreliable.
- Forced Phoneme Alignment
- Once you have text for each audio segment, you want to align words (and their constituent phonemes) to exact times. WhisperX uses an external phoneme recognition model (e.g. wav2vec2.0 trained on phonemes) to produce logits over phoneme classes over time. Then it uses Dynamic Time Warping (DTW) to align the audio phoneme sequence with the phoneme sequence implied by the transcript. From that alignment, it assigns start & end times to words (by mapping the first and last phonemes of each word). For words missing phonemes in the phoneme model’s dictionary, fallback to nearest phoneme timestamps.
- Multilingual & Translation Settings
- They note that WhisperX can work in multilingual contexts, provided the VAD model is language-robust and the phoneme alignment model supports the languages. They also acknowledge that when using Whisper’s translate mode (i.e. generating transcripts in a target language), phoneme alignment isn’t possible (since transcript is in another language, disconnected phonetics).
Experimental Setup
They conduct a range of experiments to evaluate performance. Key datasets:
- AMI Meeting Corpus (AMI-IHM): meeting audio, word-level alignments.
- Switchboard-1 (SWB): telephone conversations, with corrected word alignments.
- TEDLIUM-3: ~11 TED talks, each ~20 mins, for long-form transcription evaluation.
- Kincaid46: videos from YouTube for more varied long-form audio.
Metrics they measure include:
- WER: Word Error Rate.
- Spd. (Speed): how fast the transcription is, relative to Whisper baseline.
- IER: Insertion Error Rate — for measuring hallucinations.
- 5-Dup.: number of duplicate 5-grams (to detect repetition) in the output.
- For word segmentation (word-level timestamps): Precision and Recall, under a collar of 200 ms (i.e. predicted word segments count as correct if they overlap ground truth word segment by up to ±200 ms) and exact string match on word.
They also compare different variants: with/without VAD cut & merge; different Whisper model sizes; different phoneme models; batched vs non-batched transcription.
Key Results
Here are the main empirical findings:
- WhisperX vs Whisper / wav2vec2.0 in word segmentation & transcription:
- WhisperX substantially outperforms both Whisper and wav2vec2.0 in terms of word-segmentation precision & recall (on AMI and SWB). Also better or comparable on WER.
- On long-form datasets (TEDLIUM-3, Kincaid46), WhisperX achieves lower WER than Whisper, and much faster transcription speed when using batched inference (≈11-12× faster).
- Effect of VAD Cut & Merge:
- Pre-segmenting with VAD + cut & merge improves both transcription quality (lower WER) and word segmentation metrics (better precision & recall) compared to no preprocessing.
- Also dramatically improves speed: because you can batch many segments at once rather than processing sliding windows or full long audio. The reported speed is ~twelve-fold speedup with batched inference when using VAD-CM (cut & merge) vs baseline.
- Without VAD chunking, batched inference degrades WER due to boundary effects. Overlap windows don’t fully solve this.
- Hallucination & Repetition:
- WhisperX shows lower insertion error rates and fewer repeated 5-gram duplicates compared to Whisper (particularly on Kincaid46 & TEDLIUM benchmarks), indicating fewer hallucinations and repetition artifacts. The VAD cut & merge helps here, by eliminating long non-speech portions and managing segment boundaries more cleanly.
- wav2vec2.0, while worse in WER and segmentation, tends to be less prone to repetition than Whisper / WhisperX, but overall the trade-offs favour WhisperX.
- Effect of Model Choices:
- Using larger Whisper models improves word segmentation performance (precision & recall).
- The phoneme model also matters: different phoneme recognition models (trained on different data) yield different alignment quality. For example, the model trained on the VoxPopuli corpus gives strong performance on AMI (probably due to domain similarity).
- There are diminishing returns: a very large phoneme model doesn’t always lead to consistent gains, suggesting that more alignment-specific training data might help more.
Strengths and Contributions
WhisperX’s main contributions are:
- A practical pipeline combining VAD, segment boundary adjustments, and phoneme alignment to get accurate word-level timestamps on long audio, which is useful for subtitling, diarisation, indexing, etc.
- A demonstration that preprocessing with VAD Cut & Merge allows batched inference over segments without losing quality — so significantly faster real-world operation.
- Empirical results showing improvements not just in speed but in transcription accuracy, word segmentation, and robustness (less repetition / hallucination).
- Open-sourcing the code, allowing others to build on the work.
Limitations & Future Work
The authors also note some limitations / open areas:
- The current setup uses a multi-stage pipeline: Whisper for transcript + external phoneme recognizer + forced alignment. A single model that directly produces accurate word-level timestamps (and handles long audio) would be more elegant. They mention this as future work.
- For multilingual or translation modes (especially when transcript is translated), phoneme alignment isn’t possible/straightforward. So word-level timestamps for translated output are not handled.
- The alignment model’s quality depends on domain similarity and phoneme coverage; for some languages/domains less well represented, performance may be worse. Also, for words whose phonemes are missing in the phoneme model, fallback heuristics reduce precision.
- The corrections require additional compute/overhead (though they report alignment overhead is small, ≈10%).
Implications / Applications
The improvements made by WhisperX have several useful consequences:
- Subtitling / captions / transcripts: more accurate timestamps means better alignment to video, and finer-grained control.
- Search / indexing across long audio/video content: word-level time info allows you to jump to the exact location of a word.
- Diarisation / speaker segmentation: combined with speaker identification, word-level alignment helps better match speech to speakers. NB: Speaker segmentation for AI transcription is not yet accurate enough to be useful – for 100% accuracy, human transcription is still the go to for most academic users of transcription services.
- Media / content production: podcasts, lectures etc., where long segments are common; fast-throughput transcription with high accuracy helps in production workflows.
Numerical Highlights
Here are some concrete numbers to give sense of scale:
- On TEDLIUM:
- WhisperX with VAD-CM (cut & merge) achieves WER ≈ 9.7% vs Whisper’s ~10.5%.
- Speed: ~11.8× faster than baseline Whisper.
- On AMI:
- Word segmentation precision/recall with WhisperX: Precision ~84.1%, Recall ~60.3%. For SWB, ~93.2% / ~65.4%. These surpass earlier baselines.
- Without VAD, batched inference speed is high (many segments in batch), but WER and segmentation degrade heavily. With VAD-CM and τ = 30s, they get speed up of ~11.8× with similar or better WER.
Conclusion
WhisperX advances the state of speech transcription for long-form audio by providing a system that is both accurate (especially for word-level timestamps) and efficient (via VAD preprocessing and segment batching). It demonstrates that with smart segmentation (cut & merge) and external phoneme alignment, many of the issues of previous methods (drift, boundary artifacts, repetition/hallucinations) can be mitigated. While there remain challenges (especially for multilinguality, translated transcripts, domain mismatch, and moving toward a single monolithic model), WhisperX seems like a valuable practical advance.
🔹 Whisper vs WhisperX Comparison Chart
Feature / Aspect | Whisper (Baseline) | WhisperX (Proposed) |
---|---|---|
Input handling | Trained on 30s audio; long audio requires sliding windows (slow & error-prone). | Uses VAD + cut & merge: splits/merges into ~30s chunks at natural pauses. |
Segmentation | Sliding windows cause overlaps, drift, and boundary errors. | Segments align to speech activity; avoids cutting through words; improves accuracy. |
Timestamp accuracy | Provides coarse segment-level timestamps; word-level unreliable. | Adds phoneme-level forced alignment → precise word-level timestamps. |
Speed / Efficiency | Sequential processing of windows; slow for long audio. | Parallel batched inference across segments → ~12× faster. |
Accuracy (WER) | Good on short clips, but degrades on long recordings (e.g. TEDLIUM-3 ~10.5%). | Improves WER on long-form (e.g. TEDLIUM-3 ~9.7%). |
Word segmentation (AMI) | Lower precision/recall for word-level boundaries. | Higher precision/recall (e.g. AMI: P ~84%, R ~60%). |
Hallucinations & repetition | More prone to inserting or repeating text in long audio. | Lower insertion errors; fewer repeated n-grams. |
Multilingual support | Full multilingual support, including translation mode. | Multilingual possible, but word alignment fails in translation mode. |
Architecture | Single ASR model. | Pipeline: VAD + Whisper + phoneme aligner. |
Open-source availability | Whisper model weights & code. | WhisperX released as open-source pipeline (extends Whisper). |
👉 In short: WhisperX = Whisper + smart segmentation + alignment → faster, more accurate, with reliable timestamps.
Chester Web Marketing is an SEO company based in the North West and available to assist clients around the world with any SEO or website queries they may have.