encoder-only models with a linear classification (CTC) head on top
A CTC model is essentially an “acoustic-only” model
Prone to phonetic spelling errors
It might transcribe the audio in phonetic way for example CHRISTMAUS instead of CHRISTMAS
Wav2Vec2 2.0 is trained on 60,000 hours
Examples:
Wav2Vec2
HuBERT
XLSR
Sequence-to-sequence (Seq2Seq) Models
encoder-decoder models, with a cross-attention mechanism between the encoder and decoder
encoder plays same role as CTC, and decoder plays role of an LLM and corrects spelling mistakes on the fly
They are slower at decoding
They require much more data for training
Whisper was released in 2022 by OpenAI
Whisper is pre-trained on a vast quantity of labelled audio-transcription data, 680,000 hours to be precise
117,000 hours of this pre-training data is multilingual data (non-english)
Over 96 languages
~3% word error rate (WER)
able to handle long-form audio samples
robust to input noise
able to predict cased and punctuated transcriptions
Audio Classification
Any encoder-only audio transformer model can be turned into an audio classifier by adding a classification layer on top of the sequence of hidden states
Keyword Spotting (KWS)
Intent classification
Dataset: MINDS14
Recordings of people asking an e-banking system questions in several languages and dialects
Model: XLS-R model fine-tuned on MINDS-14
Identifying Speech Commands
Dataset: Speech Commands
15 classes for keywords
Model: Audio Spectrogram Transformer (AST) fine tuned on Speech Commands
Language Identification (LID)
Dataset: FLEURS
Few-shot Learning Evaluation of Universal Representations of Speech
By Google
Evaluate speech recognition systems in 102 languages
Model: Whisper fine tuned on FLEURS
Music Genre Classification
Dataset: GTZAN
1000 songs for music genre
Model: HuBERT fine tuned on GTZAN
Zero-Shot Audio Classification
We give audio and candidate labels, model predicts classification
Dataset: Environmental Speech Challenge (ESC)
2000 environmental audio recordings
Model: CLAP
takes audio and text and compute similarity
Text-to-Speech
aka TTS or Speech Synthesis
It is a one-to-many problem
Same Text can be synthesized into variety of speaking styles
TTS Datasets
Dataset should contain
Diverse and representative speech samples from multiple speakers
Cover a wide range of speech patterns, accents, languages, and emotions
Contain different types of sentences, phrases, and words
Cover various topics, genres, and domains to ensure the model’s ability to handle different linguistic contexts.
Collecting Data is expensive task
Speech Datasets are not very useful since they contain noises
Examples
LJSpeech: 13,100 English-language audio clips (7-non fiction books) paired with their corresponding transcriptions from single speaker
Multilingual LibriSpeech: LibriSpeech with multiple languages
Voice Cloning Toolkit (VCTK):
110 English speakers with various accents
Each speaker reads out about 400 sentences
Libri-TTS/ LibriTTS-R: multi-speaker English corpus of approximately 585 hours
SpeechT5 model
based on regular Transformer encoder-decoder model
published by Microsoft
can be tailored for Text-to-Speech, Speech-to-Text, Speaker Identification, Speech-to-Speech
It uses so-called speaker embeddings that capture a particular speaker’s voice characteristics.
Bark model
based on Transformer
published by Suno AI
highly-controllable with various settings
codebooks are representations or embeddings of the audio in integer form
Last codebooks based on the sum of the previous codebooks embeddings
EncodecModel: Decode output audio array using the codebooks
Massive Multilingual Speech (MMS) model
based on conditional variational auto-encoder (VITS)
supports over 1,100 languages
Speech-to-Speech
aka STS
Use cases
Speech Translation
Voice Assistant like Alexa, Siri
Speech Translation
aka S2ST
It involves translating speech from one language into speech in a different language
3-stage approach used by Google Translate before:
STT: Speech in X → Text in X
MT (Machine Translation): Text in X → Text in Y
TTS: Text in Y → Speech in Y
3-stage can have more errors compared to 2-stage
2-stage approach
Speech Translation (Speech in X → Text in Y)
Example: Whisper by OpenAI
TTS (Text in Y → Speech in Y)
Example: SpeechT5 by Microsoft
Voice Assistant
Wake word detection
Start Assistant with Trigger word
use Audio classification model
Speech Transcription
use ASR model on device rather than on cloud for faster inference
Language Model Query
use LLM
since query is likely small, we can send to cloud LLM to get the response
Synthesize Speech
use on device or cloud model
Speaker Diarization
task of taking an unlabelled audio input and predicting “who spoke when”
Transcription of Meeting
ASR models can predict text as well as the timestamp of multiple segments
Example: whisper-base
timestamp: [0.0, 3.56], text: The second and importance is as follows.
timestamp: [3.56, 7.84], text: Sovereignty may be defined...
timestamp: [7.84, 13.88], text: In France, the king really exercises...
timestamp: [13.88, 15.48], text: no weight.
timestamp: [15.48, 19.44], text: He was in a favored state of mind...
timestamp: [19.44, 21.28], text: cast upon his entire future.
Diarization models just predict the timestamp of each speaker but not the text
We can combine both the models to generate transcription of audio
SPEAKER_01 (0.0, 15.5) The second and importance is as follows. Sovereignty may be defined to be the right of making laws. In France, the king really exercises a portion of the sovereign power, since the laws have no weight.
SPEAKER_00 (15.5, 21.3) He was in a favored state of mind, owing to the blight his wife's action threatened to cast upon his entire future.