Audio

https://huggingface.co/learn/audio-course
Use cases
- Audio classification
- Speech-to-Text
- Text-to-Speech
- Speaker Diarization: identifying who is speaking

Audio Data

File Formats
- determine compression of audio signal
Sampling Rate
- aka Sampling frequency
- number of samples taken in one second
- measured in Hz
- CD-quality audio: 44,100 Hz or 44.1 kHz
- High-Res audio: 192 kHz
Nyquist Limit
- exactly half the sampling rate
- determines highest frequency that can be captured from the signal
- Human Hearing: 20 Hz - 20 kHz
- Human speech: 3.4 kHz to 8 kHz
- By Nyquist limit, Training speech models: 2 * 8 kHz = 16 kHz
Resampling
- process of making the sampling rates match for preprocessing
Amplitude
- Sound pressure level at any given instant
- measured in decibels (dB)
- Human Hearing: 0 dB - 120 dB, over 70 dB is dangerous for long term
- Human Speech: 55 dB - 65 dB
- Rock concert: 125 dB
Digital Amplitude
- 0 dB is loudest
- -60 dB is inaudible
Bit Depth
- Precision of recording amplitude
- Integer values: 16-bit, 24-bit
- Floating point values: 32-bit
Streaming Mode
- Audio Datasets support streaming mode
- It only iterates one by one and avoid downloading whole datasets

Audio Representation

librosa library can help to plot audio
Waveform
- Time Domain
- Amplitude vs Time
Frequency Spectrum
- Frequency Domain
- Amplitude vs Frequency
- computed using Discrete Fourier Transform (DFT)
- calculated at an Instant
Spectrogram
- frequency content of an audio signal as it changes over time
- computed using Short Time Fourier Transform (STFT)
Mel Spectrogram
- Variation of Spectrogram based on Mel scale
- The Mel scale is a perceptual scale that approximates the non-linear frequency response of the human ear
- Human auditory system is more sensitive to changes in lower frequencies than higher frequencies
- sensitivity decreases logarithmically as frequency increases
Log-Mel Spectrogram
- uses dB (log-scale) for Amplitude

Speech-to-Text

aka STT or Speech Translation or Automatic Speech Recognition (ASR)

Speech Datasets

Features
- Number of hours
- Domain: Data sourced from
  - Audiobook, Wikipedia, Podcast, YouTube
- Speaking Style
  - Narrated: read from a script
    - It tends to be spoken articulately and without any errors
  - Spontaneous: un-scripted, conversational speech
    - It has more colloquial style of speech
    - includes repetitions, hesitations and false-starts
- Transcription style
  - It refers to whether the target text has punctuation, casing or both
Examples
- LibriSpeech: Audiobook: Narrated, 960 hours
- Common Voice 11: Wikipedia: Narrated, 3000 hours
- VoxPopuli: European Parliament: Oratory, 540 hours
- TED-LIUM: TED Talks: Oratory, 450 hours
- GigaSpeech: Audiobook, Podcast, YouTube: Narrated, Spontaneous, 10,000 hours

Connectionist Temporal Classification (CTC) Models

encoder-only models with a linear classification (CTC) head on top
A CTC model is essentially an “acoustic-only” model
Prone to phonetic spelling errors
- It might transcribe the audio in phonetic way for example CHRISTMAUS instead of CHRISTMAS
Wav2Vec2 2.0 is trained on 60,000 hours
Examples:
- Wav2Vec2
- HuBERT
- XLSR

Sequence-to-sequence (Seq2Seq) Models

encoder-decoder models, with a cross-attention mechanism between the encoder and decoder
encoder plays same role as CTC, and decoder plays role of an LLM and corrects spelling mistakes on the fly
They are slower at decoding
They require much more data for training
Whisper was released in 2022 by OpenAI
- Whisper is pre-trained on a vast quantity of labelled audio-transcription data, 680,000 hours to be precise
- 117,000 hours of this pre-training data is multilingual data (non-english)
- Over 96 languages
- ~3% word error rate (WER)
- able to handle long-form audio samples
- robust to input noise
- able to predict cased and punctuated transcriptions

Audio Classification

Any encoder-only audio transformer model can be turned into an audio classifier by adding a classification layer on top of the sequence of hidden states
Keyword Spotting (KWS)
- Intent classification
  - Dataset: MINDS14
    - Recordings of people asking an e-banking system questions in several languages and dialects
  - Model: XLS-R model fine-tuned on MINDS-14
- Identifying Speech Commands
  - Dataset: Speech Commands
    - 15 classes for keywords
  - Model: Audio Spectrogram Transformer (AST) fine tuned on Speech Commands
Language Identification (LID)
- Dataset: FLEURS
  - Few-shot Learning Evaluation of Universal Representations of Speech
  - By Google
  - Evaluate speech recognition systems in 102 languages
- Model: Whisper fine tuned on FLEURS
Music Genre Classification
- Dataset: GTZAN
  - 1000 songs for music genre
- Model: HuBERT fine tuned on GTZAN
Zero-Shot Audio Classification
- We give audio and candidate labels, model predicts classification
- Dataset: Environmental Speech Challenge (ESC)
  - 2000 environmental audio recordings
- Model: CLAP
  - takes audio and text and compute similarity

Text-to-Speech

aka TTS or Speech Synthesis
It is a one-to-many problem
Same Text can be synthesized into variety of speaking styles

TTS Datasets

Dataset should contain
- Diverse and representative speech samples from multiple speakers
- Cover a wide range of speech patterns, accents, languages, and emotions
- Contain different types of sentences, phrases, and words
- Cover various topics, genres, and domains to ensure the model’s ability to handle different linguistic contexts.
Collecting Data is expensive task
Speech Datasets are not very useful since they contain noises
Examples
- LJSpeech: 13,100 English-language audio clips (7-non fiction books) paired with their corresponding transcriptions from single speaker
- Multilingual LibriSpeech: LibriSpeech with multiple languages
- Voice Cloning Toolkit (VCTK):
  - 110 English speakers with various accents
  - Each speaker reads out about 400 sentences
- Libri-TTS/ LibriTTS-R: multi-speaker English corpus of approximately 585 hours

SpeechT5 model

based on regular Transformer encoder-decoder model
published by Microsoft
can be tailored for Text-to-Speech, Speech-to-Text, Speaker Identification, Speech-to-Speech
It uses so-called speaker embeddings that capture a particular speaker’s voice characteristics.

Bark model

based on Transformer
published by Suno AI
highly-controllable with various settings
codebooks are representations or embeddings of the audio in integer form
made of 4 models
- BarkSemanticModel: causal auto-regressive transformer
  - Text model
  - Text → Semantics Text
- BarkCoarseModel: causal auto-regressive transformer
  - Coarse Acoustics model
  - Semantics Text → Audio Cookbook
- BarkFineModel: non-causal auto-encoder transformer
  - Fine Acoustics model
  - Last codebooks based on the sum of the previous codebooks embeddings
- EncodecModel: Decode output audio array using the codebooks

Massive Multilingual Speech (MMS) model

based on conditional variational auto-encoder (VITS)
supports over 1,100 languages

Speech-to-Speech

aka STS
Use cases
- Speech Translation
- Voice Assistant like Alexa, Siri

Speech Translation

aka S2ST
It involves translating speech from one language into speech in a different language
3-stage approach used by Google Translate before:
- STT: Speech in X → Text in X
- MT (Machine Translation): Text in X → Text in Y
- TTS: Text in Y → Speech in Y
3-stage can have more errors compared to 2-stage
2-stage approach
- Speech Translation (Speech in X → Text in Y)
  - Example: Whisper by OpenAI
- TTS (Text in Y → Speech in Y)
  - Example: SpeechT5 by Microsoft

Voice Assistant

Wake word detection
- Start Assistant with Trigger word
- use Audio classification model
Speech Transcription
- use ASR model on device rather than on cloud for faster inference
Language Model Query
- use LLM
- since query is likely small, we can send to cloud LLM to get the response
Synthesize Speech
- use on device or cloud model

Speaker Diarization

task of taking an unlabelled audio input and predicting “who spoke when”

Transcription of Meeting

ASR models can predict text as well as the timestamp of multiple segments
- Example: whisper-base

timestamp: [0.0, 3.56],    text: The second and importance is as follows.
timestamp: [3.56, 7.84],   text: Sovereignty may be defined...
timestamp: [7.84, 13.88],  text: In France, the king really exercises...
timestamp: [13.88, 15.48], text: no weight.
timestamp: [15.48, 19.44], text: He was in a favored state of mind...
timestamp: [19.44, 21.28], text: cast upon his entire future.

Diarization models just predict the timestamp of each speaker but not the text
- Example: speaker-diarization from pyannote

segment: [0.49, 14.52],  track: B, label: SPEAKER_01
segment: [15.36, 21.37], track: A, label: SPEAKER_00

We can combine both the models to generate transcription of audio

SPEAKER_01 (0.0, 15.5) The second and importance is as follows. Sovereignty may be defined to be the right of making laws. In France, the king really exercises a portion of the sovereign power, since the laws have no weight.

SPEAKER_00 (15.5, 21.3) He was in a favored state of mind, owing to the blight his wife's action threatened to cast upon his entire future.

Experiments

Explorer

Audio

Audio

Audio Data

Audio Representation

Speech-to-Text

Speech Datasets

Connectionist Temporal Classification (CTC) Models

Sequence-to-sequence (Seq2Seq) Models

Audio Classification

Text-to-Speech

TTS Datasets

SpeechT5 model

Bark model

Massive Multilingual Speech (MMS) model

Speech-to-Speech

Speech Translation

Voice Assistant

Speaker Diarization

Transcription of Meeting

Table of Contents