Audio

Audio Data

  • File Formats
    • determine compression of audio signal
  • Sampling Rate
    • aka Sampling frequency
    • number of samples taken in one second
    • measured in Hz
    • CD-quality audio: 44,100 Hz or 44.1 kHz
    • High-Res audio: 192 kHz
  • Nyquist Limit
    • exactly half the sampling rate
    • determines highest frequency that can be captured from the signal
    • Human Hearing: 20 Hz - 20 kHz
    • Human speech: 3.4 kHz to 8 kHz
    • By Nyquist limit, Training speech models: 2 * 8 kHz = 16 kHz
  • Resampling
    • process of making the sampling rates match for preprocessing
  • Amplitude
    • Sound pressure level at any given instant
    • measured in decibels (dB)
    • Human Hearing: 0 dB - 120 dB, over 70 dB is dangerous for long term
    • Human Speech: 55 dB - 65 dB
    • Rock concert: 125 dB
  • Digital Amplitude
    • 0 dB is loudest
    • -60 dB is inaudible
  • Bit Depth
    • Precision of recording amplitude
    • Integer values: 16-bit, 24-bit
    • Floating point values: 32-bit
  • Streaming Mode
    • Audio Datasets support streaming mode
    • It only iterates one by one and avoid downloading whole datasets

Audio Representation

  • librosa library can help to plot audio
  • Waveform
    • Time Domain
    • Amplitude vs Time
  • Frequency Spectrum
    • Frequency Domain
    • Amplitude vs Frequency
    • computed using Discrete Fourier Transform (DFT)
    • calculated at an Instant
  • Spectrogram
    • frequency content of an audio signal as it changes over time
    • computed using Short Time Fourier Transform (STFT)
  • Mel Spectrogram
    • Variation of Spectrogram based on Mel scale
    • The Mel scale is a perceptual scale that approximates the non-linear frequency response of the human ear
    • Human auditory system is more sensitive to changes in lower frequencies than higher frequencies
    • sensitivity decreases logarithmically as frequency increases
  • Log-Mel Spectrogram
    • uses dB (log-scale) for Amplitude

Speech-to-Text

  • aka STT or Speech Translation or Automatic Speech Recognition (ASR)

Speech Datasets

  • Features
    • Number of hours
    • Domain: Data sourced from
      • Audiobook, Wikipedia, Podcast, YouTube
    • Speaking Style
      • Narrated: read from a script
        • It tends to be spoken articulately and without any errors
      • Spontaneous: un-scripted, conversational speech
        • It has more colloquial style of speech
        • includes repetitions, hesitations and false-starts
    • Transcription style
      • It refers to whether the target text has punctuation, casing or both
  • Examples
    • LibriSpeech: Audiobook: Narrated, 960 hours
    • Common Voice 11: Wikipedia: Narrated, 3000 hours
    • VoxPopuli: European Parliament: Oratory, 540 hours
    • TED-LIUM: TED Talks: Oratory, 450 hours
    • GigaSpeech: Audiobook, Podcast, YouTube: Narrated, Spontaneous, 10,000 hours

Connectionist Temporal Classification (CTC) Models

  • encoder-only models with a linear classification (CTC) head on top
  • A CTC model is essentially an “acoustic-only” model
  • Prone to phonetic spelling errors
    • It might transcribe the audio in phonetic way for example CHRISTMAUS instead of CHRISTMAS
  • Wav2Vec2 2.0 is trained on 60,000 hours
  • Examples:
    • Wav2Vec2
    • HuBERT
    • XLSR

Sequence-to-sequence (Seq2Seq) Models

  • encoder-decoder models, with a cross-attention mechanism between the encoder and decoder
  • encoder plays same role as CTC, and decoder plays role of an LLM and corrects spelling mistakes on the fly
  • They are slower at decoding
  • They require much more data for training
  • Whisper was released in 2022 by OpenAI
    • Whisper is pre-trained on a vast quantity of labelled audio-transcription data, 680,000 hours to be precise
    • 117,000 hours of this pre-training data is multilingual data (non-english)
    • Over 96 languages
    • ~3% word error rate (WER)
    • able to handle long-form audio samples
    • robust to input noise
    • able to predict cased and punctuated transcriptions

Audio Classification

  • Any encoder-only audio transformer model can be turned into an audio classifier by adding a classification layer on top of the sequence of hidden states
  • Keyword Spotting (KWS)
    • Intent classification
      • Dataset: MINDS14
        • Recordings of people asking an e-banking system questions in several languages and dialects
      • Model: XLS-R model fine-tuned on MINDS-14
    • Identifying Speech Commands
      • Dataset: Speech Commands
        • 15 classes for keywords
      • Model: Audio Spectrogram Transformer (AST) fine tuned on Speech Commands
  • Language Identification (LID)
    • Dataset: FLEURS
      • Few-shot Learning Evaluation of Universal Representations of Speech
      • By Google
      • Evaluate speech recognition systems in 102 languages
    • Model: Whisper fine tuned on FLEURS
  • Music Genre Classification
    • Dataset: GTZAN
      • 1000 songs for music genre
    • Model: HuBERT fine tuned on GTZAN
  • Zero-Shot Audio Classification
    • We give audio and candidate labels, model predicts classification
    • Dataset: Environmental Speech Challenge (ESC)
      • 2000 environmental audio recordings
    • Model: CLAP
      • takes audio and text and compute similarity

Text-to-Speech

  • aka TTS or Speech Synthesis
  • It is a one-to-many problem
  • Same Text can be synthesized into variety of speaking styles

TTS Datasets

  • Dataset should contain
    • Diverse and representative speech samples from multiple speakers
    • Cover a wide range of speech patterns, accents, languages, and emotions
    • Contain different types of sentences, phrases, and words
    • Cover various topics, genres, and domains to ensure the model’s ability to handle different linguistic contexts.
  • Collecting Data is expensive task
  • Speech Datasets are not very useful since they contain noises
  • Examples
    • LJSpeech: 13,100 English-language audio clips (7-non fiction books) paired with their corresponding transcriptions from single speaker
    • Multilingual LibriSpeech: LibriSpeech with multiple languages
    • Voice Cloning Toolkit (VCTK):
      • 110 English speakers with various accents
      • Each speaker reads out about 400 sentences
    • Libri-TTS/ LibriTTS-R: multi-speaker English corpus of approximately 585 hours

SpeechT5 model

  • based on regular Transformer encoder-decoder model
  • published by Microsoft
  • can be tailored for Text-to-Speech, Speech-to-Text, Speaker Identification, Speech-to-Speech
  • It uses so-called speaker embeddings that capture a particular speaker’s voice characteristics.

Bark model

  • based on Transformer
  • published by Suno AI
  • highly-controllable with various settings
  • codebooks are representations or embeddings of the audio in integer form
  • made of 4 models
    • BarkSemanticModel: causal auto-regressive transformer
      • Text model
      • Text Semantics Text
    • BarkCoarseModel: causal auto-regressive transformer
      • Coarse Acoustics model
      • Semantics Text Audio Cookbook
    • BarkFineModel: non-causal auto-encoder transformer
      • Fine Acoustics model
      • Last codebooks based on the sum of the previous codebooks embeddings
    • EncodecModel: Decode output audio array using the codebooks

Massive Multilingual Speech (MMS) model

  • based on conditional variational auto-encoder (VITS)
  • supports over 1,100 languages

Speech-to-Speech

  • aka STS
  • Use cases
    • Speech Translation
    • Voice Assistant like Alexa, Siri

Speech Translation

  • aka S2ST
  • It involves translating speech from one language into speech in a different language
  • 3-stage approach used by Google Translate before:
    • STT: Speech in X Text in X
    • MT (Machine Translation): Text in X Text in Y
    • TTS: Text in Y Speech in Y
  • 3-stage can have more errors compared to 2-stage
  • 2-stage approach
    • Speech Translation (Speech in X Text in Y)
      • Example: Whisper by OpenAI
    • TTS (Text in Y Speech in Y)
      • Example: SpeechT5 by Microsoft

Voice Assistant

  • Wake word detection
    • Start Assistant with Trigger word
    • use Audio classification model
  • Speech Transcription
    • use ASR model on device rather than on cloud for faster inference
  • Language Model Query
    • use LLM
    • since query is likely small, we can send to cloud LLM to get the response
  • Synthesize Speech
    • use on device or cloud model

Speaker Diarization

  • task of taking an unlabelled audio input and predicting “who spoke when”

Transcription of Meeting

  • ASR models can predict text as well as the timestamp of multiple segments
    • Example: whisper-base
timestamp: [0.0, 3.56],    text: The second and importance is as follows.
timestamp: [3.56, 7.84],   text: Sovereignty may be defined...
timestamp: [7.84, 13.88],  text: In France, the king really exercises...
timestamp: [13.88, 15.48], text: no weight.
timestamp: [15.48, 19.44], text: He was in a favored state of mind...
timestamp: [19.44, 21.28], text: cast upon his entire future.
  • Diarization models just predict the timestamp of each speaker but not the text
    • Example: speaker-diarization from pyannote
segment: [0.49, 14.52],  track: B, label: SPEAKER_01
segment: [15.36, 21.37], track: A, label: SPEAKER_00
  • We can combine both the models to generate transcription of audio
SPEAKER_01 (0.0, 15.5) The second and importance is as follows. Sovereignty may be defined to be the right of making laws. In France, the king really exercises a portion of the sovereign power, since the laws have no weight.

SPEAKER_00 (15.5, 21.3) He was in a favored state of mind, owing to the blight his wife's action threatened to cast upon his entire future.