Automatic Speech Recognition (ASR)

Automatic Speech Recognition (ASR) is a vital process that involves converting speech signals into corresponding text. The accuracy and quality of ASR systems are measured by how closely their recognized word sequences align with those recognized by humans. This is quantified using metrics such as Word Error Rate (WER) 1, which represents the edit-distance between automatically generated hypotheses and the ground-truth human transcription.

Traditionally, ASR systems consist of four key components: the acoustic model, the pronunciation dictionary, the language model, and the search decoder 2. These components work in tandem to ensure that ASR output adheres to the statistical structure of language. The language model, for instance, ensures that the generated word sequences align with what is likely to be spoken 3. It assigns probabilities to different word choices based on their likelihood in a given context. The pronunciation dictionary is used to decompose words into smaller sound units known as phonemes, while the acoustic model maps the audio signal's temporal and spectral characteristics to phonemes in the language 2 4. The search decoder weighs the probabilities assigned by each component to produce the most accurate output hypothesis.

Over time, different statistical modeling techniques have been employed for each ASR component. Acoustic models initially utilized Hidden Markov Models (HMMs) with Gaussian Mixture Models (GMMs) for state representation 2. With the advent of Deep Learning, Neural Networks (NNs) gained traction, delivering improved performance through deeper architectures, larger sets of acoustic units, and training on extensive data 5 6 7. Language modeling also witnessed a shift from n-gram models to recurrent neural network language models, which capture temporal dependencies effectively 8.

Recently, the research community has been moving towards end-to-end ASR systems that integrate all four components into a unified framework 9 10 11. These systems use neural networks and have shown significant advancements. Recurrent neural networks (RNNs) have been the preferred choice for modeling temporal dependencies in audio sequences 12 13. The Transformer architecture, based on self-attention, has gained widespread adoption due to its ability to capture long-distance interactions and its high training efficiency 14 15. The conformer architecture, which combines convolutional layers with Transformers, has outperformed previous approaches using only Transformers or convolutional neural networks 16.

A very good starting point to learn about ASR is the HTKbook 1. For building an ASR system in practice, you can also learn a lot from Kaldi 17. It is a toolkit for speech recognition written in C++ and licensed under the Apache License v2.0.


Footnotes

  1. Steve Young, et al. "The HTK book" 1 2 3
  2. Mark Gales, Steve Young. "The Application of Hidden Markov Models in Speech Recognition" (2007) 2 2 3 4
  3. Andreas Stolcke. "SRILM - an Extensible Language Modeling Toolkit" (2002) 3 2
  4. Hervé Bourlard, Nelson Morgan. "Connectionist Speech Recognition: A Hybrid Approach" (1994) 4 2
  5. Geoffrey Hinton, et al. "Deep Neural Networks for Acoustic Modeling in Speech Recognition: The shared views of four research groups" (2012) 5 2
  6. Abdel-rahman Mohamed, George Dahl, Geoffrey Hinton. "Acoustic Modeling using Deep Belief Networks" (2010) 6 2
  7. George Dahl, Dong Yu, Li Deng, Alex Acero. "Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition" (2010) 7 2
  8. Tomas Mikolov, et al. "RNNLM - Recurrent Neural Network Language Modeling Toolkit" (2010) 8 2
  9. Alex Graves, Navdeep Jaitly. "Towards End-To-End Speech Recognition with Recurrent Neural Networks" (2014) 9 2
  10. Dzmitry Bahdanau, et al. "End-to-End Attention-based Large Vocabulary Speech Recognition" (2015) 10 2
  11. William Chan, et al. "Listen, Attend and Spell" (2015) 11 2
  12. C.-C. Chiu, et al. "State-of-the-art speech recognition with sequence-to-sequence models" (2018) 12 2
  13. K. Rao, et al. "Exploring architectures, data, and units for streaming end-to-end speech recognition with RNN-transducer" (2017) 13 2
  14. A. Vaswani, et al. "Attention is all you need" (2017) 14 2
  15. Q. Zhang, et al. "Transformer transducer: A streamable speech recognition model with transformer encoders and RNN-t loss" (2020) 15 2
  16. A. Gulati, et al. "Conformer: Convolution-augmented Transformer for Speech Recognition" (2020) 16 2
  17. Kaldi toolkit 17 2