Automatic Speech Recognition
Automatic Speech Recognition (ASR) is the process of converting the speech signal into its corresponding text. The quality of ASR systems is measured by how close their recognized sequences of words are to human recognized sequences of words. More formally, the ASR quality, measured by Word Error Rate (WER), is the edit-distance between automatically generated hypotheses and the ground-truth human transcription.
Traditionally, ASR systems are split into four components: the acoustic model, the pronunciation dictionary, the language model, and the search decoder. Since ASR output hypotheses need to adhere to the statistical structure of language, the language model ensure that the output sequence matches what is likely to be said. As an example, words like "school" or "work" have higher probability than "oil" or "yellow" to be the following word in the word-sequence: "Ali walks to his ....... ". The pronunciation dictionary is used to decompose words into small units of sound, known as Phonemes. The acoustic model represents the mapping between the audio signal, its temporal, i.e. time-related, and spectral, i.e. frequency-related, characteristics as well as the phonemes in the language. Each model assigns probabilities to different choices it makes, then the decoder searches over all these alternatives weighing their probabilities to come up with the best output hypothesis. A very good starting point to learn about ASR is the HTKbook .
Different statistical modeling techniques were used for different components of the ASR system. For acoustic models, Hidden Markov Models (HMMs) with Gaussian Mixture Models (GMMs) state representation  was used as well as Neural Networks . With the Deep Learning revolution, Neural Networks  got a boost in performance by going deeper , having large set of acoustic units in its output (using some ways of combining phonemes) , and training on very large volume of data . For language modeling there was a similar trend, where the n-gram language models  got dethroned by recurrent neural network language models .
More recently, the research community is going towards a more holistic approach that combine all the four components into one end-to-end ASR system, where inputs are the acoustics signal representation and the output is the word sequence without building four different distinct components [10, 11, 12].
For building an ASR system in practice, you can also learn a lot from kaldi . It is a toolkit for speech recognition written in C++ and licensed under the Apache License v2.0.
 Steve Young, et. al "The HTK book" https://www.danielpovey.com/files/htkbook.pdf
 Mark Gales, Steve Young "The Application of Hidden Markov Models in Speech Recognition" 2007, https://mi.eng.cam.ac.uk/~mjfg/mjfg_NOW.pdf
 Hervé Bourlard, Nelson Morgan, "Connectionist Speech Recognition: A Hybrid Approach", 1994
 Geoffrey Hinton, et al. "Deep Neural Networks for Acoustic Modeling in Speech Recognition: The shared views of four research groups", 2012
 Abdel-rahman Mohamed, George Dahl, Geoffrey Hinton, "Acoustic Modeling using Deep Belief Networks" 2010
 George Dahl, Dong Yu, Li Deng, Alex Acero, "Context-Dependent Pre-trained Deep Neural Networks for Large Vocabulary Speech Recognition" 2010
 Frank Seide, Gang Li, Xie Chen, Dong Yu, "Feature engineering in context-dependent deep neural networks for conversational speech transcription", 2011
 Andreas Stolcke, "SRILM - an Extensible Language Modeling Toolkit", 2002
 Tomas Mikolov, et. al, "RNNLM - Recurrent Neural Network Language Modeling Toolkit" 2010
 Alex Graves, Navdeep Jaitly, "Towards End-To-End Speech Recognition with Recurrent Neural Networks" 2014
 Dzmitry Bahdanau, et al "End-to-End Attention-based Large Vocabulary Speech Recognition" 2015
 William Chan, et al "Listen, Attend and Spell" 2015
Arabic Dialect Identifcation
The task of dialect identification (DID) is a special case of the more general problem of language identification (LID). LID refers to the process of automatically identifying the language class for a given speech segment or text document. Arabic language has several spoken dialects. There are four major dialects for Arabic, including Egyptian, Gulf, Levantine and North African in addition to modern standard Arabic (MSA) which is the official language in Arabic speaking countries.
Arabic dialect identification is arguably a more challenging problem than LID, since it consists of identifying the different dialects within the same language class. Thus, automatically identifying the input dialect from the speech signal has been an interesting research problem both on its own and to improve automatic speech recognition (ASR) .
Approaches to Arabic dialect identification (ADI) are closely related to those of language recognition. These include Gaussian mixture models, the phonotactic approach and phone recognition , the i-vector combined with dimensionality reduction  and more recently deep learning techniques [4-7]. Arabic dialect identification has been also closely associated with improving dialectal Arabic ASR interesting work has been done in the context of the GALE project  and recent thesis . In spite of this advances Arabic dialect recognition remains a challenging problem and several special sessions and contests have been organized around the subject . These include good pointers to many techniques and data sets. Also, there are various repositories [11-13] can be a good start for having an experimental setup.
 A. Ali, et al. "Automatic dialect detection in Arabic broadcast speech." in Interspeech 2016.
 Marc A. Zissman, “A comparison of four approaches to automatic language identification of telephone speech,” in IEEE Transactions on Speech and Audio Processing, vol. 4, no. 1, Jan 1996.
 N. Dehak, P.A. Torres-Carrasquillo, D. Reynolds and R. Dehak, “Language recognition via i-vectors and dimensionality reduction,” in Interspeech 2011.
 O. Ghahabi, A. Bonafonte, J. Hernando and A. Moreno, “Deep neural networks for i-vector language identification of short utterances in cars,” in Interspeech 2016.
 S. Shon, A. Ali, and J. Glass. "MIT-QCRI Arabic dialect identification system for the 2017 multi-genre broadcast challenge." Automatic Speech Recognition and Understanding Workshop (ASRU), 2017.
 M. Najafian, et al. "Exploiting convolutional neural networks for phonotactic based dialect identification." in ICASSP 2018.
 S. Shon, A. Ali, and J. Glass. "Convolutional Neural Network and Language Embeddings for End-to-End Dialect Recognition." Proc. Odyssey 2018 The Speaker and Language Recognition Workshop. 2018.
 F. Biadsy, J. Hirschberg and N. Habash, “Spoken Arabic dialect identification using phonotactic modeling, in Proceedings of EACL workshop on computational approaches to Semitic languages, 2009.
 A. Ali. Multi-dialect Arabic broadcast speech recognition. PhD thesis, The University of Edinburgh, 2018.
 Zampieri, Marcos, et al. "Language Identification and Morphosyntactic Tagging: The Second VarDial Evaluation Campaign." Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects. Association for Computational Linguistics, 2018.
Text To Speech
The Text to Speech (TTS) technology aims to convert a sequence of words into speech. Traditional TTS pipelines or engines consist of few steps to generate the speech:
- The text normalization or tokenization step aims to convert raw text containing symbols like numbers and abbreviations into the equivalent words.
- In text-to-phoneme or grapheme-to-phoneme conversion step, phonetic transcriptions for each word are assigned.
- Prosodic phrasing step aims to divide and mark the text into prosodic units, like phrases and sentences.
- Prediction of the target prosody (pitch contour, phoneme durations) step. The target prosody is used to generate/control the output speech.
- Finally, the synthesizer is used to convert the symbolic linguistic representation into sound.
Synthesized speech can be created by concatenating units of recorded speech that are stored in a database as in  . Common units used in concatenative synthesizers are phones or diphones. Alternatively, statistical parametric synthesizers also known as HMM-based synthesizers (based on hidden Markov models) can be used to create the synthesized speech . In these systems, the frequency spectrum (vocal tract), fundamental frequency (voice source), and duration (prosody) of speech are modeled simultaneously by HMMs. Speech waveforms are generated from HMMs themselves based on the maximum likelihood criterion. Recently, neural networks have been used as acoustic models for statistical parametric synthesizers . In addition, end-to-end DNN-based speech synthesizers such as Tacotron  by Google and Deep Voice  from Baidu are an active area of research. A state-of-the-art synthesizer based on Tacotron, developed for the Arabic language, is available on github .
Since Modern Standard Arabic (MSA) is written without diacritics, the first step to develop an Arabic TTS engine  is to restore the diacritics of each word in the text . The diacritized text is then passed to a phonetic transcription module to generate the phoneme sequence for each phrase . Hence, a synthesizer (i.e. concatenative, parametric, neural networks) can be used to synthesis the speech.
 A. Hunt and A. Black, :Unit selection in a concatenative speech synthesis system using a large speech database". In ICASSP-96, volume 1, pages 373--376, Atlanta, Georgia, 1996.
 Hifny, Yasser, et al. "ArabTalk®: An Implementation for Arabic Text To Speech System."The proceedings of the 4th Conference on Language Engineering. 2004.
 HMM/DNN-based Speech Synthesis System (HTS), http://hts.sp.nitech.ac.jp/
 Abdel-Hamid, Ossama, Sherif Mahdy Abdou, and Mohsen Rashwan. "Improving Arabic HMM based speech synthesis quality." Ninth International Conference on Spoken Language Processing. 2006.
 Merlin: The Neural Network (NN) based Speech Synthesis System, https://github.com/CSTR-Edinburgh/merlin
 Ping, Wei, et al. "Deep voice 3: Scaling text-to-speech with convolutional sequence learning." (2018).
 Wang, Yuxuan, et al. "Tacotron: Towards end-to-end speech synthesis." arXiv preprint arXiv:1703.10135
 Rashwan, Mohsen AA, et al. "A stochastic Arabic diacritizer based on a hybrid of factorized and unfactorized textual features."IEEE Transactions on Audio, Speech, and Language Processing 19.1 (2011): 166-175.
 Darwish, Kareem, Hamdy Mubarak, and Ahmed Abdelali. "Arabic diacritization: Stats, rules, and hacks."Proceedings of the Third Arabic Natural Language Processing Workshop. 2017.
 Hifny, Yasser. "Hybrid LSTM/MaxEnt Networks for Arabic Syntactic Diacritics Restoration."IEEE Signal Processing Letters 25.10 (2018): 1515-1519.
Language Modeling aims at accurately estimating the probability distribution of a word sequences or sentences produced in a natural language such as Arabic . Having a way to estimate the relative likelihood of different word sequences is useful in many natural language processing applications, especially those where natural text is generated such as the case of speech recognition. The goal of a speech recognizer is to match input speech sounds with word sequences. To accomplish this goal, the speech recognizer will leverage the language model to provide the capability to distinguish between words and phrases that sound similar. These ambiguities are easier to resolve when evidence from the language model is incorporated with the pronunciation model and the acoustic model.
Language models rely heavily on the context, or history, to estimate the probability distribution. The context can be long or short, knowledge-rich or knowledge-poor. We may base the estimation on a single preceding word (e.g., bigram), or potentially using knowledge of all words from the start of the passage preceding the word in question. Knowledge-rich models can incorporate information about morphology, syntax or semantics to inform the estimation of the probability distribution of word sequence, whereas knowledge poor models will rely solely on the words as the appear in the text. It is reasonable to state that current language modeling techniques can be split into two categories: count based and continuous-space based language models.
The count-based approaches represent the traditional techniques and usually involves the estimation of n-gram probabilities, where the goal is to accurately predict the next word in a sequence of words. In a model that estimates probabilities for two-word sequences (bigrams), it is unclear whether a given bigram has a count of zero because it is not a valid sequence in the language, or because it is not in the training data. As the length of the modeled sequences grows more complex, this sparsity issue also grows. Of all possible combinations of 5-grams in a language, very few are likely to appear at all in a given text, and even fewer will repeat often enough to provide reliable frequency statistics. Therefore, as the language model is trying to predict the next word, the challenge is to find appropriate, reliable estimates of word sequence probabilities to enable the prediction. Approaches to this challenge are three-folds: smoothing techniques are used to offset zero-probability sequences and spread probability mass across a model [2-4]; enhanced modeling techniques that incorporate machine learning or complex algorithms are used to create models that can best incorporate additional linguistic information [5-6]; and particularly for Arabic language modeling, morphological information is extracted and provided to the models in place of or in addition to lexical information [7-8].
The continuous space-based language modeling approach are based on the use of neural networks to estimate the probability distribution of a word sequence [9-10]. This approach, also denoted neuronal language model, are based on feed-forward neural network  or recurrent neural network [11-13] that achieved state of the art performance. Recently, a new technique based on transformers (BERT) start to be explored for language modeling as well . Initially, the feed-forward neural network based LM tackled efficiently the problems of data sparsity but not necessary the context. It uses a fixed length context. Every word in the vocabulary is associated with a distributed word feature vector, and the joint probability function of words sequence is a function of the feature vectors of these words in the sequence [9-10].
The recurrent neural network based LM was able in a certain degree to address the problem of limited context. It does not use fixed length context as their internal memory is able to remember important things about the input they received. In this type of architecture, neurons with input from recurrent connections assumed to represent short term memory and hence enables them to better leverage the history or context [9, 14, 15]. Also, subsequent research has been focusing on sub-word modelling and corpus-level modelling based on recurrent neural network and its variant, such as the long short-term memory network (LSTM) . However, a very long training time and large amounts of data are still the main limitations. It is also reasonable to say that, sub-word modelling and large-context language model are still interesting challenges to solve, which is very important for a language such as Arabic .
The reader can also refer to these [18-22] as a start to build your own language models.
 I. Zitouni (Ed.), Natural language processing of Semitic languages, theory and applications of natural language processing, Chapter 5. Springer, Berlin, Heidelberg (2014)
 Reinhard Kneser and Hermann Ney. 1995. Improved backing-off for m-gram language modeling. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pages 181–184.
 Ciprian Chelba and Johan Schalkwyk, 2013. Empirical Exploration of Language Modeling for the google.com Query Stream as Applied to Mobile Voice Search, pages 197–229. Springer, New York
 Stanley Chen and Joshua Goodman. 1998. An empirical study of smoothing techniques for language modeling. Technical Report TR-10-98, Harvard University, August.
 P.F. Brown V.J. DellaPietra P.V. DeSouza J.C. Lai R.L. Mercer "Class-based n-gram models of natural language" Computational Linguistics vol. 18 no. 4 pp. 467-479 1992.
 R. A. Solsona, E. Fosler-Lussier, H. J. Kuo, A. Potamianos and I. Zitouni, "Adaptive language models for spoken dialogue systems," 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, Orlando, FL, 2002, pp. I-37-I-40. doi: 10.1109/ICASSP.2002.5743648
 G. Choueiter, D. Povey, S. F. Chen and G. Zweig, "Morpheme-Based Language Modeling for Arabic Lvcsr," 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, Toulouse, 2006, pp. I-I.
 K. Kirchhoff, D. Vergyri, J. Bilmes, K. Duth, A. Stolcke, “Morphology-based language modeling for conversational Arabic speech recognition” Computer Speech & Language. Vol. 20 no. 4 pp. 589-608 Oct 2006.
 Mikolov, T. Statistical Language Models based on Neural Networks. PhD thesis, Brno University of Technology, 2012.
 W. Mulder, S. Bethard, M.F. Moens. A survey on the application of recurrent neural networks to statistical language modeling. Computer Speech & Language. Vol. 30 no. 1 pp. 61-98 March 2015.
 Yoon Kim, Yacine Jernite, David Sontag, and Alexander M. Rush. 2015. Character-Aware Neural Language Models. CoRR, abs/1508.06615.
 Bengio, Y., Ducharme, R., Vincent, P., and Janvin, C. A neural probabilistic language model. The Journal of Machine Learning Research, 3:1137–1155, 2003.
Mikolov, T., Karafi´at, M., Burget, L., Cernock`y, J., and Khudanpur, S. Recurrent neural network based language model. In INTERSPEECH, pp. 1045–1048, 2010.
 Martin Sundermeyer, Hermann Ney, and Ralf Schlüter. 2015. From feedforward to recurrent LSTM neural networks for language modeling. Trans. Audio, Speech and Lang. Proc. 23, 3 (March 2015), 517-529. DOI: https://doi.org/10.1109/TASLP.2015.2400218
 S. Yousfi, S.A. Berrani, C. Garcia. Contribution of recurrent connectionist language models in improving LSTM-based Arabic text recognition in videos. Pattern Recognition. Vol. 64 pp. 245-254 April 2017.
 Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. ArXiv e-prints.
 R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, ´ and Y. Wu. Exploring the limits of language modeling. arXiv preprint, 1602.02410, 2016. arxiv.org/abs/1602.02410.
 CMU Statistical Language Modeling Toolkit: http://www.speech.cs.cmu.edu/SLM/toolkit.html
 HTK Toolkit: http://htk.eng.cam.ac.uk/download.shtml
 SRILM - The SRI Language Modeling Toolkit: http://www.speech.sri.com/projects/srilm/
 Stanford CoreNLP – Natural language software: https://stanfordnlp.github.io/CoreNLP/
 The Berkeley NLP Group: http://nlp.cs.berkeley.edu/software.shtml