Text-to-Speech (TTS) Technology

The Text-to-Speech (TTS) technology aims to convert a sequence of words into speech. Traditional TTS pipelines or engines consist of a few steps to generate the speech:

  1. Text Normalization or Tokenization: This step converts raw text containing symbols like numbers and abbreviations into the equivalent words.
  2. Text-to-Phoneme or Grapheme-to-Phoneme Conversion: In this step, phonetic transcriptions are assigned to each word.
  3. Prosodic Phrasing: This step aims to divide and mark the text into prosodic units, such as phrases and sentences.
  4. Prediction of Target Prosody: The target prosody, including pitch contour and phoneme durations, is determined to generate or control the output speech.
  5. Synthesizer: The synthesizer converts the symbolic linguistic representation into sound.

Synthesized speech can be created by concatenating units of recorded speech that are stored in a database as in 1 2. Common units used in concatenative synthesizers are phones or diphones. Alternatively, statistical parametric synthesizers, also known as HMM-based synthesizers (based on hidden Markov models), can be used to create synthesized speech 3 4. In these systems, the frequency spectrum (vocal tract), fundamental frequency (voice source), and duration (prosody) of speech are modeled simultaneously by HMMs. Speech waveforms are generated from HMMs themselves based on the maximum likelihood criterion. Recently, neural networks have been used as acoustic models for statistical parametric synthesizers 5. In addition, end-to-end DNN-based speech synthesizers such as Tacotron 6 by Google and Deep Voice 7 from Baidu are an active area of research. A state-of-the-art synthesizer based on Tacotron, developed for the Arabic language, is available on GitHub 8.

Since Modern Standard Arabic (MSA) is written without diacritics, the first step to develop an Arabic TTS engine 2 is to restore the diacritics of each word in the text 9 10 11. The diacritized text is then passed to a phonetic transcription module to generate the phoneme sequence for each phrase 12. Hence, a synthesizer (i.e., concatenative, parametric, neural networks) can be used to synthesize the speech.


Footnotes

  1. A. Hunt and A. Black. "Unit selection in a concatenative speech synthesis system using a large speech database." In ICASSP-96, volume 1, pages 373--376, Atlanta, Georgia, 1996. 1 2
  2. Yasser Hifny, et al. "ArabTalk®: An Implementation for Arabic Text To Speech System." The proceedings of the 4th Conference on Language Engineering, 2004. 2 2 3
  3. Ossama Abdel-Hamid, Sherif Mahdy Abdou, and Mohsen Rashwan. "Improving Arabic HMM based speech synthesis quality." Ninth International Conference on Spoken Language Processing, 2006. 3 2
  4. HMM/DNN-based Speech Synthesis System (HTS). Link 4 2
  5. Wei Ping, et al. "Deep voice 3: Scaling text-to-speech with convolutional sequence learning." 2018. 5 2
  6. Yuxuan Wang, et al. "Tacotron: Towards end-to-end speech synthesis." arXiv preprint arXiv:1703.10135 6 2
  7. Deep Voice from Baidu. 7 2
  8. Arabic Tacotron TTS. GitHub Link 8 2
  9. Mohsen AA Rashwan, et al. "A stochastic Arabic diacritizer based on a hybrid of factorized and unfactorized textual features." IEEE Transactions on Audio, Speech, and Language Processing, 19.1 (2011): 166-175. 9 2
  10. Kareem Darwish, Hamdy Mubarak, and Ahmed Abdelali. "Arabic diacritization: Stats, rules, and hacks." Proceedings of the Third Arabic Natural Language Processing Workshop, 2017. 10 2
  11. Yasser Hifny. "Hybrid LSTM/MaxEnt Networks for Arabic Syntactic Diacritics Restoration." IEEE Signal Processing Letters, 25.10 (2018): 1515-1519. 11 2
  12. Arabic Phonetiser. GitHub Link 12 2