Automatic Speech Recognition
Automatic Speech Recognition (ASR) is the process of converting the speech signal into its corresponding text. The quality of ASR systems is measured by how close their recognized sequences of words are to human recognized sequences of words. More formally, the ASR quality, measured by Word Error Rate (WER), is the edit-distance between automatically generated hypotheses and the ground-truth human transcription.
Traditionally, ASR systems are split into four components: the acoustic model, the pronunciation model, the language model, and the search decoder. Since ASR output hypotheses need to adhere to the statistical structure of language.... Continue reading
Arabic Dialect Identifcation
The task of Arabic dialect identification (ADI) is a special case of the more general problem of language identification (LID). LID refers to the process of automatically identifying the language class for a given speech segment or text document. Arabic language has several spoken dialects. There are four major dialects for Arabic, including Egyptian, Gulf, Levantine and North African in addition to modern standard Arabic (MSA) which is the official language in Arabic speaking countries.
Approaches to ADI are closely related to those of language recognition. These include... Continue reading
Text To Speech
The Text to Speech (TTS) technology aims to convert a sequence of words into speech. Synthesized speech can be created by concatenating units of recorded speech that are stored in a database as in. Common units used in concatenative synthesizers are phones or diphones. Alternatively, statistical parametric synthesizers also known as HMM-based synthesizers (based on hidden Markov models) can be used to create the synthesized speech. Speech waveforms are generated from HMMs themselves based on the maximum likelihood criterion. Recently, neural networks have been used as acoustic models for statistical parametric synthesizers. In addition, end-to-end DNN-based speech synthesizers.... Continue reading
Language Modeling aims at accurately estimating the probability distribution of a word sequences or sentences produced in a natural language such as Arabic. Having a way to estimate the relative likelihood of different word sequences is useful in many natural language processing applications, especially those where natural text is generated such as the case of speech recognition. The goal of a speech recognizer is to match input speech sounds with word sequences. To accomplish this goal, the speech recognizer will leverage the language model to provide the capability to distinguish between words and phrases that sound similar. These ambiguities are easier to resolve when evidence from the language model is incorporated with the pronunciation model and the acoustic model.
Language models rely heavily on the context, or history, to estimate the probability distribution. The context can be long or short, knowledge-rich or knowledge-poor. We may base the estimation on a single preceding word (e.g., bigram), or potentially using knowledge of all words from the start of the passage preceding the word in question. Knowledge-rich models can incorporate information about morphology, syntax or semantics to inform the estimation of the probability distribution of word sequence, whereas knowledge poor models will rely solely on the words as the appear in the text. It is reasonable to state that current language modeling techniques can be split into two categories: count based and continuous-space based language models.... Continue reading