Language Models

Language Modeling aims at accurately estimating the probability distribution of word sequences or sentences produced in a natural language such as Arabic ¹. Having a way to estimate the relative likelihood of different word sequences is useful in many natural language processing applications, especially those where natural text is generated such as the case of speech recognition. The goal of a speech recognizer is to match input speech sounds with word sequences. To accomplish this goal, the speech recognizer will leverage the language model to provide the capability to distinguish between words and phrases that sound similar. These ambiguities are easier to resolve when evidence from the language model is incorporated with the pronunciation model and the acoustic model.

Language models rely heavily on the context, or history, to estimate the probability distribution. The context can be long or short, knowledge-rich or knowledge-poor. We may base the estimation on a single preceding word (e.g., bigram), or potentially using knowledge of all words from the start of the passage preceding the word in question. Knowledge-rich models can incorporate information about morphology, syntax or semantics to inform the estimation of the probability distribution of word sequence, whereas knowledge-poor models will rely solely on the words as they appear in the text. It is reasonable to state that current language modeling techniques can be split into two categories: count based and continuous-space based language models.

The count-based approaches represent the traditional techniques and usually involve the estimation of n-gram probabilities, where the goal is to accurately predict the next word in a sequence of words. In a model that estimates probabilities for two-word sequences (bigrams), it is unclear whether a given bigram has a count of zero because it is not a valid sequence in the language, or because it is not in the training data. As the length of the modeled sequences grows more complex, this sparsity issue also grows. Of all possible combinations of 5-grams in a language, very few are likely to appear at all in a given text, and even fewer will repeat often enough to provide reliable frequency statistics. Therefore, as the language model is trying to predict the next word, the challenge is to find appropriate, reliable estimates of word sequence probabilities to enable the prediction. Approaches to this challenge are three-fold: smoothing techniques are used to offset zero-probability sequences and spread probability mass across a model ²-³; enhanced modeling techniques that incorporate machine learning or complex algorithms are used to create models that can best incorporate additional linguistic information ⁴-⁵; and particularly for Arabic language modeling, morphological information is extracted and provided to the models in place of or in addition to lexical information ⁶-⁷.

The continuous space-based language modeling approach is based on the use of neural networks to estimate the probability distribution of a word sequence ⁸-⁹. This approach, also denoted neuronal language models, is based on feed-forward neural network ^9 or recurrent neural network ¹⁰-¹¹ that achieved the state-of-the-art performance. Recently, a new technique based on transformers (BERT) has started to be explored for language modeling as well ¹². Initially, the feed-forward neural network-based LM tackled efficiently the problems of data sparsity but not necessarily the context. It uses a fixed-length context. Every word in the vocabulary is associated with a distributed word feature vector, and the joint probability function of the word sequence is a function of the feature vectors of these words in the sequence ⁸-⁹.

The recurrent neural network-based LM was able, to a certain degree, to address the problem of limited context. It does not use fixed-length context as its internal memory is able to remember important things about the input it received. In this type of architecture, neurons with input from recurrent connections are assumed to represent short-term memory and hence enable them to better leverage the history or context ⁸, ¹³, ¹⁴. Also, subsequent research has been focusing on sub-word modeling and corpus-level modeling based on recurrent neural network and its variant, such as the long short-term memory network (LSTM) ¹⁴. However, very long training time and large amounts of data are still the main limitations. It is also reasonable to say that sub-word modeling and large-context language models are still interesting challenges to solve, which is very important for a language such as Arabic ¹⁵.

The reader can also refer to these ¹⁶-¹⁷ as a start to build your own language models.

I. Zitouni (Ed.), Natural language processing of Semitic languages, theory and applications of natural language ¹processing, Chapter 5. Springer, Berlin, Heidelberg (2014) ↩ ↩²
Reinhard Kneser and Hermann Ney. 1995. Improved backing-off for m-gram language modeling. In Proceedings of the ²IEEE International Conference on Acoustics, Speech and Signal Processing, pages 181–184. ↩ ↩²
Stanley Chen and Joshua Goodman. 1998. An empirical study of smoothing techniques for language modeling. ³Technical Report TR-10-98, Harvard University, August. ↩ ↩²
P.F. Brown V.J. DellaPietra P.V. DeSouza J.C. Lai R.L. Mercer "Class-based n-gram models of natural language" ⁴Computational Linguistics vol. 18 no. 4 pp. 467-479 1992. ↩ ↩²
R. A. Solsona, E. Fosler-Lussier, H. J. Kuo, A. Potamianos and I. Zitouni, "Adaptive language models for spoken ⁵dialogue systems," 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing, Orlando, FL, 2002, pp. I-37-I-40. doi: 10.1109/ICASSP.2002.5743648 ↩ ↩²
G. Choueiter, D. Povey, S. F. Chen and G. Zweig, "Morpheme-Based Language Modeling for Arabic Lvcsr," 2006 IEEE ⁶International Conference on Acoustics Speech and Signal Processing Proceedings, Toulouse, 2006, pp. I-I. doi: 10.1109/ICASSP.2006.1660205 ↩ ↩²
K. Kirchhoff, D. Vergyri, J. Bilmes, K. Duth, A. Stolcke, “Morphology-based language modeling for ⁷conversational Arabic speech recognition” Computer Speech & Language. Vol. 20 no. 4 pp. 589-608 Oct 2006. ↩ ↩²
Mikolov, T. Statistical Language Models based on Neural Networks. PhD thesis, Brno University of Technology, ⁸2012. ↩ ↩² ↩³ ↩⁴
W. Mulder, S. Bethard, M.F. Moens. A survey on the application of recurrent neural networks to statistical ⁹language modeling. Computer Speech & Language. Vol. 30 no. 1 pp. 61-98 March 2015. ↩ ↩² ↩³
Yoon Kim, Yacine Jernite, David Sontag, and Alexander M. Rush. 2015. Character-Aware Neural Language Models. ¹⁰CoRR, abs/1508.06615. ↩ ↩²
Mikolov, T., Karafi´at, M., Burget, L., Cernock`y, J., and Khudanpur, S. Recurrent neural network-based ¹¹language model. In INTERSPEECH, pp. 1045–1048, 2010. ↩ ↩²
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional ¹²Transformers for Language Understanding. ArXiv e-prints. ↩ ↩²
Martin Sundermeyer, Hermann Ney, and Ralf Schlüter. 2015. From feedforward to recurrent LSTM neural networks ¹³for language modeling. Trans. Audio, Speech and Lang. Proc. 23, 3 (March 2015), 517-529. DOI: https://doi.org/10.1109/TASLP.2015.2400218 ↩ ↩²
S. Yousfi, S.A. Berrani, C. Garcia. Contribution of recurrent connectionist language models in improving ¹⁴LSTM-based Arabic text recognition in videos. Pattern Recognition. Vol. 64 pp. 245-254 April 2017. ↩ ↩² ↩³
R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, ´ and Y. Wu. Exploring the limits of language modeling. ¹⁵arXiv preprint, 1602.02410, 2016. arxiv.org/abs/1602.02410. ↩ ↩²
CMU Statistical Language Modeling Toolkit: [http://www.speech.cs.cmu.edu/SLM/toolkit.html](http://www.speech.[^18]cs. cmu.edu/SLM/toolkit.html) ↩
NLTK - Natural Language Toolkit: https://www.nltk.org/ ¹⁷ ↩ ↩²

Language Models

Footnotes