Lexicon (Pronouncing Dictionary): QCRI collected a news archive from many news websites, and through the collaboration with Aljazeera, they had access to more than ten years of news articles from the Arabic news website Aljazeera.net, and processed the text using MADA to restore vowelization. The collected text is mostly MSA, but they have some colloquial words every now and then. They selected all words that occurred more than once in the news archive and created QCRI phoneme ASR lexicon. The lexicon has 526K unique grapheme words, with 2M pronunciations, with an average of 3.8 pronunciations for each grapheme word.
- The phoneme-based Arabic speech lexicon is roughly 3.8 pronunciations for each word. You can find it here.You can find another two grapheme lexicon here: