Speech Resources
We aim to create a difference in the Arabic speech resources
We share, support and encourage everyone to contribute to Arabic speech resources
Numerous efforts have been given to produce spoken Arabic data set resources. From CallHome task (1996/97 NIST benchmark) to
the Global Autonomous Language Exploitation (GALE) [2006-2009], many resources have been created.
Here we list some publicly available speech corpora.
Hours of Arabic Speech Data
Arabic Speech Recognition Resources
Benchmarks
MGB-2: More than 1,200 hours collected from Aljazeera TV, along with 130 million words from Aljazeera.net. programs have been manually captioned with no timing information.
MGB-3: Egyptian Arabic Speech recognition in the wild. Every sentence was annotated by four annotators. More than 15 hours have been collected from YouTube.
MGB-5: Moroccan Arabic speech recognition in the wild. We release 14 hours transcribed from YouTube along with 90 hours genre-labeled with no transcription.
QASR: till today, is the largest transcribed Arabic speech corpus with around 2, 000 hours with multi-layer annotation, in multi-dialect and code-switching speech
ESCWA code switching: Collected over two days of meetings of the United Nations Economic and Social Commission for West Asia (ESCWA) in 2019.
Dialectal Arabic Code-Switching Dataset: includes the annotated two-hours Egyptian dataset from the ADI-5 development split in the MGB-3 challenge
Arabic Dialect Identification Resources
ADI-5: More than 50 hours collected from Aljazeera TV. 4 regional dialectal: Egyptian (EGY), Levantine (LAV), Gulf (GLF), North African (NOR), and Modern Standard Arabic (MSA). This dataset is a part of the MGB-3 challenge.
ADI-17: More than 3,000 hours of multi-genre speech data collected from YouTube and labeled as one of 17 countries. This dataset is a part of the MGB-5 challenge.
Lexicon
The grapheme-based Arabic speech lexicon is 1:1 word to grapheme mapping
Text to Speech
Broadcast News Arabic TTS data from QASR
Planning to contribute to ArabicSpeech community!
Looking to make ArabicSpeech great, full of resources and support open source!
Join our community!