Automatic Speech Recognition
The Multi-Genre Broadcast (MGB) Challenge is an evaluation of speech recognition, speaker diarization, dialect detection and lightly supervised alignment using TV recordings in English and Arabic. More details about the MGB Challenge can be found here.
The 1,200 hours from Aljazeera TV programs have been manually captioned with no timing information. QCRI Arabic ASR system has been used to recognize all programs. The ASR output was used to align the manual captioning and produce speech segments for training speech recognition. More than 20 hours from 2015 programs have been transcribed verbatim and manually segmented. This data is split into a development set of 10 hours, and a similar evaluation set of 10 hours. Both the development and evaluation data have been released in the 2016 MGB challenge. You can download it here.
The MGB-3 is 16 hours multi-genre data collected from different YouTube channels. The 16 hours have been manually transcribed. The chosen Arabic dialect for the MGB-3 is Egyptian. Given that dialectal Arabic has no orthographic rules, each program has been transcribed by four different transcribers using this transcription guideline. The MGB-3 data is split into three groups; adaptation, development and evaluation data which was shared at the evaluation. You can download it here.
More details about the Arabic version of the MGB can be found here.
The grapheme-based Arabic speech lexicon is 1:1 word to grapheme mapping. You can find it here.
The phoneme-based Arabic speech lexicon is roughly 3.8:1 word to phoneme mapping. You can find it here.
The grapheme-based Egyptian Arabic speech lexicon can be found here.
Arabic Dialect Identifcation
In this task, we classify Arabic speech into five dialects:
- Egyptian Arabic (EGY) covers the dialects of the Nile valley: Egypt and Sudan.
- Levantine Arabic (LAV) includes the dialects of Lebanon, Syria, Jordan and Palestine.
- Gulf Arabic (GLF) includes the dialects of Kuwait, United Arab Emirates, Bahrain, and Qatar. Saudi Arabia is typically included, although there is a wide range of sub-dialects within it. Omani Arabic is sometimes included as well.
- North African Arabic (NOR) - also known as Maghrebi - covers the dialects of Morocco, Algeria, Tunisia, and Mauritania. Libyan Arabic is sometimes included too.
- Modern Standard Arabic (MSA), which constitutes formal speech.
The Arabic Dialect Identification (ADI) classification assumes that each speech segment corresponds to one dialect.
There are three editions of the ADI challenge challenge data: