MGB-3-Dataset

The Arabic track for the 2017 multi-dialect multi-genre evaluation (speech recognition in the wild) is an extension of the 2016 evaluation (MGB-2). In addition to the 1,200 hours used in 2016 from Aljazeera TV programs, the MGB-3 explores multi-genre data; comedy, cooking, cultural, environment, family-kids, fashion, movies-drama, sports, and science talks (TEDX). The MGB-3 Arabic data comprises 16 hours multi-genre data collected from different YouTube channels. The 16 hours have been manually transcribed. The chosen Arabic dialect for this year is Egyptian. Given that dialectal Arabic has no orthographic rules, each program has been transcribed by four different transcribers using this transcription guidelines. The MGB-3 data is split into three groups; adaptation, development, and evaluation data.

Given that dialectal Arabic does not have a clearly defined orthography, different people tend to write the same word in slightly different forms. Therefore, instead of developing strict guidelines to ensure a standardized orthography, variations in spelling are allowed. Thus multiple transcriptions were produced, allowing transcribers to write the transcripts as they deemed correct. Every file has been segmented and transcribed by four different Egyptian annotators. The 80 YouTube clips have been manually labeled for speech, non-speech segments. About 12 minutes from each program were selected for transcription. The resulting 16 hours speech segments were then distributed into train, development, and test data sets as follows:

Adaptation: 12 minutes * 24 programs
Development: 12 minutes * 24 programs
Evaluation: 12 minutes * 31 programs

You can find samples here: audio, segmentation, transcription in Arabic, and transcription in Buckwalter.

You can find the MGB-3 ASR baseline system here