DACS-Dataset

This release includes the annotated two-hours Egyptian dataset from the ADI-5 development split in the MGB-3 challenge 1. The released MGB-3 data includes speech features and textual features extracted from ASR transcription.

Unlike MGB-3:EGY, this dataset is manually segmented to the audio into smaller utterances (with 500 msec silence or more) and transcribed the speech verbatim by a lay native Egyptian speaker.

The transcribed data is then annotated for word-level Code-Switching (CS) information by 3 annotators. Using the guideline mentioned in the paper, the annotators were asked to classify the words into one of the following four categories: (i) MSA: MSA word with MSA pronunciations; (ii) EGY: Egyptian word; (iii) MIX: MSA word with dialectal pronunciations and (iv) FRN: Foreign word, i.e., not Arabic. In addition, a 'NULL' tag was assigned in case the word is unintelligible or cannot be categorised to one of the four labels.

More details can be found here