QCRI Aljazeera Speech Resources: QASR

DATA DOWNLOAD will be available soon. Thank you for your interest 

QASR is, till today, the largest transcribed Arabic speech corpus with around 2, 000 hours with multi-layer annotation, in multi-dialect and code-switching speech. The data is crawled from the Aljazeera news channel with lightly supervised transcriptions and linguistically motivated segmentation. QASR is suitable for training and evaluating speech recognition systems, acoustics- and/or linguistics-based Arabic dialect identification, punctuation restoration, speaker identification, speaker linking, and potentially other NLP modules for spoken data
