QCRI Aljazeera Speech Resources: QASR

DATA DOWNLOAD will be available from the end of August 2021. Thank you for your interest 

QASR is, till today, the largest transcribed Arabic speech corpus with around 2, 000 hours with multi-layer annotation, in multi-dialect and code-switching speech. The data is crawled from the Aljazeera news channel with lightly supervised transcriptions and linguistically motivated segmentation. QASR is suitable for training and evaluating speech recognition systems, acoustics- and/or linguistics-based Arabic dialect identification, punctuation restoration, speaker identification, speaker linking, and potentially other NLP modules for spoken data
Some benchmark results using QASR data can be found in [1] [2].
More details about the QASR can be found [1].