QASR-Dataset

QASR is a large transcribed Arabic speech corpus with around 2,000 hours with multi-layer annotation, in multi-dialect and code-switching speech. The data is crawled from the Aljazeera news channel with lightly supervised transcriptions and linguistically motivated segmentation. QASR is suitable for training and evaluating speech recognition systems, acoustics- and/or linguistics-based Arabic dialect identification, punctuation restoration, speaker identification, speaker linking, and potentially other natural language processing modules for spoken data.

QASR Corpus Statistics and Annotation View
Total Hours2,041
No. of Episodes3,545
Segments1.6M
Avg. Episode Length32min
Avg. Segment Length4sec
No. of Speakers27,977
Features
Learn more about QASR from our ACL2021 Presentation