QASR-Dataset

QASR is a large transcribed Arabic speech corpus with around 2,000 hours with multi-layer annotation, in multi-dialect and code-switching speech. The data is crawled from the Aljazeera news channel with lightly supervised transcriptions and linguistically motivated segmentation. QASR is suitable for training and evaluating speech recognition systems, acoustics- and/or linguistics-based Arabic dialect identification, punctuation restoration, speaker identification, speaker linking, and potentially other natural language processing modules for spoken data.

QASR Corpus Statistics and Annotation View

Total Hours2,041

No. of Episodes3,545

Segments1.6M

Avg. Episode Length32min

Avg. Segment Length4sec

No. of Speakers27,977

Features

Learn more about QASR from our ACL2021 Presentation