QCRI Aljazeera Speech Resources: QASR
DATA DOWNLOAD will be available soon. Thank you for your interest
QASR is, till today, the largest transcribed Arabic speech corpus with around 2, 000 hours with multi-layer annotation, in multi-dialect and code-switching speech. The data is crawled from the Aljazeera news channel with lightly supervised transcriptions and linguistically motivated segmentation. QASR is suitable for training and evaluating speech recognition systems, acoustics- and/or linguistics-based Arabic dialect identification, punctuation restoration, speaker identification, speaker linking, and potentially other NLP modules for spoken data
Some benchmark results using QASR data can be found in [1] [2].
More details about the QASR can be found [1].
- [1] QASR: QCRI Aljazeera Speech Resource. A Large Scale Annotated Arabic Speech Corpus.
- [2] Towards One Model to Rule All: Multilingual Strategy for Dialectal Code-Switching Arabic ASR.