QASR-Dataset

QASR is a large transcribed Arabic speech corpus with around 2,000 hours with multi-layer annotation, in multi-dialect and code-switching speech. The data is crawled from the Aljazeera news channel with lightly supervised transcriptions and linguistically motivated segmentation. QASR is suitable for training and evaluating speech recognition systems, acoustics- and/or linguistics-based Arabic dialect identification, punctuation restoration, speaker identification, speaker linking, and potentially other natural language processing modules for spoken data.

QASR Corpus Statistics and Annotation View
Total Hours2,041
No. of Episodes3,545
Segments1.6M
Avg. Episode Length32min
Avg. Segment Length4sec
No. of Speakers27,977
Features

Segmentation

Linguistically motivated speech segments!

Speech segmentation is the process of dividing a continuous stream of speech into meaningful units, such as sentences, words, or syllables. It is crucial for Broadcast domain speech to be segmented into meaningful sentences. Since the main programs from Aljazeera was not segmented, we aligned the given transcription with the ASR words for the whole episode; we considered many factors that we believe led to better and more logical segmentation; Fig 1 shows the final outcome is a human-like linguistically motivated segmentation sentence.


Punctuation

First Spoken Arabic Corpus for Punctuation Restoration.

QASR is the first corpus for spoken Arabic punctuation restoration. We segment the utterances from the same speaker with a maximum window of 120 tokens. We then remove utterances with 6 words and no punctuation in the segment. For the task, we only keep the top 3 punctuation classes (‘,’, ‘?’ and ‘.’) and the rest are mapped to class ‘O’ representing no punctuation. The distribution of punctuation in QASR is highly imbalanced, which is expected of a spoken corpus.


Code-Switching

Spoken Arabic Corpus for Modelling Code-switching.

Code-switching is the practice of shifting between two or more languages or dialects within a single utterances or an interaction. QASR corpus has over 6,000 segments with intrasentential code-switching, where alternation between Arabic and English/French languages are seen. Even though the code-switching occurs in only 0:4% of the full dataset, we notice that we have very short 968 segments with frequent alternating language code, such as: ”عندي duplex جوا building بجنينه”. These segments are useful to explore the effect of such code-switching in the performance of speech and NLP models jointly.

Learn more about QASR from our ACL2021 Presentation