QASR TTS challenge promotes research in Text to Speech (TTS) for Arabic, a language known for its rich morphology and the presence of multiple dialects. Additionally, the consonantal nature of Arabic, where vowelization is often absent in written form, adds an extra layer of complexity to the task.
Arabic language is well-resourced in Automatic Speech Recognition (ASR) and Natural Language Processing (NLP) tasks. However, there are limited resources to train/design TTS systems, especially in terms of high-quality recordings. This competition will address the following challenges:
- How to leverage the availability of large transcribed – yet less accurate – transcription publicly available to the train TTS? Here we plan to use QASR corpus; an open-source data with 2,000 hours of broadcast news. A significant amount of this dataset are from anchor speakers, where the recording has a high-quality recording setup. We will provide anchor speaker meta-data.
- How to use less accurate transcribed broadcast-domain data? Broadcast transcription does not match the audio in multiple cases, owing to edits to enhance clarity, paraphrasing, the removal of hesitations and disfluencies, and summarisation in such instances as overlapping speech. We release light-supervised transcription with the corresponding confidence based on alignment feedback with various ASR systems.
- How to restore phoneme sequence for spoken text? Given that the original text has no vowelization, we will provide two subsets of data with vowelization; Manual vowelization to match the exact pronunciation; Automatic vowelization, generated by Farasa, which may be grammatically correct but does not necessarily match the pronounced speech.