Your address will show here +12 34 56 78
QASR TTS Challenge 1.0
Semi-Supervised Broadcast News Text to Speech Challenge

The first edition of the QASR TTS challenge (QASR TTS 1.0) is designed to motivate research in multi-dialectal and multi-speakers TTS. The training data is collected from broadcast news. In this challenge, we target two voices: male and female. Both are anchor speakers on Aljazeera Arabic news channel. The challenge features two tracks: constrained and semi-constrained, based on TV data from Aljazeera. 

QASR TTS 1.0 is an official challenge at the 2023 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU 2023).

Massive Dynamic

We take pride in delivering Intelligent Designs and Engaging Experiences for clients all over the World.

Important Dates


May 16 2023: Team registration opens June 26 2023: Team registration closes June 26 2023: Test sentences released June 26 2023: Open system submissions Jul 1 2023: Deadline for participants to submit synthetic speech (Tentative) Jul 3 2023: ASRU paper submission Jul 10 2023: ASRU paper revision Sep 18 2023: Release evaluation results Dec 16-20 2023: ASRU workshop presentation

Challenge Background

 


QASR TTS challenge promotes research in Text to Speech (TTS) for Arabic, a language known for its rich morphology and the presence of multiple dialects. Additionally, the consonantal nature of Arabic, where vowelization is often absent in written form, adds an extra layer of complexity to the task.


Arabic language is well-resourced in Automatic Speech Recognition (ASR) and Natural Language Processing (NLP) tasks.  However, there are limited resources to train/design TTS systems, especially in terms of high-quality recordings. This competition will address the following challenges:



  • How to leverage the availability of large transcribed – yet less accurate – transcription publicly available to the train TTS? Here we plan to use QASR corpus; an open-source data with 2,000 hours of broadcast news. A significant amount of this dataset are from anchor speakers, where the recording has a high-quality recording setup. We will provide anchor speaker meta-data.

  • How to use less accurate transcribed broadcast-domain data? Broadcast transcription does not match the audio in multiple cases, owing to edits to enhance clarity, paraphrasing, the removal of hesitations and disfluencies, and summarisation in such instances as overlapping speech. We release light-supervised transcription with the corresponding confidence based on alignment feedback with various ASR systems.

  • How to restore phoneme sequence for spoken text? Given that the original text has no vowelization, we will provide two subsets of data with vowelization; Manual vow​​elization to match the exact pronunciation; Automatic vowelization, generated by Farasa, which may be grammatically correct but does not necessarily match the pronounced speech.


 
Challenge Tracks

 


While we invite general submissions for this topic, the 1st QASR TTS challenge incorporates a unique set of tracks, which will focus on building fully-supervised and semi-supervised data selection for low-resource TTS speech synthesis for a morphologically complex language.


Our Challenge will offer the following two tracks to build TTS systems:

Track A – Constrained Condition: Systems in this track can be trained using the dataset provided by the organizers. The provided dataset is manually transcribed and vowelized to match the pronunciation of the anchor speakers (roughly one hour each).


Track B – Semi-Constrained Condition: Systems in this track can be trained using the lightly supervised ASR transcription provided with the audio for two anchor speakers. Participants will also be given speaker metadata in this subtask, including cross-episode speaker-information linking (roughly 10 hours each).


Optional Tracks

We also invite contributions for any relevant work in low-resource TTS  problems, which include, but not limited to:


  1. Transfer learning from high-resource languages

  2. Data selection/curation for under-resourced

  3. Multilingual TTS

  4. Dialectal TTS

  5. TTS for code-switching and Mixed language (e.g., Pidgin languages and Creole languages)

  6. TTS for endangered or extinct languages

  7. TTS for languages with no standard orthography

  8. TTS for languages with no orthographic rules

Challenge Data

 

Recordings are selected from two random anchors as our main speakers, one male and one female. Intuitively, anchor speakers will be speaking clearly and their recordings will generally be performed in high-quality studios with minimal surrounding noise. The original version of QASR data is optimized for ASR with a 16 kHz sampling rate. However, for this TTS challenge, we will release the audio with 22 kHz sampling rate.

 

Track A – Constrained Condition data: This track will use one hour, which has been manually revised to ensure the data has high-quality recordings as well as revised transcription with full-manual vowelization, for each anchor speaker. More about the Constrained Condition data can be found here

 

Transcription: The audio will be provided, in addition to verbatim transcription and manual vowelization, which matches the pronounced speech rather than the grammar of the language. 

 

Track B – Constrained Condition data: This track will use all the recordings from the two speakers in QASR data. Speaker identification and speaker linking were labeled using the provided meta-data and semi-supervised methods as described here

Transcription: The transcription may lack verbatim transcription. This is common in broadcast news for numerous reasons; dropping words, repetition or correction; spelling and grammar mistakes; code-switching across Arabic dialects or between languages; and finally, overlapping speech segments. The text in this track is not vowelized. Participants are encouraged to consider open source vowelization tools such as Camel or Farasa.

 

Audio and Transcription Samples

The shown sample in track A is accurately transcribed and manually vowelized. 
Transcription Sample from Track A:

نَشْكُرُ الأُسْتَاذْ وَائِلْ قَنْدِيلْ الكَاتِبْ الصَّحَفِيّْ

Transcription Sample from Track B:

  أشكر الأستاذ وائل قنديل الكاتب الصحفي ،  

The shown sample in track B is as received from the meta-data, where the utterances are non-verbatim transcription,  with no short vowels in the text.
Challenge Baseline, Evaluation and Rules

 

Baseline Systems

Most current TTS systems consist of three modules: text front-end, acoustic models, and vocoder. We release software packages to enable research groups not familiar with the Arabic language or TTS to build baseline systems.

Text front-end: The transcription is in Arabic UTF-8 script. The following notebook has multiple examples of Arabic text preprocessing, including text normalization, automatic text vowelization using Farasa tools (which is essential for track B), and Arabic phonetizer.

Acoustic models and vocoder: The following ESPNET recipe is optimized for data in track A using character-based models. This approach finetunes a pre-trained model using a one-hour dataset.

 

Evaluation tasks

The output from each system will be evaluated using both objective and subjective evaluation metrics. The primary objective of this challenge is to assess speech intelligibility or naturalness of the submitted systems for the target speaker. This will systematically require subjective tests by human listenters. The naturalness of the synthesized speech would be judged by Mean Opinion Score (MOS) and Word Error Rate (WER). More details will be available soon!

 

Rules for all tasks

QASR TTS 1.0 is a closed competition, and only provided data should be used for the submitted systems. 

Pre-trained models: We encourage participants using pre-trained models to consider open-source models – we will list commonly used models. Participants are encouraged to propose models if they are not listed and explain details in the system paper.

 

Download Instructions

If you want to participate in this challenge, you need to register using this link.Upon completion of the form, a link to download the dataset will be sent.

 

License

We will use the same license agreement as QASR corpus. 

 

Ethics Statement

Since the developed models could synthesize speech that maintain speaker identity, it may carry potential risks in the misuse of these models, such as spoofing voice identification or impersonating a specific speaker. Prospective participants in this challenge are advised to take caution that synthesized speech will not be used for impersonating the target speakers.

 
Organizers
Ahmed Ali (Qatar Computing Research Institute)
Soumi Maiti (Carnegie Mellon University)
Shinnosuke Takamichi (University of Tokyo)
Shammur Chowdhury (Qatar Computing Research Institute)
Hamdy Mubarak (Qatar Computing Research Institute)
Ahmed Abdel Ali (Qatar Computing Research Institute)
Massa Baali (Carnegie Mellon University)
Bhiksha Ramakrishnan (Carnegie Mellon University)

Advisors
Shinji Watanabe (Carnegie Mellon University)
Simon King (University of Edinburgh)

Contacts
info@arabicspeech.org