resources

Broadcast News Arabic Text to Speech

 

Speech synthesis for low-resource languages: 

 

Nowadays speech synthesis, which is also known as text to speech, is one of the trending areas in the artificial intelligence domain. It gives the ability to generate well-established human-like speech from text input. Several studies have been carried out on building high-resource TTS systems.  In contrast, low-resource languages, including Arabic, have very limited TTS systems due to the lack of resources. In our study we propose a method for building TTS in such a low-resource scenario, including data collection and pre-training/fine-tuning strategies for TTS training, using broadcast news as a case study. We propose to adopt a fine-tuning strategy on top of a pre-trained Tacotron2 English model with one hour broadcast recording. We further build a FastSpeech2-based Conformer model by using this fine-tuned Arabic TTS model as a teacher model. We presented the whole pipeline in figure 1. 

 

 

Fig 1. TTS Pipeline

Data Selection 

 

Building a corpus is a very challenging task not to mention that it's not only about building the corpus it's also about having models that perform well on these corporas. In order to classify the good recordings in the MGB2 dataset that we held our experiments on we hired a professional linguist to listen to all segments from the selected best two anchor speakers and classify them into six classes. As we can see in figure 2. 

 

This process is very demanding and it takes time. For that we explored the MOSNet, a deep learning-based assessment model to predict human ratings of converted speech. 

 

Fig 2. Manual Classification for speech segments

Vowelization 

Our text was without vowelization (aka diacritization) which makes the problem very challenging. However, there are some differences between diacritizing a transcribed speech and a normal written text such as correction, hesitation, and repetition which needs more attention from the current text diacritizers. Additionally, the diacritized text should match the speakers' actual pronunciation of words even if they are not grammatically correct. We ran experiments with vowelization and we saw that it had a significant improvement on the performance of the model.

Fig 3. Applying vowelization

Measuring TTS quality

The quality of the text to speech system is usually measured by the Mean Opinion Score (MOS) which is a scoring technique for speech quality evaluation.  We used 100 sentences of evaluation data for the subjective evaluation. Each subject evaluated 100 samples of each model (in total 400 samples) and rated the intelligibility and naturalness of each sample on a 5-point scale: 5 for excellent, 4 for good, 3 for fair, 2 for poor, and 1 for bad. The number of subjects was ten. Our FastSpeech2-based Conformer model by using the fine-tuned Arabic Transformer TTS model as a teacher model achieved a mean opinion score (MOS) of 4.4 for intelligibility and 4.2 for naturalness.

Model list

 

  • Groundtruth: Natural speech
  • FastSpeech2 with finetuned Transformer as the teacher model with vowelization and reduction factor = 1
  • FastSpeech2 with finetuned Transformer as the teacher model without vowelization and reduction factor = 1
  • FastSpeech2 with finetuned Transformer as the teacher model with vowelization, with PWG and reduction factor = 1
text: وأشكر ضيفنا في الأستوديو الكاتب الصحفي الأستاذ محمد القدوسي
text: من خلال أربعة محاور رئيسية

Code: https://github.com/espnet/espnet/tree/master/egs2/qasr_tts/tts1

 

Author


Avatar