Several high-resource Text to Speech (TTS) systems currently produce natural, well-established human-like speech. In contrast, low-resource languages, including Arabic, have very limited TTS systems due to the lack of resources. We propose a fully unsupervised method for building TTS, including automatic data selection and pre-training/fine-tuning strategies for TTS training, using broadcast news as a case study. We show how careful selection of data, yet smaller amounts, can improve the efficiency of TTS system in generating more natural speech than a system trained on a bigger dataset. We adopt to propose different approaches for the: 1) data: we applied automatic annotations using DNSMOS, automatic vowelization, and automatic speech recognition (ASR) for fixing transcriptions’ errors; 2) model: we used transfer learning from high-resource language in TTS model and fine-tuned it with one hour broadcast recording then we used this model to guide a FastSpeech2-based Conformer model for duration. Our objective evaluation shows 3.9% character error rate (CER), while the groundtruth has 1.3% CER. As for the subjective evaluation, where 1 is bad and 5 is excellent, our FastSpeech2-based Conformer model achieved a mean opinion score (MOS) of 4.4 for intelligibility and 4.2 for naturalness, where many annotators recognized the voice of the broadcaster, which proves the effectiveness of our proposed unsupervised method.



Model Overview: 



The overview of our model. Our input is the Arabic text that we feed to the Transformer to produce mel-spectrogram. Then we train FastSpeech2 from scratch on 1 hour of data guided by the teacher model. In the end, we trained a parallel wave gan to generate speech from the predicted mel-spectrogram.






Fig 1. TTS Pipeline

Data Selection:



Building a corpus is a very challenging task not to mention that it's not only about building the corpus it's also about having models that perform well on these corporas. In order to classify the good recordings in the MGB2 dataset that we held our experiments on we hired a professional linguist to listen to all segments from the selected best two anchor speakers and classify them into six classes. As we can see in figure 2. 


This process is very demanding and it takes time. For that we explored the MOSNet, a deep learning-based assessment model to predict human ratings of converted speech. Then we explored other automatic method like wvMOS and DNSMOS to select better speech samples. 







Fig 2. Manual Classification for speech segments



Our text was without vowelization (aka diacritization) which makes the problem very challenging. However, there are some differences between diacritizing a transcribed speech and a normal written text such as correction, hesitation, and repetition which needs more attention from the current text diacritizers. Additionally, the diacritized text should match the speakers' actual pronunciation of words even if they are not grammatically correct. We ran experiments with vowelization and we saw that it had a significant improvement on the performance of the model. 



Fig 3. Applying vowelization


Measuring TTS quality:


The quality of the text to speech system is usually measured by the Mean Opinion Score (MOS) which is a scoring technique for speech quality evaluation.  We used 100 sentences of evaluation data for the subjective evaluation. Each subject evaluated 100 samples of each model (in total 400 samples) and rated the intelligibility and naturalness of each sample on a 5-point scale: 5 for excellent, 4 for good, 3 for fair, 2 for poor, and 1 for bad. The number of subjects was ten. Our FastSpeech2-based Conformer model by using the fine-tuned Arabic Transformer TTS model as a teacher model achieved a mean opinion score (MOS) of 4.4 for intelligibility and 4.2 for naturalness.


Model list:

  1. Groundtruth: Natural speech
  2. FastSpeech2 with finetuned Transformer as the teacher model with vowelization and reduction factor = 1
  3. FastSpeech2 with finetuned Transformer as the teacher model without vowelization and reduction factor = 1
  4. FastSpeech2 with finetuned Transformer as the teacher model with vowelization, with PWG and reduction factor = 1 



This video shows the speech generated by our model followed by the real speech. 


Ethics Statement



Since our models could synthesize speech that maintains speaker identity, it may carry potential risks in misuse of the model, such as spoofing voice identification or impersonating a specific speaker. We conducted the experiments under the assumption that the user agree to be the target speaker in speech synthesis. If the model is generalized to unseen speakers in the real world, it should include a protocol to ensure that the speaker approves the use of their voice and a synthesized speech detection model.