Abstract:

The presentation will focus on an important research topic developed in the last decade at the Speech Processing Research Group from Technical University of Cluj-Napoca: text to speech synthesis for Romanian. While the earlier achievements in this field were related to speech synthesis using diphone concatenation or statistical methods, much effort has been also dedicated to speech synthesis using Deep Neural Networks (DNN). Starting from a successful end to end approach, that is training the network only with the text – audio pair, without any other text annotation, we show that there is still room for speech quality improvement from both perspectives: text processing modules, as well as acoustic modelling. First, this has been realised through several text annotations: phonetic transcription, syllabification, lexical stress positioning, POS tagging, or even by a higher level of representations, such as text style information. The methods and the performance for these text processing modules are presented. Second, a number of investigations have been accomplished to experiment various neural network architectures for acoustic modelling in order to enhance the speech quality. For example: Tacotron 2 for expressive speech synthesis, an improved DCTTS implementation for speaker adaptation, or Tacotron for speech synthesis trained with imperfect data. Further work will conclude the presentation.