Voice Conversion System Based on Deep Neural Network Capable of Parallel Computation
PubDate: August 2018
Teams: The University of Tokyo
Writers: Kunihiko Sato; Jun Rekimoto
Voice conversion (VC) algorithms modify the speech of a particular speaker to resemble that of another speaker. Many existing virtual reality (VR) and augmented reality (AR) systems make it possible to change the appearance of users, and if VC is added, then users can also change their voice. State-of-the-art VC methods employ recurrent neural networks (RNNs), including long short-term memory (LSTM) networks, for generating converted speech. However, it is difficult for RNNs to perform parallel computations because the computations at each timestep depend on the results of a previous timestep, which prevents them from operating in real-time. In contrast, we propose a novel VC approach based on a dilated convolutional neural network (Dilated CNN), which is a deep neural network model that allows for parallel computation. We adapted the Dilated CNN model to perform convolutions in both the forward and reverse directions to ensure the learning is successful. In addition, to ensure the model can be parallelized during both the training and inference phases, we developed a model architecture that predicts all output values from the value of the input speech, and does not rely on predicted values for the next input. The results demonstrate that the proposed VC approach has a faster conversion rate relative to that of state-of-the-art methods, while improving speech quality a little and maintaining speaker similarity.