Title

A Study of Japanese Mixed Emotional Speech Synthesis Based on an End-to-End Emotional Speech Synthesis Model

2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

Authors

Issei Sakata and Tetsuo Kosaka
Graduate School of Science and Engineering, Yamagata University, Yonezawa, Japan

Abstract

In this study, we examined mixed emotion speech synthesis for Japanese using Emotional-VITS, an emotional speech synthesis model based on VITS, which is an end-to-end speech synthesis model. On the dataset of pretrained models, we compared models trained in two languages (Japanese and Chinese) and models trained only in Japanese. Regarding the quality of the synthesized speech, we confirmed that the model trained only in Japanese has higher naturalness than the model trained in Japanese and Chinese in both subjective and objective evaluation experiments, and that the input text is accurately synthesized. We also examined the emotional features used in speech synthesis and confirmed that the average of the emotional features of other speakers, other than the speaker to be synthesized, yields a higher emotion recognition rate than the average of the emotional features extracted from the target speech or the average of the emotional features of the speech of the speaker to be synthesized. Furthermore, it was found that even in mixed emotion speech synthesis, in which two types of emotions are mixed, the emotion recognized by the subject changes consistently depending on the mixing ratio, and it was found that emotions close to human sensation are possible.

Model Architecture

Model structure overview
Fig.1: Overall architecture of Emotional-VITS

Speech Synthesis Evaluation Results

Results of MOS, UTMOSv2, MCD and CER evaluation experiments
Model Configuration MOS Higher is better UTMOSv2 Higher is better MCD [dB] Lower is better CER [%] Lower is better
Ground Truth 4.909 2.514
Japanese + Chinese model Target Speech 2.485 1.298 8.071 11.38
Speaker's Ave. 2.523 1.419 7.889 10.29
Other's Ave. 2.644 1.313 8.259 11.55
Japanese model Target Speech 3.000 2.068 7.805 8.078
Speaker's Ave. 3.083 2.397 7.732 7.176
Other's Ave. 3.295 2.017 7.884 6.771

Audio Samples: Japanese+Chinese model vs. Japanese model

Select the emotion vector and speaker to compare playback between the two models.

Japanese+Chinese Model

Japanese Model

Mixed-Emotion Speech Synthesis

Select combinations of two emotions (e.g., Joy with Surprise / Sadness with Surprise) and speaker to listen to synthesized mixed emotion speech at mixing ratios α = 0, 0.25, 0.5, 0.75, 1.0.
The mixed ratio is calculated as Emotion1 × α + Emotion2 × (1 − α). Selecting an emotion pair lists every α with its corresponding audio and coefficients.