Title
A Study of Japanese Mixed Emotional Speech Synthesis Based on an End-to-End Emotional Speech Synthesis Model
2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)
Abstract
In this study, we examined mixed emotion speech synthesis for Japanese using Emotional-VITS, an emotional speech synthesis model based on VITS, which is an end-to-end speech synthesis model. On the dataset of pretrained models, we compared models trained in two languages (Japanese and Chinese) and models trained only in Japanese. Regarding the quality of the synthesized speech, we confirmed that the model trained only in Japanese has higher naturalness than the model trained in Japanese and Chinese in both subjective and objective evaluation experiments, and that the input text is accurately synthesized. We also examined the emotional features used in speech synthesis and confirmed that the average of the emotional features of other speakers, other than the speaker to be synthesized, yields a higher emotion recognition rate than the average of the emotional features extracted from the target speech or the average of the emotional features of the speech of the speaker to be synthesized. Furthermore, it was found that even in mixed emotion speech synthesis, in which two types of emotions are mixed, the emotion recognized by the subject changes consistently depending on the mixing ratio, and it was found that emotions close to human sensation are possible.
Model Architecture
Speech Synthesis Evaluation Results
| Model | Configuration | MOS Higher is better | UTMOSv2 Higher is better | MCD [dB] Lower is better | CER [%] Lower is better |
|---|---|---|---|---|---|
| Ground Truth | 4.909 | 2.514 | – | – | |
| Japanese + Chinese model | Target Speech | 2.485 | 1.298 | 8.071 | 11.38 |
| Speaker's Ave. | 2.523 | 1.419 | 7.889 | 10.29 | |
| Other's Ave. | 2.644 | 1.313 | 8.259 | 11.55 | |
| Japanese model | Target Speech | 3.000 | 2.068 | 7.805 | 8.078 |
| Speaker's Ave. | 3.083 | 2.397 | 7.732 | 7.176 | |
| Other's Ave. | 3.295 | 2.017 | 7.884 | 6.771 | |
Audio Samples: Japanese+Chinese model vs. Japanese model
Select the emotion vector and speaker to compare playback between the two models.
Japanese+Chinese Model
Japanese Model
Mixed-Emotion Speech Synthesis
Select combinations of two emotions (e.g., Joy with Surprise / Sadness with Surprise) and speaker to listen to synthesized mixed emotion speech at mixing ratios α = 0, 0.25, 0.5, 0.75, 1.0.
The mixed ratio is calculated as Emotion1 × α + Emotion2 × (1 − α). Selecting an emotion pair lists every α with its corresponding audio and coefficients.