Title

A Study of Japanese Mixed Emotional Speech Synthesis Based on an End-to-End Emotional Speech Synthesis Model

2025 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)

Authors

Issei Sakata and Tetsuo Kosaka
Graduate School of Science and Engineering, Yamagata University, Yonezawa, Japan

Abstract

In this study, we examined mixed emotion speech synthesis for Japanese using Emotional-VITS, an emotional speech synthesis model based on VITS, which is an end-to-end speech synthesis model. On the dataset of pretrained models, we compared models trained in two languages (Japanese and Chinese) and models trained only in Japanese. Regarding the quality of the synthesized speech, we confirmed that the model trained only in Japanese has higher naturalness than the model trained in Japanese and Chinese in both subjective and objective evaluation experiments, and that the input text is accurately synthesized. We also examined the emotional features used in speech synthesis and confirmed that the average of the emotional features of other speakers, other than the speaker to be synthesized, yields a higher emotion recognition rate than the average of the emotional features extracted from the target speech or the average of the emotional features of the speech of the speaker to be synthesized. Furthermore, it was found that even in mixed emotion speech synthesis, in which two types of emotions are mixed, the emotion recognized by the subject changes consistently depending on the mixing ratio, and it was found that emotions close to human sensation are possible.

Model Architecture

Model structure overview — Fig.1: Overall architecture of Emotional-VITS

Speech Synthesis Evaluation Results

Results of MOS, UTMOSv2, MCD and CER evaluation experiments
Model	Configuration	MOS Higher is better	UTMOSv2 Higher is better	MCD [dB] Lower is better	CER [%] Lower is better
Ground Truth		4.909	2.514	–	–
Japanese + Chinese model	Target Speech	2.485	1.298	8.071	11.38
	Speaker's Ave.	2.523	1.419	7.889	10.29
	Other's Ave.	2.644	1.313	8.259	11.55
Japanese model	Target Speech	3.000	2.068	7.805	8.078
	Speaker's Ave.	3.083	2.397	7.732	7.176
	Other's Ave.	3.295	2.017	7.884	6.771

Audio Samples: Japanese+Chinese model vs. Japanese model

Select the emotion vector and speaker to compare playback between the two models.

Emotion Vector

Speaker

Japanese+Chinese Model

Japanese Model

Mixed-Emotion Speech Synthesis

Select combinations of two emotions (e.g., Joy with Surprise / Sadness with Surprise) and speaker to listen to synthesized mixed emotion speech at mixing ratios α = 0, 0.25, 0.5, 0.75, 1.0.
The mixed ratio is calculated as Emotion₁ × α + Emotion₂ × (1 − α). Selecting an emotion pair lists every α with its corresponding audio and coefficients.

Emotion Pair

Speaker