Department of Electronic Engineering, University of Rome Tor Vergata, 00133 Rome, Italy.
Institute of Computational Perception, Johannes Kepler University, 4040 Linz, Austria.
Sensors (Basel). 2022 Mar 23;22(7):2461. doi: 10.3390/s22072461.
Machine Learning (ML) algorithms within a human-computer framework are the leading force in speech emotion recognition (SER). However, few studies explore cross-corpora aspects of SER; this work aims to explore the feasibility and characteristics of a cross-linguistic, cross-gender SER. Three ML classifiers (SVM, Naïve Bayes and MLP) are applied to acoustic features, obtained through a procedure based on Kononenko's discretization and correlation-based feature selection. The system encompasses five emotions (disgust, fear, happiness, anger and sadness), using the Emofilm database, comprised of short clips of English movies and the respective Italian and Spanish dubbed versions, for a total of 1115 annotated utterances. The results see MLP as the most effective classifier, with accuracies higher than 90% for single-language approaches, while the cross-language classifier still yields accuracies higher than 80%. The results show cross-gender tasks to be more difficult than those involving two languages, suggesting greater differences between emotions expressed by male versus female subjects than between different languages. Four feature domains, namely, RASTA, F0, MFCC and spectral energy, are algorithmically assessed as the most effective, refining existing literature and approaches based on standard sets. To our knowledge, this is one of the first studies encompassing cross-gender and cross-linguistic assessments on SER.
在人机框架内,机器学习(ML)算法是语音情感识别(SER)的主要力量。然而,很少有研究探索 SER 的跨语料库方面;这项工作旨在探索跨语言、跨性别 SER 的可行性和特点。三种 ML 分类器(SVM、朴素贝叶斯和 MLP)应用于通过基于 Kononenko 的离散化和基于相关性的特征选择过程获得的声学特征。该系统包括五种情绪(厌恶、恐惧、幸福、愤怒和悲伤),使用 Emofilm 数据库,该数据库由英语电影的短片以及各自的意大利语和西班牙语配音版本组成,共有 1115 个标注的话语。结果表明 MLP 是最有效的分类器,单语言方法的准确率高于 90%,而跨语言分类器的准确率仍高于 80%。结果表明,跨性别任务比涉及两种语言的任务更难,这表明男性和女性受试者表达的情绪之间的差异大于不同语言之间的差异。RASTA、F0、MFCC 和光谱能量这四个特征域被算法评估为最有效的特征域,从而完善了现有基于标准集的文献和方法。据我们所知,这是首批涵盖 SER 跨性别和跨语言评估的研究之一。