Publications
(†) denotes equal contribution
(*) denotes correspondance
denotes journal
denotes conference
denotes preprint
2024
- PreprintFleser: Multimodal Emotion Recognition Via Dynamic Fuzzy Membership and Attention FusionNhut Minh Nguyen, Minh Trung Nguyen, Thanh Trung Nguyen, and 6 more authors2024
Multimodal learning has been demonstrated to improve classification outcomes in speech emotion recognition (SER). Despite this advantage, multimodal approaches in SER often face key challenges such as limited robustness in handling uncertainty, difficulties in generalizing across diverse emotional contexts, and inefficiencies in integrating heterogeneous modalities. To overcome these constraints, we propose FleSER, a multimodal emotion recognition framework that utilizes dynamic fuzzy membership and attention fusion. In this architecture, we introduce a rule-based dynamic fuzzy membership mechanism that adaptively transforms features. The FleSER architecture leverages audio and textual modalities, employing self-modality and cross-modality attention mechanisms with the α interpolation to capture complementary emotional cues. The α interpolation-based feature fusion mechanism adaptively emphasizes the more informative modality in varying contexts, ensuring robust multimodal integration. This comprehensive design improves the model’s recognition accuracy. We evaluate the FleSER architecture on the three benchmark datasets: IEMOCAP, ESD, and MELD. FleSER surpasses the previous state-of-the-art (SOTA) by 1.92% on IEMOCAP and an impressive 6.37% on ESD, demonstrating its superior effectiveness in enhancing emotion recognition accuracy across various datasets. Ablation studies further validate the effectiveness of each key component, including unimodal and multimodal input effectiveness, fuzzy membership functions, fusion strategies, and projection dimension on FleSER architecture performance. Our source code is publicly available at https://github.com/nhut-ngnn/FleSER.
@article{nguyen2025fleser, title = {Fleser: Multimodal Emotion Recognition Via Dynamic Fuzzy Membership and Attention Fusion}, author = {Nguyen, Nhut Minh and Nguyen, Minh Trung and Nguyen, Thanh Trung and Tran, Phuong-Nam and Pham, Nhat Truong and Le, Linh and OTHMANI, Alice and Saddik, Abdulmotaleb El and Dang, Duc Ngoc Minh}, journal = {}, year = {2024}, ssrn = {5316634}, preprint = {true}, google_scholar_id = {2osOgNQ5qMEC}, }
- PreprintMulti-modal fusion in speech emotion recognition: A comprehensive review of methods and technologiesNhut Minh Nguyen, Thanh Trung Nguyen, Phuong-Nam Tran, and 3 more authors2024
Speech emotion recognition (SER) plays a crucial role in human-computer interaction, enhancing numerous applications such as virtual assistants, healthcare monitoring, and customer support by identifying and interpreting emotions conveyed through spoken language. While single-modality SER systems demonstrate notable simplicity and computational efficiency, excelling in extracting critical features like vocal prosody and linguistic content, there is a pressing need to improve their performance in challenging conditions, such as noisy environments and the handling of ambiguous expressions or incomplete information. These challenges underscore the necessity of transitioning to multi-modal approaches, which integrate complementary data sources to achieve more robust and accurate emotion detection. With advancements in artificial intelligence, especially in neural networks and deep learning, many studies have employed advanced deep learning and feature fusion techniques to enhance SER performance. This review synthesizes comprehensive publications from 2020 to 2024, exploring prominent multi-modal fusion strategies, including early fusion, late fusion, deep fusion, and hybrid fusion methods, while also examining data representation, data translation, attention mechanisms, and graph-based fusion technologies. We assess the effectiveness of various fusion techniques across standard SER datasets, highlighting their performance in diverse tasks and addressing challenges related to data alignment, noise management, and computational demands. Additionally, we explore potential future directions for enhancing multi-modal SER systems, emphasizing scalability and adaptability in real-world applications. This survey aims to contribute to the advancement of multi-modal SER and to inform researchers about effective fusion strategies for developing more responsive and emotion-aware systems.
@article{nguyen5063214multi, title = {Multi-modal fusion in speech emotion recognition: A comprehensive review of methods and technologies}, author = {Nguyen, Nhut Minh and Nguyen, Thanh Trung and Tran, Phuong-Nam and Lim, Chee Peng and Pham, Nhat Truong and Dang, Duc Ngoc Minh}, journal = {}, year = {2024}, ssrn = {5063214}, preprint = {true}, google_scholar_id = {9yKSN-GCB0IC}, }
- Voice-Based Age and Gender Recognition: A Comparative Study of LSTM, RezoNet and Hybrid CNNs-BiLSTM ArchitectureNhut Minh Nguyen, Thanh Trung Nguyen, Hua Hiep Nguyen, and 2 more authorsIn 2024 15th International Conference on Information and Communication Technology Convergence (ICTC), 2024
In this study, we compared three architectures for the task of age and gender recognition from voice data: Long Short-Term Memory networks (LSTM), Hybrid of Convolutional Neural Networks and Bidirectional Long Short-Term Memory (CNNs-BiLSTM), and the recently released RezoNet architecture. The dataset used in this study was sourced from Mozilla Common Voice in Japanese. Features such as pitch, magnitude, Mel-frequency cepstral coefficients (MFCCs), and filter-bank energies were extracted from the voice data for signal processing, and the three architectures were evaluated. Our evaluation revealed that LSTM was slightly less accurate than RezoNet (83.1%), with the hybrid CNNs-BiLSTM (93.1%) and LSTM achieving the highest accuracy for gender recognition (93.5%). However, hybrid CNNs-BiLSTM architecture outperformed the other models in age recognition, achieving an accuracy of 69.75%, compared to 64.25% and 44.88% for LSTM and RezoNet, respectively. Using Japanese language data and the extracted characteristics, the hybrid CNNs-BiLSTM architecture model demonstrated the highest accuracy in both tests, highlighting its efficacy in voice-based age and gender detection. These results suggest promising avenues for future research and practical applications in this field.
@inproceedings{nguyen2024age-gender, bibtex_show = true, title = {Voice-Based Age and Gender Recognition: A Comparative Study of LSTM, RezoNet and Hybrid CNNs-BiLSTM Architecture}, author = {Nguyen, Nhut Minh and Nguyen, Thanh Trung and Nguyen, Hua Hiep and Tran, Phuong-Nam and Dang, Duc Ngoc Minh}, booktitle = {2024 15th International Conference on Information and Communication Technology Convergence (ICTC)}, pages = {1--1}, year = {2024}, publisher = {IEEE}, doi = {10.1109/ICTC62082.2024.10827387}, dimensions = {true}, google_scholar_id = {d1gkVwhDpl0C}, }