Nhut Minh Nguyen

Student of Artificial Intelligence, FPT University

Short Bio: My name is Nguyen Minh Nhut (Nguyễn Minh Nhựt - Vietnamese), and I am currently an undergraduate student pursuing a Bachelor of Science degree in Artificial Intelligence at FPT University, Ho Chi Minh Campus. I am deeply passionate about exploring the theoretical foundations and practical applications of artificial intelligence, with a strong academic and research focus on machine learning, deep learning, speech processing, and computer vision. Since the beginning of my academic journey, I have been committed to understanding how intelligent systems can be designed to better perceive, interpret, and interact with human users. As a research assistant, I actively engage in developing AI-driven solutions that enhance human-computer interaction, aiming to bridge the gap between computational intelligence and real-world communication systems.

I am currently working under the guidance of Dr. Duc Ngoc Minh Dang, a respected researcher in the field of artificial intelligence. Under his mentorship, I have been involved in multiple research projects that investigate multimodal emotion recognition, graph-based neural networks, and advanced learning techniques for speech and visual understanding. This experience has not only sharpened my technical skills in deep neural architectures and data modeling but has also nurtured a deeper appreciation for interdisciplinary AI research.

Research Interests: Deep Learning, Speech and Audio Processing, Human-Centered AI, Multimodal Emotion Recognition, Human-Computer Interaction, Graph Neural Networks.

news

Aug 01, 2025	📰 03 paper is accepted at The 25th Asia-Pacific Network Operations and Management Symposium, Kaohsiung, Taiwan.
Jun 27, 2025	🗞️ 01 manuscript entitled FleSER: Multimodal Emotion Recognition via Dynamic Fuzzy Membership and Attention Fusion has been submitted to SSRN.
Apr 13, 2025	🏆 1st Prize in Student Research Competition at FPT University!
Dec 28, 2024	🎓 2nd Prize in Student Research Competition at FPT University!
Dec 18, 2024	🗞️ 01 manuscript entitled Multi-modal fusion in speech emotion recognition: A comprehensive review of methods and technologies has been submitted to SSRN.

selected publications

Preprint
Fleser: Multimodal Emotion Recognition Via Dynamic Fuzzy Membership and Attention Fusion

Nhut Minh Nguyen, Minh Trung Nguyen, Thanh Trung Nguyen, and 6 more authors

2024

Abs Bib Code

Multimodal learning has been demonstrated to improve classification outcomes in speech emotion recognition (SER). Despite this advantage, multimodal approaches in SER often face key challenges such as limited robustness in handling uncertainty, difficulties in generalizing across diverse emotional contexts, and inefficiencies in integrating heterogeneous modalities. To overcome these constraints, we propose FleSER, a multimodal emotion recognition framework that utilizes dynamic fuzzy membership and attention fusion. In this architecture, we introduce a rule-based dynamic fuzzy membership mechanism that adaptively transforms features. The FleSER architecture leverages audio and textual modalities, employing self-modality and cross-modality attention mechanisms with the α interpolation to capture complementary emotional cues. The α interpolation-based feature fusion mechanism adaptively emphasizes the more informative modality in varying contexts, ensuring robust multimodal integration. This comprehensive design improves the model’s recognition accuracy. We evaluate the FleSER architecture on the three benchmark datasets: IEMOCAP, ESD, and MELD. FleSER surpasses the previous state-of-the-art (SOTA) by 1.92% on IEMOCAP and an impressive 6.37% on ESD, demonstrating its superior effectiveness in enhancing emotion recognition accuracy across various datasets. Ablation studies further validate the effectiveness of each key component, including unimodal and multimodal input effectiveness, fuzzy membership functions, fusion strategies, and projection dimension on FleSER architecture performance. Our source code is publicly available at https://github.com/nhut-ngnn/FleSER.
@article{nguyen2025fleser, title = {Fleser: Multimodal Emotion Recognition Via Dynamic Fuzzy Membership and Attention Fusion}, author = {Nguyen, Nhut Minh and Nguyen, Minh Trung and Nguyen, Thanh Trung and Tran, Phuong-Nam and Pham, Nhat Truong and Le, Linh and OTHMANI, Alice and Saddik, Abdulmotaleb El and Dang, Duc Ngoc Minh}, journal = {}, year = {2024}, ssrn = {5316634}, preprint = {true}, google_scholar_id = {2osOgNQ5qMEC}, }
Preprint
Multi-modal fusion in speech emotion recognition: A comprehensive review of methods and technologies

Nhut Minh Nguyen, Thanh Trung Nguyen, Phuong-Nam Tran, and 3 more authors

2024

Abs Bib Code

Speech emotion recognition (SER) plays a crucial role in human-computer interaction, enhancing numerous applications such as virtual assistants, healthcare monitoring, and customer support by identifying and interpreting emotions conveyed through spoken language. While single-modality SER systems demonstrate notable simplicity and computational efficiency, excelling in extracting critical features like vocal prosody and linguistic content, there is a pressing need to improve their performance in challenging conditions, such as noisy environments and the handling of ambiguous expressions or incomplete information. These challenges underscore the necessity of transitioning to multi-modal approaches, which integrate complementary data sources to achieve more robust and accurate emotion detection. With advancements in artificial intelligence, especially in neural networks and deep learning, many studies have employed advanced deep learning and feature fusion techniques to enhance SER performance. This review synthesizes comprehensive publications from 2020 to 2024, exploring prominent multi-modal fusion strategies, including early fusion, late fusion, deep fusion, and hybrid fusion methods, while also examining data representation, data translation, attention mechanisms, and graph-based fusion technologies. We assess the effectiveness of various fusion techniques across standard SER datasets, highlighting their performance in diverse tasks and addressing challenges related to data alignment, noise management, and computational demands. Additionally, we explore potential future directions for enhancing multi-modal SER systems, emphasizing scalability and adaptability in real-world applications. This survey aims to contribute to the advancement of multi-modal SER and to inform researchers about effective fusion strategies for developing more responsive and emotion-aware systems.
@article{nguyen5063214multi, title = {Multi-modal fusion in speech emotion recognition: A comprehensive review of methods and technologies}, author = {Nguyen, Nhut Minh and Nguyen, Thanh Trung and Tran, Phuong-Nam and Lim, Chee Peng and Pham, Nhat Truong and Dang, Duc Ngoc Minh}, journal = {}, year = {2024}, ssrn = {5063214}, preprint = {true}, google_scholar_id = {9yKSN-GCB0IC}, }
ICTC
Voice-Based Age and Gender Recognition: A Comparative Study of LSTM, RezoNet and Hybrid CNNs-BiLSTM Architecture

Nhut Minh Nguyen, Thanh Trung Nguyen, Hua Hiep Nguyen, and 2 more authors

In 2024 15th International Conference on Information and Communication Technology Convergence (ICTC), 2024

Abs Bib HTML Code

In this study, we compared three architectures for the task of age and gender recognition from voice data: Long Short-Term Memory networks (LSTM), Hybrid of Convolutional Neural Networks and Bidirectional Long Short-Term Memory (CNNs-BiLSTM), and the recently released RezoNet architecture. The dataset used in this study was sourced from Mozilla Common Voice in Japanese. Features such as pitch, magnitude, Mel-frequency cepstral coefficients (MFCCs), and filter-bank energies were extracted from the voice data for signal processing, and the three architectures were evaluated. Our evaluation revealed that LSTM was slightly less accurate than RezoNet (83.1%), with the hybrid CNNs-BiLSTM (93.1%) and LSTM achieving the highest accuracy for gender recognition (93.5%). However, hybrid CNNs-BiLSTM architecture outperformed the other models in age recognition, achieving an accuracy of 69.75%, compared to 64.25% and 44.88% for LSTM and RezoNet, respectively. Using Japanese language data and the extracted characteristics, the hybrid CNNs-BiLSTM architecture model demonstrated the highest accuracy in both tests, highlighting its efficacy in voice-based age and gender detection. These results suggest promising avenues for future research and practical applications in this field.
@inproceedings{nguyen2024age-gender, bibtex_show = true, title = {Voice-Based Age and Gender Recognition: A Comparative Study of LSTM, RezoNet and Hybrid CNNs-BiLSTM Architecture}, author = {Nguyen, Nhut Minh and Nguyen, Thanh Trung and Nguyen, Hua Hiep and Tran, Phuong-Nam and Dang, Duc Ngoc Minh}, booktitle = {2024 15th International Conference on Information and Communication Technology Convergence (ICTC)}, pages = {1--1}, year = {2024}, publisher = {IEEE}, doi = {10.1109/ICTC62082.2024.10827387}, dimensions = {true}, google_scholar_id = {d1gkVwhDpl0C}, }