Student of Artificial Intelligence, FPT University
Short Bio: My name is Nguyen Minh Nhut (Nguyễn Minh Nhựt - Vietnamese). I am a dedicated undergraduate student pursuing a B.Sc. degree in Artificial Intelligence from FPT University, Ho Chi Minh Campus. As an enthusiastic research assistant in the field of artificial intelligence, Nguyen is passionate about advancing technologies that enhance human-computer interaction. My primary research interests lie in machine learning and deep learning, with a particular focus on their applications in speech processing and computer vision.
I am currently working as an Undergraduated AI researcher under the guidance of Dr. Duc Ngoc Minh Dang, where he continues to explore and contribute to cutting-edge developments in AI.
Research Interests: Deep Learning, Speech Processing, Computer Vision.
Speech emotion recognition (SER) plays a crucial role in human-computer interaction, enhancing numerous applications such as virtual assistants, healthcare monitoring, and customer support by identifying and interpreting emotions conveyed through spoken language. While single-modality SER systems demonstrate notable simplicity and computational efficiency, excelling in extracting critical features like vocal prosody and linguistic content, there is a pressing need to improve their performance in challenging conditions, such as noisy environments and the handling of ambiguous expressions or incomplete information. These challenges underscore the necessity of transitioning to multi-modal approaches, which integrate complementary data sources to achieve more robust and accurate emotion detection. With advancements in artificial intelligence, especially in neural networks and deep learning, many studies have employed advanced deep learning and feature fusion techniques to enhance SER performance. This review synthesizes comprehensive publications from 2020 to 2024, exploring prominent multi-modal fusion strategies, including early fusion, late fusion, deep fusion, and hybrid fusion methods, while also examining data representation, data translation, attention mechanisms, and graph-based fusion technologies. We assess the effectiveness of various fusion techniques across standard SER datasets, highlighting their performance in diverse tasks and addressing challenges related to data alignment, noise management, and computational demands. Additionally, we explore potential future directions for enhancing multi-modal SER systems, emphasizing scalability and adaptability in real-world applications. This survey aims to contribute to the advancement of multi-modal SER and to inform researchers about effective fusion strategies for developing more responsive and emotion-aware systems.
@article{nguyen5063214multi,title={Multi-modal fusion in speech emotion recognition: A comprehensive review of methods and technologies},author={Nguyen, Nhut Minh and Nguyen, Thanh Trung and Tran, Phuong-Nam and Lim, Chee Peng and Pham, Nhat Truong and Dang, Duc Ngoc Minh},journal={},year={2024},ssrn={5063214},preprint={true},google_scholar_id={9yKSN-GCB0IC},}
In this study, we compared three architectures for the task of age and gender recognition from voice data: Long Short-Term Memory networks (LSTM), Hybrid of Convolutional Neural Networks and Bidirectional Long Short-Term Memory (CNNs-BiLSTM), and the recently released RezoNet architecture. The dataset used in this study was sourced from Mozilla Common Voice in Japanese. Features such as pitch, magnitude, Mel-frequency cepstral coefficients (MFCCs), and filter-bank energies were extracted from the voice data for signal processing, and the three architectures were evaluated. Our evaluation revealed that LSTM was slightly less accurate than RezoNet (83.1%), with the hybrid CNNs-BiLSTM (93.1%) and LSTM achieving the highest accuracy for gender recognition (93.5%). However, hybrid CNNs-BiLSTM architecture outperformed the other models in age recognition, achieving an accuracy of 69.75%, compared to 64.25% and 44.88% for LSTM and RezoNet, respectively. Using Japanese language data and the extracted characteristics, the hybrid CNNs-BiLSTM architecture model demonstrated the highest accuracy in both tests, highlighting its efficacy in voice-based age and gender detection. These results suggest promising avenues for future research and practical applications in this field.
@inproceedings{nguyen2024age-gender,bibtex_show=true,title={Voice-Based Age and Gender Recognition: A Comparative Study of LSTM, RezoNet and Hybrid CNNs-BiLSTM Architecture},author={Nguyen, Nhut Minh and Nguyen, Thanh Trung and Nguyen, Hua Hiep and Tran, Phuong-Nam and Dang, Duc Ngoc Minh},booktitle={2024 15th International Conference on Information and Communication Technology Convergence (ICTC)},pages={1--1},year={2024},publisher={IEEE},doi={10.1109/ICTC62082.2024.10827387},dimensions={true},google_scholar_id={d1gkVwhDpl0C},}
Feel free to reach out for any inquiries, collaborations, or just to say hello! I'm always open to discussing new projects, creative ideas, or opportunities to be part of your vision. Let's connect and create something amazing together.