Speech Emotion Recognition with Hybrid CNN- LSTM and Transformers Models: Evaluating the Hybrid Model Using Grad-CAM

Authors

  • Lihini Sangeetha Kumari Herath Mudiyanselage Computing Center, Faculty of Engineering, University Of Peradeniya
  • HMNS Kumari Faculty of Information Technology and Communication Sciences, Tampere University
  • UMMPK Nawarathne Faculty of Computing, Sri Lanka Institute of Information Technology

Keywords:

Convolutional neural network, Grad-CAM, Hybrid model, Long Short-Term Memory, Speech emotion recognition, Image transformers

Abstract

Emotional recognition and classification using artificial intelligence (AI) techniques play a crucial role in human-computer interaction (HCI). It enables the prediction of human emotions from audio signals with broad applications in psychology, medicine, education, entertainment, etc. This research focused on speech-emotion recognition (SER) by employing classification methods and transformer models using the Toronto Emotional Speech Set (TESS). Initially, acoustic features were extracted using different feature extraction techniques, including chroma, Mel-scaled spectrogram, contrast features, and Mel Frequency Cepstral Coefficients (MFCCs) from the audio dataset. Then, this study employed a Convolutional Neural Network (CNN), Long Short-Term Memory (LSTM), and a hybrid CNN-LSTM model to classify emotions. To compare the performance of these models, classical image transformer models such as ViT (Visual Image Transformer) and BEiT (Bidirectional Encoder Representation of Images) were employed on the Mel-spectograms derived from the same dataset. Evaluation metrics such as accuracy, precision, recall, and F1-score were calculated for each of these models to ensure a comprehensive performance comparison. According to the results, the hybrid model performed better than other models by achieving an accuracy of 99.01%, while the CNN, LSTM, ViT, and BEiT models demonstrated accuracies of 95.37%, 98.57%, 98%, and 98.3%, respectively. To interpret the output of this hybrid model and to provide visual explanations of its predictions, the Grad-CAM (Gradient-weighted Class Activation Mappings) was obtained. This technique reduced the black-box character of deep models, making them more reliable to use in clinical and other delicate contexts. In conclusion, the hybrid CNN-LSTM model showed strong performance in audio-based emotion classification.

References

W. Zheng, W. Zheng, and Y. Zong, “Multi-scale discrepancy adversarial network for crosscorpus speech emotion recognition,” Virtual Reality & Intelligent Hardware, vol. 3, no. 1, pp. 65–75, Feb. 2021, doi: https://doi.org/10.1016/j.vrih.2020.11.006. Access Article

G. A. Koduru, H. B. Valiveti, and A. K. Budati, “Feature extraction algorithms to improve the speech emotion recognition rate,” International Journal of Speech Technology, vol. 23, no. 1, pp. 45–55, Jan. 2020, doi: https://doi.org/10.1007/s10772-020-09672-4. Access Article

J. H. L. Hansen and D. A. Cairns, “ICARUS: Source generator based real-time recognition of speech in noisy stressful and Lombard effect environments ☆,” Speech Communication, vol. 16, no. 4, pp. 391–422, Jun. 1995, doi: https://doi.org/10.1016/0167- 6393(95)00007-b. Access Article

C. Spencer et al., “A Comparison of Unimodal and Multimodal Measurements of Driver Stress in Real-World Driving Conditions,” PsyArXiv (OSF Preprints), Jun. 2020, doi: https://doi.org/10.31234/osf.io/en5r3. Access Article

B. Schuller, G. Rigoll, and M. Lang, “Speech emotion recognition combining acoustic features and linguistic information in a hybrid support vector machine-belief network architecture,” IEEE Xplore, May 01, 2004. https://ieeexplore.ieee.org/document/1326051 Access Article

D. J. France, R. G. Shiavi, S. Silverman, M. Silverman, and D. M. Wilkes, “Acoustical properties of speech as indicators of depression and suicidal risk,” IEEE transactions on bio-medical engineering, vol. 47, no. 7, pp. 829–837, Jul. 2000, doi: https://doi.org/10.1109/10.846676. Access Article

M. Young, ‘The Technical Writers Handbook,’ Mill Valley, CA University Science, 1989. - References - Scientific Research Publishing,” Scirp.org, 2021. https://www.scirp.org/reference/referencespapers?referenceid=998786 (accessed Sep. 01, 2024). Access Article

"

“Speech and Multimedia Transmission Quality (STQ); Requirements for Emotion Detectors used for Telecommunication Measurement Applications; Detectors for written text and spoken speech TECHNICAL SPECIFICATION.” Accessed: Sep. 01, 2024.

[Online]. Available: https://www.etsi.org/deliver/etsi_ts/103200_103299/103296/01.01.0 1_60/ts_103296v010101p.pdf Access Article

"

S. G. Koolagudi and K. S. Rao, “Emotion recognition from speech: a review,” International Journal of Speech Technology, vol. 15, no. 2, pp. 99–117, Jan. 2012, doi: https://doi.org/10.1007/s10772-

-9125-1. Access Article

"

M. El Ayadi, M. S. Kamel, and F. Karray, “Survey on speech emotion recognition: Features, classification schemes, and databases,” Pattern Recognition, vol. 44, no. 3, pp. 572–587, Mar. 2011, doi: https://doi.org/10.1016/j.patcog.2010.09.020. Access Article

M. Ren, W. Nie, A. Liu, and Y. Su, “Multi-modal Correlated Network for emotion recognition in speech,” Visual Informatics, vol. 3, no. 3, pp. 150–155, Sep. 2019, doi: https://doi.org/10.1016/j.visinf.2019.10.003. Access Article

“Index of /class/archive/cs/cs224n/cs224n.1214/reports,” Stanford.edu, 2021. https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1214/reports/ (accessed Sep. 01, 2024).

M. M. Rezapour Mashhadi and K. Osei-Bonsu, “Speech emotion recognition using machine learning techniques: Feature extraction and comparison of convolutional neural network and random forest,” PloS one, vol. 18, no. 11, p. e0291500, 2023, doi: https://doi.org/10.1371/journal.pone.0291500. Access Article

N. P. Tigga and S. Garg, “Speech Emotion Recognition for multiclass classification using Hybrid CNN-LSTM,” International Journal of Microsystems and Iot, vol. 1, pp. 9–17, 2023. Access Article

H. Qazi and B. N. Kaushik, “A hybrid technique using CNN+ LSTM for speech emotion recognition,” International Journal of Engineering and Advanced Technology (IJEAT), vol. 9, no. 5, pp. 1126–1130, 2020. Access Article

L. Kerkeni, Y. Serrestou, M. Mbarki, K. Raoof, M. A. Mahjoub, and C. Cleder, “Automatic Speech Emotion Recognition Using Machine Learning,” Social Media and Machine Learning, Mar. 2019, doi: https://doi.org/10.5772/intechopen.84856. Access Article

H. S. Kumbhar and S. U. Bhandari, “Speech Emotion Recognition using MFCC features and LSTM network,” Sep. 2019, doi: https://doi.org/10.1109/iccubea47591.2019.9129067. Access Article

Y. Yu and Y.-J. Kim, “Attention-LSTM-Attention Model for Speech Emotion Recognition and Analysis of IEMOCAP Database,” Electronics, vol. 9, no. 5, p. 713, Apr. 2020, doi: https://doi.org/10.3390/electronics9050713. Access Article

“Speech Emotion Recognition Using CNN-LSTM and Vision Transformer,” Dntb.gov.ua, 2023. https://ouci.dntb.gov.ua/en/works/7WQ2BrPl/ (accessed Sep. 01, 2024). Access Article

"

“Toronto emotional speech set (TESS),” www.kaggle.com. https://www.kaggle.com/datasets/ejlok1/toronto-emotional-speech- set-tess Access Article

L. Toledo, A. Luiz, and J. Fiais, “A Deep Learning Approach for Speech Emotion Recognition Optimization Using Meta- Learning,” Electronics, vol. 12, no. 23, pp. 4859–4859, Dec. 2023, doi: https://doi.org/10.3390/electronics12234859. Access Article

V. Vielzeuf, S. Pateux, and F. Jurie, “Temporal multimodal fusion for video emotion classification in the wild,” in Proceedings of the 19th ACM International Conference on Multimodal Interaction, 2017, pp. 569–576. Access Article

S. Waldekar and G. Saha, “Wavelet Transform Based Mel- scaled Features for Acoustic Scene Classification.,” in INTERSPEECH, 2018, vol. 2018, pp. 3323–3327. Access Article

A. Dosovitskiy, “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020. Access Article

“Baeldung on CS,” www.baeldung.com, Mar. 19, 2021. https://www.baeldung.com/cs/.

H. Bao, L. Dong, S. Piao, and F. Wei, “Beit: Bert pre-training of image transformers,” arXiv preprint arXiv:2106.08254, 2021. Access Article

Zhou, Bolei, et al. "Learning deep features for discriminative localization." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. Access Article

Jeong, Seung-Min, et al. "Exploring Spectrogram-Based Audio Classification for Parkinson’s Disease: A Study on Speech Classification and Qualitative Reliability Verification." Sensors 24.14 (2024): 4625. Access Article

Additional Files

Published

07/01/2025

How to Cite

Herath Mudiyanselage, L. S. K., HMNS Kumari, & UMMPK Nawarathne. (2025). Speech Emotion Recognition with Hybrid CNN- LSTM and Transformers Models: Evaluating the Hybrid Model Using Grad-CAM. International Journal of Research in Computing, 4(II), 56–66. Retrieved from https://ijrcom.org/index.php/ijrc/article/view/159