Abstract
The application of deep learning in multimodal systems has shown significant progress, especially in streamlining gesture recognition and facilitating sign language interpretation for the hearing impaired. This paper explores the integration of gesture and emotion analysis using convolutional neural networks (CNNs) for facial expression recognition and long short-term memory (LSTM) networks for temporal gesture analysis. To evaluate the effectiveness of the algorithms, multimodal systems were tested on specialized datasets such as iMiGUE, which includes emotion videos that have been accurately annotated. These datasets enabled the evaluation of the model's performance on real-life tasks along with the comparison between different models.