Authors: S.Kanimozhi, G.Pooja Sri, N.Suganya, R.Vishnu Priya
Abstract: Mental health monitoring has become increasingly significant due to the growing prevalence of stress, anxiety, and emotional disorders in modern society. Multimodal Emotion Detection using Voice and Text aims to enhance emotion recognition accuracy by analyzing multiple forms of human communication simultaneously. The proposed system integrates speech signals and textual data to detect emotional states such as happiness, sadness, anger, fear, and neutrality. Voice inputs are processed by extracting acoustic features including pitch, tone, speech rate, and intensity, while textual inputs are analyzed using Natural Language Processing (NLP) techniques to identify semantic meaning and sentiment patterns. Advanced machine learning and deep learning algorithms are employed to perform multimodal feature fusion and classify emotions more effectively than single-modal approaches. The framework includes stages such as data acquisition, preprocessing, feature extraction, multimodal fusion, and emotion classification. By accurately identifying emotional conditions, the system supports mental well-being monitoring and helps in the early detection of stress or negative emotional states. This technology can be applied in healthcare systems, intelligent virtual assistants, counseling platforms, and educational environments to provide timely emotional insights, personalized support, and improved human–computer interaction.
DOI: https://doi.org/10.5281/zenodo.19184746