This repository contains the implementation of MAVEN (Multi-modal Attention for Valence-Arousal Emotion Network), a novel architecture for dynamic emotion recognition through dimensional modeling of affect. The model integrates visual, audio, and textual modalities via a bi-directional cross-modal attention mechanism, enabling comprehensive interactions between all modality pairs. MAVEN predicts emotions in polar coordinate form (theta and intensity), aligning with psychological models of the emotion circumplex.
MAVEN is designed to recognize emotions in conversational videos by using multi-modal data (visual, audio, and textual). The proposed model employs modality-specific encoders (Swin Transformer for video, HuBERT for audio, and RoBERTa for text) to extract rich feature representations. Our work focuses on the bidirectional cross-modal attention mechanism, which refines each modality's representation through weighted attention from other modalities, followed by self-attention refinement.
The model is trained and evaluated on the Aff-Wild2 dataset, an audiovisual (A/V) dataset containing 594 videos with approximately 3 million frames from 584 subjects. Each frame is annotated with continuous valence and arousal values, representing emotional states along the dimensions of pleasantness (valence) and intensity (arousal).
MAVEN consists of the following components:
-
Modality-Specific Encoders:
- Visual: Swin Transformer for capturing local and global visual patterns.
- Audio: HuBERT for extracting acoustic features from raw audio waveforms.
- Text: RoBERTa for linguistic analysis and semantic understanding.
-
Cross-Modal Attention Mechanism:
- Six distinct attention pathways (video-to-audio, video-to-text, audio-to-video, audio-to-text, text-to-video, and text-to-audio) enable bidirectional information flow between modalities.
-
BEiT Multi-Headed Attention:
- After cross-modal fusion, the enhanced features are refined using BEiT-based self-attention to capture global dependencies.
-
Emotion Prediction:
- The final output predicts emotions in polar coordinates (theta and intensity), which are then transformed into valence and arousal values.
The model is trained using the following setup:
- Optimizer: Adam with a learning rate of
1e-4and weight decay of1e-4. - Learning Rate Scheduler: ReduceLROnPlateau with a factor of
0.1and patience of5. - Batch Size: 8.
- Training Duration: 100 epochs (patience of 10 epochs).
Pre-trained feature extractors (Swin, HuBERT, RoBERTa, and BEiT-3) are frozen during training to focus optimization on the fusion and prediction layers.
The performance of the model is evaluated using the Concordance Correlation Coefficient (CCC) for both valence and arousal. The overall performance measure
The baseline model (pre-trained ResNet-50) achieves the following performance on the validation set:
MAVEN demonstrates superior performance in capturing the complex and nuanced nature of emotional expressions in conversational videos. The model achieves SOTA results on the Aff-Wild2 dataset, significantly outperforming the baseline.
To train and evaluate the MAVEN model, follow these steps:
-
Clone the Repository:
git clone https://github.com/Vrushank-Ahire/MAVEN_8th_ABAW.git cd MAVEN -
Install Dependencies:
pip install -r requirements.txt
-
Download the Aff-Wild2 Dataset:
- Ensure you have access to the Aff-Wild2 dataset and place it in the
data/directory.
- Ensure you have access to the Aff-Wild2 dataset and place it in the
-
Train the Model:
python embeddings.py python TrainBEiT.py python TrainMLP.py
-
Evaluate the Model:
python Test.py
Citation If you like our work , please cite our paper:
bibtex@InProceedings{Ahire_2025_CVPR,
author = {Ahire, Vrushank and Shah, Kunal and Khan, Mudasir and Pakhale, Nikhil and Sookha, Lownish and Ganaie, Mudasir and Dhall, Abhinav},
title = {MAVEN: Multi-modal Attention for Valence-Arousal Emotion Network},
booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) Workshops},
month = {June},
year = {2025},
pages = {5789-5799}
}Paper Access 📄 Paper Link: Full Paper Access You can also access our paper using the QR code below:
📁 Repository: You can also find the paper PDF in our repository- MAVEN_Multi-modal_Attention_for_Valence-Arousal_Emotion_Network_CVPRW_2025_paper.pdf
This project is licensed under the MIT License. See the LICENSE file for details.
For any questions or issues, please open an issue on GitHub or contact the authors.

