Do language models understand time?π§ In the kitchen arenaπ§βπ³, where burritos are rolledπ―, rice waits patientlyπ, and sauce steals the spotlight, LLMs try their best to keep up. Captions flow like a recipeβprecise and temptingβbut can they truly tell the difference between prepping, cooking, and eating? After all, in cooking, timing isnβt just everythingβitβs the secret sauce!π₯³π₯³π₯³
πππ A collection of papers and resources related to Large Language Models in video domainποΈ.
π More details please refer to our paper .
π οΈ Please let us know if you find out a mistake or have any suggestions by e-mail: Xi.Ding1@anu.edu.au
If you find our work useful for your research, please cite the following paper:
@inproceedings {10.1145/3701716.3717744 ,
author = { Ding, Xi and Wang, Lei} ,
title = { Do Language Models Understand Time?} ,
year = { 2025} ,
isbn = { 9798400713316} ,
publisher = { Association for Computing Machinery} ,
address = { New York, NY, USA} ,
url = { https://doi.org/10.1145/3701716.3717744} ,
doi = { 10.1145/3701716.3717744} ,
pages = { 1855β1868} ,
numpages = { 14} ,
keywords = { interaction, language language models, temporal, videos} ,
location = { Sydney NSW, Australia} ,
series = { WWW '25}
}
[10/02/2025] π The GitHub repository for our paper has been released.
[27/01/2025] π Our paper has been accepted as an oral presentation at the Companion Proceedings of The Web Conference 2025 (WWW 2025)
Performance comparison of visual encoders. (Left): Image classification accuracy for various image encoders pretrained and fine-tuned on the ImageNet-1K dataset. (Right): Action recognition accuracy for different video encoders pretrained and fine-tuned on the Kinetics-400 and Something-Something V2 datasets.
πΈ Models with Image Encoder
β¨ The tables below present summaries of the latest multimodal video-LLMs with image encoders and their interaction and fusion mechanisms.
Click to expand Table 1
Model
Venue
Other modality encoders
Interaction / Fusion mechanism
Description
Code
Flamingo
NeurIPS 2022
Text: Chinchilla
Perceiver Resampler & Gated XATTN-DENSE
Visual-language model.
Github
Click to expand Table 2
Model
Venue
Other modality encoders
Interaction / Fusion mechanism
Description
Code
mPLUG-2
ICML 2023
Text: BERT
Universal layers & cross-attention modules
Modularized multi-modal foundation model.
GitHub
Vid2Seq
CVPR 2023
Text: T5-Base
Cross-modal attention
Sequence-to-sequence video-language model.
GitHub
Video-LLaMA
EMNLP 2023
Text: Vicuna, Audio: ImageBind
Aligned via Q-Formers for video and audio
Instruction-tuned multimodal model.
GitHub
Video-ChatGPT
ACL 2023
Text: Vicuna-v1.1
Spatiotemporal features projected via linear layer
Integration of vision and language for video understanding.
GitHub
Valley
arXiv 2023
Text: StableVicuna
Projection layer
LLM for video assistant tasks.
GitHub
Macaw-LLM
arXiv 2023
Text: LLAMA-7B, Audio: Whisper
Alignment module unifies multi-modal representations
Multimodal integration using image, audio, and video inputs.
GitHub
Auto-AD II
CVPR 2023
Text: BERT
Cross-attention layers
Movie description using vision and language.
GitHub
GPT4Video
ACMMM 2023
Text: LLaMA 2
Transformer-based cross-attention layer
Video understanding with LLM-based reasoning.
-
LLaMA-VID
ECCV 2023
Text: Vicuna
Context attention and linear projector
LLaMA-VID for visual-textual alignment in video.
GitHub
COSMO
arXiv 2024
Text: OPT-IML/RedPajama/Mistral
Gated cross-attention
Contrastive-streamlined multimodal model.
-
VTimeLLM
CVPR 2024
Text: Vicuna
Linear layer
Temporal video understanding enhanced with LLMs.
GitHub
VILA
CVPR 2024
Text: LLaMA-2-7B/13B
Linear layer
Vision-language model.
GitHub
PLLaVA
arXiv 2024
Text: LLAMA-7B
MM projector with adaptive pooling
Parameter-free extension for video captioning tasks.
GitHub
V2Xum-LLaMA
arXiv 2024
Text: LLaMA 2
Vision adapter
Video summarization using temporal prompt tuning.
GitHub
VideoGPT+
arXiv 2024
Text: Phi-3-Mini-3.8B
MLP
Enhanced video understanding.
GitHub
EmoLLM
arXiv 2024
Text: Vicuna-v1.5, Audio: Whisper
Multi-perspective visual projection
Multimodal emotional understanding with improved reasoning.
GitHub
ShareGPT4Video
arXiv 2024
Text: Mistral-7B-Instruct-v0.2
MLP
Precise and detailed video captions with hierarchical prompts.
GitHub
VideoLLaMA 2
arXiv 2024
Text: LLAMA 1.5, Audio: BEATs
Spatial-Temporal Convolution (STC) connector
Advancing spatial-temporal modeling and audio understanding.
GitHub
VideoLLM-online
CVPR 2024
Text: Llama-2-Chat/Llama-3-Instruct
MLP projector
Online video large language model for streaming video.
GitHub
LongVA
arXiv 2024
Text: Qwen2-Extended
MLP
Long context video understanding.
GitHub
InternLM-XComposer-2.5
arXiv 2024
Text: InternLM2-7B[15], Audio: Whisper
MLP
Long-context LVLM supporting ultra-high-resolution video tasks.
GitHub
Qwen2-VL
arXiv 2024
Text: Qwen2-7B
Cross-attention modules
Vision-language model for multimodal tasks.
GitHub
Video-XL
arXiv 2024
Text: Qwen2-7B
Visual-language projector
Long-context video understanding model.
GitHub
SlowFocus
NeurIPS 2024
Text: Vicuna-7B v1.5
Visual adapter (projector layer)
Fine-grained temporal understanding in video LLM.
GitHub
VideoStudio
ECCV 2024
Text: CLIP ViT-H/14
Cross-attention modules
Multi-scene video generation
GitHub
VideoINSTA
arXiv 2024
Text: Llama-3-8B-Instruc
Self-reflective spatial-temporal fusion
Zero-shot long video understanding model.
GitHub
TRACE
arXiv 2024
Text: Mistral-7B
Task-interleaved sequence modeling & Adaptive head-switching
Video temporal grounding via causal event modeling.
GitHub
Click to expand Table 3
Model
Venue
Other modality encoders
Interaction / Fusion mechanism
Description
Code
VideoChat
arXiv 2023
Text: StableVicuna, Audio: Whisper
Q-Former bridges visual features to LLMs for reasoning
Chat-centric model for video analysis.
GitHub
VAST
NeurIPS 2023
Text: BERT, Audio: BEATs
Cross-attention layers
Omni-modality foundational model.
GitHub
VTG-LLM
arXiv 2024
Text: LLaMA-2-7B
Projection layer
Enhanced video temporal grounding.
GitHub
AutoAD III
CVPR 2024
Text: GPT-3.5-turbo
Shared Q-Former
Video description enhancement with LLMs.
GitHub
MA-LMM
CVPR 2024
Text: Vicuna
A trainable Q-Former
Memory-augmented large multimodal model.
GitHub
MiniGPT4-Video
arXiv 2024
Text: LLaMA 2
Concatenates visual tokens and projects into LLM space
Video understanding with visual-textual token interleaving.
GitHub
Vriptor
arXiv 2024
Text: ST-LLM, Audio: Whisper
Scene-level sequential alignment
Vriptor for dense video captioning.
GitHub
Kangaroo
arXiv 2024
Text: Llama-3-8B-Instruct
Multi-modal projector
Video-language model supporting long-context video input.
-
Click to expand Table 4
Model
Venue
Other modality encoders
Interaction / Fusion mechanism
Description
Code
LAVAD
CVPR 2024
Text: Llama-2-13b-chat
Converts video features into textual prompts for LLMs
Training-free video anomaly detection using LLMs.
GitHub
Click to expand Table 5
Model
Venue
Other modality encoders
Interaction / Fusion mechanism
Description
Code
Video-CCAM
arXiv 2024
Text: Phi-3-4k-instruct/ Yi-1.5-9B-Chat
Cross-attention-based projector
Causal cross-attention masks for short and long videos.
GitHub
Apollo
arXiv 2024
Text: Qwen2.5-7B
Perceiver Resampler & Token Integration with Timestamps
Video understanding model.
-
Click to expand Table 6
Model
Venue
Other modality encoders
Interaction / Fusion mechanism
Description
Code
Oryx
arXiv 2024
Text: Qwen2-7B/32B
Cross attention
Spatial-temporal model for high-resolution understanding.
GitHub
π₯ Models with Video Encoder
β¨ The tables below present summaries of the latest multimodal video-LLMs with video encoders and their interaction and fusion mechanisms.
Traditional (e.g., I3D, SlowFast)
Click to expand Table 7
Model
Venue
Other modality encoders
Interaction / Fusion mechanism
Description
Code
VideoLLM
arXiv 2023
Text: e.g., BERT, T5
Semantic translator aligns visual and text encodings
Video sequence modeling using LLMs.
GitHub
Loong
arXiv 2024
Text: Standard text tokenizer
Decoder-only autoregressive LLM with causal attention
Decoder-only autoregressive LLM with causal attention.
-
Click to expand Table 8
Model
Venue
Other modality encoders
Interaction / Fusion mechanism
Description
Code
LaViLa
CVPR 2022
Text: 12-layer Transformer
Cross-attention modules
Large-scale language model.
GitHub
Video ReCap
CVPR 2024
Text: GPT-2
Cross-attention layers
Recursive hierarchical captioning model
GitHub
Click to expand Table 9
Model
Venue
Other modality encoders
Interaction / Fusion mechanism
Description
Code
OmniViD
CVPR 2024
Text: BART
MQ-Former
Generative model for universal video understanding.
GitHub
Click to expand Table 10
Model
Venue
Other modality encoders
Interaction / Fusion mechanism
Description
Code
VideoChat2
CVPR 2024
Text: Vicuna
Linear projection
A comprehensive multi-modal video understanding benchmark.
GitHub
Click to expand Table 11
Model
Venue
Other modality encoders
Interaction / Fusion mechanism
Description
Code
Video-LLaVA
arXiv 2023
Text: Vicuna v1.5
MLP projection layer
Unified visual representation learning for video.
GitHub
MotionLLM
arXiv 2024
Text: Vicuna
Motion / Video translator
Understanding human behaviors from human motions and videos.
GitHub
Holmes-VAD
arXiv 2024
Text: LLaMA3-Instruct-70B
Temporal sampler
Multimodal LLM for video anomaly detection.
GitHub
Click to expand Table 12
Model
Venue
Other modality encoders
Interaction / Fusion mechanism
Description
Code
InternVideo2
ECCV 2023
Text: BERT-Large, Audio: BEATs
Q-Former aligns multi-modal embeddings
Foundation model for multimodal video understanding.
GitHub
Click to expand Table 13
Model
Venue
Other modality encoders
Interaction / Fusion mechanism
Description
Code
InternVideo2
ECCV 2023
Text: BERT-Large, Audio: BEATs
Q-Former aligns multi-modal embeddings
Foundation model for multimodal video understanding.
GitHub
VITA
arXiv 2024
Text: Mixtral-8x7B, Audio: Mel Filter Bank
MLP
Open-source interactive multimodal LLM.
GitHub
Click to expand Table 14
Model
Venue
Other modality encoders
Interaction / Fusion mechanism
Description
Code
ChatVideo
arXiv 2023
Text: ChatGPT, Audio: e.g., Whisper
Tracklet-centric with ChatGPT reasoning
Chat-based video understanding system.
Coming soon
The distributions of interaction/fusion mechanisms and data modalities in 66 closely related video-LLMs from January 2024 to December 2024. (Left): Fusion mechanisms are classified into five categories: Cross-attention (e.g., crossattention modules, gated cross-attention), Projection layers (e.g., linear projection, MLP projection), Q-Former-based methods (e.g., Q-Former aligns multi-modal embeddings, Trainable Q-Former), Motion/Temporal-Specific mechanisms (e.g., temporal samplers, scene-level sequential alignment), and Other Methods (e.g., Tracklet-centric, Perceiver Resampler, MQ-Former). (Right): The distribution of data modalities used in these video-LLMs, with text modalities appearing across all models. Note that a model may use multiple fusion methods and/or data modalities.
β¨The tables below provide a comprehensive overview of video datasets across various tasks.
Click to expand Table 15
Dataset
Year
Source
# Videos
Modality
Avg. length (s)
Temporal annotation
Description
HMDB51
2011
YouTube
6,766
Video
3~4
No
Daily human actions
UCF101
2012
YouTube
13,320
Video+Audio
7.21
No
Human actions (e.g., sports, daily activities)
ActivityNet
2015
YouTube
27,801
Video+Text
300~1200
Temporal extent provided
Human-centric activities
Charades
2016
Crowdsourced
9,848
Video+Text
30.1
Start and end timestamps provided
Household activities
Kinetics-400
2017
YouTube
306,245
Video
10
No
Human actions (e.g., sports, tasks)
AVA
2018
Movies
430
Video
Variable
Start and end timestamps provided
Action localization in movie scenes
Something-Something V2
2018
Crowdsourced
220,847
Video
2~6
Weak
Human-object interactions
COIN
2019
YouTube
11,827
Video+Text
141.6
Start and end timestamps provided
Comprehensive instructional tasks (e.g., cooking, repair)
Kinetics-700
2019
YouTube
650,317
Video
10
No
Expanded version of Kinetics-400 and Kinetics-600
EPIC-KITCHENS
2020
Participant kitchens
432
Video+Text+Audio
~458
Start and end timestamps provided
Largest egocentric video dataset
Ego4D
2021
Wearable Cameras
3,850 hours
Video+Text+Audio
Variable
Start and end timestamps provided
First-person activities and interactions
VidSitu
2021
YouTube
29,000
Video+Text
~10
Temporal extent for events provided
Event-centric and causal activity annotations
Click to expand Table 16
Dataset
Year
Source
# Videos
Modality
Avg. length (s)
Temporal annotation
Description
MovieQA
2016
Multiple platforms
408
Video+Text
202.7
Start and end timestamps provided
QA for movie scenes
TGIF-QA
2016
Tumblr GIFs
56,720
Video+Text
3~5
Action timestamps provided
QA over social media GIFs
MSVD-QA
2017
YouTube
1,970
Video+Text
27.5
Start and end timestamps provided
QA for actions description
MSRVTT-QA
2017
YouTube
10,000
Video+Text
15~30
Weak
QA across diverse scenes
TVQA
2019
TV Shows
21,793
Video+Text
60~90
Start and end timestamps provided
QA over medical dramas, sitcoms, crime shows
ActivityNet-QA
2019
YouTube
5,800
Video+Text
180
Implicit (derived from ActivityNet)
QA for human-annotated videos
How2QA
2020
HowTo100M (YouTube)
22,000
Video+Text
60
Temporal extent provided
QA over instructional videos
YouCookQA
2021
YouCook2 (YouTube)
2,000
Video+Text
316.2
Temporal boundaries provided
Cooking-related instructional QA
STAR
2021
Human activity datasets
22,000
Video+Text
Variable
Action-level boundaries provided
QA over human-object interactions
MVBench
2023
Public datasets
3,641
Video+Text
5~35
Start and end timestamps provided
Multi-domain QA (e.g., sports, indoor scenes)
EgoSchema
2023
Ego4D (Wearable Cameras)
5,063
Video+Text
180
Timestamped narrations provided
Long-form egocentric activities
Click to expand Table 17
Dataset
Year
Source
# Videos
Modality
Avg. length (s)
Temporal annotation
Description
YouCook
2013
YouTube
88
Video+Text
180~300
Weak
Cooking instructional videos
MSR-VTT
2016
YouTube
7,180
Video+Text+Audio
10~30
Weak
General scenarios (e.g., sports, transport)
ActivityNet Captions
2017
YouTube
20,000
Video+Text
180
Start and end timestamps provided
Dense captions for human-centered activities
VATEX
2019
YouTube
41,250
Video+Text
~10
Weak
Multilingual descriptions with English-Chinese parallel captions
HowTo100M
2019
YouTube
1.22M
Video+Text+Audio
390
Subtitle timestamps provided
Instructional video captions
TVC
2020
TV Shows
108,965
Video+Text
76.2
Start and end timestamps provided
Multimodal video captioning dataset
Click to expand Table 18
Dataset
Year
Source
# Videos
Modality
Avg. length (s)
Temporal annotation
Description
LSMDC
2015
Movies
118,114
Video+Text
4.8
Start and end timestamps provided
Large-scale dataset for movie description tasks
DiDeMo
2017
Flickr (YFCC100M)
10,464
Video+Text
27.5
Start and end timestamps provided
Moment localization in diverse, unedited personal videos
FIVR-200K
2019
YouTube
225,960
Video
~120
Start and end timestamps provided
Large-scale incident video retrieval dataset with diverse news events
TVR
2020
TV Shows
21,793
Video+Text
76.2
Start and end timestamps provided
Video-subtitle multimodal moment retrieval dataset
TextVR
2023
YouTube
10,500
Video+Text
15
Weak
Cross-modal video retrieval with text reading comprehension
EgoCVR
2024
Ego4D
2,295
Video+Text
3.9~8.1
Weak
Egocentric dataset for fine-grained composed video retrieval
Click to expand Table 19
Dataset
Year
Source
# Videos
Modality
Avg. length (s)
Temporal annotation
Description
Subway Entrance
2008
Surveillance cameras
1
Video
4,800
No
Crowd monitoring for unusual event detection at subway entrances
Subway Exit
2008
Surveillance cameras
1
Video
5,400
No
Crowd monitoring for unusual event detection at subway exits
CUHK Avenue
2013
Surveillance cameras
15
Video
120
No
Urban avenue scenes with anomalies like running, loitering, etc.
Street Scene
2020
Urban street surveillance
81
Video
582
Spatial and temporal bounding boxes
Urban street anomalies (e.g., jaywalking, loitering, illegal parking, etc.)
XD-Violence
2020
Movies and in-the-wild scenes
4,754
Video+Audio
~180
Start and end timestamps provided
Multimodal violence detection covering six violence types
CUVA
2024
YouTube, Bilibili
1,000
Video+Text
~117
Start and end timestamps provided
Causation-focused anomaly understanding across 42 anomaly types
MSAD
2024
Online Surveillance
720
Video
~20
Frame-level annotations in test set
Multi-scenario dataset with 14 scenarios
Click to expand Table 20
Dataset
Year
Source
# Videos
Modality
Avg. length (s)
Temporal annotation
Description
VIDAL-10M
2023
Multiple platforms
10M
Video+Infrared+Depth+Audio+Text
~20
Weak
Multi-domain retrieval dataset
Video-MME
2024
YouTube
900
Video+Text+Audio
1017.9
Temporal ranges via certificate length
Comprehensive evaluation benchmark across many domains
Left: Performance (accuracy) comparison of recent videoLLMs on the Video-MME benchmark. Right: Performance comparison of recent video-LLMs on video QA benchmarks. Models using pretrained video encoders (e.g., Video-LLaVA and VideoChat2) are marked with squares, while models using pretrained image encoders are represented by circles.
Performance comparison of recent video-LLMs on
(a) video retrieval and (b) video captioning benchmarks.
β€οΈβπ₯β€οΈβπ₯β€οΈβπ₯ Contribution
We warmly invite everyone to contribute to this repository and help enhance its quality and scope. Feel free to submit pull requests to add new methods, datasets or other useful resources, as well as to correct any errors you discover. To ensure consistency, please format your pull requests using our tables' structures. We greatly appreciate your valuable contributions and support!