Luis Denninger1 · Sina Mokhtarzadeh Azar1 · Jürgen Gall1,2 ·
1University of Bonn, Germany 2Lamarr Institute for Machine Learning and Artificial Intelligence
All our results are reported after 50K training steps using 25 DDIM steps with a guidance scale of 7.5 as reported by the baseline. Our model performs best at a guidance scale of 3.5
| Method | FVD (VideoGPT) | FVD (StyleGAN) | MSE | TransErr | RotErr | CamMC |
|---|---|---|---|---|---|---|
| MotionCtrl | 78.30 | 64.47 | 3654.54 | 2.89 | 2.04 | 4.34 |
| CameraCtrl | 71.22 | 58.05 | 3130.63 | 2.54 | 1.84 | 3.85 |
| CamI2V | 71.01 | 57.90 | 2692.84 | 1.79 | 1.16 | 2.58 |
| Ours | 53.90 | 45.36 | 2579.96 | 1.53 | 1.09 | 2.29 |
Initialize your python environment and install the PyTorch library like:
conda create -n camcontexti2v python=3.10
conda activate camcontexti2v
conda install -y pytorch==2.4.1 torchvision==0.19.1 pytorch-cuda=12.1 -c pytorch -c nvidia
conda install -y xformers -c xformersInstall all other requirements using:
pip install -r requirements.txtFinally, download all required checkpoints and place them as follows:
| Model | Location |
|---|---|
| CamContextI2V | ./ckpts/256_camcontexti2v.pt |
| DynamiCrafter | ./ckpts/dynamicrafter/model.ckpt |
| CamI2V | ./ckpts/256_cami2v.pt |
| CameraCtrl | ./ckpts/256_cameractrl.pt |
| MotionCtrl | ./ckpts/256_motionctrl.pt |
| I3D (VideoGPT) | .ckpts/videogpt/i3d_pretrained_400.pt |
| I3D (StyleGAN) | .ckpts/stylegan/i3d_torchscript.pt |
For the evaluation pipeline you additionally need to install the following requirements.
Simply clone and install the following repository:
git clone [email protected]:LDenninger/FVD.git evaluation
pip install -e evaluation/FVDTo install Glomap, please first install Colmap using the provided instructions, then follow the installation guide in the repository.
This project uses the RealEstate10K dataset which needs to be downloaded from YouTube. To get the meta data for the videos first obtain:
wget https://storage.cloud.google.com/realestate10k-public-files/RealEstate10K.tar.gzHow you obtain and unpack the dataset is up to you but we recommend following the guide as proposed here.
Additionally you will need the video captions generated by CameraCtrl.
The final dataset should have the following structure:
─┬─ RealEstate10K
├─┬─ valid_meta # Directories holding txt files containg all meta data
│ │─── train
│ └─── test
├─┬─ video_clips # Directories holding the video clips
│ │─── train
│ └─── test
├─── test_captions.json # Test captions
├─── train_captions.json # Train captions
├─── train_valid_list.txt # File containing all train video names
└─── test_valid_list.txt # File containg all test video names
This projects defines the directory and machine setup in CamContextI2V/utils/meta.py.
Before running anything, please adjust this file to your setup.
Training your model:
python CamContextI2V/01_train.py -r <run name> -c <config file> -m <machine to run on>For in-detail information on the command line arguments run python CamContextI2V/01_train.py -h.
Running inference:
python CamContextI2V/02_generate_videos.py <run name>For in-detail information on the command line arguments run python CamContextI2V/02_generate_videos.py -h.
Evaluation:
python CamContextI2V/03_evaluation.py -p <video path> -o <output path> --max-videos-in-mem <Images in RAM> [--fvd/--extended/--glomap]Visualization: To start and interactive gradio visualization, run:
python CamContextI2V/04_visualize.pyWe thank the authors of CamI2V for their implementation of the camera pose conditioning and the authors of DynamiCrafter for the implementation of the base model.
@article{denninger2025camcontexti2v,
title={CamContextI2V: Context-aware Controllable Video Generation},
author={Denninger, Luis and Mokhtarzadeh Azar, Sina and Gall, Juergen},
journal={},
year={2025}
}
