简体中文 | English
EchoMimicV3: 1.3B Parameters are All You Need for Unified Multi-Modal and Multi-Task Human Animation
1Core Contributor 2Corresponding Authors
- EchoMimicV1: Lifelike Audio-Driven Portrait Animations through Editable Landmark Conditioning. GitHub
- EchoMimicV2: Towards Striking, Simplified, and Semi-Body Human Animation. GitHub
- EchoMimicV3: 1.3B Parameters are All You Need for Unified Multi-Modal and Multi-Task Human Animation. GitHub
- [2026.01.22] 🔥 We update our EchoMimicV3-Flash-Pro on Huggingface.
- 🚀 8-step High-quality Generation.
- 🧩 No Face Mask required.
- 💾 12G VRAM Requirement.
- ✅ Supports up to 768×768 Resolution.
- [2025.11.09] 🔥 EchoMimicV3 is accepted by AAAI 2026.
- [2025.08.21] 🔥 EchoMimicV3 gradio demo on modelscope is ready.
- [2025.08.12] 🔥🚀 12G VRAM is All YOU NEED to Generate Video. Please use this GradioUI. Check the tutorial from @gluttony-10. Thanks for the contribution.
- [2025.08.12] 🔥 EchoMimicV3 can run on 16G VRAM using ComfyUI. Thanks @smthemex for the contribution.
- [2025.08.09] 🔥 We release our models on ModelScope.
- [2025.08.08] 🔥 We release our codes on GitHub and models on Huggingface.
- [2025.07.08] 🔥 Our paper is in public on arxiv.
teaser_github.mp4 |
hoi_github.mp4 |
01.mp4 |
02.mp4 |
03.mp4 |
04.mp4 |
For more demo videos, please refer to the project page
- Tested System Environment: Centos 7.2/Ubuntu 22.04, Cuda >= 12.1
- Tested GPUs: A100(80G) / RTX4090D (24G) / V100(16G)
- Tested Python Version: 3.10 / 3.11
Please use the one-click installation package (passport: glut) to get started quickly for Quantified version.
conda create -n echomimic_v3 python=3.10
conda activate echomimic_v3
pip install -r requirements.txt
| Models | Download Link | Notes |
|---|---|---|
| Wan2.1-Fun-V1.1-1.3B-InP | 🤗 Huggingface | Base model |
| wav2vec2-base | 🤗 Huggingface | Audio encoder for preview |
| chinese-wav2vec2-base | 🤗 Huggingface | Audio encoder for flash-pro |
| EchoMimicV3-preview | 🤗 Huggingface | preview weights |
| EchoMimicV3-preview | 🤗 ModelScope | preview weights |
| EchoMimicV3-flash-pro | 🤗 Huggingface | Flash-Pro weights |
-- The weights of EchoMimicV3-flash-pro is organized as follows.
./flash-pro/
├── Wan2.1-Fun-V1.1-1.3B-InP
├── chinese-wav2vec2-base
└── transformer
└── diffusion_pytorch_model.safetensors
-- The weights is of EchoMimicV3-preview organized as follows.
./preview/
├── Wan2.1-Fun-V1.1-1.3B-InP
├── wav2vec2-base-960h
└── transformer
└── diffusion_pytorch_model.safetensors
bash run_flash_pro.sh
python infer_preview.py
For Quantified GradioUI version for EchoMimicV3-preview:
python app_mm.py
images, audios, masks and prompts are provided in datasets/echomimicv3_demos
- Audio CFG: Audio CFG
audio_guidance_scaleworks optimally between 1.8~2. Increase the audio CFG value for better lip synchronization, while decreasing the audio CFG value can improve the visual quality. - Text CFG: Text CFG
guidance_scaleworks optimally between 3~6. Increase the text CFG value for better prompt following, while decreasing the text CFG value can improve the visual quality. - TeaCache: The optimal range for
teacache_thresholdis between 0~0.1. - Sampling steps: 5 steps for talking head, 15~25 steps for talking body.
- Long video generation: If you want to generate a video longer than 138 frames, you can use Long Video CFG.
- Try setting
partial_video_lengthto 81, 65 or smaller to reduce VRAM usage.
If you find our work useful for your research, please consider citing the paper :
@misc{meng2025echomimicv3,
title={EchoMimicV3: 1.3B Parameters are All You Need for Unified Multi-Modal and Multi-Task Human Animation},
author={Rang Meng, Yan Wang, Weipeng Wu, Ruobing Zheng, Yuming Li, Chenguang Ma},
year={2025},
eprint={2507.03905},
archivePrefix={arXiv}
}
- Wan2.1: https://github.com/Wan-Video/Wan2.1/
- VideoX-Fun: https://github.com/aigc-apps/VideoX-Fun/
The models in this repository are licensed under the Apache 2.0 License. We claim no rights over the your generated contents, granting you the freedom to use them while ensuring that your usage complies with the provisions of this license. You are fully accountable for your use of the models, which must not involve sharing any content that violates applicable laws, causes harm to individuals or groups, disseminates personal information intended for harm, spreads misinformation, or targets vulnerable populations.


