AnimateZero:
Video Diffusion Models are Zero-Shot Image Animators

Jiwen Yu¹ Xiaodong Cun^2* Chenyang Qi³ Yong Zhang² Xintao Wang² Ying Shan² Jian Zhang^1*
^*Corresponding Author.
¹Peking University ²Tencent AI Lab ³HKUST

[Arxiv] [Github]

Generated Image

Output Video

Generated Image

Output Video

Generated Image

Output Video

Our Unique Insights and Proposed Zero-Shot Methodology

Vanilla Video Diffusion Models (VDMs) have the following issues demonstrated in (a) of the following figure:

Black box: The generation process is still a black box
Inefficient and uncontrollabe: to obtain satisfactory results, a large amount of trial and error is needed
Domain gap: limited by the domain of the video dataset used during training

We propose step-by-step video generation pipeline to address the above issues demonstrated in (b) of the following figure:

Decoupling: the video generation process is decoupled into appearance (T2I) and motion (I2V)
Efficient and controllable: T2I generation is more controllable and efficient compared to T2V, allowing us to obtain satisfactory images before performing I2V to generate videos
Mitigating domain gap issues: the domain of T2I models can be fine-tuned to align with the actual domain, which is more efficient than tuning the entire video model

Moreover, we discovered that with just zero-shot modifications, we can transform pre-trained T2V models into I2V models, which means that Video Diffusion Models are Zero-Shot Image Animators!

Gallery

Below, we showcase the videos generated by AnimateZero on a variety of personalized T2I models.

Model: ToonYou

Model: CarDos Anime

Model: Anything V5

Model: Counterfeit V3.0

Model: Realistic Vision V5.1

Model: Photon

Model: helloObject

We also demonstrate an approach to control the motion of generated videos using text. In the following examples, we control the final state of the video by interpolating the text embeddings, thus achieving text-controlled motion.

Generated Image

Output Video

+ "happy and smile"

+ "angry and serious"

+ "open mouth"

+ "very sad"

Application: Video Editing

AnimateZero can also be used for better video editing, both for generated videos and real videos. One common use of AnimateDiff (AD) is to assist ControlNet (CN) in video editing, but it still has a domain gap problem. AnimateZero (AZ) has obvious advantages in this regard, namely generating videos with higher subjective quality and higher matching degree with the given text prompt.

Original Video

CN+AD

CN+AZ (ours)

Original Video

CN+AD

CN+AZ (ours)

"A girl swimming in lava"

"A girl is running in the forest, grassland"

"A candle is burning purple flames by the seaside"

"cute cat, colorful fur"

"A girl is dancing"

"a driving red car"

"A woman with red glasses"

"A young woman"

Application: Frame Interpolation

Extending the technology proposed by AnimateZero can simultaneously insert the first and last frames generated, realizing gradual transitions between the two frames, i.e., frame interpolation.

First Frame

Last Frame

Output Video

First Frame

Last Frame

Output Video

Application: Looped Video Generation

Looped video generation is a special case of frame interpolation, when the first frame and the last frame inserted are the same.

Generated Image

Output Video

Generated Image

Output Video

Generated Image

Output Video

Application: Real Image Animation

AnimateZero has the potential to do real image animation, although the domain gap problem still exists, which is limited by the domain of the T2I model used.

Real Image

Output Video

Real Image

Output Video

Real Image

Output Video

BibTeX

 @misc{yu2023animatezero,

    title={AnimateZero: Video Diffusion Models are Zero-Shot Image Animators},

    author={Yu, Jiwen and Cun, Xiaodong and Qi, Chenyang and Zhang, Yong and Wang, Xintao and Shan, Ying and Zhang, Jian},

    booktitle={arXiv preprint arXiv:2312.03793},

    year={2023}

  }

Project page template is borrowed from AnimateDiff.