AnimateZero:
Video Diffusion Models are Zero-Shot Image Animators

Jiwen Yu1 Xiaodong Cun2* Chenyang Qi3 Yong Zhang2 Xintao Wang2 Ying Shan2 Jian Zhang1*
*Corresponding Author.
1Peking University 2Tencent AI Lab 3HKUST

[Arxiv]     [Github]

Generated Image

Output Video

Generated Image

Output Video

Generated Image

Output Video

Our Unique Insights and Proposed Zero-Shot Methodology

Vanilla Video Diffusion Models (VDMs) have the following issues demonstrated in (a) of the following figure:

We propose step-by-step video generation pipeline to address the above issues demonstrated in (b) of the following figure:

Moreover, we discovered that with just zero-shot modifications, we can transform pre-trained T2V models into I2V models, which means that Video Diffusion Models are Zero-Shot Image Animators!


Gallery

Below, we showcase the videos generated by AnimateZero on a variety of personalized T2I models.

Model: ToonYou

Model: CarDos Anime

Model: Anything V5

Model: Counterfeit V3.0

Model: Realistic Vision V5.1

Model: Photon

Model: helloObject

We also demonstrate an approach to control the motion of generated videos using text. In the following examples, we control the final state of the video by interpolating the text embeddings, thus achieving text-controlled motion.

Generated Image

Output Video

+ "happy and smile"

+ "angry and serious"

+ "open mouth"

+ "very sad"

Application: Video Editing

AnimateZero can also be used for better video editing, both for generated videos and real videos. One common use of AnimateDiff (AD) is to assist ControlNet (CN) in video editing, but it still has a domain gap problem. AnimateZero (AZ) has obvious advantages in this regard, namely generating videos with higher subjective quality and higher matching degree with the given text prompt.

Original Video

CN+AD

CN+AZ (ours)

Original Video

CN+AD

CN+AZ (ours)

"A girl swimming in lava"

"A girl is running in the forest, grassland"

"A candle is burning purple flames by the seaside"

"cute cat, colorful fur"

"A girl is dancing"

"a driving red car"

"A woman with red glasses"

"A young woman"

Application: Frame Interpolation

Extending the technology proposed by AnimateZero can simultaneously insert the first and last frames generated, realizing gradual transitions between the two frames, i.e., frame interpolation.

First Frame

Last Frame

Output Video

First Frame

Last Frame

Output Video

Application: Looped Video Generation

Looped video generation is a special case of frame interpolation, when the first frame and the last frame inserted are the same.

Generated Image

Output Video

Generated Image

Output Video

Generated Image

Output Video

Application: Real Image Animation

AnimateZero has the potential to do real image animation, although the domain gap problem still exists, which is limited by the domain of the T2I model used.

Real Image

Output Video

Real Image

Output Video

Real Image

Output Video

BibTeX

@misc{yu2023animatezero,
  title={AnimateZero: Video Diffusion Models are Zero-Shot Image Animators},
  author={Yu, Jiwen and Cun, Xiaodong and Qi, Chenyang and Zhang, Yong and Wang, Xintao and Shan, Ying and Zhang, Jian},
  booktitle={arXiv preprint arXiv:2312.03793},
  year={2023}
}

Project page template is borrowed from AnimateDiff.