i2vGen-XL (Image to Video)

I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models

Open Source Image to Video Model

i2vGen-XL - Image to Video Generation XL

We've been seeing a lot of amazing text to video and image to video models lately from companies like Runway and Pika and Stable Video diffusion and recently Leonardo entered the game well.

Now Alibaba group is entering the game with i2vGen-XL or image to video generation XL.

This model is capable of generating higher resolution videos as well as longer videos.

It requires an image input. it's not text to video. it is image to video.

Unlike most of the other video models, i2vGen-XL is available as open source on Alibaba's GitHub page.

All of the code is currently available. All of the installation instructions to install it locally or on a cloud computer are available.

I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion Models

i2vGen-XL - Image to Video Generation XL details

Video synthesis has recently made remarkable strides benefiting from the rapid development of diffusion models. However, it still encounters challenges in terms of semantic accuracy, clarity and spatio-temporal continuity.

They primarily arise from the scarcity of well-aligned text-video data and the complex inherent structure of videos, making it difficult for the model to simultaneously ensure semantic and qualitative excellence. In this report, we propose a cascaded I2VGen-XL approach that enhances model performance by decoupling these two factors and ensures the alignment of the input data by utilizing static images as a form of crucial guidance. I2VGen-XL consists of two stages:

i) the base stage guarantees coherent semantics and preserves content from input images by using two hierarchical encoders, and

ii) the refinement stage enhances the video's details by incorporating an additional brief text and improves the resolution to 1280x720.

To improve the diversity, we collect around 35 million single-shot text-video pairs and 6 billion text-image pairs to optimize the model. By this means, I2VGen-XL can simultaneously enhance the semantic accuracy, continuity of details and clarity of generated videos.

Through extensive experiments, we have investigated the underlying principles of I2VGen-XL and compared it with current top methods, which can demonstrate its effectiveness on diverse data. The source code and models will be publicly available here;

I2VGen-XL: https://i2vgen-xl.github.io/

Shiwei Zhang*, Jiayu Wang*, Yingya Zhang*, Kang Zhao, Hangjie Yuan, Zhiwu Qing, Xiang Wang, Deli Zhao, Jingren Zhou

Last updated