Rerender A Video

This is an interesting project called "RERENDER A VIDEO".


Zero-shot in video and AI refers to the ability of a model to generate videos based on a textual description, without having seen any training data of videos with that description.

Large text-to-image diffusion models have exhibited impressive proficiency in generating high-quality images. However, when applying these models to the video domain, ensuring temporal consistency across video frames remains a formidable challenge.

This project proposes a novel 'Zero-shot text-guided video-to-video translation framework' to adapt image models to videos. The framework includes two parts: key frame translation and full video translation.

  1. Key Frame Translation: This part uses an adapted diffusion model to generate key frames, with hierarchical cross-frame constraints applied to enforce coherence in shapes, textures and colors.

  2. Key Frame Propagation: The part propagates the key frames to other frames with temporal-aware patch matching and frame blending.

The framework achieves global style and local texture temporal consistency at a low cost (without re-training or optimization). The adaptation is compatible with existing image diffusion techniques, allowing our framework to take advantage of them, such as customizing a specific subject with LoRA, and introducing extra spatial guidance with ControlNet.

Extensive experimental results demonstrate the effectiveness of our proposed framework over existing methods in rendering high-quality and temporally-coherent videos.


Rerender-A-Video proposes novel hierarchical cross-frame constraints for pre-trained image diffusion models to produce coherent video frames. The key idea is to use 'optical flow' to apply dense cross-frame constraints, with the previous rendered frame serving as a low-level reference for the current frame and the first rendered frame acting as an anchor to regulate the rendering process to prevent deviations from the initial appearance.

Hierarchical cross-frame constraints are realized at different stages of diffusion sampling. In addition to global style consistency (cross-frame attention), the method enforces consistency in shapes (shape-aware cross-frame latent fusion), textures (pixel-aware cross-frame latent fusion) and colors (color-aware adaptive latent adjustment) at early, middle and late stages, respectively. This innovative and lightweight modification achieves both global and local temporal consistency.

This space provides the function of key frame translation. Full code for full video translation will be released upon the publication of the paper.

To avoid overload, we set limitations to the maximum frame number (8) and the maximum frame resolution (512x768).

The running time of a video of size 512x640 is about 1 minute per keyframe under T4 GPU.

How to use:

  1. Run 1st Key Frame: only translate the first frame, so you can adjust the prompts/models/parameters to find your ideal output appearance before run the whole video.

  2. Run Key Frames: translate all the key frames based on the settings of the first frame

  3. Run All: Run 1st Key Frame and Run Key Frames

  4. Run Propagation: propogate the key frames to other frames for full video translation. This part will be released upon the publication of the paper.


  1. This method cannot handle large or quick motions where the optical flow is hard to estimate. Videos with stable motions are preferred.

  2. Pixel-aware fusion may not work for large or quick motions.

  3. Try different color-aware AdaIN settings and even unuse it to avoid color jittering.

  4. revAnimated_v11 model for non-photorealstic style, realisticVisionV20_v20 model for photorealstic style.

  5. To use your own SD/LoRA model, you may clone the space and specify your model with

  6. This method is based on the original SD model. You may need to convert Diffuser/Automatic1111 models to the original one.

Note: This code is for research purpose and non-commercial use only.

Last updated