Generative Video

Here are some of the leading AI architectures in Generative Video modeling from text and video inputs:


  • Video Transformer - A transformer architecture combined with 3D convolutional nets to generate video from text. Pioneered by models like GPT-3 Turbo.

  • VQVAE-2 - Uses a VQ-VAE model and transformer to generate video from text by predicting latent code. Used in tools like Anthropic's Claude.

  • CTRL-V - Combines transformers, object detection and retrieval networks to synthesize video from text captions.

  • DALL-E - Can generate simple videos from text using a transformer and object bank.


  • VideoGAN - Uses GAN architectures with spatio-temporal convolutional nets to convert low-res to high-res video.

  • Vid2Vid - Employs encoder-decoder structure and novel video blocks to convert input video to target domains like segmentation.

  • MoCoGAN - Decomposes motion and content for video using RNNs and GANs. Used for future prediction and style transfer.

  • Recycle-GAN - Architectures using space-time memory networks to synthesize multi-modal video output from unstructured video input.

  • SlowFast Networks - Two-stream 3D convolutional networks that model video at different speeds for generation.

So in summary, transformer-based architectures combined with deep convolutional nets have proven very effective for high-quality generative video modeling from both text and video inputs.

Last updated