Generative Video
Here are some of the leading AI architectures in Generative Video modeling from text and video inputs:
Text-to-Video:
Video Transformer - A transformer architecture combined with 3D convolutional nets to generate video from text. Pioneered by models like GPT-3 Turbo.
VQVAE-2 - Uses a VQ-VAE model and transformer to generate video from text by predicting latent code. Used in tools like Anthropic's Claude.
CTRL-V - Combines transformers, object detection and retrieval networks to synthesize video from text captions.
DALL-E - Can generate simple videos from text using a transformer and object bank.
Video-to-Video:
VideoGAN - Uses GAN architectures with spatio-temporal convolutional nets to convert low-res to high-res video.
Vid2Vid - Employs encoder-decoder structure and novel video blocks to convert input video to target domains like segmentation.
MoCoGAN - Decomposes motion and content for video using RNNs and GANs. Used for future prediction and style transfer.
Recycle-GAN - Architectures using space-time memory networks to synthesize multi-modal video output from unstructured video input.
SlowFast Networks - Two-stream 3D convolutional networks that model video at different speeds for generation.
So in summary, transformer-based architectures combined with deep convolutional nets have proven very effective for high-quality generative video modeling from both text and video inputs.
Last updated