# Generative Video

Here are some of the leading AI architectures in Generative Video modeling from text and video inputs:

**Text-to-Video:**

* Video Transformer - A transformer architecture combined with 3D convolutional nets to generate video from text. Pioneered by models like GPT-3 Turbo.
* VQVAE-2 - Uses a VQ-VAE model and transformer to generate video from text by predicting latent code. Used in tools like Anthropic's Claude.
* CTRL-V - Combines transformers, object detection and retrieval networks to synthesize video from text captions.
* DALL-E - Can generate simple videos from text using a transformer and object bank.

**Video-to-Video:**

* VideoGAN - Uses GAN architectures with spatio-temporal convolutional nets to convert low-res to high-res video.
* Vid2Vid - Employs encoder-decoder structure and novel video blocks to convert input video to target domains like segmentation.
* MoCoGAN - Decomposes motion and content for video using RNNs and GANs. Used for future prediction and style transfer.
* Recycle-GAN - Architectures using space-time memory networks to synthesize multi-modal video output from unstructured video input.
* SlowFast Networks - Two-stream 3D convolutional networks that model video at different speeds for generation.

So in summary, transformer-based architectures combined with deep convolutional nets have proven very effective for high-quality generative video modeling from both text and video inputs.
