> For the complete documentation index, see [llms.txt](https://metaverse-imagen.gitbook.io/ai-tools-research/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://metaverse-imagen.gitbook.io/ai-tools-research/ai-technology/generative-ai-architectures-and-models/generative-ai-models-for-video-and-image-synthesis/generative-video.md).

# Generative Video

Here are some of the leading AI architectures in Generative Video modeling from text and video inputs:

**Text-to-Video:**

* Video Transformer - A transformer architecture combined with 3D convolutional nets to generate video from text. Pioneered by models like GPT-3 Turbo.
* VQVAE-2 - Uses a VQ-VAE model and transformer to generate video from text by predicting latent code. Used in tools like Anthropic's Claude.
* CTRL-V - Combines transformers, object detection and retrieval networks to synthesize video from text captions.
* DALL-E - Can generate simple videos from text using a transformer and object bank.

**Video-to-Video:**

* VideoGAN - Uses GAN architectures with spatio-temporal convolutional nets to convert low-res to high-res video.
* Vid2Vid - Employs encoder-decoder structure and novel video blocks to convert input video to target domains like segmentation.
* MoCoGAN - Decomposes motion and content for video using RNNs and GANs. Used for future prediction and style transfer.
* Recycle-GAN - Architectures using space-time memory networks to synthesize multi-modal video output from unstructured video input.
* SlowFast Networks - Two-stream 3D convolutional networks that model video at different speeds for generation.

So in summary, transformer-based architectures combined with deep convolutional nets have proven very effective for high-quality generative video modeling from both text and video inputs.
