> For the complete documentation index, see [llms.txt](https://metaverse-imagen.gitbook.io/ai-tools-research/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://metaverse-imagen.gitbook.io/ai-tools-research/ai-technology/generative-ai-architectures-and-models/generative-ai-models-for-video-and-image-synthesis/generative-images.md).

# Generative Images

Here are some leading AI architectures for generative image modeling from text and image inputs:

**Text-to-Image:**

* DALL-E 2 - A transformer variant combined with hierarchical VQ-VAE to generate images from text descriptions.
* Imagen - Also based on transformers and diffusion models to generate photorealistic images from text.
* GLIDE - Uses a cascade of transformers and normalizing flows for high-fidelity image generation based on text prompts.
* Parti - Leverages sparsely-gated mixture of experts model built on top of transformers to generate images from text.

**Image-to-Image:**

* Pix2Pix - Uses a convolutional GAN architecture for image-to-image translation tasks like transforming sketches to photos.
* CycleGAN - Employs a cycle consistency loss using two GANs to translate images between unpaired domains without supervision.
* SPADE - A semantic image synthesis model that uses segmentation maps as conditional inputs to generate photorealistic images from label maps.
* Deep Image Analogy - Matches filter responses between input and exemplar images using a pre-trained CNN to transfer visual attributes.
* U-GAT-IT - A generative adversarial network with a novel attention module for unsupervised image-to-image translation.

In summary, transformer networks combined with GANs and VAEs have proven very effective for text-to-image generation, while convolutional GAN architectures are widely used for image-to-image tasks.

CLIP, BigGAN and StyleGAN are some other notable AI architectures for generative image modeling:

* CLIP (Contrastive Language-Image Pre-Training) - Developed by OpenAI, CLIP uses a vision transformer and text transformer encoder for zero-shot image classification guided by natural language.
* BigGAN - A GAN architecture from Google using residual blocks and orthogonal regularization to synthesize high-fidelity 512x512 images conditioned on class labels.
* StyleGAN - Developed by Nvidia, StyleGAN uses an anatomically inspired generator and adaptive discriminator loss to provide fine-grained control over image synthesis results.

Some key points on their applications:

* CLIP provides the image-text modeling that enables zero-shot control over other generative models like DALL-E for text-guided image synthesis.
* BigGAN demonstrated highly realistic conditional image generation guided by class labels. But it cannot be directly controlled using text.
* StyleGAN introduced style-based editing of faces and objects by separating high-level attributes and stochastic variation in the GAN framework.

So in summary, CLIP, BigGAN and StyleGAN have driven major advances in controllable image synthesis and editing using vision-language models, conditional generation, and style-based disentangled representations respectively.
