> For the complete documentation index, see [llms.txt](https://metaverse-imagen.gitbook.io/ai-tools-research/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://metaverse-imagen.gitbook.io/ai-tools-research/ai-technology/generative-ai-architectures-and-models/generative-ai-models-for-video-and-image-synthesis/generative-images.md).

# Generative Images

Here are some leading AI architectures for generative image modeling from text and image inputs:

**Text-to-Image:**

* DALL-E 2 - A transformer variant combined with hierarchical VQ-VAE to generate images from text descriptions.
* Imagen - Also based on transformers and diffusion models to generate photorealistic images from text.
* GLIDE - Uses a cascade of transformers and normalizing flows for high-fidelity image generation based on text prompts.
* Parti - Leverages sparsely-gated mixture of experts model built on top of transformers to generate images from text.

**Image-to-Image:**

* Pix2Pix - Uses a convolutional GAN architecture for image-to-image translation tasks like transforming sketches to photos.
* CycleGAN - Employs a cycle consistency loss using two GANs to translate images between unpaired domains without supervision.
* SPADE - A semantic image synthesis model that uses segmentation maps as conditional inputs to generate photorealistic images from label maps.
* Deep Image Analogy - Matches filter responses between input and exemplar images using a pre-trained CNN to transfer visual attributes.
* U-GAT-IT - A generative adversarial network with a novel attention module for unsupervised image-to-image translation.

In summary, transformer networks combined with GANs and VAEs have proven very effective for text-to-image generation, while convolutional GAN architectures are widely used for image-to-image tasks.

CLIP, BigGAN and StyleGAN are some other notable AI architectures for generative image modeling:

* CLIP (Contrastive Language-Image Pre-Training) - Developed by OpenAI, CLIP uses a vision transformer and text transformer encoder for zero-shot image classification guided by natural language.
* BigGAN - A GAN architecture from Google using residual blocks and orthogonal regularization to synthesize high-fidelity 512x512 images conditioned on class labels.
* StyleGAN - Developed by Nvidia, StyleGAN uses an anatomically inspired generator and adaptive discriminator loss to provide fine-grained control over image synthesis results.

Some key points on their applications:

* CLIP provides the image-text modeling that enables zero-shot control over other generative models like DALL-E for text-guided image synthesis.
* BigGAN demonstrated highly realistic conditional image generation guided by class labels. But it cannot be directly controlled using text.
* StyleGAN introduced style-based editing of faces and objects by separating high-level attributes and stochastic variation in the GAN framework.

So in summary, CLIP, BigGAN and StyleGAN have driven major advances in controllable image synthesis and editing using vision-language models, conditional generation, and style-based disentangled representations respectively.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://metaverse-imagen.gitbook.io/ai-tools-research/ai-technology/generative-ai-architectures-and-models/generative-ai-models-for-video-and-image-synthesis/generative-images.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
