Generative Images

Here are some leading AI architectures for generative image modeling from text and image inputs:

Text-to-Image:

  • DALL-E 2 - A transformer variant combined with hierarchical VQ-VAE to generate images from text descriptions.

  • Imagen - Also based on transformers and diffusion models to generate photorealistic images from text.

  • GLIDE - Uses a cascade of transformers and normalizing flows for high-fidelity image generation based on text prompts.

  • Parti - Leverages sparsely-gated mixture of experts model built on top of transformers to generate images from text.

Image-to-Image:

  • Pix2Pix - Uses a convolutional GAN architecture for image-to-image translation tasks like transforming sketches to photos.

  • CycleGAN - Employs a cycle consistency loss using two GANs to translate images between unpaired domains without supervision.

  • SPADE - A semantic image synthesis model that uses segmentation maps as conditional inputs to generate photorealistic images from label maps.

  • Deep Image Analogy - Matches filter responses between input and exemplar images using a pre-trained CNN to transfer visual attributes.

  • U-GAT-IT - A generative adversarial network with a novel attention module for unsupervised image-to-image translation.

In summary, transformer networks combined with GANs and VAEs have proven very effective for text-to-image generation, while convolutional GAN architectures are widely used for image-to-image tasks.

CLIP, BigGAN and StyleGAN are some other notable AI architectures for generative image modeling:

  • CLIP (Contrastive Language-Image Pre-Training) - Developed by OpenAI, CLIP uses a vision transformer and text transformer encoder for zero-shot image classification guided by natural language.

  • BigGAN - A GAN architecture from Google using residual blocks and orthogonal regularization to synthesize high-fidelity 512x512 images conditioned on class labels.

  • StyleGAN - Developed by Nvidia, StyleGAN uses an anatomically inspired generator and adaptive discriminator loss to provide fine-grained control over image synthesis results.

Some key points on their applications:

  • CLIP provides the image-text modeling that enables zero-shot control over other generative models like DALL-E for text-guided image synthesis.

  • BigGAN demonstrated highly realistic conditional image generation guided by class labels. But it cannot be directly controlled using text.

  • StyleGAN introduced style-based editing of faces and objects by separating high-level attributes and stochastic variation in the GAN framework.

So in summary, CLIP, BigGAN and StyleGAN have driven major advances in controllable image synthesis and editing using vision-language models, conditional generation, and style-based disentangled representations respectively.

Last updated