Generative Images
Here are some leading AI architectures for generative image modeling from text and image inputs:
Text-to-Image:
DALL-E 2 - A transformer variant combined with hierarchical VQ-VAE to generate images from text descriptions.
Imagen - Also based on transformers and diffusion models to generate photorealistic images from text.
GLIDE - Uses a cascade of transformers and normalizing flows for high-fidelity image generation based on text prompts.
Parti - Leverages sparsely-gated mixture of experts model built on top of transformers to generate images from text.
Image-to-Image:
Pix2Pix - Uses a convolutional GAN architecture for image-to-image translation tasks like transforming sketches to photos.
CycleGAN - Employs a cycle consistency loss using two GANs to translate images between unpaired domains without supervision.
SPADE - A semantic image synthesis model that uses segmentation maps as conditional inputs to generate photorealistic images from label maps.
Deep Image Analogy - Matches filter responses between input and exemplar images using a pre-trained CNN to transfer visual attributes.
U-GAT-IT - A generative adversarial network with a novel attention module for unsupervised image-to-image translation.
In summary, transformer networks combined with GANs and VAEs have proven very effective for text-to-image generation, while convolutional GAN architectures are widely used for image-to-image tasks.
CLIP, BigGAN and StyleGAN are some other notable AI architectures for generative image modeling:
CLIP (Contrastive Language-Image Pre-Training) - Developed by OpenAI, CLIP uses a vision transformer and text transformer encoder for zero-shot image classification guided by natural language.
BigGAN - A GAN architecture from Google using residual blocks and orthogonal regularization to synthesize high-fidelity 512x512 images conditioned on class labels.
StyleGAN - Developed by Nvidia, StyleGAN uses an anatomically inspired generator and adaptive discriminator loss to provide fine-grained control over image synthesis results.
Some key points on their applications:
CLIP provides the image-text modeling that enables zero-shot control over other generative models like DALL-E for text-guided image synthesis.
BigGAN demonstrated highly realistic conditional image generation guided by class labels. But it cannot be directly controlled using text.
StyleGAN introduced style-based editing of faces and objects by separating high-level attributes and stochastic variation in the GAN framework.
So in summary, CLIP, BigGAN and StyleGAN have driven major advances in controllable image synthesis and editing using vision-language models, conditional generation, and style-based disentangled representations respectively.
Last updated