Stable Diffusion (StabilityAI)

Stable Diffusion

http://github.com/Stability-AI/stablediffusion

‘Stable Diffusion’ is a software that uses a ‘Diffusion Model’ combined with VAE -Variational AutoEncoder for text2image creation. VAE is one of the four well-known Deep Learning-based Image Generative Models. It is primarily used to generate detailed images conditioned on text descriptions, though it can also be applied to other tasks such as inpainting, outpainting, and generating image-to-image translations guided by a text prompt.

Stable Diffusion was developed by StabilityAi in collaboration with a number of academic researchers and non-profit organizations.

Stable Diffusion is a Latent Diffusion Model, a kind of Deep Generative Neural Network. Latent diffusion models are machine learning models designed to learn the underlying structure of a dataset by mapping it to a lower-dimensional latent space. This latent space represents the data in which the relationships between different data points are more easily understood and analyzed.

Stable Diffusion’s Training data

Stable Diffusion was trained on pairs of images and captions taken from LAION-5B, a publicly available dataset derived from Common Crawl data scraped from the web, where 5 billion image-text pairs were classified based on language and filtered into separate datasets by resolution, a predicted likelihood of containing a watermark, and predicted "aesthetic" score (e.g. subjective visual quality).

The dataset was created by LAION, a German non-profit which receives funding from Stability AI. The Stable Diffusion model was trained on three subsets of LAION-5B: laion2B-en, laion-high-resolution, and laion-aesthetics v2 5+. A third-party analysis of the model's training data identified that out of a smaller subset of 12 million images taken from the original wider dataset used, approximately 47% of the sample size of images came from 100 different domains, with Pinterest taking up 8.5% of the subset, followed by websites such as WordPress, Blogspot, Flickr, DeviantArt and Wikimedia Commons.

Training procedures

The model was initially trained on the laion2B-en and laion-high-resolution subsets, with the last few rounds of training done on LAION-Aesthetics v2 5+, a subset of 600 million captioned images which the LAION-Aesthetics Predictor V2 predicted that humans would, on average, give a score of at least 5 out of 10 when asked to rate how much they liked them. The LAION-Aesthetics v2 5+ subset also excluded low-resolution images and images which LAION-5B-WatermarkDetection identified as carrying a watermark with greater than 80% probability. Final rounds of training additionally dropped 10% of text conditioning to improve Classifier-Free Diffusion Guidance.

The model was trained using 256 Nvidia A100 GPUs on Amazon Web Services for a total of 150,000 GPU-hours, at a cost of $600,000.

The creators of Stable Diffusion acknowledge the potential for algorithmic bias, as the model was primarily trained on images with English descriptions. As a result, generated images reinforce social biases and are from a western perspective as the creators note that the model lacks data from other communities and cultures. The model gives more accurate results for prompts that are written in English in comparison to those written in other languages with western or white cultures often being the default representation.

Qs? How is text2image using prompts achieved in 'Stable Diffusion'?

Diffusion models do not inherently have the capability to generate images from text prompts directly. Instead, they are typically used for generating images by simulating a diffusion process. However, they can be combined with other models or techniques to generate images based on text prompts.

‘Stable Diffusion’ uses a ‘Diffusion Model’ combined with VAE -Variational AutoEncoder, one of the four well-known Deep Learning-based Image Generative Models for text2image creation. It is primarily used to generate detailed images conditioned on text descriptions, though it can also be applied to other tasks such as inpainting, outpainting, and generating image-to-image translations guided by a text prompt. Stable Diffusion was developed by the start-up StabilityAi in collaboration with a number of academic researchers and non-profit organizations.

In the case of Stable Diffusion text-to-image synthesis is achieved by combining it with an encoder model (like VAE -Variational AutoEncoder, one of the four well-known Deep Learning-based Image Generative Models).- that translates the input text into a meaningful representation, which can then guide the diffusion process for image generation. This could involve using a separate model like a Transformer or an RNN (Recurring Neural Network) to encode the text before using the diffusion model for image synthesis.

The primary difference between Stable diffusion models for text-to-image synthesis and DALL-E lies in the underlying architecture and approach for generating images from text prompts:

Stable Diffusion Architecture

Stable Diffusion is a Generative Modeling Method based on the ‘Denoising Diffusion Probabilistic Model (DDPM)’ framework. It doesn't have a specific architecture on its own but rather refers to the method or process used for generating photo-realistic images given text input.

When implementing a Stable Diffusion model, you would typically use a neural network architecture to perform the denoising step during the diffusion process. The choice of architecture depends on the nature of the data and the task at hand. For image-related tasks, a Convolutional Neural Network (CNN) is often a suitable choice, as it is designed to handle grid-like data and can efficiently capture local spatial patterns and hierarchical features in images. For other tasks, such as text generation or processing, a Transformer Network could be employed.

Stable Diffusion uses a kind of diffusion model (DM), called a Latent Diffusion Model (LDM) developed by the CompVis group at LMU Munich. Introduced in 2015, Diffusion models are trained with the objective of removing successive applications of Gaussian noise on training images which can be thought of as a sequence of denoising autoencoders.

Stable Diffusion consists of 3 parts:

· The Variational Autoencoder (VAE),

· U-Net,

· Optional text encoder.

In summary, Stable Diffusion is a method based on the DDPM framework and does not have a specific architecture on its own. Instead, it relies on a suitable neural network architecture, such as a CNN or a Transformer Network, to perform the de-noising step during the diffusion process. The choice of architecture depends on the specific task and the nature of the input data.

Stable Diffusion uses a kind of Diffusion Model (DM), called a latent diffusion model (LDM) developed by the CompVis group at LMU Munich. Introduced in 2015, diffusion models are trained with the objective of removing successive applications of Gaussian noise on training images, which can be thought of as a sequence of denoising autoencoders. Stable Diffusion consists of 3 parts: the variational autoencoder (VAE), U-Net, and an optional text encoder. The VAE encoder compresses the image from pixel space to a smaller dimensional latent space, capturing a more fundamental semantic meaning of the image. Gaussian noise is iteratively applied to the compressed latent representation during forward diffusion. The U-Net block, composed of a ResNet backbone, denoises the output from forward diffusion backwards to obtain a latent representation. Finally, the VAE decoder generates the final image by converting the representation back into pixel space. The denoising step can be flexibly conditioned on a string of text, an image, or another modality. The encoded conditioning data is exposed to denoising U-Nets via a cross-attention mechanism. For conditioning on text, the fixed, pretrained CLIP ViT-L/14 text encoder is used to transform text prompts to an embedding space. Researchers point to increased computational efficiency for training and generation as an advantage of LDMs.

Diffusion model like the ones used in deep learning for tasks such as image synthesis, these models can be trained on various datasets depending on the specific task at hand. For example, a diffusion model for image synthesis could be trained on datasets such as ImageNet, CIFAR-10, or a custom dataset of images.

In Layman’s terms. Stable Diffusion is an AI model that generates images from text input. You give the model Prompts such as:

gingerbread house, diorama, in focus, white background, toast , crunch cereal

The AI model would generate images that match the prompt.

There are similar text-to-image generation services like DALLE and MidJourney. Why Stable Diffusion? The advantages of Stable Diffusion are;

· Open-source: Many enthusiasts have created free and powerful tools.

· Designed for low-power computers: It’s free or cheap to run.

Advanced GUI Option for Stable Diffusion

You can use a more advanced GUI (Graphical User Interface) if you outgrow the

free online Stable Diffusion model usage services (text2image generators) since their functionalities are pretty limited.

AUTOMATIC1111, is a powerful and popular GUI choice.

(See the Quick Start Guide for setting up AUTOMATIC1111 in the Google Colab cloud server.)

You can run AUTOMATIC1111on your PC as well if you have a decent NVIDIA GPU with at least 4GB VRAM.

Why use an advanced GUI? A whole array of tools are at your disposal?

· Advanced prompting techniques

· Regenerate a small part of an image with Inpainting

· Generate images based on an input image (Image-to-image)

· Edit an image by telling an instruction.

Stable Diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input. Stable Diffusion is a generative model that converts textual descriptions into photorealistic images. This type of model is often referred to as a "text-to-image synthesis" model.

Text-to-image synthesis models are designed to generate images based on textual input, often using deep learning techniques. They typically involve two key components:

  • a text encoder, which processes the input text and extracts meaningful features, and

  • an image generator, which takes these features and generates an image accordingly

Latent, by definition, means “hidden”. The concept of “latent space” is important because it's utility is at the core of 'deep learning' — learning the features of data and simplifying data representations for the purpose of finding patterns. Latent space or embedding space, is an embedding of a set of items within a manifold in which items resembling each other are positioned closer to one another in the latent space. ‘Latent space’ refers specifically to the space from which the low-dimensional representation is drawn. ‘Embedding’ refers to the way the low-dimensional data is mapped to ("embedded in") the original higher dimensional space. If it seems that this process is 'hidden' from you, it's because it is.

Stable Diffusion’s code and model weights have been released publicly, and it can run on most consumer hardware equipped with a modest GPU with at least 8 GB VRAM. This marked a departure from previous proprietary text-to-image models such as DALL-E (OpenAi), Imagen (Google) and Midjourney which were accessible only via cloud services.

Last updated