LoRA (Low-Rank Adaptation of Large Language Models)

LoRA (Low-Rank Adaptation of Large Language Models)

What is LoRA and How does LoRA work in Stable Diffusion?

LoRA stands for Low-Rank Adaptation of Large Language Models (LLMs). It is a novel technique proposed by Microsoft Research with a focus on addressing the challenge of Optimizing and Fine-Tuning Large Language Models. These models, such as GPT-3, with billions of parameters and GPT-4 with over a trillion parameters, are prohibitively expensive to adapt to specific tasks or domains.

LoRA models are small Stable Diffusion models that apply tiny changes to stable checkpoint models. They are usually 10 to 100 times smaller than checkpoint models. That makes them very attractive to people having an extensive collection of models. LoRA’s effective adaptation technique maintains model quality while significantly reducing the number of trainable parameters for downstream tasks with no increased inference time.

One can “teach” new ideas to a Stable Diffusion model using Dreambooth. Dreambooth and LoRA are compatible, and the procedure is similar to fine-tuning with a few benefits:

  • Training is more rapid.

  • Only a few pictures of the subject are required (5-10 are usually enough).

  • If one wishes to increase the text encoder’s subject-specific fidelity, one can adjust it.

Easy fine-tuning has long been a goal. Textual Inversion is another well-liked technique that aims to introduce new ideas to a trained Stable Diffusion Model in addition to Dreambooth. The fact that training weights are portable and straightforward to transmit is one of the key benefits of utilizing text inversion. They can be used for a single subject (or a small number of issues), but LoRA can be used for general-purpose fine-tuning, which allows it to be customized for different domains or datasets.

Dreambooth allows you to "teach" new concepts to a Stable Diffusion model. LoRA is compatible with Dreambooth and the process is similar to fine-tuning, with a couple of advantages: Training is faster. We only need a few images of the subject we want to train (5 or 10 are usually enough). We can tweak the text encoder, if we want, for additional fidelity to the subject.

Even though LoRA was initially proposed for Large-Language Models and demonstrated on Transformer blocks, the technique can also be applied elsewhere.

LoRA applies small changes to the most critical part of Stable Diffusion models: The cross-attention layers. It is the part of the model where the image and the prompt meet (cross-attention layers that relate the image representations with the prompts that describe them). The researchers found that by focusing on the ‘Transformer attention blocks’ of large-language models, fine-tuning quality with LoRA was on par with full model fine-tuning while being much faster and requiring less compute. (Researchers found it sufficient to fine-tune this part of the model to achieve good training).

Last updated