LLaVA 1.5

LLaVA: Large Language and Vision Assistant

LLaVA: Large Language and Vision Assistant

LLaVA is an advanced AI model that combines a vision encoder and large language models for general-purpose visual and language understanding. It is a novel end-to-end trained multimodal model that aims to achieve impressive chat abilities while mimicking the behavior of multimodal models like GPT-4.

The key focus of LLaVA is visual instruction tuning, which involves using machine-generated instruction-following data to enhance the capabilities of large language models in understanding and generating content in the multimodal domain. By leveraging language-only models like GPT-4, LLaVA generates multimodal language-image instruction-following data, bridging the gap between language and vision.

With LLaVA, users can benefit from an AI-powered assistant that excels in chat capabilities and offers accurate responses to a wide range of visual instructions. It sets a new state-of-the-art accuracy on science question answering tasks and provides impressive results on unseen images and instructions.

Key Features of LLaVA:

  • Multimodal Instruction Generation: LLaVA leverages language-only models to generate language-image instruction pairs, enabling effective instruction following in the multimodal domain.

  • Large Language and Vision Model: LLaVA combines a vision encoder with a powerful language model, allowing it to understand and generate content in both visual and textual formats.

  • Fine-tuning Capabilities: LLaVA can be fine-tuned on specific tasks, such as science question answering, to enhance its performance in domain-specific applications.

  • Open-Source Availability: The GPT-4 generated visual instruction tuning data, LLaVA model, and code base are made publicly available, promoting research and collaboration in the field of multimodal AI.

LLaVA is a significant advancement in the field of multimodal AI, providing researchers, developers, and AI enthusiasts with a powerful tool for exploring, studying, and developing state-of-the-art models that can understand and generate content in both language and vision domains.

LLaVA-1.5 achieves SoTA on 11 benchmarks, with just simple modifications to the original LLaVA, utilizes all public data, completes training in ~1 day on a single 8-A100 node, and surpasses methods that use billion-scale data.

LLaVA represents a novel end-to-end trained large multimodal model that combines a Vision Encoder and Vicuna for general-purpose visual and language understanding, achieving impressive chat capabilities mimicking spirits of the multimodal GPT-4 and setting a new state-of-the-art accuracy on Science QA.

Visual Instruction Tuning, NeurIPS 2023 (Oral), Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee, University of Wisconsin-Madison, Microsoft Research, Columbia University.

To learn more about LLaVA and access the resources related to the project, including the code, model, and dataset, visit the LLaVA website.

Last updated