[LLaVA] Visual Instruction Tuning


# Author
- Haotian Liu1∗, Chunyuan Li2∗, Qingyang Wu3, Yong Jae Lee1
  - 1University of Wisconsin–Madison 2Microsoft Research 3Columbia University
  - https://llava-vl.github.io

# Abstract
- We present the first attempt to use language-only GPT-4 to generate multimodal language-image
instruction-following data. 
- By instruction tuning on such generated data, we introduce LLaVA: **L**arge **L**anguage **a**nd **V**ision **A**ssistant, an end-to-end trained large multimodal model that connects a vision encoder and an LLM for general-purpose visual and language understanding.

# Introduction
- In this paper, we present visual instruction-tuning
-  Multimodal instruction-following data
  - present a data reformation perspective and pipeline to convert image-text pairs into an appropriate instruction-following format, using ChatGPT/GPT-4.
- Large multimodal models
  - develop a large multimodal model (LMM), **by connecting the open-set visual encoder of CLIP** [40] with the language **decoder Vicuna** [9], and fine-tuning end-to-end on our generated instructional vision-language data.
- Multimodal instruction-following benchmark
  - present **LLaVA-Bench** with two challenging benchmarks, with a diverse selection of paired images, instructions and detailed annotations.
- Open-source
  - release the following assets to the public: the generated multimodal instruction data, the codebase, the model checkpoints, and a visual chat demo


# GPT-assisted Visual Instruction Data Generation
- public multimodal data such as image-text pairs? ranging from CC [8] to LAION [45]
- However, when it comes to multimodal instruction-following data, the available amount is limited
  - propose to **leverage ChatGPT/GPT-4** for multimodal instruction-following data collection, based on the widely existing image-pair data.
- 이미지와 캡션이 있을때 캡션을 토대로 하는 질문을 생성(아예 QA 자체를 생성할수도있고) 하는 방식으로 데이터 구축
<img width="708" alt="image" src="https://github.com/user-attachments/assets/e5fcecf4-1370-474e-81e9-16ceb6429ff7" />

- we use two types of symbolic representations: (캡션뿐만 아니라 바운딩박스도 사용)
  - (i) **Captions** typically describe the visual scene from various perspectives;
  - (ii) **Bounding boxes** usually localize the objects in the scene, and each box encodes the object concept
and its spatial location.
<img width="554" alt="image" src="https://github.com/user-attachments/assets/5ebdc14f-3420-434c-b89f-3eda3961fa7b" />

- We use **COCO images** [31] and generate **three** types of instruction-following data.
  - Conversation: including the object types, counting the objects, object actions, object locations, relative positions between
objects.
  - Detailed description: randomly sample one question(created by authors) from the list to ask GPT-4 to generate the detailed description
  - Complex reasoning: The answers typically require a step-by-step reasoning process by following rigorous logic
- collect **158K** unique language-image instruction-following samples in total, including **58K** in conversations, **23K** in detailed description, and **77k** in complex reasoning, respectively.

# Visual Instruction Tuning
## Architecture
- 이미지를 이미지 인코더로 벡터 Z로 한번 만든 다음에 Language embedding인 H로 프로젝션하는 구조
  - Z는 마지막 트랜스포머 레이어의 전과 후 grid features가 사용됨
  - CLIP ViT-L/14 사용 (다른것도 써도된다고 페이퍼에 첨언)

<img width="732" alt="image" src="https://github.com/user-attachments/assets/cdde9006-c004-48d4-a6d2-30163cdf0501" />

## Training 
- 멀티턴으로 학습 들어감
- 첫번째 턴에는 질문이 먼저오고 이미지가 들어갈수도 있고, 이미지가 먼저들어오고 질문이 들어올수도 있게끔 랜덤하게 함
<img width="732" alt="image" src="https://github.com/user-attachments/assets/e1aac012-027f-47ae-ba0b-1d932948cc97" />
<img width="727" alt="image" src="https://github.com/user-attachments/assets/191edd86-84ee-48d1-8e00-bc0e2e4a6cd1" />

- Stage 1: Pre-training for Feature Alignment.  (Adapter만 학습)
  - To strike a balance between concept coverage and training efficiency, we filter CC3M to 595K image-text pairs.
  - In training, we keep both the visual encoder and LLM weights frozen, and maximize the likelihood of (3) with trainable parameters θ= W (the projection matrix) only. In this way, the image features Hv can be aligned with the pre-trained LLM word embedding. This stage can be understood as training a compatible visual tokenizer for the frozen LLM.
- Stage 2: Fine-tuning End-to-End (LLM + Adapter만 학습, visual encoder는 freeze)
  - We always keep the visual encoder weights frozen, and continue to update both the pre-trained weights of the projection layer and LLM in LLaVA; i.e., the trainable parameters are θ= {W, ϕ}in (3)
  - Two use case scenarios
    - Multimodal Chatbot: develop a Chatbot by fine-tuning on the **158K** language-image instruction-following data in Section 3
    - Science QA: Each question is provided a context in the form of natural language or an image. The assistant provides the reasoning process in natural language and selects the answer among multiple choices. For training in (2), we organize the data as a **single** turn conversation

# Experiments
- train all models with 8×A100s, following Vicuna’s hyperparameters [9]. 
- **pre-train** our model on **the filtered CC-595K subset** for 1 epoch with a learning rate of 2e-3 and a batch size of 128
- **fine-tune** on the proposed **LLaVA-Instruct-158K** dataset for 3 epochs, with a **learning rate of 2e-5** and a batch size of 32

## Multimodal Chatbot
- Surprisingly, although LLaVA is trained with a small multimodal instruction-following dataset (∼80K unique images), it demonstrates quite similar reasoning results with multimodal GPT-4 on these examples
- Note that while these images are out-of-domain for LLaVA, LLaVA is still able to understand the scenes and follow the question instruction to provide a reasonable response

<img width="534" alt="image" src="https://github.com/user-attachments/assets/a9d04d65-41e0-46f9-94b7-c1ddb180c4ba" />

### Quantitative Evaluation
- Specifically, we create triplets consisting of image, ground-truth textual descriptions, and question. The candidate models (e.g., LLaVA) predict the answers based on the question and the image. To provide an approximate theoretical upper bound, we create a reference prediction based on the question and **the ground-truth textual descriptions, using the text-only GPT-4**.
- It evaluates the helpfulness, relevance, accuracy, and level of detail of the responses from the assistants, and gives an overall **score on a scale of 1 to 10**, where a higher score indicates better overall performance

<img width="564" alt="image" src="https://github.com/user-attachments/assets/4b5e4b02-ba3c-4626-8e47-531e6c6c217a" />

### Limitations
- 딸기 요거트 있냐 물어봤더니 딸기랑 요거트 각각 따로 있는것들을 보고 Yes라고 말함. Bag of patches 컨셉으로 보고 있는거 아니냐고 지적
- observed an interesting failure of LLaVA, as it responds with yes when asked if strawberry-flavored yogurt is present, even though the fridge contains only yogurt and strawberries. This indicates that, at times, LLaVA perceives the image as a “bag of patches”, failing to grasp the complex semantics within the image.
<img width="542" alt="image" src="https://github.com/user-attachments/assets/c9efd4d4-43ad-48b2-ae86-dd329986601a" />

## ScienceQA
- ScienceQA [34] contains 21k multimodal multiple choice questions with rich domain diversity across 3 subjects, 26 topics, 127 categories, and 379 skills.
- We consider two representative methods, including GPT-3.5 model (text-davinci-002) with and without chain-of-thought (CoT), LLaMA-Adapter [59], as well as multimodal chain-of-thought (MM-CoT) [61], which is the current SoTA method on this dataset
- We consider two schemes to combine the outcomes from our model and GPT-4. **(i) A GPT-4 complement.** Whenever GPT-4 fails to provide answers, we use the prediction from our method. This schemes yields 90.97% accuracy, which is almost the same as applying our method alone. **(ii) GPT-4 as the judge**. Whenever GPT-4 and LLaVA produce different answers, we prompt GPT-4 again, asking it to provide its own final answer based on the question and two outcomes. The spirit is similar with CoT, but with the external knowledge from the other model.
<img width="540" alt="image" src="https://github.com/user-attachments/assets/87a0b1dd-bdfc-4d50-ad45-372c93be3a5f" />

### Ablations
<img width="545" alt="image" src="https://github.com/user-attachments/assets/280b5304-e679-4290-9095-124951eaedbd" />

# Conclusion
- This paper demonstrated the effectiveness of visual instruction tuning. 
- We presented an automatic pipeline to create language-image instruction-following data, based on which we train LLaVA, a multimodal model to follow human intent to complete visual tasks



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LLaVA] Visual Instruction Tuning #39

Author

Abstract

Introduction

GPT-assisted Visual Instruction Data Generation

Visual Instruction Tuning

Architecture

Training

Experiments

Multimodal Chatbot

Quantitative Evaluation

Limitations

ScienceQA

Ablations

Conclusion

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[LLaVA] Visual Instruction Tuning #39

Description

Author

Abstract

Introduction

GPT-assisted Visual Instruction Data Generation

Visual Instruction Tuning

Architecture

Training

Experiments

Multimodal Chatbot

Quantitative Evaluation

Limitations

ScienceQA

Ablations

Conclusion

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions