Skip to content

[LLaVA] Visual Instruction Tuning #39

@eagle705

Description

@eagle705

Author

  • Haotian Liu1∗, Chunyuan Li2∗, Qingyang Wu3, Yong Jae Lee1

Abstract

  • We present the first attempt to use language-only GPT-4 to generate multimodal language-image
    instruction-following data.
  • By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and an LLM for general-purpose visual and language understanding.

Introduction

  • In this paper, we present visual instruction-tuning
  • Multimodal instruction-following data
  • present a data reformation perspective and pipeline to convert image-text pairs into an appropriate instruction-following format, using ChatGPT/GPT-4.
  • Large multimodal models
    • develop a large multimodal model (LMM), by connecting the open-set visual encoder of CLIP [40] with the language decoder Vicuna [9], and fine-tuning end-to-end on our generated instructional vision-language data.
  • Multimodal instruction-following benchmark
    • present LLaVA-Bench with two challenging benchmarks, with a diverse selection of paired images, instructions and detailed annotations.
  • Open-source
    • release the following assets to the public: the generated multimodal instruction data, the codebase, the model checkpoints, and a visual chat demo

GPT-assisted Visual Instruction Data Generation

  • public multimodal data such as image-text pairs? ranging from CC [8] to LAION [45]
  • However, when it comes to multimodal instruction-following data, the available amount is limited
    • propose to leverage ChatGPT/GPT-4 for multimodal instruction-following data collection, based on the widely existing image-pair data.
  • 이미지와 캡션이 있을때 캡션을 토대로 하는 질문을 생성(아예 QA 자체를 생성할수도있고) 하는 방식으로 데이터 구축
image
  • we use two types of symbolic representations: (캡션뿐만 아니라 바운딩박스도 사용)
    • (i) Captions typically describe the visual scene from various perspectives;
    • (ii) Bounding boxes usually localize the objects in the scene, and each box encodes the object concept
      and its spatial location.
image
  • We use COCO images [31] and generate three types of instruction-following data.
    • Conversation: including the object types, counting the objects, object actions, object locations, relative positions between
      objects.
    • Detailed description: randomly sample one question(created by authors) from the list to ask GPT-4 to generate the detailed description
    • Complex reasoning: The answers typically require a step-by-step reasoning process by following rigorous logic
  • collect 158K unique language-image instruction-following samples in total, including 58K in conversations, 23K in detailed description, and 77k in complex reasoning, respectively.

Visual Instruction Tuning

Architecture

  • 이미지를 이미지 인코더로 벡터 Z로 한번 만든 다음에 Language embedding인 H로 프로젝션하는 구조
    • Z는 마지막 트랜스포머 레이어의 전과 후 grid features가 사용됨
    • CLIP ViT-L/14 사용 (다른것도 써도된다고 페이퍼에 첨언)
image

Training

  • 멀티턴으로 학습 들어감
  • 첫번째 턴에는 질문이 먼저오고 이미지가 들어갈수도 있고, 이미지가 먼저들어오고 질문이 들어올수도 있게끔 랜덤하게 함
image image
  • Stage 1: Pre-training for Feature Alignment. (Adapter만 학습)
    • To strike a balance between concept coverage and training efficiency, we filter CC3M to 595K image-text pairs.
    • In training, we keep both the visual encoder and LLM weights frozen, and maximize the likelihood of (3) with trainable parameters θ= W (the projection matrix) only. In this way, the image features Hv can be aligned with the pre-trained LLM word embedding. This stage can be understood as training a compatible visual tokenizer for the frozen LLM.
  • Stage 2: Fine-tuning End-to-End (LLM + Adapter만 학습, visual encoder는 freeze)
    • We always keep the visual encoder weights frozen, and continue to update both the pre-trained weights of the projection layer and LLM in LLaVA; i.e., the trainable parameters are θ= {W, ϕ}in (3)
    • Two use case scenarios
      • Multimodal Chatbot: develop a Chatbot by fine-tuning on the 158K language-image instruction-following data in Section 3
      • Science QA: Each question is provided a context in the form of natural language or an image. The assistant provides the reasoning process in natural language and selects the answer among multiple choices. For training in (2), we organize the data as a single turn conversation

Experiments

  • train all models with 8×A100s, following Vicuna’s hyperparameters [9].
  • pre-train our model on the filtered CC-595K subset for 1 epoch with a learning rate of 2e-3 and a batch size of 128
  • fine-tune on the proposed LLaVA-Instruct-158K dataset for 3 epochs, with a learning rate of 2e-5 and a batch size of 32

Multimodal Chatbot

  • Surprisingly, although LLaVA is trained with a small multimodal instruction-following dataset (∼80K unique images), it demonstrates quite similar reasoning results with multimodal GPT-4 on these examples
  • Note that while these images are out-of-domain for LLaVA, LLaVA is still able to understand the scenes and follow the question instruction to provide a reasonable response
image

Quantitative Evaluation

  • Specifically, we create triplets consisting of image, ground-truth textual descriptions, and question. The candidate models (e.g., LLaVA) predict the answers based on the question and the image. To provide an approximate theoretical upper bound, we create a reference prediction based on the question and the ground-truth textual descriptions, using the text-only GPT-4.
  • It evaluates the helpfulness, relevance, accuracy, and level of detail of the responses from the assistants, and gives an overall score on a scale of 1 to 10, where a higher score indicates better overall performance
image

Limitations

  • �딸기 요거트 있냐 물어봤더니 딸기랑 요거트 각각 따로 있는것들을 보고 Yes라고 말함. Bag of patches 컨셉으로 보고 있는거 아니냐고 지적
  • observed an interesting failure of LLaVA, as it responds with yes when asked if strawberry-flavored yogurt is present, even though the fridge contains only yogurt and strawberries. This indicates that, at times, LLaVA perceives the image as a “bag of patches”, failing to grasp the complex semantics within the image.
image

ScienceQA

  • ScienceQA [34] contains 21k multimodal multiple choice questions with rich domain diversity across 3 subjects, 26 topics, 127 categories, and 379 skills.
  • We consider two representative methods, including GPT-3.5 model (text-davinci-002) with and without chain-of-thought (CoT), LLaMA-Adapter [59], as well as multimodal chain-of-thought (MM-CoT) [61], which is the current SoTA method on this dataset
  • We consider two schemes to combine the outcomes from our model and GPT-4. (i) A GPT-4 complement. Whenever GPT-4 fails to provide answers, we use the prediction from our method. This schemes yields 90.97% accuracy, which is almost the same as applying our method alone. (ii) GPT-4 as the judge. Whenever GPT-4 and LLaVA produce different answers, we prompt GPT-4 again, asking it to provide its own final answer based on the question and two outcomes. The spirit is similar with CoT, but with the external knowledge from the other model.
image

Ablations

image

Conclusion

  • This paper demonstrated the effectiveness of visual instruction tuning.
  • We presented an automatic pipeline to create language-image instruction-following data, based on which we train LLaVA, a multimodal model to follow human intent to complete visual tasks

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions