Skip to content

[LLaVA 1.5] Improved Baselines with Visual Instruction Tuning #42

@eagle705

Description

@eagle705

Author

  • Haotian Liu1 Chunyuan Li2 Yuheng Li1 Yong Jae Lee1
    • 1University of Wisconsin–Madison 2Microsoft Research, Redmond

Abstract

  • We show that the fully-connected vision-language connector in LLaVA is surprisingly powerful and data-efficient. With simple modifications to LLaVA, formatting prompts, we establish stronger baselines that namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with response achieve state-of-the-art across 11 benchmarks.
  • Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in∼1 day on a single 8-A100 node

Introduction

  • we present the first systematic study to investigate the design choices of LMMs in a controlled setting
  • First, we unveil that the fully-connected vision-language connector in LLaVA is surprisingly powerful and data- efficient, and we establish stronger and more feasible baselines built upon the LLaVA framework.
  • LLaVA uses one of the simplest architecture design for LMMs and requires only training a simple fully-connected projection layer on merely 600K image-text pairs (Qwen-VL이나 InstructBLIP 대비 적은 데이터로도 잘 된다는 얘기)
  • Our final model can finish training in∼1 day on a single 8-A100 machine and achieves state-of-the-art results
    • A100 1 node면 많이 필요하진 않네
    • Moreover, unlike Qwen-VL [3] that includes in-house data in training, LLaVA utilizes only publicly available data.
      • LLaVA는 오픈소스라는것 퍼블릭데이터라는것 요걸 엄청 강조함
  • Next, we delve into an early exploration of other open problems of large multimodal models. Our findings include:
    • (1) Scaling to high-resolution image inputs. We show that LLaVA’s architecture is versatile in scaling to higher resolutions by simply dividing images into grids and maintains its data efficiency; with the increased resolution, it improves the model’s detailed perception capabilities and reduces hallucination.
    • (2) Compositional capabilities. We find that large multimodal models are capable of generalizing to compositional capabilities. For example, training on long-form language reasoning together with shorter visual reasoning can improve the model’s writing capability for multimodal questions.
    • (3) Data efficiency. We show that randomly downsampling LLaVA’s training data mixture by up to 75% does not significantly decrease the model’s performance, suggesting that the possibility of a more sophisticated dataset compression strategy can further improve LLaVA’s already efficient training pipeline.
    • (4) Data scaling. We provide empirical evidence for the scaling of data granularity in conjunction with the model’s capability is crucial for an improved capability without introducing artifacts like hallucination.
  • Our improved baselines, LLaVA-1.5, uses only public data, achieves the state-of-the-art on a broad range of 11 tasks, and is significantly more data-efficient than previous approaches.

Image

Approach

Preliminaries

  • 기존 방법중 하나였던 InstructBLIP 같은 경우도 academic-task-oriented datasets like VQA-v2 같은걸 썼지만, 한계가 있음
  • as shown in Table 1a, it can overfit to VQA training sets with short-answers, even on requests that require detailed responses.

Image

Response Format Prompting

  • find that the inability [7] to balance between short and long-form VQA for approaches like InstructBLIP [14], which leverages instruction following data that includes both natural responses and short-answers, is mainly due to the following reasons.
  • First, ambiguous prompts on the response format. For example, Q: {Question} A: {Answer}. Such prompts do not clearly indicate the desired output format, and can overfit an LLM behaviorally to short-form answers even for natural visual conversations.
  • Second, not finetuning the LLM. The first issue is worsened by InstructBLIP only finetuning the Qformer for instruction-tuning. It requires the Qformer’s visual output tokens to control the length of the LLM’s output to be either long-form or short-form, as in prefix tuning [33], but Qformer may lack the capability of properly doing so, due to its limited capacity compared with LLMs like LLaMA.
  • we propose to use a single response formatting prompt that clearly indicates the output format. It is appended at the end of VQA questions when promoting short answers: Answer the question using a single word or phrase
  • As shown in Table 2, by merely including VQAv2 [19] in training, LLaVA’s performance on MME significantly improves (1323.8 vs 809.6) and outperforms InstructBLIP by 111 points.

Image

Additional scaling

  • We further scale up the input image resolution to 3362 to allow the LLM to clearly “see” the details of images, by swapping the vision encoder to CLIP-ViT-L-336px (the highest resolution available for CLIP)
  • shows the most significant improvement when scaling the LLM to 13B
  • Due to the increased image input resolution to 336^2, the training of LLaVA-1.5 is∼2×as long as LLaVA:∼6 hours of pretraining and∼20 hours of visual instruction tuning, using 8×A100s

Scaling to Higher Resolutions

  • the image resolution of the existing open source CLIP vision encoders is limited to 336^2, preventing the support of higher resolution images by simply replacing the vision encoder as we did in Sec. 3.3. In this section, we present an early exploration of scaling the LMM to higher resolutions, while maintaining the data efficiency of LLaVA-1.5.
    • 오픈소스 인코더는 336^2까지만 지원하니까 간단히 교체해서 고해상도 지원하는건 어려우니 다른 방법 고려
  • to scale up the resolution, previous approaches mostly choose to perform positional embedding interpolation [3, 32] and adapt the ViT backbone to the new resolution during finetuning
    • 포지셔널 임베딩을 인터폴레이션해서 튜닝시키는 방법도 있긴하지만 추가 학습 필요해서 이 방법은 제외
  • Instead, as shown in Fig. 2, we overcome this by dividing the image into smaller image patches of the resolution that
    the vision encoder is originally trained for, and encode them independently. After obtaining the feature maps of individual patches, we then combine them into a single large feature map of the target resolution, and feed that into the LLM. To provide the LLM with the global context and to reduce the artifact of the split-encode-merge operation, we additionally concatenate the feature of a downsampled image to the LLaVA-1.5. We call this resulting model LLaVA-1.5-HD.

A. Implementation Details

A.1. LLaVA-1.5-HD

A.1.1 Preprocessing

Overview

  • use CLIP-ViT-L-14 (2242) as the base image encoder.
  • We first select and pad the input image to a target resolution that effectively captures its details, and split the image into 224^2 grids.
  • All 2242 image patches are encoded by the CLIP image encoder separately and their features are merged back to a single large feature map
  • then post-process the resulting feature map to a flattened list of features. We additionally concatenate the features of a fixed-resolution image to provide the model with a global context.

Target resolution selection

  • predefine a set of resolutions to support up to six grids (1x1, 1x2, 1x3, 1x4, 1x5, 1x6, 2x2, 2x3, and their transpose). This system allows for a maximum resolution of 672x448 (or 448x672)
    • 여기서 "up to six grids"라는 표현은 그리드의 종류 수가 6개라는 의미가 아니라, 이미지가 분할되는 패치의 총 개수가 최대 6개라는 뜻입니다. 예를 들어, 1x1, 1x2, …, 1x6: 각각 1, 2, …, 6개의 패치를 생성합니다. 2x2: 4개의 패치를, 2x3: 6개의 패치를 만듭니다. 그리고 "이들의 전치"라고 하는 것은 1x2를 2x1로, 1x3를 3x1로 바꾸는 등의 변환인데, 이 경우에도 패치의 총 수는 그대로 유지됩니다. 즉, 1x6와 전치된 6x1 모두 6개의 패치를 가지게 됩니다.
  • Two criteria are enforced in the target resolution selection:
    • (1) Detail preservation: the selected resolution preserves as much detail from the original image as possible;
    • (2) Resource efficiency: the resolution should not be excessively large to avoid unnecessary consumption of pixels and memory (e.g. it should not select 448^2 for a 224^2 input image -> 입력 이미지 해상도 이상의 해상도 쓰지말란 얘기)

Postprocessing

  • perform three steps of postprocessing to ensure that the final features can be processed effectively and efficiently by the language model
    • (1) Padding removal. Features corresponding exclusively to the paddings are discarded. This reduces the number of visual tokens processed by the language model and improves the efficiency.
      • 여러 패치로 쪼개서 인코딩할때마다 각각 인코딩하고 패딩은 생길테니 이거 나중에 flatten할땐 제거해주는게 효율적이긴함
  • (2) Row-end Tokens. We append a special token to the end of each row of features, to provide an explicit indication of the shape of the image. Unlike the original LLaVA and LLaVA-1.5 that uses a fixed resolution, we now use a variable resolution for the image features of LLaVA-1.5-HD, such indication allows the language model to capture the exact shape and the size of the image for each sample
    • LLaVA-1.5-HD에서는 다양한 resolution을 지원하니까 이미지 shape을 알기 위해서라도 어디가 이미지피쳐가 끝나는건지 토큰으로 표기해서 image shape에 대한 explicit indication 제공
    • (3) Flattening. Finally, we flatten the image feature map and feed it into the language model along with language token features

A.1.2 Training

  • Since we compute the visual features on the original 224^2 resolution that the vision encoder is trained on, we do not
    perform additional pretraining. We also do not perform additional high resolution pretraining for the visual projectors, and perform visual instruction tuning directly on the higher-resolution images
    • 기존 visual encoder랑 visual projectors는 따로 추가 pretraining 학습하진 않았고 바로 그냥 바로 visual instruction tuning 단계로 감

----- LLaVA 1.5 HD 끝 ----

A.3. Hyperparameters

Image

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions