Skip to content

[LLaVA OneVision] Easy Visual Task Transfer #43

@eagle705

Description

@eagle705

Author

Abstract

  • LLaVA-OneVision is the first single model that can simultaneously push the performance boundaries of open LMMs in three important computer vision scenarios: single-image, multi-image, and video scenarios
  • Importantly, the design of LLaVA-OneVision allows strong transfer learning across different modalities/scenarios, yielding new emerging capabilities

Introduction

  • The first LLaVA model [83] demonstrates impressive multimodal chat abilities, sometimes exhibiting the behaviors similar to GPT-4V on previously unseen images and instructions for the first time
    • chat 가능한 GPT-4V와 비슷한 첫 연구
  • LLaVA-1.5 [81] significantly expands and improves the capabilities by incorporating more academic related instruction data, achieving SoTA performance on a dozens of benchmarks with a data-efficient recipe
    • Academic-related instruction data로 튜닝해서 성능 좋게 거둠
  • LLaVA-NeXT [82] inherits this property, further pushing performance boundaries through three key techniques: AnyRes for handling high-resolution images, expanding high-quality instruction data, and utilizing the best open LLM available at the time.
    • AnyRes를 통해 고해상도 이미지도 핸들링하면서 양질의 데이터로 좋은 성능 냄 (NeXT도 비디오 커버 가능하긴했음)
      • The Video blog [169] shows that the image-only-trained LLaVA-NeXT model is surprisingly strong on video tasks with zero-shot modality transfer, due to the design of AnyRes to digest any vision signals as a sequence of images.
        • AnyRes 덕분에 어떤 비전 시그널이든 이미지의 시퀀스로 이해했기 때문에 잘했던듯
      • LLaVA-NeXT 블로그는 총 4개인데, Video, Stronger, Ablation, Interleave 이렇게 구성되어있음
  • contributions:
    • Large multimodal models. We develop LLaVA-OneVision, a family of open large multimodal models (LMMs) that improves the performance boundaries of open LMMs in three important vision settings, including single-image, multi-image, and video scenarios
    • Emerging Capabilities with Task Transfer. Our design in modeling and data representations allow task transfer across different scenarios, suggesting a simple approach to yield new emgerging capabilities. In particular, LLaVA-OneVision demonstrate strong video understanding through task transfer from images.
    • Open-source. the generated multimodal instruction data, the codebase, the model checkpoints, and a visual chat demo

Modeling

Image

Network Architecture

  • LLM: Qwen-2, Vision Encoder: SigLIP, Projector: 2-layer MLP
  • 조건부확률보면 항상 vision signal X_v를 넣고 있고, 이건 모든 답변은 vision feature에 그라운딩한다는걸 표현함
    Image

Visual Representations

  • It relates to two factors, the resolution in the raw pixel space and the number of tokens in the feature space
    • the visual input representation configuration (resolution, #token)
    • we observe that the scaling of resolution is more effective than that of token numbers, and recommend an AnyRes strategy with pooling (스케일이 토큰 개수보다 더 중요해서 AnyRes에 풀링 적용을 권장)
Image Image
  • For AnyRes with a configuration of width a, height b, it divides the image into a×b crops, each with the shape (a,b). Each crop has the same resolution suitable for the vision encoder. Assuming there are T tokens per crop, the total number of visual tokens is L = (a×b+ 1) ×T,
    • 하나의 crop당 비전인코더가 인코딩할수있고, 그 단위당 토큰이 T개 나온다고하면 axbT + 전체를보는 resize용 1T개가 나오게됨
  • We consider a threshold τ, and reduce the #token per crop, using bilinear interpolation if needed
    • Threshold를 기준으로 토큰 개수가 너무 많으면 Bilinear interpolation을 통해 Crop당 토큰 개수를 조절함
    • 예를들면 하나의 패치당 토큰 개수가 30이고, ab가 31이고, Threshold가 100개인데,총 토큰 L이 120개((3*1+1)30)가 나왔다면, 100/(3+1) = 100/4 = 25개를 Crop당 토큰 개수로 변경해줘서, 254 = 100개가 나오도록 즉 최대 Threshold 개수만큼 조절해줌. a,b의 구성을 정의해놓으면 가장 작은 crop 수가 나오도록 선택해서 진행함
  • We illustratie the configuration in Figure 3, describe the detailed in Section C.1 and provide high-level encoding strategies as below
Image
  • Single-image
    • consider a large maximum spatial configuration (a,b) for single-image representation to maintain the original image resolution without resizing
    • By representing an image with a long sequence that mimics video representation, we facilitate a smoother capability transfer from image to video understanding [169, 64]
  • Multi-image
    • Only the base image resolution is considered
    • eliminating the need for multi-crop of high resolution image and thus saving computational resources
  • Video
    • Each frame of the video is resized to the base image resolution and processed by the vision encoder to generate feature maps. Bilinear interpolation is employed to reduce the number of tokens, allowing the consideration of a larger number of frames by reducing tokens per frame
Image

Data

  • quality over quantity

4.1 High-Quality Knowledge

  • The web-scale public image-text data is often of low quality, rendering the data scaling of multimodal pre-training less efficient
  • Instead, we recommend to focus on high-quality knowledge learning, given a limited compute budget. This approach acknowledges that the pre-trained LLMs and ViTs already possess a substantial knowledge base, and the goal is to refine and enhance this knowledge with carefully curated data.
  • three major categories for high-quality knowledge learning
    • Re-Captioned Detailed Description Data:
      • We used the model to generate new captions
        for the images from the following datasets: COCO118K, BLIP558K, and CC3M. We combined them to form the Re-Captioned Detailed Description Data, totaling 3.5M samples.
    • Document / OCR Data
      • Text Reading subset from the UReader dataset, totaling 100K, which is easily accessible through PDF rendering
      • We used this text reading data along with the SynDOG EN/CN, to form the Document / OCR Data, totaling 1.1M samples
    • Chinese and Language Data
      • used the original ShareGPT4V [20] images and utilized
        GPT-4V provided by the Azure API to generate 92K detailed Chinese caption data, aiming to improve the model’s capability in Chinese.
      • We collected 143K samples from the Evo-Instruct dataset
  • almost all (accounting for 99.8%) of the high-quality knowledge data is synthetic.

4.2 Visual Instruction Tuning Data

Image Image Image Image

5 Training Strategies

Image

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions