Sign in

LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture

By Xidong Wang and others
Expanding the long-context capabilities of Multi-modal Large Language Models~(MLLMs) is crucial for video understanding, high-resolution image understanding, and multi-modal agents. This involves a series of systematic optimizations, including model architecture, data construction and training strategy, particularly addressing challenges such as degraded performance with more images and high computational costs. In... Show more
September 4, 2024
=
0
Loading PDF…
Loading full text...
Similar articles
Loading recommendations...
=
0
x1
LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture
Click on play to start listening