ViLA: Efficient Video-Language Alignment for Video Question Answering

By Xijun Wang and others at
LogoUniversity of Maryland
In this work, we propose an efficient Video-Language Alignment (ViLA) network. Our ViLA model addresses both efficient frame sampling and effective cross-modal alignment in a unified way. In our ViLA network, we design a new learnable text-guided Frame-Prompter together with a new cross-modal distillation (QFormer-Distiller) module. Pre-trained large image-language models... Show more
April 29, 2024
