Sign in

Efficient Vision-Language Pre-training by Cluster Masking

By Zihao Wei and others at
LogoUniversity of Michigan
We propose a simple strategy for masking image patches during visual-language contrastive learning that improves the quality of the learned representations and the training speed. During each iteration of training, we randomly mask clusters of visually similar image patches, as measured by their raw pixel intensities. This provides an extra... Show more
May 14, 2024
Loading PDF…
Loading full text...
Similar articles
Loading recommendations...