Fine-grained Multi-Modal Self-Supervised Learning