Vela: Scalable Embeddings with Voice Large Language Models for Multimodal Retrieval