Interpretable Multimodal Emotion Recognition using Hybrid Fusion of Speech and Image Data