Learning to Evaluate Performance of Multi-modal Semantic Localization