Multi-modal Representation Learning for Video Advertisement Content Structuring