Masked Modeling Duo: Towards a Universal Audio Pre-training Framework