Disentangled Counterfactual Learning for Physical Audiovisual Commonsense Reasoning