Variance Reduction based Experience Replay for Policy Optimization