Audio-Visual Transformer Based Crowd Counting