Recognizing actions from a limited set of labeled videos remains a challenge as annotating visual data is not only tedious but also can be expensive due to classified nature. Moreover, handling spatio-temporal data using deep 3D transformers for this can introduce significant computational complexity. In this paper, our objective is... Show more