In this paper, we study Tracking by Language that localizes the target box sequence in a video based on a language query. We propose a framework called GTI that decomposes the problem into three sub-tasks: Grounding, Tracking, and Integration. The three sub-task modules operate simultaneously and predict the box sequence... Show more