Abstract: Natural language tracking aims to localize the target object referred to by a language description using a sequence of bounding boxes in video frames. Compared with traditional visual single ...