Hierarchical Instruction-aware Embodied Visual Tracking
Kui Wu , Hao Chen , Churan Wang , Fakhri Karray , Zhoujun Li , Yizhou Wang , Fangwei Zhong
User-centric embodied visual tracking (UC-EVT) requires embodied agents to follow dynamic, natural language instructions specifying not only which target to track, but also how to track, including distance, angle, and directional constraints. This dual requirement for robust language understanding and low-latency control poses significant challenges, as current approaches using end-to-end RL, VLM/VLA, and LLM-based methods fail to adequately balance comprehension with low-latency tracking. In this paper, we introduce Hierarchical Instruction-aware Embodied Visual Tracking (HIEVT), which decomposes the problem into on-demand instruction understanding with spatial goal generation (high-level) and asynchronous continuous goal-conditioned control execution (low-level). HIEVT employs an LLM-based Semantic-Spatial Goal Aligner to parse diverse human instructions into spatial goals that directly specify desired target positioning, coupled with an RL-based Adaptive Goal-Aligned Policy that enables real-time target positioning according to generated spatial goals. We establish a comprehensive UC-EVT benchmark using over 1.7 million training trajectories, evaluating performance across one seen environment and nine challenging unseen environments. Extensive experiments and real-world deployments demonstrate HIEVT’s superior robustness, generalizability, and long-horizon tracking capabilities across diverse environments, varying target dynamics, and complex instruction combinations. The complete project is available at [1].
Applications to robotics / Autonomy / Planning
Higher Education Press 2026
/
| 〈 |
|
〉 |