Hierarchical Instruction-aware Embodied Visual Tracking

Kui Wu; Hao Chen; Churan Wang; Fakhri Karray; Zhoujun Li; Yizhou Wang; Fangwei Zhong

doi:10.1007/s11704-026-60179-8

Front. Comput. Sci. ›› DOI: 10.1007/s11704-026-60179-8

RESEARCH ARTICLE

Hierarchical Instruction-aware Embodied Visual Tracking

Author information +

History +

PDF (10489KB)

Abstract

User-centric embodied visual tracking (UC-EVT) requires embodied agents to follow dynamic, natural language instructions specifying not only which target to track, but also how to track, including distance, angle, and directional constraints. This dual requirement for robust language understanding and low-latency control poses significant challenges, as current approaches using end-to-end RL, VLM/VLA, and LLM-based methods fail to adequately balance comprehension with low-latency tracking. In this paper, we introduce Hierarchical Instruction-aware Embodied Visual Tracking (HIEVT), which decomposes the problem into on-demand instruction understanding with spatial goal generation (high-level) and asynchronous continuous goal-conditioned control execution (low-level). HIEVT employs an LLM-based Semantic-Spatial Goal Aligner to parse diverse human instructions into spatial goals that directly specify desired target positioning, coupled with an RL-based Adaptive Goal-Aligned Policy that enables real-time target positioning according to generated spatial goals. We establish a comprehensive UC-EVT benchmark using over 1.7 million training trajectories, evaluating performance across one seen environment and nine challenging unseen environments. Extensive experiments and real-world deployments demonstrate HIEVT’s superior robustness, generalizability, and long-horizon tracking capabilities across diverse environments, varying target dynamics, and complex instruction combinations. The complete project is available at [1].

Keywords

Applications to robotics / Autonomy / Planning

Cite this article

Download citation ▾

Kui Wu, Hao Chen, Churan Wang, Fakhri Karray, Zhoujun Li, Yizhou Wang, Fangwei Zhong. Hierarchical Instruction-aware Embodied Visual Tracking. Front. Comput. Sci. DOI:10.1007/s11704-026-60179-8

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

RIGHTS & PERMISSIONS

Higher Education Press 2026

PDF (10489KB)

114

Accesses

Citation

Detail

Sections

Recommended

About the journal

Aims & scope

Description

Editorial board

Abstracting / indexing

Contact us

Browse

Just accepted

All volumes and issues

Collections

Featured articles

Most accessed

Most cited

Collections

Multimedia collections

Authors & reviewers

Online submission

Call for papers

Guidelines for authors

Download templates

Guidelines for reviewers

Abstract

Keywords

Cite this article

References

RIGHTS & PERMISSIONS

Just Accepted