MarsInfer: A Precision-Aware Token Pruning Framework for Accelerating Long-Context LLM Inference

Zhao XU; Guo-Hao XU; Jingzhen DING; Houqiang LI

doi:10.1007/s11704-026-60400-8

Front. Comput. Sci. ›› DOI: 10.1007/s11704-026-60400-8

RESEARCH ARTICLE

MarsInfer: A Precision-Aware Token Pruning Framework for Accelerating Long-Context LLM Inference

Author information +

History +

PDF (2565KB)

Abstract

Excessive time-to-first-token (TTFT) latency during the prefilling stage remains the primary bottleneck in long-context LLM inference, as existing methods either neglect the dominant Feed-Forward Network (FFN) computations (consuming > 60% of inference time) or sacrifice prefilling efficiency for decoding-stage gains. To bridge this gap, we propose MarsInfer, a precision-aware token pruning framework that specifically targets prefilling latency via two key mechanisms: (1) layer-adaptive token selection that dynamically identifies non-critical tokens using attention score distributions before FFN execution, and (2) residual semantic preservation that maintains pruned tokens’ information via skip connections without deferring computations. Unlike prior approaches, MarsInfer achieves hardware-agnostic acceleration by eliminating redundant FFN operations at their source. Extensive validation across different long-context benchmarks (LongBench-E, L-Eval, etc.) demonstrates that MarsInfer achieves a superior trade-off between efficiency and accuracy: On Qwen2-7B-Instruct with a 32K context, it delivers 1.24× TTFT speedup with merely 1.30% average accuracy degradation. The gains scale robustly to larger models, yielding 1.39× TTFT reduction on Qwen1.5-32B-Chat (< 1.5% accuracy drop). Our work establishes a new paradigm for low-latency long-context inference where prefilling acceleration no longer compromises semantic fidelity.

Keywords

Large Language Models / Token Pruning / Inference Optimization / Long Context / FFN Acceleration / Time-To-First-Token (TTFT)

Cite this article

Download citation ▾

Zhao XU, Guo-Hao XU, Jingzhen DING, Houqiang LI. MarsInfer: A Precision-Aware Token Pruning Framework for Accelerating Long-Context LLM Inference. Front. Comput. Sci. DOI:10.1007/s11704-026-60400-8

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

RIGHTS & PERMISSIONS

Higher Education Press 2026

PDF (2565KB)

Accesses

Citation

Detail

Sections

Recommended

About the journal

Aims & scope

Description

Editorial board

Abstracting / indexing

Contact us

Browse

Just accepted

All volumes and issues

Collections

Featured articles

Most accessed

Most cited

Collections

Multimedia collections

Authors & reviewers

Online submission

Call for papers

Guidelines for authors

Download templates

Guidelines for reviewers

Abstract

Keywords

Cite this article

References

RIGHTS & PERMISSIONS

Just Accepted