MarsInfer: A Precision-Aware Token Pruning Framework for Accelerating Long-Context LLM Inference

Zhao XU , Guo-Hao XU , Jingzhen DING , Houqiang LI

Front. Comput. Sci. ››

PDF (2565KB)
Front. Comput. Sci. ›› DOI: 10.1007/s11704-026-60400-8
RESEARCH ARTICLE
MarsInfer: A Precision-Aware Token Pruning Framework for Accelerating Long-Context LLM Inference
Author information +
History +
PDF (2565KB)

Abstract

Excessive time-to-first-token (TTFT) latency during the prefilling stage remains the primary bottleneck in long-context LLM inference, as existing methods either neglect the dominant Feed-Forward Network (FFN) computations (consuming > 60% of inference time) or sacrifice prefilling efficiency for decoding-stage gains. To bridge this gap, we propose MarsInfer, a precision-aware token pruning framework that specifically targets prefilling latency via two key mechanisms: (1) layer-adaptive token selection that dynamically identifies non-critical tokens using attention score distributions before FFN execution, and (2) residual semantic preservation that maintains pruned tokens’ information via skip connections without deferring computations. Unlike prior approaches, MarsInfer achieves hardware-agnostic acceleration by eliminating redundant FFN operations at their source. Extensive validation across different long-context benchmarks (LongBench-E, L-Eval, etc.) demonstrates that MarsInfer achieves a superior trade-off between efficiency and accuracy: On Qwen2-7B-Instruct with a 32K context, it delivers 1.24× TTFT speedup with merely 1.30% average accuracy degradation. The gains scale robustly to larger models, yielding 1.39× TTFT reduction on Qwen1.5-32B-Chat (< 1.5% accuracy drop). Our work establishes a new paradigm for low-latency long-context inference where prefilling acceleration no longer compromises semantic fidelity.

Keywords

Large Language Models / Token Pruning / Inference Optimization / Long Context / FFN Acceleration / Time-To-First-Token (TTFT)

Cite this article

Download citation ▾
Zhao XU, Guo-Hao XU, Jingzhen DING, Houqiang LI. MarsInfer: A Precision-Aware Token Pruning Framework for Accelerating Long-Context LLM Inference. Front. Comput. Sci. DOI:10.1007/s11704-026-60400-8

登录浏览全文

4963

注册一个新账户 忘记密码

References

RIGHTS & PERMISSIONS

Higher Education Press 2026

PDF (2565KB)

0

Accesses

0

Citation

Detail

Sections
Recommended

/