BEFI: Balanced and Efficient Federated Inference of Large Language Models

Lulu Zhang , Qian Tao , Zimu Zhou , Yuanyuan Zhang , Yuxiang Wang , Yongxin Tong

Front. Comput. Sci. ››

PDF (4874KB)
Front. Comput. Sci. ›› DOI: 10.1007/s11704-026-60010-4
RESEARCH ARTICLE
BEFI: Balanced and Efficient Federated Inference of Large Language Models
Author information +
History +
PDF (4874KB)

Abstract

Large language models have shown significant performance across various tasks. This paper focuses on federated LLM inference, in which the context generated by silos requires being protected, resulting in both inefficiency and imbalance in communication overhead. To address this issue, this paper proposes BEFI, enabling Balanced and Efficient Federated LLM Inference with the equipment of carefully designed mechanisms for information sharing among silos. For important tokens, we propose a fine-grained, fixed-size cache sharing mechanism that enables direct and balanced sharing of the KV cache of these tokens. For less critical tokens, we propose a context-free intermediate states sharing mechanism, which allows for the sharing of tokens of arbitrary length with constant communication overhead. Finally, we evaluate the effectiveness of BEFI across various LLMs and privacy mechanisms, demonstrating that BEFI reduces communication overheads by up to 96.0%, while maintaining balanced communication.

Keywords

Large Language Model / Inference Optimization / Federated Learning

Cite this article

Download citation ▾
Lulu Zhang, Qian Tao, Zimu Zhou, Yuanyuan Zhang, Yuxiang Wang, Yongxin Tong. BEFI: Balanced and Efficient Federated Inference of Large Language Models. Front. Comput. Sci. DOI:10.1007/s11704-026-60010-4

登录浏览全文

4963

注册一个新账户 忘记密码

References

RIGHTS & PERMISSIONS

Higher Education Press 2026

PDF (4874KB)

0

Accesses

0

Citation

Detail

Sections
Recommended

/