BEFI: Balanced and Efficient Federated Inference of Large Language Models
Lulu Zhang , Qian Tao , Zimu Zhou , Yuanyuan Zhang , Yuxiang Wang , Yongxin Tong
Large language models have shown significant performance across various tasks. This paper focuses on federated LLM inference, in which the context generated by silos requires being protected, resulting in both inefficiency and imbalance in communication overhead. To address this issue, this paper proposes BEFI, enabling Balanced and Efficient Federated LLM Inference with the equipment of carefully designed mechanisms for information sharing among silos. For important tokens, we propose a fine-grained, fixed-size cache sharing mechanism that enables direct and balanced sharing of the KV cache of these tokens. For less critical tokens, we propose a context-free intermediate states sharing mechanism, which allows for the sharing of tokens of arbitrary length with constant communication overhead. Finally, we evaluate the effectiveness of BEFI across various LLMs and privacy mechanisms, demonstrating that BEFI reduces communication overheads by up to 96.0%, while maintaining balanced communication.
Large Language Model / Inference Optimization / Federated Learning
Higher Education Press 2026
/
| 〈 |
|
〉 |