A study and formal framework of the composability of LLM compression techniques
Gansen HU , Zhaoguo WANG
Front. Comput. Sci. ›› 2026, Vol. 20 ›› Issue (9) : 2009616
Large language models (LLMs) show impressive capabilities across many NLP tasks, but their enormous size creates major deployment challenges. While single compression methods provide limited solutions, combining approaches such as pruning, quantization, knowledge distillation, and low-rank approximation might be essential for both higher compression rates and better model performance.
This paper studies the synergistic effects of combining multiple LLM compression techniques. Our findings reveal that strategic combinations can potentially reduce model size by more than 90% while maintaining performance, with contextual pruning and quantization. Meanwhile, the order of application could impact outcomes, and that joint optimization of compression methods could outperform sequential combination. Although promising, existing combination approaches rely on manual design choices and lack a systematic framework for multi-technique compression. To address this, we prototype a formal framework for automated, multi-technique LLM compression that optimizes the combination sequence. Finally, we discuss remaining challenges and outline future research directions for more efficient large language models.
artificial intelligence / natural language processing / model compression / model checking and program correctness
| [1] |
|
| [2] |
|
| [3] |
|
| [4] |
|
| [5] |
Song Y, Mi Z, Xie H, Chen H. PowerInfer: fast large language model serving with a consumer-grade GPU. In: Proceedings of the 30th ACM SIGOPS Symposium on Operating Systems Principles. 2024, 590−606 |
| [6] |
|
| [7] |
|
| [8] |
|
| [9] |
|
| [10] |
|
| [11] |
|
| [12] |
|
| [13] |
|
| [14] |
|
| [15] |
|
| [16] |
Sun M, Liu Z, Bair A, Kolter J Z. A simple and effective pruning approach for large language models. In: Proceedings of the 12th International Conference on Learning Representations. 2024 |
| [17] |
|
| [18] |
Zhang Y, Bai H, Lin H, Zhao J, Hou L, Cannistraci C V. Plug-and-play: an efficient post-training pruning method for large language models. In: Proceedings of the 12th International Conference on Learning Representations. 2024 |
| [19] |
|
| [20] |
|
| [21] |
|
| [22] |
|
| [23] |
|
| [24] |
|
| [25] |
|
| [26] |
|
| [27] |
|
| [28] |
|
| [29] |
|
| [30] |
|
| [31] |
|
| [32] |
|
| [33] |
|
| [34] |
|
| [35] |
|
| [36] |
|
| [37] |
|
| [38] |
Dong Q, Li L, Dai D, Zheng C, Ma J, Li R, Xia H, Xu J, Wu Z, Chang B, Sun X, Li L, Sui Z. A survey on in-context learning. In: Proceedings of 2024 Conference on Empirical Methods in Natural Language Processing. 2024, 1107−1128 |
| [39] |
|
| [40] |
|
| [41] |
|
| [42] |
|
| [43] |
|
| [44] |
|
| [45] |
Min S, Lewis M, Zettlemoyer L, Hajishirzi H. MetaICL: learning to learn in context. In: Proceedings of 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2022, 2791−2809 |
| [46] |
|
| [47] |
|
| [48] |
|
| [49] |
Gu Y, Dong L, Wei F, Huang M. MiniLLM: knowledge distillation of large language models. In: Proceedings of the 12th International Conference on Learning Representations. 2024 |
| [50] |
|
| [51] |
|
| [52] |
|
| [53] |
|
| [54] |
Hu E J, Shen Y, Wallis P, Allen-Zhu Z, Li Y, Wang S, Wang L, Chen W. LoRA: low-rank adaptation of large language models. In: Proceedings of the 10th International Conference on Learning Representations. 2022 |
| [55] |
|
| [56] |
|
| [57] |
|
| [58] |
Huang W, Zhang Y, Zheng X, Liu Y, Lin J, Yao Y, Ji R. Dynamic low-rank sparse adaptation for large language models. In: Proceedings of the 13th International Conference on Learning Representations. 2025 |
| [59] |
|
| [60] |
|
| [61] |
|
| [62] |
|
| [63] |
|
| [64] |
|
| [65] |
|
| [66] |
|
| [67] |
|
| [68] |
|
| [69] |
|
| [70] |
|
| [71] |
|
| [72] |
|
| [73] |
|
| [74] |
|
| [75] |
Xu Y, Xie L, Gu X, Chen X, Chang H, Zhang H, Chen Z, Zhang X, Tian Q. QA-LoRA: quantization-aware low-rank adaptation of large language models. In: Proceedings of the 12th International Conference on Learning Representation. 2024 |
| [76] |
Li Y, Yu Y, Liang C, He P, Karampatziakis N, Chen W, Zhao T. LoftQ: LoRA-fine-tuning-aware quantization for large language models. In: Proceedings of the 12th International Conference on Learning Representations. 2024 |
| [77] |
|
| [78] |
Liu J, Gong R, Wei X, Dong Z, Cai J, Zhuang B. QLLM: accurate and efficient low-bitwidth quantization for large language models. In: Proceedings of the 12th International Conference on Learning Representations. 2024 |
| [79] |
Guo J, Wu J, Wang Z, Liu J, Yang G, Ding Y, Gong R, Qin H, Liu X. Compressing large language models by joint sparsification and quantization. In: Proceedings of the 41st International Conference on Machine Learning. 2024 |
| [80] |
|
| [81] |
|
| [82] |
|
| [83] |
|
| [84] |
|
| [85] |
|
| [86] |
|
| [87] |
|
| [88] |
Harma S B, Chakraborty A, Kostenok E, Mishin D, Ha D, Falsafi B, Jaggi M, Liu M, Oh Y, Subramanian S, Yazdanbakhsh A. Effective interplay between sparsity and quantization: from theory to practice. In: Proceedings of the 13th International Conference on Learning Representations. 2025 |
| [89] |
|
| [90] |
|
| [91] |
|
| [92] |
|
| [93] |
|
| [94] |
|
| [95] |
Wang Z, Zhou Z, Yang Y, Ding H, Hu G, Ding D, Tang C, Chen H, Li J. WeTune: automatic discovery and verification of query rewrite rules. In: Proceedings of 2022 International Conference on Management of Data. 2022, 94−107 |
| [96] |
|
| [97] |
|
| [98] |
|
| [99] |
|
| [100] |
|
| [101] |
|
| [102] |
|
| [103] |
|
| [104] |
|
| [105] |
|
| [106] |
|
| [107] |
|
| [108] |
|
| [109] |
|
Higher Education Press
/
| 〈 |
|
〉 |