A study and formal framework of the composability of LLM compression techniques

Gansen HU; Zhaoguo WANG

doi:10.1007/s11704-025-50814-1

Front. Comput. Sci. ›› 2026, Vol. 20 ›› Issue (9) : 2009616 DOI: 10.1007/s11704-025-50814-1

Information Security

REVIEW ARTICLE

A study and formal framework of the composability of LLM compression techniques

Author information +

History +

PDF (2590KB)

Abstract

Large language models (LLMs) show impressive capabilities across many NLP tasks, but their enormous size creates major deployment challenges. While single compression methods provide limited solutions, combining approaches such as pruning, quantization, knowledge distillation, and low-rank approximation might be essential for both higher compression rates and better model performance.

This paper studies the synergistic effects of combining multiple LLM compression techniques. Our findings reveal that strategic combinations can potentially reduce model size by more than 90% while maintaining performance, with contextual pruning and quantization. Meanwhile, the order of application could impact outcomes, and that joint optimization of compression methods could outperform sequential combination. Although promising, existing combination approaches rely on manual design choices and lack a systematic framework for multi-technique compression. To address this, we prototype a formal framework for automated, multi-technique LLM compression that optimizes the combination sequence. Finally, we discuss remaining challenges and outline future research directions for more efficient large language models.

Graphical abstract

Keywords

artificial intelligence / natural language processing / model compression / model checking and program correctness

Cite this article

Download citation ▾

Gansen HU, Zhaoguo WANG. A study and formal framework of the composability of LLM compression techniques. Front. Comput. Sci., 2026, 20(9): 2009616 DOI:10.1007/s11704-025-50814-1

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]	Zhou Z, Ning X, Hong K, Fu T, Xu J, Li S, Lou Y, Wang L, Yuan Z, Li X, Yan S, Dai G, Zhang X P, Dong Y, Wang Y. A survey on efficient inference for large language models. 2024, arXiv preprint arXiv: 2404.14294

[2]	Zhu X, Li J, Liu Y, Ma C, Wang W . A survey on model compression for large language models. Transactions of the Association for Computational Linguistics, 2024, 12: 1556–1577

[3]	Wang W, Chen W, Luo Y, Long Y, Lin Z, Zhang L, Lin B, Cai D, He X. Model compression and efficient inference for large language models: a survey. 2024, arXiv preprint arXiv: 2402.09748

[4]	DeepSeek-AI. DeepSeek-R1: incentivizing reasoning capability in LLMs via reinforcement learning. 2025, arXiv preprint arXiv: 2501.12948

[5]	Song Y, Mi Z, Xie H, Chen H. PowerInfer: fast large language model serving with a consumer-grade GPU. In: Proceedings of the 30th ACM SIGOPS Symposium on Operating Systems Principles. 2024, 590−606

[6]	Yao Z, Aminabadi R Y, Zhang M, Wu X, Li C, He Y. ZeroQuant: efficient and affordable post-training quantization for large-scale transformers. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 1970

[7]	Wu X, Li C, Aminabadi R Y, Yao Z, He Y. Understanding Int4 quantization for language models: latency speedup, composability, and failure cases. In: Proceedings of the 40th International Conference on Machine Learning. 2023, 37524−37539

[8]	Dettmers T, Pagnoni A, Holtzman A, Zettlemoyer L. QLORA: efficient finetuning of quantized LLMs. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 441

[9]

Gholami A, Kim S, Dong Z, Yao Z, Mahoney M W, Keutzer K. A survey of quantization methods for efficient neural network inference. In: Thiruvathukal G K, Lu Y H, Kim J, Chen Y, Chen B, eds. Low-Power Computer Vision: Improve the Efficiency of Artificial Intelligence. New York: Chapman and Hall/CRC, 2022, 291−326

[10]	Han P, Shi X, Huang J . FedAL: black-box federated knowledge distillation enabled by adversarial learning. IEEE Journal on Selected Areas in Communications, 2024, 42( 11): 3064–3077

[11]	Xu X, Li M, Tao C, Shen T, Cheng R, Li J, Xu C, Tao D, Zhou T. A survey on knowledge distillation of large language models. 2024, arXiv preprint arXiv: 2402.13116

[12]	Lin J, Tang J, Tang H, Yang S, Chen W M, Wang W C, Xiao G, Dang X, Gan C, Han S. AWQ: activation-aware weight quantization for on-device LLM compression and acceleration. In: Proceedings of the Seventh Annual Conference on Machine Learning and Systems. 2024, 87−100

[13]	Chee J, Cai Y, Kuleshov V, De Sa C. QuIP: 2-bit quantization of large language models with guarantees. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 196

[14]	Liu Z, Wang J, Dao T, Zhou T, Yuan B, Song Z, Shrivastava A, Zhang C, Tian Y, Re C, Chen B. Deja vu: contextual sparsity for efficient LLMs at inference time. In: Proceedings of the 40th International Conference on Machine Learning. 2023, 22137−22176

[15]	Frankle J, Carbin M. The lottery ticket hypothesis: finding sparse, trainable neural networks. In: Proceedings of the 7th International Conference on Learning Representations. 2019

[16]	Sun M, Liu Z, Bair A, Kolter J Z. A simple and effective pruning approach for large language models. In: Proceedings of the 12th International Conference on Learning Representations. 2024

[17]	Frantar E, Alistarh D. Sparsegpt: Massive language models can be accurately pruned in one-shot. In: International conference on machine learning. 2023

[18]	Zhang Y, Bai H, Lin H, Zhao J, Hou L, Cannistraci C V. Plug-and-play: an efficient post-training pruning method for large language models. In: Proceedings of the 12th International Conference on Learning Representations. 2024

[19]	Wang Z. SparseRT: accelerating unstructured sparsity on GPUs for deep learning inference. In: Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques. 2020, 31−42

[20]	Wang Z. SparseRT: accelerating unstructured sparsity on GPUs for deep learning inference. 2020, arXiv preprint arXiv: 2008.11849

[21]	Okanovic P, Kwasniewski G, Labini P S, Besta M, Vella F, Hoefler T. High performance unstructured SpMM computation using tensor cores. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2024, 1−14

[22]	Ma X, Fang G, Wang X. LLM-pruner: on the structural pruning of large language models. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 950

[23]	Gao S, Lin C H, Hua T, Zheng T, Shen Y, Jin H, Hsu Y C. DISP-LLM: dimension-independent structural pruning for large language models. In: Proceedings of the 38th International Conference on Neural Information Processing Systems. 2024, 2305

[24]	Cheng H, Zhang M, Shi J Q. MINI-LLM: memory-efficient structured pruning for large language models. 2024, arXiv preprint arXiv: 2407.11681

[25]	Dong H, Chen B, Chi Y. Prompt-prompted adaptive structured pruning for efficient LLM generation. 2024, arXiv preprint arXiv: 2404.01365

[26]	An Y, Zhao X, Yu T, Tang M, Wang J. Fluctuation-based adaptive structured pruning for large language models. In: Proceedings of the 38th AAAI Conference on Artificial Intelligence. 2024, 10865−10873

[27]	Dery L, Kolawole S, Kagy J F, Smith V, Neubig G, Talwalkar A. Everybody Prune now: structured pruning of LLMs with only forward passes. 2024, arXiv preprint arXiv: 2402.05406

[28]	Pool J, Sawarkar A, Rodge J. Accelerating inference with sparsity using the NVIDIA ampere architecture and NVIDIA TensorRT. See developer.nvidia.com/blog/accelerating-inference-with-sparsity-using-ampere-and-tensorrt/ website, 2021

[29]	Yu J, Huang T. AutoSlim: towards one-shot architecture search for channel numbers. 2019, arXiv preprint arXiv: 1903.11728

[30]	Cai H, Gan C, Wang T, Zhang Z, Han S. Once-for-all: train one network and specialize it for efficient deployment. In: Proceedings of the 8th International Conference on Learning Representations. 2020

[31]	Chen Z, Jia C, Hu M, Xie X, Li A, Chen M . FlexFL: heterogeneous federated learning via APoZ-guided flexible pruning in uncertain scenarios. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2024, 43( 11): 4069–4080

[32]	Lin C, Tang J, Yang S, Wang H, Tang T, Tian B, Stoica I, Han S, Gao M. Twilight: adaptive attention sparsity with hierarchical top-p pruning. 2025, arXiv preprint arXiv: 2502.02770

[33]	Liu Z, Li C, Xiao S, Li C, Lian D, Shao Y. Matryoshka re-ranker: a flexible re-ranking architecture with configurable depth and width. 2025, arXiv preprint arXiv: 2501.16302

[34]	Guo Y. A survey on methods and theories of quantized neural networks. 2018, arXiv preprint arXiv: 1808.04752

[35]	Ma S, Wang H, Ma L, Wang L, Wang W, Huang S, Dong L, Wang R, Xue J, Wei F. The era of 1-bit LLMs: all large language models are in 1.58 bits. 2024, arXiv preprint arXiv: 2402.17764

[36]	Frantar E, Ashkboos S, Hoefler T, Alistarh D. GPTQ: accurate post-training quantization for generative pre-trained transformers. 2022, arXiv preprint arXiv: 2210.17323

[37]	Huang W, Zheng X, Ma X, Qin H, Lv C, Chen H, Luo J, Qi X, Liu X, Magno M . An empirical study of LLaMA3 quantization: from LLMs to MLLMs. Visual Intelligence, 2024, 2( 1): 36

[38]	Dong Q, Li L, Dai D, Zheng C, Ma J, Li R, Xia H, Xu J, Wu Z, Chang B, Sun X, Li L, Sui Z. A survey on in-context learning. In: Proceedings of 2024 Conference on Empirical Methods in Natural Language Processing. 2024, 1107−1128

[39]	Wang Z. Zero-shot knowledge distillation from a decision-based black-box model. In: Proceedings of the 38th International Conference on Machine Learning. 2021, 10675−10685

[40]	Wang D, Zhang S, Wang L. Deep epidemiological modeling by black-box knowledge distillation: an accurate deep learning model for COVID-19. In: Proceedings of the 35th AAAI Conference on Artificial Intelligence. 2021, 15424−15430

[41]	Fu Y, Peng H, Ou L, Sabharwal A, Khot T. Specializing smaller language models towards multi-step reasoning. In: Proceedings of the 40th International Conference on Machine Learning. 2023, 10421−10430

[42]	Nguyen D, Gupta S, Do K, Venkatesh S. Black-box few-shot knowledge distillation. In: Proceedings of the 17th European Conference on Computer Vision. 2022, 196−211

[43]	Hsieh C Y, Li C L, Yeh C K, Nakhost H, Fujii Y, Ratner A, Krishna R, Lee C Y, Pfister T. Distilling step-by-step! Outperforming larger language models with less training data and smaller model sizes. In: Proceedings of the Findings of the Association for Computational Linguistics. 2023, 8003−8017

[44]	Ho N, Schmid L, Yun S Y. Large language models are reasoning teachers. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. 2023, 14852−14882

[45]	Min S, Lewis M, Zettlemoyer L, Hajishirzi H. MetaICL: learning to learn in context. In: Proceedings of 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2022, 2791−2809

[46]	Huang Y, Chen Y, Yu Z, McKeown K. In-context learning distillation: transferring few-shot learning ability of pre-trained language models. 2022, arXiv preprint arXiv: 2212.10670

[47]	Hong J, Tu Q, Chen C, Xing G, Zhang J, Yan R. CycleAlign: iterative distillation from black-box LLM to white-box models for better human alignment. In: Proceedings of the Findings of the Association for Computational Linguistics. 2024, 14596−14609

[48]	Timiryasov I, Tastet J L. Baby llama: knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty. 2023, arXiv preprint arXiv: 2308.02019

[49]	Gu Y, Dong L, Wei F, Huang M. MiniLLM: knowledge distillation of large language models. In: Proceedings of the 12th International Conference on Learning Representations. 2024

[50]	Agarwal R, Vieillard N, Stanczyk P, Ramos S, Geist M, Bachem O. GKD: generalized knowledge distillation for auto-regressive sequence models. 2023, arXiv preprint arXiv: 2306.13649

[51]	Liang C, Zuo S, Zhang Q, He P, Chen W, Zhao T. Less is more: task-aware layer-wise distillation for language model compression. In: Proceedings of the 40th International Conference on Machine Learning. 2023, 20852−20867

[52]	Yang C, Zhu Y, Lu W, Wang Y, Chen Q, Gao C, Yan B, Chen Y. Survey on knowledge distillation for large language models: methods, evaluation, and application. ACM Transactions on Intelligent Systems and Technology, 2024, doi: 10.1145/3699518

[53]	Kishore Kumar N, Schneider J . Literature survey on low rank approximation of matrices. Linear and Multilinear Algebra, 2017, 65( 11): 2212–2244

[54]	Hu E J, Shen Y, Wallis P, Allen-Zhu Z, Li Y, Wang S, Wang L, Chen W. LoRA: low-rank adaptation of large language models. In: Proceedings of the 10th International Conference on Learning Representations. 2022

[55]	Zhang M, Chen H, Shen C, Yang Z, Ou L, Yu X, Zhuang B. LoRAPrune: structured pruning meets low-rank parameter-efficient fine-tuning. In: Proceedings of the Findings of the Association for Computational Linguistics. 2024, 3013−3026

[56]	Li Y, Yu Y, Zhang Q, Liang C, He P, Chen W, Zhao T. LoSparse: structured compression of large language models based on low-rank and sparse approximation. In: Proceedings of the 40th International Conference on Machine Learning. 2023, 20336−20350

[57]	Chen T, Ding T, Yadav B, Zharkov I, Liang L. LoRAShear: efficient large language model structured pruning and knowledge recovery. 2023, arXiv preprint arXiv: 2310.18356

[58]	Huang W, Zhang Y, Zheng X, Liu Y, Lin J, Yao Y, Ji R. Dynamic low-rank sparse adaptation for large language models. In: Proceedings of the 13th International Conference on Learning Representations. 2025

[59]	Saha R, Srivastava V, Pilanci M. Matrix compression via randomized low rank and low precision factorization. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 826

[60]	Yao Z, Wu X, Li C, Youn S, He Y. ZeroQuant-v2: exploring post-training quantization in LLMs from comprehensive study to low rank compensation. 2023, arXiv preprint arXiv: 2303.08302

[61]	Wu X, Yao Z, He Y. ZeroQuant-FP: a leap forward in LLMs post-training W4A8 quantization using floating-point formats. 2023, arXiv preprint arXiv: 2307.09782

[62]	Li S, Chen J, Han X, Bai J. NutePrune: efficient progressive pruning with numerous teachers for large language models. 2024, arXiv preprint arXiv: 2402.09773

[63]	Zhu H, Shen C. SDMPrune: self-distillation MLP pruning for efficient large language models. 2025, arXiv preprint arXiv: 2506.11120

[64]	Muralidharan S, Sreenivas S T, Joshi R, Chochowski M, Patwary M, Shoeybi M, Catanzaro B, Kautz J, Molchanov P. Compact language models via pruning and knowledge distillation. In: Proceedings of the 38th International Conference on Neural Information Processing Systems. 2024, 1299

[65]	Wang T, Zhou W, Zeng Y, Zhang X. EfficientVLM: fast and accurate vision-language models via knowledge distillation and modal-adaptive pruning. In: Proceedings of the Findings of the Association for Computational Linguistics. 2023, 13899−13913

[66]	Chiu C Y, Hong D Y, Liu P, Wu J J. Effective compression of language models by combining pruning and knowledge distillation. In: Proceedings of the 48th Annual Computers, Software, and Applications Conference (COMPSAC). 2024, 429−438

[67]	Zafrir O, Larey A, Boudoukh G, Shen H, Wasserblat M. Prune once for all: sparse pre-trained language models. 2021, arXiv preprint arXiv: 2111.05754

[68]	Huang W, Hu Y, Jian G, Zhu J, Chen J. Pruning large language models with semi-structural adaptive sparse training. In: Proceedings of the 39th AAAI Conference on Artificial Intelligence. 2025, 24167−24175

[69]	Fan T, Ma G, Song Y, Fan L, Chen K, Yang Q. PPC-GPT: federated task-specific compression of large language models via pruning and chain-of-thought distillation. 2025, arXiv preprint arXiv: 2502.15857

[70]	Thangarasa V, Venkatesh G, Lasby M, Sinnadurai N, Lie S. Self-data distillation for recovering quality in pruned large language models. 2024, arXiv preprint arXiv: 2410.09982

[71]

Sreenivas S T, Muralidharan S, Joshi R, Chochowski M, Mahabaleshwarkar A S, Shen G, Zeng J, Chen Z, Suhara Y, Diao S, Yu C, Chen W C, Ross H, Olabiyi O, Aithal A, Kuchaiev O, Korzekwa D, Molchanov P, Patwary M, Shoeybi M, Kautz J, Catanzaro B. LLM pruning and distillation in practice: the minitron approach. 2024, arXiv preprint arXiv: 2408.11796

[72]	Liu Z, Oguz B, Zhao C, Chang E, Stock P, Mehdad Y, Shi Y, Krishnamoorthi R, Chandra V. LLM-QAT: data-free quantization aware training for large language models. In: Proceedings of the Findings of the Association for Computational Linguistics. 2024, 467−484

[73]	O’Neill J, Dutta S. Self-distilled quantization: achieving high compression rates in transformer-based language models. In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. 2023, 1329−1339

[74]	Chen T, Li Z, Xu W, Zhu Z, Li D, Tian L, Barsoum E, Wang P, Cheng J. TernaryLLM: ternarized large language model. 2024, arXiv preprint arXiv: 2406.07177

[75]	Xu Y, Xie L, Gu X, Chen X, Chang H, Zhang H, Chen Z, Zhang X, Tian Q. QA-LoRA: quantization-aware low-rank adaptation of large language models. In: Proceedings of the 12th International Conference on Learning Representation. 2024

[76]	Li Y, Yu Y, Liang C, He P, Karampatziakis N, Chen W, Zhao T. LoftQ: LoRA-fine-tuning-aware quantization for large language models. In: Proceedings of the 12th International Conference on Learning Representations. 2024

[77]	Jeon H, Kim Y, Kim J J. L4Q: parameter efficient quantization-aware fine-tuning on large language models. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics. 2025, 2002−2024

[78]	Liu J, Gong R, Wei X, Dong Z, Cai J, Zhuang B. QLLM: accurate and efficient low-bitwidth quantization for large language models. In: Proceedings of the 12th International Conference on Learning Representations. 2024

[79]	Guo J, Wu J, Wang Z, Liu J, Yang G, Ding Y, Gong R, Qin H, Liu X. Compressing large language models by joint sparsification and quantization. In: Proceedings of the 41st International Conference on Machine Learning. 2024

[80]	Wu, X., Li, C., Aminabadi, R.Y., Yao, Z., He, Y. Understanding INT4 Quantization for Transformer Models: Latency Speedup, Composability, and Failure Cases. 2023, arXiv preprint arXiv: 2301.12017

[81]	Zhou C, Zhou Y, Wang Y, Han S, Qiao Q, Li H. QPruner: probabilistic decision quantization for structured pruning in large language models. In: Proceedings of the Findings of the Association for Computational Linguistics. 2025, 4276−4286

[82]	Almazrouei E, Alobeidli H, Alshamsi A, Cappelli A, Cojocaru R, Debbah M, Goffinet É, Hesslow D, Launay J, Malartic Q, Mazzotta D, Noune B, Pannier B, Penedo G. The falcon series of open language models. 2023, arXiv preprint arXiv: 2311.16867

[83]	Grattafiori A, Dubey A, Jauhri A, Pandey A, Kadian A, , . The llama 3 herd of models. 2024, arXiv preprint arXiv: 2407.21783

[84]	Liu L, Zhang S, Kuang Z, Zhou A, Xue J H, Wang X, Chen Y, Yang W, Liao Q, Zhang W. Group fisher pruning for practical network compression. In: Proceedings of the 38th International Conference on Machine Learning. 2021, 7021−7032

[85]	Peste A, Iofinova E, Vladu A, Alistarh D. AC/DC: alternating compressed/decompressed training of deep neural networks. In: Proceedings of the 35th International Conference on Neural Information Processing Systems. 2021, 655

[86]	Evci U, Gale T, Menick J, Castro P S, Elsen E. Rigging the lottery: making all tickets winners. In: Proceedings of the 37th International Conference on Machine Learning. 2020, 2943−2952

[87]	Merity S, Xiong C, Bradbury J, , . Pointer sentinel mixture models. 2016, arXiv preprint arXiv:1609.07843

[88]	Harma S B, Chakraborty A, Kostenok E, Mishin D, Ha D, Falsafi B, Jaggi M, Liu M, Oh Y, Subramanian S, Yazdanbakhsh A. Effective interplay between sparsity and quantization: from theory to practice. In: Proceedings of the 13th International Conference on Learning Representations. 2025

[89]	Leofante F, Narodytska N, Pulina L, Tacchella A. Automated verification of neural networks: advances, challenges and perspectives. 2018, arXiv preprint arXiv: 1805.09938

[90]	Holzmann G J, Smith M H . An automated verification method for distributed systems software based on model extraction. IEEE Transactions on Software Engineering, 2002, 28( 4): 364–377

[91]	Farzan A, Heizmann M, Hoenicke J, Kincaid Z, Podelski A. Automated program verification. In: Proceedings of the 9th International Conference on Language and Automata Theory and Applications. 2015, 25−46

[92]	Forejt V, Kwiatkowska M, Norman G, Parker D. Automated verification techniques for probabilistic systems. In: Proceedings of the 11th International School on Formal Methods for the Design of Computer, Communication and Software Systems. 2011, 53−113

[93]	Barrett C, Tinelli C. Satisfiability modulo theories. In: Clarke E M, Henzinger T A, Veith H, Bloem R, eds. Handbook of Model Checking. Cham: Springer, 2018, 305−343

[94]	de Moura L, Bjørner N. Z3: an efficient SMT solver. In: Proceedings of the 14th International Conference on Tools and Algorithms for the Construction and Analysis of Systems. 2008, 337−340

[95]	Wang Z, Zhou Z, Yang Y, Ding H, Hu G, Ding D, Tang C, Chen H, Li J. WeTune: automatic discovery and verification of query rewrite rules. In: Proceedings of 2022 International Conference on Management of Data. 2022, 94−107

[96]	Gurfinkel A, Shoham S, Meshman Y. SMT-based verification of parameterized systems. In: Proceedings of the 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. 2016, 338−348

[97]	Lahiri S, Qadeer S . Back to the future: revisiting precise program verification using SMT solvers. ACM SIGPLAN Notices, 2008, 43( 1): 171–182

[98]	Cordeiro L, Fischer B, Marques-Silva J . SMT-based bounded model checking for embedded ANSI-C software. IEEE Transactions on Software Engineering, 2012, 38( 4): 957–974

[99]	Bjørner N, de Moura L. Applications of SMT solvers to program verification. Notes for the Summer School on Formal Techniques, 2014

[100]

De Moura L, Bjørner N. Z3: An efficient SMT solver. In: International conference on Tools and Algorithms for the Construction and Analysis of Systems. 2008. 337-340

[101]

Tung F, Mori G. CLIP-Q: deep network compression learning by in-parallel pruning-quantization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 7873−7882

[102]

Mishra A, Latorre J A, Pool J, Stosic D, Stosic D, Venkatesh G, Yu C, Micikevicius P. Accelerating sparse deep neural networks. 2021, arXiv preprint arXiv: 2104.08378

[103]

Hu P, Peng X, Zhu H, Aly M M S, Lin J. OPQ: compressing deep neural networks with one-shot pruning-quantization. In: Proceedings of the 35th AAAI Conference on Artificial Intelligence. 2021, 7780−7788

[104]

Basha S H S, Farazuddin M, Pulabaigari V, Dubey S R, Mukherjee S . Deep model compression based on the training history. Neurocomputing, 2024, 573: 127257

[105]

Hoefler T, Alistarh D, Ben-Nun T, Dryden N, Peste A . Sparsity in deep learning: pruning and growth for efficient inference and training in neural networks. The Journal of Machine Learning Research, 2021, 22( 1): 241

[106]

Han Y, Huang G, Song S, Yang L, Wang H, Wang Y . Dynamic neural networks: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44( 11): 7436–7456

[107]

Lubana E S, Dick R P. A gradient flow framework for analyzing network pruning. In: Proceedings of the 9th International Conference on Learning Representations. 2021

[108]

Yang G, He C, Guo J, Wu J, Ding Y, Liu A, Qin H, Ji P, Liu X. LLMCBench: benchmarking large language model compression for efficient deployment. In: Proceedings of the 38th International Conference on Neural Information Processing Systems. 2024, 2778

[109]

Zhao J, Wang M, Zhang M, Shang Y, Liu X, Wang Y, Zhang M, Nie L. Benchmarking post-training quantization in LLMs: comprehensive taxonomy, unified evaluation, and comparative analysis. 2025, arXiv preprint arXiv: 2502.13178