Strategies for computational efficiency in small language models

Jonathan Taylar

doi:10.1007/s43684-026-00130-7

Autonomous Intelligent Systems ›› 2026, Vol. 6 ›› Issue (1) :8 DOI: 10.1007/s43684-026-00130-7

Original Article

research-article

Strategies for computational efficiency in small language models

Jonathan Taylar ¹^,^a

Author information +

History +

PDF

Abstract

The proliferation of Large Language Models (LLMs) is often constrained by their significant computational and memory requirements, limiting their deployment to large data centers. Small Language Models (SLMs) offer a feasible solution for on-device applications; yet their efficiency requires optimization to operate well on resource-constrained hardware. This study looks at ways to make SLMs more efficient at using computers. The effects of two primary methods were compared: post-training optimization and architectural innovation through quantitative and observational study. Using a standardized suite of benchmarks measuring accuracy, reasoning, and inference performance, a baseline is established with state-of-the-art SLMs like Phi-3 and Llama 3. Post-training techniques were evaluated, including 4-bit quantization (GPTQ) and knowledge distillation from a superior teacher model. Finally, these optimized Transformers were compared against a custom-trained, non-Transformer model based on the Mamba (State-Space Model) architecture. Results show that 4-bit quantization is the most effective compression strategy among those evaluated. It reduces peak inference memory footprint by 71%, increases throughput by 83%, and does so with minimal accuracy degradation. Within the controlled experimental space evaluated in this study, the 4-bit quantized Phi-3-mini model occupies a Pareto-optimal position in memory-normalized accuracy. Focused skill growth works best with knowledge distillation. However, new designs like Mamba offer a different trade-off by being the best at streaming jobs’ raw output. It was found that improving current Transformer-based SLMs through quantization is the best way to use them for general purposes. However, customized designs and distillation work better for specific, high-performance uses. This research offers a definitive framework and pragmatic recommendations for advancing the next generation of robust, efficient, and accessible language models.

Keywords

Small Language Models (SLMs) / Large Language Models (LLMs) / Natural Language Processing (NLP) / Generative AI

Cite this article

Download citation ▾

Jonathan Taylar. Strategies for computational efficiency in small language models. Autonomous Intelligent Systems, 2026, 6(1): 8 DOI:10.1007/s43684-026-00130-7

登录浏览全文

4963

注册一个新账户忘记密码

References

Publishing order | Descend order by publishing year | Descend order by cited within

[1]

T.B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D.M. Ziegler, J. Wu, C. Winter, D. Amodei, Language models are few-shot learners (2020). arXiv preprint arXiv:2005.14165

[2]	A. Vaswani, G. Brain, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, I. Polosukhin, Attention is all you need. Advances in Neural Information Processing Systems (2017). arXiv preprint arXiv:1706.03762

[3]	S. Gunasekar, Y. Zhang, J. Aneja, C.C.T. Mendes, A. Del Giorno, S. Gopi, M. Javaheripi, P. Kauffmann, G. Rosa, O. Saarikivi, A. Salim, S. Shah, H.S. Behl, X. Wang, S. Bubeck, R. Eldan, A.T. Kalai, Y.-T. Lee, Y. Li, Textbooks Are All You Need (2023). http://arxiv.org/abs/2306.11644

[4]	Y. Li, S. Gunasekar, et al., Phi-2: the surprising power of small language models (2023). arXiv preprint arXiv:2312.06663

[5]	M. Abdin, S.A. Jacobs, A.A. Awan, et al., Phi-3 technical report: a highly capable language model locally on your phone (2024). arXiv preprint arXiv:2404.14219

[6]	Gemma Team, Google DeepMind, Gemma: open models based on Gemini research and technology (2024). arXiv preprint arXiv:2403.08295

[7]	Google DeepMind, Gemma 2B (2024). https://ai.google.dev/gemma. Accessed 2025

[8]	Google DeepMind, Gemma 7B (2024). https://ai.google.dev/gemma. Accessed 2025

[9]	E. Frantar, S. Ashkboos, T. Hoefler, D. Alistarh, GPTQ: accurate Post-Training Quantization for Generative Pre-trained Transformers (2023). http://arxiv.org/abs/2210.17323

[10]	J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, S. Han, AWQ: activation-aware Weight Quantization for LLM Compression and Acceleration (2024). http://arxiv.org/abs/2306.00978

[11]	G. Hinton, O. Vinyals, J. Dean, Distilling the Knowledge in a Neural Network (2015). http://arxiv.org/abs/1503.02531

[12]	Wang A., Xie R., Li S., Sun X., Kang Z.. Sparsifying mamba. Findings of the Association for Computational Linguistics: EMNLP 2025, 2025, 17661-17667 Association for Computational Linguistics

[13]	Ali A.A., Zimerman I., Wolf L.. The hidden attention of mamba models. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), 2025, 1516-1534 Association for Computational Linguistics

[14]	B. Liao, H. Tao, Q. Zhang, T. Cheng, Y. Li, H. Yin, W. Liu, X. Wang, Multimodal mamba: decoder-only multimodal state space model via quadratic to linear distillation (2025). arXiv preprint arXiv:2502.13145

[15]	Ergasti A., Botti F., Fontanini T., Ferrari C., Bertozzi M., Prati A.. U-shape mamba: state space model for faster diffusion. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW 2025), 2025, 3276-3283

[16]	Gu A., Dao T.. Mamba: Linear-Time Sequence Modeling with Selective State Spaces, 2024, http://arxiv.org/abs/2312.00752

[17]	J. Chen, Y. Shabanzadeh, E. Crncevic, T. Hoefler, D. Alistarh, The geometry of llm quantization: Gptq as babai’s nearest plane algorithm (2025). arXiv preprint arXiv:2507.18553

[18]	X. Zheng, H. Qin, Y. Li, H. Chu, J. Wang, J. Guo, M. Magno, X. Liu, First-order error matters: accurate compensation for quantized large language models (2025). arXiv preprint arXiv:2507.11017

[19]	I. Proskurina, G. Metzler, J. Velcin, Fair-gptq: bias-aware quantization for large language models (2025). arXiv preprint arXiv:2509.15206

[20]	Z. Li, Y. Su, R. Yang, C. Xie, Z. Wang, Z. Xie, N. Wong, H. Yang, Quantization meets reasoning: exploring llm low-bit quantization degradation for mathematical reasoning (2025). arXiv preprint arXiv:2501.03035

[21]	E. Frantar, D. Alistarh, Sparsegpt: massive language models can be accurately pruned in one-shot. Post-training pruning and quantization compatibility without full retraining; supports structured sparsity and efficient inference (2023). arXiv preprint arXiv:2301.00774

[22]	Zhou Z., Kurz S., Zhao Z.. Revisiting pruning vs quantization for small language models. Findings of the Association for Computational Linguistics: EMNLP 2025, 2025, 12055-12070 https://aclanthology.org/2025.findings-emnlp.645/

[23]	D. Du, Y. Zhang, S. Cao, et al., Bitdistiller: unleashing the potential of sub-4-bit llms via self-distillation (2024). arXiv preprint arXiv:2402.10631

[24]	Y.-L. Sung, P. Yadav, J. Li, J. Yoon, M. Bansal, Rsq: learning from important tokens leads to better quantized llms (2025). arXiv preprint arXiv:2503.01820

[25]	J. Chen, J. Shi, J. Huo, C. Wu, R2q: towards robust 2-bit large language models via residual refinement quantization (2025). arXiv preprint arXiv:2511.21736

[26]	B. Peng, E. Alcaide, Q. Anthony, A. Albalak, S. Arcadinho, S. Biderman, H. Cao, X. Cheng, M. Chung, M. Grella, K.K. GV, X. He, H. Hou, J. Lin, P. Kazienko, J. Kocon, J. Kong, B. Koptyra, H. Lau, R.-J. Zhu, RWKV: reinventing RNNs for the Transformer Era (2023). http://arxiv.org/abs/2305.13048

[27]	A. Datta, The evolution of rwkv: advancements in efficient language modeling. Surveys RWKV developments including RWKV-v5 and related recurrent/linear architectures with long context efficiency (2024). arXiv preprint arXiv:2411.02795

[28]	C. Xu, Y. Yue, Z. Xu, X. Hu, J. Yu, Z. Chen, S. Zhou, Z. Yuan, D. Yang, Rwkvquant: quantizing the rwkv family with proxy guided hybrid of scalar and vector quantization (2025). arXiv preprint arXiv:2505.03803

[29]	M.A. Haque, F. Rahman, K. Datta Gupta, K. Shujaee, R. George, Tinyllm: evaluation and optimization of small language models for agentic tasks on edge devices (2025). arXiv preprint arXiv:2511.22138

[30]	P. Zhang, G. Zeng, T. Wang, W. Lu, Tinyllama: an open-source small language model (2024). arXiv preprint arXiv:2401.02385

[31]	QwenLM, Qwen-1.8B-Chat-Int4 Model Card. https://huggingface.co/Qwen/Qwen-1_8B-Chat-Int4. Hugging Face model card (quantized 4-bit conversational variant) (2025)

[32]	T. Dettmers, M. Lewis, Y. Belkada, L. Zettlemoyer, LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (2022). http://arxiv.org/abs/2208.07339

[33]	Mukherjee S., Mitra A., Jawahar G., Agarwal S., Palangi H., Awadallah A.H.. Orca: Progressive Learning from Complex Explanation Traces of GPT-4, 2023, http://arxiv.org/abs/2306.02707

[34]	A. Gu, K. Goel, C. Ré, Efficiently modeling long sequences with structured state spaces. (2022). arXiv preprint arXiv:2111.00396

[35]	A. Gu, A. Gupta, K. Goe, On the parameterization and initialization of diagonal state space models. (2022). arXiv preprint arXiv:2206.11893

[36]	Poli M., Massaroli S., Nguyen E., Fu D.Y., Dao T., Baccus S., Bengio Y., Ermon S., Ré C.. Hyena Hierarchy: Towards Larger Convolutional Language Models, 2023, http://arxiv.org/abs/2302.10866

[37]	Mohammadabadi S.M.S., Kara B.C., Eyupoglu C., Uzay C., Tosun M.S., Karakuş O.. A survey of large language models: evolution, architectures, adaptation, benchmarking, applications, challenges, and societal implications. Electronics, 2025, 14(18): 3580

[38]	E. Team, A survey on large language model benchmarks (2025). arXiv preprint arXiv:2508.15361

[39]	Q. Mai, Slimming down llms without losing their minds (2025). arXiv preprint arXiv:2506.10885

[40]	D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, J. Steinhardt, Measuring massive multitask language understanding. (2020). arXiv preprint arXiv:2009.03300

[41]	K. Cobbe, et al., Training verifiers to solve math word problems (2021). arXiv preprint arXiv:2110.14168

[42]	Touvron H., Lavril T., Izacard G., Martinet X., Lachaux M.-A., Lacroix T., Rozière B., Goyal N., Hambro E., Azhar F., Rodriguez A., Joulin A., Grave E., Lample G.. LLaMA: Open and Efficient Foundation Language Models, 2023, http://arxiv.org/abs/2302.13971

[43]	Jurafsky D., Martin J.H.. Speech and Language Processing, 2023, 3Pearson

[44]	B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, D. Kalenichenko, Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference (2017). http://arxiv.org/abs/1712.05877