The proliferation of Large Language Models (LLMs) is often constrained by their significant computational and memory requirements, limiting their deployment to large data centers. Small Language Models (SLMs) offer a feasible solution for on-device applications; yet their efficiency requires optimization to operate well on resource-constrained hardware. This study looks at ways to make SLMs more efficient at using computers. The effects of two primary methods were compared: post-training optimization and architectural innovation through quantitative and observational study. Using a standardized suite of benchmarks measuring accuracy, reasoning, and inference performance, a baseline is established with state-of-the-art SLMs like Phi-3 and Llama 3. Post-training techniques were evaluated, including 4-bit quantization (GPTQ) and knowledge distillation from a superior teacher model. Finally, these optimized Transformers were compared against a custom-trained, non-Transformer model based on the Mamba (State-Space Model) architecture. Results show that 4-bit quantization is the most effective compression strategy among those evaluated. It reduces peak inference memory footprint by 71%, increases throughput by 83%, and does so with minimal accuracy degradation. Within the controlled experimental space evaluated in this study, the 4-bit quantized Phi-3-mini model occupies a Pareto-optimal position in memory-normalized accuracy. Focused skill growth works best with knowledge distillation. However, new designs like Mamba offer a different trade-off by being the best at streaming jobs’ raw output. It was found that improving current Transformer-based SLMs through quantization is the best way to use them for general purposes. However, customized designs and distillation work better for specific, high-performance uses. This research offers a definitive framework and pragmatic recommendations for advancing the next generation of robust, efficient, and accessible language models.
| [1] |
T.B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D.M. Ziegler, J. Wu, C. Winter, D. Amodei, Language models are few-shot learners (2020). arXiv preprint arXiv:2005.14165
|
| [2] |
A. Vaswani, G. Brain, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, L. Kaiser, I. Polosukhin, Attention is all you need. Advances in Neural Information Processing Systems (2017). arXiv preprint arXiv:1706.03762
|
| [3] |
S. Gunasekar, Y. Zhang, J. Aneja, C.C.T. Mendes, A. Del Giorno, S. Gopi, M. Javaheripi, P. Kauffmann, G. Rosa, O. Saarikivi, A. Salim, S. Shah, H.S. Behl, X. Wang, S. Bubeck, R. Eldan, A.T. Kalai, Y.-T. Lee, Y. Li, Textbooks Are All You Need (2023). http://arxiv.org/abs/2306.11644
|
| [4] |
Y. Li, S. Gunasekar, et al., Phi-2: the surprising power of small language models (2023). arXiv preprint arXiv:2312.06663
|
| [5] |
M. Abdin, S.A. Jacobs, A.A. Awan, et al., Phi-3 technical report: a highly capable language model locally on your phone (2024). arXiv preprint arXiv:2404.14219
|
| [6] |
Gemma Team, Google DeepMind, Gemma: open models based on Gemini research and technology (2024). arXiv preprint arXiv:2403.08295
|
| [7] |
Google DeepMind, Gemma 2B (2024). https://ai.google.dev/gemma. Accessed 2025
|
| [8] |
Google DeepMind, Gemma 7B (2024). https://ai.google.dev/gemma. Accessed 2025
|
| [9] |
E. Frantar, S. Ashkboos, T. Hoefler, D. Alistarh, GPTQ: accurate Post-Training Quantization for Generative Pre-trained Transformers (2023). http://arxiv.org/abs/2210.17323
|
| [10] |
J. Lin, J. Tang, H. Tang, S. Yang, W.-M. Chen, W.-C. Wang, G. Xiao, X. Dang, C. Gan, S. Han, AWQ: activation-aware Weight Quantization for LLM Compression and Acceleration (2024). http://arxiv.org/abs/2306.00978
|
| [11] |
G. Hinton, O. Vinyals, J. Dean, Distilling the Knowledge in a Neural Network (2015). http://arxiv.org/abs/1503.02531
|
| [12] |
Wang A., Xie R., Li S., Sun X., Kang Z.. Sparsifying mamba. Findings of the Association for Computational Linguistics: EMNLP 2025, 2025, 17661-17667 Association for Computational Linguistics
|
| [13] |
Ali A.A., Zimerman I., Wolf L.. The hidden attention of mamba models. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), 2025, 1516-1534 Association for Computational Linguistics
|
| [14] |
B. Liao, H. Tao, Q. Zhang, T. Cheng, Y. Li, H. Yin, W. Liu, X. Wang, Multimodal mamba: decoder-only multimodal state space model via quadratic to linear distillation (2025). arXiv preprint arXiv:2502.13145
|
| [15] |
Ergasti A., Botti F., Fontanini T., Ferrari C., Bertozzi M., Prati A.. U-shape mamba: state space model for faster diffusion. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW 2025), 2025, 3276-3283
|
| [16] |
Gu A., Dao T.. Mamba: Linear-Time Sequence Modeling with Selective State Spaces, 2024, http://arxiv.org/abs/2312.00752
|
| [17] |
J. Chen, Y. Shabanzadeh, E. Crncevic, T. Hoefler, D. Alistarh, The geometry of llm quantization: Gptq as babai’s nearest plane algorithm (2025). arXiv preprint arXiv:2507.18553
|
| [18] |
X. Zheng, H. Qin, Y. Li, H. Chu, J. Wang, J. Guo, M. Magno, X. Liu, First-order error matters: accurate compensation for quantized large language models (2025). arXiv preprint arXiv:2507.11017
|
| [19] |
I. Proskurina, G. Metzler, J. Velcin, Fair-gptq: bias-aware quantization for large language models (2025). arXiv preprint arXiv:2509.15206
|
| [20] |
Z. Li, Y. Su, R. Yang, C. Xie, Z. Wang, Z. Xie, N. Wong, H. Yang, Quantization meets reasoning: exploring llm low-bit quantization degradation for mathematical reasoning (2025). arXiv preprint arXiv:2501.03035
|
| [21] |
E. Frantar, D. Alistarh, Sparsegpt: massive language models can be accurately pruned in one-shot. Post-training pruning and quantization compatibility without full retraining; supports structured sparsity and efficient inference (2023). arXiv preprint arXiv:2301.00774
|
| [22] |
Zhou Z., Kurz S., Zhao Z.. Revisiting pruning vs quantization for small language models. Findings of the Association for Computational Linguistics: EMNLP 2025, 2025, 12055-12070 https://aclanthology.org/2025.findings-emnlp.645/
|
| [23] |
D. Du, Y. Zhang, S. Cao, et al., Bitdistiller: unleashing the potential of sub-4-bit llms via self-distillation (2024). arXiv preprint arXiv:2402.10631
|
| [24] |
Y.-L. Sung, P. Yadav, J. Li, J. Yoon, M. Bansal, Rsq: learning from important tokens leads to better quantized llms (2025). arXiv preprint arXiv:2503.01820
|
| [25] |
J. Chen, J. Shi, J. Huo, C. Wu, R2q: towards robust 2-bit large language models via residual refinement quantization (2025). arXiv preprint arXiv:2511.21736
|
| [26] |
B. Peng, E. Alcaide, Q. Anthony, A. Albalak, S. Arcadinho, S. Biderman, H. Cao, X. Cheng, M. Chung, M. Grella, K.K. GV, X. He, H. Hou, J. Lin, P. Kazienko, J. Kocon, J. Kong, B. Koptyra, H. Lau, R.-J. Zhu, RWKV: reinventing RNNs for the Transformer Era (2023). http://arxiv.org/abs/2305.13048
|
| [27] |
A. Datta, The evolution of rwkv: advancements in efficient language modeling. Surveys RWKV developments including RWKV-v5 and related recurrent/linear architectures with long context efficiency (2024). arXiv preprint arXiv:2411.02795
|
| [28] |
C. Xu, Y. Yue, Z. Xu, X. Hu, J. Yu, Z. Chen, S. Zhou, Z. Yuan, D. Yang, Rwkvquant: quantizing the rwkv family with proxy guided hybrid of scalar and vector quantization (2025). arXiv preprint arXiv:2505.03803
|
| [29] |
M.A. Haque, F. Rahman, K. Datta Gupta, K. Shujaee, R. George, Tinyllm: evaluation and optimization of small language models for agentic tasks on edge devices (2025). arXiv preprint arXiv:2511.22138
|
| [30] |
P. Zhang, G. Zeng, T. Wang, W. Lu, Tinyllama: an open-source small language model (2024). arXiv preprint arXiv:2401.02385
|
| [31] |
QwenLM, Qwen-1.8B-Chat-Int4 Model Card. https://huggingface.co/Qwen/Qwen-1_8B-Chat-Int4. Hugging Face model card (quantized 4-bit conversational variant) (2025)
|
| [32] |
T. Dettmers, M. Lewis, Y. Belkada, L. Zettlemoyer, LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale (2022). http://arxiv.org/abs/2208.07339
|
| [33] |
Mukherjee S., Mitra A., Jawahar G., Agarwal S., Palangi H., Awadallah A.H.. Orca: Progressive Learning from Complex Explanation Traces of GPT-4, 2023, http://arxiv.org/abs/2306.02707
|
| [34] |
A. Gu, K. Goel, C. Ré, Efficiently modeling long sequences with structured state spaces. (2022). arXiv preprint arXiv:2111.00396
|
| [35] |
A. Gu, A. Gupta, K. Goe, On the parameterization and initialization of diagonal state space models. (2022). arXiv preprint arXiv:2206.11893
|
| [36] |
Poli M., Massaroli S., Nguyen E., Fu D.Y., Dao T., Baccus S., Bengio Y., Ermon S., Ré C.. Hyena Hierarchy: Towards Larger Convolutional Language Models, 2023, http://arxiv.org/abs/2302.10866
|
| [37] |
Mohammadabadi S.M.S., Kara B.C., Eyupoglu C., Uzay C., Tosun M.S., Karakuş O.. A survey of large language models: evolution, architectures, adaptation, benchmarking, applications, challenges, and societal implications. Electronics, 2025, 14(18): 3580
|
| [38] |
E. Team, A survey on large language model benchmarks (2025). arXiv preprint arXiv:2508.15361
|
| [39] |
Q. Mai, Slimming down llms without losing their minds (2025). arXiv preprint arXiv:2506.10885
|
| [40] |
D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, J. Steinhardt, Measuring massive multitask language understanding. (2020). arXiv preprint arXiv:2009.03300
|
| [41] |
K. Cobbe, et al., Training verifiers to solve math word problems (2021). arXiv preprint arXiv:2110.14168
|
| [42] |
Touvron H., Lavril T., Izacard G., Martinet X., Lachaux M.-A., Lacroix T., Rozière B., Goyal N., Hambro E., Azhar F., Rodriguez A., Joulin A., Grave E., Lample G.. LLaMA: Open and Efficient Foundation Language Models, 2023, http://arxiv.org/abs/2302.13971
|
| [43] |
Jurafsky D., Martin J.H.. Speech and Language Processing, 2023, 3Pearson
|
| [44] |
B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, D. Kalenichenko, Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference (2017). http://arxiv.org/abs/1712.05877
|
Funding
National University(NU-ERC-2025-3T-14-IFRP-MNL)
RIGHTS & PERMISSIONS
The Author(s)