Exploring dark knowledge under various teacher capacities and addressing capacity mismatch

Wen-Shu FAN , Xin-Chun LI , De-Chuan ZHAN

Front. Comput. Sci. ›› 2026, Vol. 20 ›› Issue (6) : 2006333

PDF (2826KB)
Front. Comput. Sci. ›› 2026, Vol. 20 ›› Issue (6) : 2006333 DOI: 10.1007/s11704-025-41434-w
Artificial Intelligence
RESEARCH ARTICLE

Exploring dark knowledge under various teacher capacities and addressing capacity mismatch

Author information +
History +
PDF (2826KB)

Abstract

Knowledge Distillation (KD) could transfer the “dark knowledge” of a well-performed yet large neural network to a weaker but lightweight one. From the view of output logits and softened probabilities, this paper goes deeper into the dark knowledge provided by teachers with different capacities. Two fundamental observations are: (1) a larger teacher tends to produce probability vectors with lower distinction among non-ground-truth classes; (2) teachers with different capacities are basically consistent in their cognition of relative class affinity. Through abundant experimental studies we verify these observations and provide in-depth empirical explanations to them. We argue that the distinctness among incorrect classes embodies the essence of dark knowledge. A larger and more accurate teacher lacks this distinctness, which hampers its teaching ability compared to a smaller teacher, ultimately leading to the peculiar phenomenon named “capacity mismatch”. Building on this insight, this paper explores multiple simple yet effective ways to address capacity mismatch, achieving superior experimental results compared to previous approaches.

Graphical abstract

Keywords

knowledge distillation / dark knowledge / capacity mismatch / non-ground-truth class / temperature scaling

Cite this article

Download citation ▾
Wen-Shu FAN, Xin-Chun LI, De-Chuan ZHAN. Exploring dark knowledge under various teacher capacities and addressing capacity mismatch. Front. Comput. Sci., 2026, 20(6): 2006333 DOI:10.1007/s11704-025-41434-w

登录浏览全文

4963

注册一个新账户 忘记密码

References

[1]

Buciluǎ C, Caruana R, Niculescu-Mizil A. Model compression. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2006, 535−541

[2]

Howard A G, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H. MobileNets: efficient convolutional neural networks for mobile vision applications. 2017, arXiv preprint arXiv: 1704.04861

[3]

Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L C. MobileNetV2: inverted residuals and linear bottlenecks. In: Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 4510−4520

[4]

Zhang X, Zhou X, Lin M, Sun J. ShuffleNet: an extremely efficient convolutional neural network for mobile devices. In: Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2018, 6848−6856

[5]

Ma N, Zhang X, Zheng H T, Sun J. ShuffleNet V2: practical guidelines for efficient CNN architecture design. In: Proceedings of the 15th European Conference on Computer Vision (ECCV). 2018, 122−138

[6]

Zhou Z H, Jiang Y, Chen S F . Extracting symbolic rules from trained neural network ensembles. AI Communications, 2003, 16( 1): 3–15

[7]

Zhou Z H, Jiang Y . Nec4.5: neural ensemble based C4.5. IEEE Transactions on Knowledge and Data Engineering, 2004, 16( 6): 770–773

[8]

Urban G, Geras K J, Kahou S E, Aslan Ö, Wang S, Mohamed A, Philipose M, Richardson M, Caruana R. Do deep convolutional nets really need to be deep and convolutional? In: Proceedings of the 5th International Conference on Learning Representations. 2017, 1−13

[9]

Hinton G E, Vinyals O, Dean J. Distilling the knowledge in a neural network. 2015, arXiv preprint arXiv: 1503.02531

[10]

Gou J, Sun L, Yu B, Du L, Ramamohanarao K, Tao D . Collaborative knowledge distillation via multiknowledge transfer. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35( 5): 6718–6730

[11]

Lopez-Paz D, Bottou L, Schölkopf B, Vapnik V. Unifying distillation and privileged information. In: Proceedings of the 4th International Conference on Learning Representations. 2016, 1−10

[12]

Phuong M, Lampert C. Towards understanding knowledge distillation. In: Proceedings of the 36th International Conference on Machine Learning. 2019, 5142−5151

[13]

Menon A K, Rawat A S, Reddi S J, Kim S, Kumar S. A statistical perspective on distillation. In: Proceedings of the 38th International Conference on Machine Learning. 2021, 7632−7642

[14]

Gou J, Yu B, Maybank S J, Tao D . Knowledge distillation: a survey. International Journal of Computer Vision, 2021, 129( 6): 1789–1819

[15]

Romero A, Ballas N, Kahou S E, Chassang A, Gatta C, Bengio Y. FitNets: hints for thin deep nets. In: Proceedings of the 3rd International Conference on Learning Representations. 2015, 1−13

[16]

Zagoruyko S, Komodakis N. Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer. In: Proceedings of the 5th International Conference on Learning Representations. 2017, 1−13

[17]

Huang Z, Wang N. Like what you like: knowledge distill via neuron selectivity transfer. 2017, arXiv preprint arXiv: 1707.01219

[18]

Heo B, Lee M, Yun S, Choi J Y. Knowledge transfer via distillation of activation boundaries formed by hidden neurons. In: Proceedings of the 33rd AAAI Conference on Artificial Intelligence. 2019, 3779−3787

[19]

Passalis N, Tefas A. Learning deep representations with probabilistic knowledge transfer. In: Proceedings of the 15th European Conference on Computer Vision (ECCV). 2018, 283−299

[20]

Joshi C K, Liu F, Xun X, Lin J, Foo C S . On representation knowledge distillation for graph neural networks. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35( 4): 4656–4667

[21]

Peng B, Jin X, Li D, Zhou S, Wu Y, Liu J, Zhang Z, Liu Y. Correlation congruence for knowledge distillation. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. 2019, 5006−5015

[22]

Tian Y, Krishnan D, Isola P. Contrastive representation distillation. In: Proceedings of the 8th International Conference on Learning Representations. 2020, 1−19

[23]

Yim J, Joo D, Bae J, Kim J. A gift from knowledge distillation: fast optimization, network minimization and transfer learning. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017, 7130−7138

[24]

Liu Y, Cao J, Li B, Yuan C, Hu W, Li Y, Duan Y. Knowledge distillation via instance relationship graph. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 7089−7097

[25]

Yu L, Yazici V O, Liu X, van de Weijer J, Cheng Y, Ramisa A C. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019, 2902−2911

[26]

Tung F, Mori G. Similarity-preserving knowledge distillation. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. 2019, 1365−1374

[27]

Ye H J, Lu S, Zhan D C. Distilling cross-task knowledge via relationship matching. In: Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 12393−12402

[28]

Liu J, Wen D, Gao H, Tao W, Chen T W, Osa K, Kato M. Knowledge representing: efficient, sparse representation of prior knowledge for knowledge distillation. In: Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 2019, 638−646

[29]

Li X, Grandvalet Y, Davoine F. Explicit inductive bias for transfer learning with convolutional networks. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 2825−2834

[30]

Li S, Lin M, Wang Y, Wu Y, Tian Y, Shao L, Ji R . Distilling a powerful student model via online knowledge distillation. IEEE Transactions on Neural Networks and Learning Systems, 2023, 34( 11): 8743–8752

[31]

Li X C, Fan W S, Song S, Li Y, Li B, Shao Y, Zhan D C. Asymmetric temperature scaling makes larger networks teach well again. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 277

[32]

Cho J H, Hariharan B. On the efficacy of knowledge distillation. In: Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. 2019, 4793−4801

[33]

Mirzadeh S I, Farajtabar M, Li A, Levine N, Matsukawa A, Ghasemzadeh H. Improved knowledge distillation via teacher assistant. In: Proceedings of the 34th AAAI Conference on Artificial Intelligence. 2020, 5191−5198

[34]

Zhu Y, Wang Y. Student customized knowledge distillation: bridging the gap between student and teacher. In: Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. 2021, 5037−5046

[35]

Gao M, Wang Y, Wan L . Residual error based knowledge distillation. Neurocomputing, 2021, 433: 154–161

[36]

Li X, Li S, Omar B, Wu F, Li X . ResKD: residual-guided knowledge distillation. IEEE Transactions on Image Processing, 2021, 30: 4735–4746

[37]

Krizhevsky A. Learning multiple layers of features from tiny images. Technical Report TR-2009, Toronto: University of Toronto, 2012

[38]

Ganin Y, Lempitsky V. Unsupervised domain adaptation by backpropagation. In: Proceedings of the 32nd International Conference on International Conference on Machine Learning. 2015, 1180−1189

[39]

Tang J, Shivanna R, Zhao Z, Lin D, Singh A, Chi E H, Jain S. Understanding and improving knowledge distillation. 2020, arXiv preprint arXiv: 2002.03532

[40]

Ji G, Zhu Z. Knowledge distillation in wide neural networks: risk bound, data efficiency and imperfect teacher. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. 2020, 1749

[41]

Yuan L, Tay F E H, Li G, Wang T, Feng J. Revisiting knowledge distillation via label smoothing regularization. In: Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 3902−3910

[42]

Dao T, Kamath G M, Syrgkanis V, Mackey L. Knowledge distillation as semiparametric inference. In: Proceedings of the 9th International Conference on Learning Representations. 2021, 1−20

[43]

Zhou H, Song L, Chen J, Zhou Y, Wang G, Yuan J, Zhang Q. Rethinking soft labels for knowledge distillation: a bias-variance tradeoff perspective. In: Proceedings of the 9th International Conference on Learning Representations. 2021, 1−15

[44]

Hsu D, Ji Z, Telgarsky M, Wang L. Generalization bounds via distillation. In: Proceedings of the 9th International Conference on Learning Representations. 2021, 1−25

[45]

Cheng X, Rao Z, Chen Y, Zhang Q. Explaining knowledge distillation by quantifying the knowledge. In: Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020, 12922−12932

[46]

Müller R, Kornblith S, Hinton G. When does label smoothing help? In: Proceedings of the 33rd International Conference on Neural Information Processing Systems. 2019, 422

[47]

Shen Z, Liu Z, Xu D, Chen Z, Cheng K T, Savvides M. Is label smoothing truly incompatible with knowledge distillation: an empirical study. In: Proceedings of the 9th International Conference on Learning Representations. 2021, 1−17

[48]

Li C, Cheng G, Han J . Boosting knowledge distillation via intra-class logit distribution smoothing. IEEE Transactions on Circuits and Systems for Video Technology, 2024, 34( 6): 4190–4201

[49]

Furlanello T, Lipton Z, Tschannen M, Itti L, Anandkumar A. Born again neural networks. In: Proceedings of the 35th International Conference on Machine Learning. 2018, 1607−1616

[50]

Tan C, Liu J . Improving knowledge distillation with a customized teacher. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35( 2): 2290–2299

[51]

Huang T, Zhang Y, Zheng M, You S, Wang F, Qian C, Xu C. Knowledge diffusion for distillation. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. 2023, 2849

[52]

Huang T, You S, Wang F, Qian C, Xu C. Knowledge distillation from a stronger teacher. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 2443

[53]

Wang C, Yang Q, Huang R, Song S, Huang G. Efficient knowledge distillation from model checkpoints. In: Proceedings of the 36th International Conference on Neural Information Processing Systems. 2022, 44

[54]

Ye H, Ming L, Zhan D, Chao W . Few-shot learning with a strong teacher. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46( 3): 1425–1440

[55]

Wang Y, Qian B, Liu H, Rui Y, Wang M . Unpacking the gap box against data-free knowledge distillation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46( 9): 6280–6291

[56]

Sun S, Ren W, Li J, Wang R, Cao X. Logit standardization in knowledge distillation. In: Proceedings of 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024, 15731−15740

[57]

He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016, 770−778

[58]

Xie S, Girshick R, Dollár P, Tu Z, He K. Aggregated residual transformations for deep neural networks. In: Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. 2017, 5987−5995

[59]

Khosla A, Jayadevaprakash N, Yao B, Li F F. Novel dataset for fine-grained image categorization. In: First Workshop on Fine-Grained Visual Categorization, CVPR. Citeseer. 2011

[60]

Wah C, Branson S, Welinder P, Perona P, Belongie S. The CaltechUCSD birds-200-2011 dataset. dataset[J]. 2011

[61]

Zagoruyko S, Komodakis N. Wide residual networks. In: Proceedings of the British Machine Vision Conference. 2016, 87.1−87.12

[62]

Tavanaei A. Embedded encoder-decoder in convolutional networks towards explainable AI. 2020, arXiv preprint arXiv: 2007.06712

[63]

Lecun Y, Bottou L, Bengio Y, Haffner P . Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998, 86( 11): 2278–2324

[64]

Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. In: Proceedings of the 3rd International Conference on Learning Representations. 2015, 1−14

[65]

Netzer Y, Wang T, Coates A, Bissacco A, Wu B, Ng A Y. Reading digits in natural images with unsupervised feature learning. In: Proceedings of the NIPS Workshop on Deep Learning and Unsupervised Feature Learning. 2011, 1−9

[66]

Nair V, Hinton G E. Rectified linear units improve restricted Boltzmann machines. In: Proceedings of the 27th International Conference on International Conference on Machine Learning. 2010, 807−814

[67]

van der Maaten L. Barnes-hut-SNE. In: Proceedings of the 1st International Conference on Learning Representations. 2013, 1−11

[68]

Fisher R A . The use of multiple measurements in taxonomic problems. Annals of Eugenics, 1936, 7( 2): 179–188

[69]

Yang J, Zhang D, Yang J Y, Niu B . Globally maximizing, locally minimizing: unsupervised discriminant projection with applications to face and palm biometrics. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 29( 4): 650–664

[70]

Li X C, Zhan D C. FedRS: federated learning with restricted softmax for label distribution non-ⅡD data. In: Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2021, 995−1005

[71]

Fagin R, Kumar R, Sivakumar D . Comparing top k lists. SIAM Journal on Discrete Mathematics, 2003, 17( 1): 134–160

[72]

You K, Liu Y, Wang J, Long M. LogME: practical assessment of pre-trained models for transfer learning. In: Proceedings of the 38th International Conference on Machine Learning. 2021, 12133−12143

[73]

Li Y, Yosinski J, Clune J, Lipson H, Hopcroft J. Convergent learning: do different neural networks learn the same representations? In: Proceedings of the 1st International Workshop on Feature Extraction: Modern Questions and Challenges at NIPS 2015. 2015, 196−212

[74]

Ren Y, Guo Q, Jin Z, Ravfogel S, Sachan M, Schölkopf B, Cotterell R. All roads lead to Rome? Exploring the invariance of transformers’ representations. 2023, arXiv preprint arXiv: 2305.14555

[75]

Shui C, Abbasi M, Robitaille L É, Wang B, Gagné C. A principled approach for learning task similarity in multitask learning. In: Proceedings of the 28th International Joint Conference on Artificial Intelligence. 2019, 3446−3452

[76]

Chandrasegaran K, Tran N T, Zhao Y, Cheung N M. Revisiting label smoothing and knowledge distillation compatibility: what was missing? In: Proceedings of the 39th International Conference on Machine Learning. 2022, 2890−2916

[77]

Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks. In: Proceedings of the 26th International Conference on Neural Information Processing Systems. 2012, 1097−1105

RIGHTS & PERMISSIONS

Higher Education Press

AI Summary AI Mindmap
PDF (2826KB)

Supplementary files

Highlights

239

Accesses

0

Citation

Detail

Sections
Recommended

AI思维导图

/