1 Introduction
Fig.2 Training pipeline. We collect knowledge concepts and instructions under the guidance of textbooks, Bloom’s Taxonomy and strong LLMs, serving as instruction-tuning data to transform general LLMs to educational LLMs. During inference, we construct a local knowledge base based on the textbook, incorporating search engine capabilities for retrieval enhancement. |
2 Related Work
2.1 Large Language Models
2.2 Bloom’s Taxonomy
3 Methods
3.1 Knowledge Concepts
3.2 Knowledge-Based Instruction Tuning
3.3 Retrieval Enhancement
4 Experiments
4.1 Baselines
4.2 Experiment Detail
4.3 Testset Detail
4.3.1 Self-Constructed Dataset
4.3.2 Public Dataset: C-Eval
4.4 Results on Self-Constructed Dataset
4.5 Results on C-Eval
Tab.1 Results on the validation set of C-Eval benchmark |
Model | STEM | Social science | Humanities | Other | Hard | Average |
---|---|---|---|---|---|---|
Chinese-Alpaca-7B | 35.45 | 51.53 | 47.67 | 41.87 | 28.28 | 42.49 |
Qwen-7B-Chat | 51.61 | 72.64 | 66.94 | 53.83 | 35.14 | 59.37 |
WisdomBot | 59.17 | 72.01 | 65.38 | 54.96 | 49.26 | 62.06 |
Tab.2 Results on the STEM subset within the validation set of C-Eval benchmark |
Model | Computer network | Operatingsystem | Computer architecture | College programming |
---|---|---|---|---|
Chinese-Alpaca-7B | 36.84 | 52.63 | 38.1 | 43.24 |
Qwen-7B-Chat | 42.11 | 42.11 | 52.38 | 64.86 |
WisdomBot | 52.63 | 57.89 | 57.14 | 62.16 |
Model | Collegephysics | Collegechemistry | Advancedmathematics | Probabilityand statistics |
Chinese-Alpaca-7B | 31.58 | 16.67 | 21.05 | 33.33 |
Qwen-7B-Chat | 31.58 | 54.17 | 10.53 | 22.22 |
WisdomBot | 57.89 | 58.33 | 26.32 | 33.33 |
Model | Discretemathematics | Electricalengineer | Metrologyengineer | High schoolmathematics |
Chinese-Alpaca-7B | 43.75 | 37.84 | 50 | 16.67 |
Qwen-7B-Chat | 18.75 | 24.32 | 75 | 33.33 |
WisdomBot | 37.5 | 35.14 | 70.83 | 33.33 |
Model | High schoolphysics | High schoolchemistry | High school biology | Middle school mathematics |
Chinese-Alpaca-7B | 31.58 | 31.58 | 42.11 | 21.05 |
Qwen-7B-Chat | 57.89 | 52.63 | 73.68 | 63.16 |
WisdomBot | 78.95 | 68.42 | 68.42 | 63.16 |
Model | Middle school biology | Middle school physics | Middle school chemistry | Veterinary medicine |
Chinese-Alpaca-7B | 47.62 | 47.37 | 40 | 26.09 |
Qwen-7B-Chat | 85.71 | 84.21 | 100 | 43.48 |
WisdomBot | 90.48 | 84.21 | 95 | 52.17 |
Tab.3 Results on the social science subset within the validation set of C-Eval benchmark |
Model | College economics | Business administration | Marxism | Mao Zedong Thought |
---|---|---|---|---|
Chinese-Alpaca-7B | 32.73 | 45.45 | 52.63 | 54.17 |
Qwen-7B-Chat | 45.45 | 54.55 | 73.68 | 75.00 |
WisdomBot | 38.18 | 54.55 | 84.21 | 62.50 |
Model | Education science | Teacher qualification | High school politics | High school geography |
Chinese-Alpaca-7B | 37.93 | 59.09 | 57.89 | 42.11 |
Qwen-7B-Chat | 65.52 | 84.09 | 94.74 | 63.16 |
WisdomBot | 72.41 | 81.82 | 94.74 | 57.89 |
Model | Middle school politics | Middle school geography | ||
Chinese-Alpaca-7B | 66.67 | 66.67 | ||
Qwen-7B-Chat | 95.24 | 75.00 | ||
WisdomBot | 90.48 | 83.33 |
Tab.4 Results on the humanities subset within the validation set of C-Eval benchmark |
Model | Modern Chinese history | Ideological and moral cultivation | Logic | Law |
---|---|---|---|---|
Chinese-Alpaca-7B | 52.17 | 52.63 | 54.55 | 20.83 |
Qwen-7B-Chat | 78.26 | 84.21 | 36.36 | 41.67 |
WisdomBot | 69.57 | 94.74 | 59.09 | 37.50 |
Model | Chinese language and literature | Art studies | Professional tour guide | Legal professional |
Chinese-Alpaca-7B | 34.78 | 48.48 | 51.72 | 39.13 |
Qwen-7B-Chat | 56.52 | 66.67 | 79.31 | 43.48 |
WisdomBot | 47.83 | 69.70 | 68.97 | 43.48 |
Model | High school Chinese | High school history | Middle school history | |
Chinese-Alpaca-7B | 47.37 | 50.00 | 72.73 | |
Qwen-7B-Chat | 78.95 | 80.00 | 90.91 | |
WisdomBot | 57.89 | 75.00 | 95.45 |
Tab.5 Results on the other subset within the validation set of C-Eval benchmark |
Model | Civil servant | Sports science | Plant protection | Basic medicine |
---|---|---|---|---|
Chinese-Alpaca-7B | 40.43 | 57.89 | 36.36 | 47.37 |
Qwen-7B-Chat | 48.94 | 47.37 | 68.18 | 63.16 |
WisdomBot | 53.19 | 52.63 | 59.09 | 68.42 |
Model | Clinical medicine | Urban and rural planner | Accountant | Fire engineer |
Chinese-Alpaca-7B | 36.36 | 52.17 | 36.73 | 38.71 |
Qwen-7B-Chat | 45.45 | 63.04 | 51.02 | 48.39 |
WisdomBot | 50.00 | 60.87 | 53.06 | 45.16 |
Model | Environmental impact assessment engineer | Tax accountant | Physician | |
Chinese-Alpaca-7B | 45.16 | 34.69 | 34.69 | |
Qwen-7B-Chat | 48.39 | 53.06 | 55.10 | |
WisdomBot | 58.06 | 44.90 | 59.18 |
4.6 Advanced Cognitive Ability Comparisons
Tab.6 Comparisons on three advanced cognitive abilities |
Model | Creativity | Personalized ability | Logical reasoning (%) |
---|---|---|---|
Chinese-Alpaca-7B | 2.78 | 3.56 | 8 |
Qwen-7B-Chat | 2.86 | 3.34 | 46 |
WisdomBot | 3.28 | 3.80 | 52 |
4.7 Experiments on Retrieval Enhancement
Tab.7 Comparisons on retrieval enhancements |
Model | Local knowledge base (%) | Search engine (%) |
---|---|---|
w/o retrieval | 30 | 35 |
w retrieval | 70 | 93 |