1 Introduction
Technology can become the “wings” that will allow the educational world to fly farther and faster than ever before—if we allow it.
— Jenny Arledge
Over the past few decades, education has undergone a profound transformation driven by the advancement of technology, resulting in a significant shift in teaching and learning paradigm (
Batool et al., 2023;
Chen et al., 2020;
Chiu et al., 2023;
Latif et al., 2023;
Li et al., 2024;
Maghsudi et al., 2021). The journey began with the electrification phase (
Sovacool & Ryan, 2016), which introduced audio-visual tools to enhance teaching, marking the dawn of modern educational technology. Following this, the integration of computers and multimedia further revolutionized education (
Romero & Ventura, 2013), enabling the use of digital resources, online learning, and resource sharing. Entering the 2010s, artificial intelligence (AI) began to capture significant attention and was increasingly applied in various educational (
Alam, 2023;
Ercikan & McCaffrey, 2022;
Tapalova & Zhiyenbayeva, 2022). The history of AI in education dates back several decades. In the 1950s, the first intelligent tutoring system (ITS), known as “SAINT,” was developed (
Alkhatlan & Kalita, 2018). Later, in the 2000s, AI gained prominence in educational games and simulations, exemplified by the defeat of world chess champion Garry Kasparov by IBM’s Deep Blue. With the recent widespread availability of generative AI tools, AI has gradually permeated the education sector (
Denny et al., 2024;
Romero & Ventura, 2020;
Xiong et al., 2024). Fig.1 illustrates the evolution of educational technology, highlighting its clear progression through five distinct stages.
The integration of AI into education has long been a topic of speculation and anticipation. However, it was the launch of ChatGPT (
Shafik, 2024) by OpenAI that brought the world a significant step closer to realizing the transformative potential of AI technology. ChatGPT’s introduction was widely recognized, with Science (
2022) naming it one of the top ten scientific breakthroughs of the year. The technology behind lie large-scale foundation models (LFMs), particularly large language models (LLMs). These models have dramatically reduced the labor time required in various fields, serving as prime examples of the emergence of new productive forces. Their impact extends far beyond the realm of education, influencing industries such as healthcare, media, and transportation. For instance, in natural language processing (NLP), LLMs have shown exceptional capabilities in tasks requiring reasoning and reading comprehension (
Bhagavatula et al., 2019;
Hendrycks et al., 2020;
Rajpurkar et al., 2018). In protein structure prediction, AI has also outperformed human experts in CASP evaluations (
Zheng et al., 2023). In the legal field, GPT-4 (
OpenAI, 2023) achieved an impressive score of 297 on the Uniform Bar Examination, surpassing the high threshold of 273 set by Arizona. Furthermore, in programming competitions, OpenAI’s reasoning model, o1 (
Zhong et al., 2024), ranked among the top 500 in the American Invitational Mathematics Examination qualification round and outperformed human doctoral-level experts in benchmark tests for physics, biology, and chemistry. Such applications have become increasingly widespread, impacting various industries such as healthcare, media, transportation, and of course education.
The education field, with its wealth of data resources, growing demand for personalized learning, and substantial social value, has become an ideal landscape for the application of LLM technology. As a result, numerous online education companies and educational institutions have actively invested efforts aimed at creating LFMs specifically designed to address the unique challenges and requirements of the education domain. For example, iFLYTEK’s Spark LFM (
iFLYTEK, 2024) demonstrates its versatility in education by offering multimodal interaction, coding, text generation, mathematical problem-solving, and knowledge-based question-answering capabilities. Another notable example is EduChat (
Dan et al., 2023), which integrates Socratic teaching methods during its instruction fine-tuning phase. Furthermore, TAL’s MathGPT model (
Liu, 2025), with a parameter scale of 100 billion, is capable of processing formula image inputs and providing both solutions and detailed explanations, demonstrating its potential to assist students in complex mathematical learning.
Research and empirical evidence suggest that LLMs not only perform exceptionally well on benchmark tests across various disciplines (
OpenAI, 2023), but also provide valuable support and enhancement in diverse educational scenarios. LLMs can function as writing and reading assistants, helping students improve both their language expression and comprehension skills (
Susnjak & McIntosh, 2024). They also play a crucial role in teacher–student collaboration, personalized learning, and automated assessment (
Tan et al., 2023). Furthermore, LLMs are adept at handling tasks such as responding to student inquiries (
Lazaridou et al., 2022) and assisting with learning planning (
Chen et al., 2023b;
Wang et al., 2024a), thereby supporting students’ learning processes on a broader scale. These models are fundamentally transforming the education landscape, empowering all levels of the educational process toward a new era of intelligence and personalization.
Despite huge potential, the application of LLMs in education remains in its early stages, with much of their transformative power yet to be fully realized. This is particularly true in K-12 education, where students’ learning needs are highly personalized, characterized by its emphasis on holistic student development, including cognitive growth, social-emotional learning, and literacy acquisition. Specifically, there are several fundamental principles in K-12 education, demanding a distinct approach to the integration of LLMs.
First, cognitive development varies significantly across age groups, requiring instructional materials to be developmentally appropriate. Vygotsky (
1978)’s zone of proximal development suggests that optimal learning occurs when students engage with tasks that are slightly beyond their independent ability but achievable with guidance. Piaget (
1952)’s stages of cognitive development highlight age-specific learning needs: Younger children in the preoperational stage (ages 2–7) benefit from hands-on, visually supported instruction, while older students in the formal operational stage (ages 11+) require opportunities for abstract reasoning and problem-solving. Effective AI-assisted learning must scaffold student understanding in alignment with these developmental principles.
Second, literacy development is central to K-12 education, serving as the foundation for all other learning domains. According to stages of reading development of Chall (
1983), literacy acquisition follows a progression from phonemic awareness in early grades to reading fluency and comprehension in later years. LLMs have the potential to support this process by offering personalized reading interventions, adaptive writing feedback, and automated text simplification to align with different literacy levels (
Wolf, 2007). However, without careful calibration, these models may reinforce surface-level comprehension rather than fostering deep engagement with texts, highlighting the need for structured integration with established literacy frameworks.
Third, K-12 education prioritizes real-world learning experiences that cultivate problem-solving skills, creativity, and social awareness. Constructivist theories of learning, particularly experiential learning model of Dewey (
1986), stress that meaningful learning occurs through engagement with authentic, real-world problems rather than passive information absorption. In this context, LLMs should not merely function as knowledge repositories but actively facilitate problem-based learning, inquiry-based instruction, and project-based learning (
Hmelo-Silver, 2004). For instance, an AI-powered learning platform could simulate historical events, guide students through scientific investigations, or generate context-based math problems that reflect real-world applications.
Forth, student engagement is a key determinant of learning success. Self-determination theory highlights the importance of autonomy, competence, and relatedness in fostering intrinsic motivation (
Ryan & Deci, 2000). LLMs can enhance motivation by providing adaptive learning paths, gamified educational experiences, and personalized feedback mechanisms. However, an over-reliance on AI-generated assistance may reduce students’ opportunities for productive struggle—a critical component of deep learning (
Kapur, 2008). A balanced approach should ensure that AI tools complement rather than supplant student-led inquiry and critical thinking. The convergence of large foundational models and K-12 education is an inevitable trend. Given this trajectory, it is crucial to address the question: How can LFMs be effectively integrated into K-12 education? To move forward, it is essential to clarify the key issues surrounding the promotion of these technologies within the K-12 educational context.
We observe that no existing literature has systematically summarized the application of LLMs in K-12 education from a technological perspective. To bridge this gap, this paper aims to provide a comprehensive technological review of LLMs in K-12 education, offering an in-depth analysis of the current state and emerging trends within the global K-12 LLM-supported educational ecosystem. This review aims to offer valuable insights for educators on effective teaching strategies and for students on how to enhance their self-directed learning practices. The cases discussed in this paper represent a selection of the most relevant cases collected to date, providing a foundational understanding of the landscape of LLM applications in K-12 education. We hope that this work will stimulate further innovation and foster a deeper understanding of the evolving role of LLMs in shaping the future of K-12 education.
The paper is structured as follows. Section 2 provides a detailed introduction to the emergence and current state of LFMs for education. Section 3 analyzes the application of these models in the K-12 education sector through a survey of current practices, focusing on typical scenarios. Furthermore, Section 4 highlights the current challenges and explores future potential opportunities. Section 5 presents the conclusions.
2 LFM for Education
2.1 Large-Scale Foundation Model
LLMs are the typical LFM, trained on massive datasets with scaled model sizes, have ushered in a new era of possibilities in AI. Through pre-training with large-scale unlabelled data, instruction tuning (
Chung et al., 2024) and key technologies for reinforcement learning from human feedback (RLHF) (
Ouyang et al., 2022), LLMs have acquired rich knowledge of the world (
Brown et al., 2020;
Zhao et al., 2023), have better text comprehension and generation capabilities, can solve a variety of complex tasks effectively, and can excel at accurately tracking context in multi-turn dialogues, ensuring alignment with human values for safe and responsible use (
Bubeck et al., 2023;
Wei, 2022a).
The most powerful LLMs are the transformer architecture (
Devlin et al., 2019;
Vaswani et al., 2017), which use stacked self-attention in encoder and cross-attention in decoder within an encoder–decoder structure. The encoder transforms an input sequence of symbols
x = (
x1,
x2,
...,
xn) into a sequence of continuous representations
y = (
y1,
y2, ...,
yn), which the decoder then uses to generate the output sequence of symbols (i.e., human-like text), producing one element at a time. ChatGPT (
Shafik, 2024) developed by OpenAI has gained widespread attention in society since its release. ChatGPT is the typical decoder-only transformer model where GPT-series (
Brown et al., 2020;
Radford et al., 2019 &
2020) are decoder to accurately predict the next word. It is the conversation model that can understand and generate human-like text. After that, GPT-4 and GPT-4o1 (
OpenAI, 2023) were released, introducing multimodal input capabilities and demonstrating significant performance improvements across various evaluation tasks compared to ChatGPT. With ChatGPT, GPT-4 and GPT-4o as milestones of language models, the AI generated content era began (
Bubeck et al., 2023).
During the development process of LLMs, several key techniques drive the success of LLMs. First, it is challenging to train LLMs due to their massive size, where distributed training algorithms with various parallel strategies, such as DeepSpeed (
Rasley et al., 2020) and Megatron-LM (
Shoeybi et al., 2019) are proposed to learn the parameters of LLMs. In addition, the optimization frameworks with optimization tricks, i.e., restart to overcome training loss spike (
Chowdhery et al., 2023) and mixed precision training (
Le Scao et al., 2023) for training stability emerged. Second, LLMs with general-purpose problem-solving abilities after pre-training on large-scale corpora, may not always be immediately evident when performing specific tasks. Ability elicitation through the design of appropriate task instructions or specific in-context learning strategies to elicit such abilities is necessary. For example, it is useful to approach complex reasoning tasks through chain-of-thought (CoT) prompting by incorporating intermediate reasoning steps. Third, alignment tuning is proposed to align LLMs with human values. This ensures they produce high-quality, harmless responses, avoiding toxic, biased, or harmful content for humans. InstructGPT (
Ouyang et al., 2022) introduced RLHF technique to generate helpful, honest, and harmless context. Moreover, there are various technical advancements have also played a significant role in the success of LLMs.
Recently, a large number of LFMs have emerged to advance the capabilities of artificial intelligence in areas such as NLP (
Touvron et al., 2023a &
2023b), computer vision (
Awadalla et al., 2023;
Li et al., 2023b;
Zhu et al., 2023a), and multimodal (
Radford et al., 2021;
Wang et al., 2022a). These advanced LFMs can be broadly classified into LLMs and multimodal LLMs (MLLMs), where LLMs have shown surprising performance on NLP tasks and MLLMs have the ability to receive and reason on multimodal tasks. The LLaMA series from LLaMA, LLaMA-2 (
Touvron et al., 2023a &
2023b), LLaMA-3 to LLaMA-3.1 (
AI Meta, 2024a &
2024b), available as open-source LLMs pre-trained on 15 trillion tokens, deliver competitive performance compared to leading closed-source models like GPT-4. The Mistral series (
Jiang et al., 2023 &
2024), including Mistral (7B), Mistral NeMo (12B), Mistral Large 2 (123B), and Mixtral, delivers strong performance on benchmarks like MMLU and GSM8k with Mistral Large 2’s support for 11 languages and over 80 programming languages. The Gemma series includes lightweight and high-performing open models, such as Gemma-1 (2B and 7B) and Gemma-2 (2B, 9B, and 27B) (
Gemma Team, 2024a &
2024b), trained on up to 13 trillion English tokens, where Gemma-2 demonstrates exceptional performance across benchmarks like ARC-c, MMLU, and GSM8k. The Qwen series (
Qwen Team, 2024;
Yang et al., 2024) is an open source collection of large-scale models supporting both English and Chinese, including Qwen (7B to 72B), Qwen1.5 (0.5B to 110B), Qwen2 (0.5B to 72B), and the latest Qwen2.5 (0.5B to 72B), pre-trained on up to 18 trillion tokens, demonstrating significant improvements in knowledge retention, coding, mathematics and instructional follow-up. Moreover, Baichuan (
Yang et al., 2023a) and GLM (
GLM Team, 2024) are open-source bilingual LLMs supporting both Chinese and English, excelling in various evaluations across semantics, mathematics, reasoning, coding, and knowledge domains. Despite their impressive performance on various NLP tasks, LLMs are constrained to processing discrete text and lack the ability to understand multimodal input and produce multimodal output, such as images, videos, and audio. In contrast, MLLMs are designed to achieve this capability. Generally, MLLMs consist of a pre-trained modality encoder that compresses modalities such as images or audio into a more compact representation, a pre-trained LLM, a modality interface that matches different modalities, and a modality generator for generating other modalities expected for text (
Yin et al., 2023a). In the field of MLLMs, CLIP (
Radford et al., 2021) integrates a visual encoder that aligns semantically with text by leveraging large-scale pretraining on image-text pairs. Some works like MiniGPT-4 (
Zhu et al., 2023a) adopts EVA-CLIP (
Fang et al., 2023;
Sun et al., 2023) as an encoder, Osprey (
Yuan et al., 2024) adopts a convolution-based ConvNext-L (
Cherti et al, 2023) as encoder, and Qwen-vl (
Bai et al., 2023) adopts the aligned image-caption-box visual receptor as an encoder to support flexible image resolution and parameter size input. Encoders for other modalities are also available (
Deshmukh et al., 2023;
Elizalde et al., 2023;
Han et al., 2023;
Lin et al., 2023). For example, DouBao (
Bai et al., 2024) employs representation learning-generator-renderer pipeline to encode multi-modal inputs, including lyrics, style descriptions, audio references, and voice prompts for making vocal music generation.
Bridging the gap between different modalities in large multimodal models is necessary but highly resource-intensive. A more practical approach is to design the modality interface as a learnable connector between the encoder and the LLM, where the module projects multimodal information into the space that the LLM can understand efficiently. To make LLM understandable efficiently, the token-level and feature-level fusion of multimodal information are embedded in the modality interface. For token-level fusion, the features generated by encoders are converted into tokens and combined with text tokens, which are then fed into the LLMs for further processing. For example, the BLIP-2 series (
Chen et al., 2023a;
Dai et al., 2023;
Li et al., 2023a;
Zhang et al., 2023a) are typical approaches that introduce learnable query tokens and Q-Former network to extract information from multimodal input. In addition, some methods such as LLaVA (
Liu et al., 2023 &
2024) use an MLP-based interface to align features between multimodal information (
Pi et al., 2023;
Su et al., 2023;
Zhang, 2023c). The feature-level fusion based approaches mainly focus on deep interaction and fusion between multimodal features (
Guo et al., 2023;
Wang et al., 2023;
Yin et al., 2023b;
Zhang et al., 2023b). For example, Flamingo (
Alayrac et al., 2022) integrates external visual cues into language features by adding cross-attention layers within the frozen Transformer layers of LLMs. Among these LLMs, as data sets become larger and model architectures more sophisticated, LLMs have developed remarkable emergent capabilities, including instruction following (
Chung et al., 2024;
Wei et al., 2021), in-context learning (
Liu et al., 2021;
Rubin et al., 2021), CoT (
Fu et al., 2022;
Zhang et al., 2022) and planning (
Chen et al., 2023d;
Yao et al., 2023), which enable them to perform complex tasks with increasing skill and adaptability.
Research on LFMs is not only hot in academia, but industry has also conducted research on practical applications and released a variety of products in real-life scenarios. For example, Moonshot AI has developed Kimi 3, which can handle long conversations with a capacity of up to 200,000 characters in input and output, and can also receive documents to chat with users. OpenAI enabled Sora to generate new videos based on text, image, and video inputs. Qwen is Alibaba Group’s LLM, works as an assistant that can do scientific answering, writing, creative work, coding, etc. Today, more and more LLM products can parse web pages and URLs to capture up-to-date information for providing users with the information they need. However, current LLMs are limited in processing multimodal information with long context, weak ability of agents to interact with the real world, being tricked into giving biased or undesirable responses, etc.
2.2 LFM of Education
The ability of LLMs to explain, generalize, and create has attracted considerable interest and led to extensive research and discussion in domain applications such as education (
Kasneci et al., 2023;
Li et al., 2023c), healthcare (
Thirunavukarasu et al., 2023;
Yang et al., 2023c), and finance (
Wu et al., 2023a;
Yang et al., 2023b). Focusing on education, the LLMs can make potential in teacher–student collaboration, personalized learning, and assessment automation (
Wang et al., 2024b).
The research and development of educational LLMs has been given great attention, many research institutions and enterprises have released their educational LLMs for enhancing educational quality. For example, iFLYTEK Spark helps students with various tasks such as grading homework, practicing spoken language, answering encyclopedic questions, and writing code. MathGPT is specialized in mathematical LLM, which can do mathematical calculation and reasoning, code understanding and generation, and math knowledge question answering. Ziyue is a specialized English LLM that supports speaking coach, AI writing, and AI translation. OpenAI o1 can reason through complex tasks by spending more time thinking before responding, achieving a higher ability to solve harder problems than previous models in science, coding, and math. Developing educational LLMs are an effective way to support digital transformation in education. More educational LLM are summarized into Tab.1.
3 Current Trends
This paper focuses on K-12 Education and explores how LLMs can enhance educational quality. The application of LLMs in K-12 education requires tailored approaches, such as providing guided responses, staying within students’ knowledge, and adhering to curriculum standards. These applications should encourage critical thinking, match students’ cognitive levels, and avoid overly complex or irrelevant information to ensure effective and reliable support for learning. Therefore, the successful integration of LLMs into K-12 education requires a comprehensive understanding of students’ learning characteristics, cognitive development, and curriculum requirements. From an application needs perspective, LLMs must support differentiated instruction and adaptive learning while ensuring pedagogical alignment with established educational frameworks. Given the diverse learning paces and abilities of K-12 students, LLMs should facilitate personalized learning by providing scaffolded guidance, timely feedback, and explanations that align with students’ cognitive development stages. Currently, LLMs demonstrate strong capabilities in automating content creation and other related areas. However, challenges still exist in many aspects, such as ensuring content accuracy and mitigating biases. Therefore, LLMs should be considered as auxiliary tools rather than replacements for traditional teaching strategies in order to maximize their educational benefits.
This section reviews the board applications of LLM in K-12 education scenario supporting for students and teachers, where the advanced technologies, case, and ricks in each perspective are highlighted.
3.1 Teaching Resource Recommendation
Lesson planning is the process in which teachers design and organize the content, methods, and materials for a specific lesson to ensure effective teaching and learning (
Iqbal et al., 2021;
Milkova, 2025). It involves setting clear objectives, selecting appropriate teaching strategies, preparing resources, and outlining activities to engage students and achieve desired learning outcomes. Importantly, finding the high-quality resources that cover the topics of the lesson is the key step and is highly used in pedagogical practice (
Ding & Carlson, 2013). Educational resource recommendation technologies (
Manouselis et al., 2010;
Wu et al., 2020) offer a practical solution by leveraging teacher-specific requirements to identify and suggest relevant teaching and learning materials from educational databases. The previous approaches (
Karpicke & Grimaldi, 2012;
Urdaneta-Ponte et al., 2021) usually either used the deep neural-based retrieval network (
Dou et al., 2007;
Zhu et al., 2023b) to return a list of relevant resources to the user’s submitted query, or explore the collaborative filtering approach (
Herlocker et al., 2004;
Schafer et al., 2007), the content-based approach (
Zhu et al., 2019;
Zhu et al., 2020), and the hybrid approach (
Deschênes, 2020;
Zhu et al., 2022) to recommend useful information that is relevant to their individual needs. With the advent of LLMs, exploring the great potential of LLMs to help teachers create high-quality teaching materials has attracted more attention (
Wang, 2024b;
Xu et al., 2024;
Zhu et al., 2023c). Leiker et al. (
2023) used LLM with human-in-the-loop based prompt engineering guidance process to find high-quality content. Koraishi (
2023) used ChatGPT with zero-shot prompt strategy to optimize the materials of the English.
Here we give a real example to help readers understand how to use LLMs for resource recommendation. Fig.2 gives the details, where we use Qwen2.5 to perform the recommendation task. When user input their query, Qwen2.5 searches the website for information and uses its ability to summarize and refine the information to return the relevant resources and their URL to the user. Although LLM can effectively reduce teachers’ resource preparation burden, improve efficiency, and support high-quality lesson planning, it is unable to effectively analyze teaching and learning situations and recommend personalized learning content. That is, it cannot well provide personalized recommendations for teachers by fully understanding students’ learning abilities and prior knowledge.
3.2 Teaching Content Creation
Educational content, such as lesson plans, teaching designs, syllabus, presentation slides. and questions, are related to the educational goals, and objectives and are necessary materials in pedagogical practice (
Bell, 1993;
Davis, 2017). How to generate such materials is a popular research topic in the application of LLM to education.
The lesson plans, teaching designs and syllabus are most the operational documents that guide the specific implementation of the lesson, the course, and the unit (
Jury et al., 2024;
Wang et al., 2024b). Traditional methods rely on education specialists or teachers to determine the teaching or learning objectives, activities and strategies to check student understanding, and write the lesson plan, teaching design and syllabus (
Milkova, 2025), which lacks flexibility and individualization. Within the era of Big Data, a large amount of data on student learning and teacher teaching has been accumulated, deriving some tools or systems that utilize deep learning models (
Han et al., 2021) to mine the data and help teachers produce instructional content (
Yu et al., 2019). However, these tools or systems suffered from the inability of human–machine interaction, causing the separation of lesson preparation and teaching process and their adaptive adjustments, which seriously reduces teaching efficiency (
Jia et al., 2018). LLMs have strong comprehension, generation, and interaction capabilities. Teachers can provide them with customized template, content, and background knowledge, enabling LLMs to generate lesson plans, teaching designs, and syllabi that encompass teaching objectives, learning points, and key knowledge points, among other elements, thus broadening teachers’ teaching design ideas.
Here we give a real example to help readers understand how to use LLMs for generating lesson plans, teaching designs, and syllabus. Fig.3 gives the details, where we use Qwen2.5 to perform the task. When a user submits the qurey, Qwen2.5 claims that teaching land reforms in ancient China, particularly the Equal-Field System of Emperor Xiaowen of the Northern Wei Dynasty and Wang Anshi’s Equal-Field Taxation Law, can be both enlightening and engaging for high school students, and gives the lesson plan composed of “introduction,” “brief historical context,” “detailed examination of reforms,” “interactive activities,” and “homework assignment.”
Presentation slides are a series of visual aids created to support teachers in delivering information, ideas, and concepts to students. The traditional method is to manually search or create videos, pictures, and text, then organize such materials to create slides (
Strauss et al., 2011). This method of relying on manual labor to integrate different media types requires extra effort and technical skills, is time consuming and inefficient. MLLMs can generate various slides based on input conditioning, such as customized syllabus, teaching content, and background knowledge. Fig.4 gives a slides example. MLLMs transforms the content of the lesson plan into the slides. However, it is obvious that most of the generated slides are text-based and lack graphics, which cannot be directly used for pedagogical practice.
Question generation has become one of the most popular research topics in LLMs’ application for education (
Du et al., 2017;
Duan et al., 2017). Manually creating questions is a complex process that demands specialized training and practical experience. Automatic question generation (AQG) techniques (Kurdi et al., 2023) can reduce the costs of manual question creation and meet the demand for a continuous supply of new questions. AQG aims to generate questions of controlled difficulty, enrich question forms and structures, and automate template construction (
Zhang et al., 2021). For example, Heilman and Smith (
2010) proposed a statistical ranking network to generate and rank fact-based questions based on the content of a reading practice or assessment. The highest-ranked questions were either revised by educators or directly provided to students for practice. Recently, many LLM-based approaches proposed various strategies, such as fine-tuning LLMs with supplemental reading materials (
Xiao et al., 2023), aligning with specific learning objectives (
Doughty et al., 2024) and making implicit diversity controls toward the equation of question (
Zhou et al., 2023b) to generate questions. Fig.5 provides the example of generating math exercise about the knowledge point “opposite vertex.” At present, simple questions can be generated, and the generation of complex questions is limited.
3.3 Intelligent Question Retrieval
Intelligent question retrieval aims to analyze the content of a given problem, and rapidly match relevant questions and answers from a vast database. This technology provides solutions, detailed explanations, and reference materials for the question input, helping students consolidate their knowledge and enhance their learning efficiency.
Most current efforts still rely on traditional AI methods due to the high cost and slow processing speed of LFMs. Furthermore, the majority of existing solutions have been developed by online education companies, such as Readboy, ZuoYeBang, TAL, and iFLYTEK, which have access to extensive question datasets. In addition to accumulating large question datasets, these approaches primarily focus on several key technical aspects: (1) enhancing the effectiveness and efficiency of retrieval strategies; (2) improving the accuracy of retrieval for complex illustrations and diagrams; (3) integrating factors such as question subject, type, and difficulty level to customize search results to meet individual needs; (4) ensuring that incomplete or defective questions are still retrieved with high precision.
The intelligent question retrieval system typically follows several stages. First, question input processing accepts the user’s question, either as text or an image for text recognition. The text undergoes preliminary processing, including word segmentation, automatic error correction, and stop word filtering. In the query semantic understanding phase, the system predicts the subject category, question type, and difficulty level by analyzing both the image and text. It also conducts word weight analysis, emphasizing high-weight words in the recall process while excluding irrelevant or low-weight terms. Synonyms are rewritten according to predefined rules to expand the recall scope. In the retrieval and reranking phase, relevant documents are first filtered from the question gallery, then reranked based on a refined score considering factors such as relevance, quality, difficulty, and popularity to select the optimal results. Fig.6 illustrates retrieval examples in a intelligent question retrieval system.
Although intelligent question retrieval systems have become a mature application widely adopted by online education enterprises, they still face several significant challenges. First, recognition accuracy is suboptimal when dealing with illegible handwriting. Second, retrieval performance is severely limited when questions contain images, as extracting key information and performing effective matching proves difficult. Third, for new types of interdisciplinary questions, where the question database coverage is insufficient, providing precise answers remains a challenge. These issues underscore the need for further advancements in both technology and methodology to improve the reliability of intelligent question retrieval systems.
3.4 Intelligent Problem Solving
Intelligent problem solving is defined as the systematic generation of problem-solving processes based on the content of a question, providing accurate and effective solutions to support personalized learning. Recent advancements in intelligent problem solving have been driven by LFMs, which, having been pre-trained on vast amounts of text data, possess robust natural language understanding and generation capabilities. These models can effectively parse complex questions, identify relevant knowledge points, and generate systematic problem-solving steps and final answers.
LFMs, when not specifically optimized for intelligent problem-solving, often struggle with tasks requiring rigorous logical reasoning, such as mathematics. To improve their performance in this domain, various strategies have been adopted, which can be broadly classified into three categories. The first category involves prompting LLMs. Wu et al. (
2023b) pioneered the use of various prompting techniques, including vanilla prompts, program of thoughts prompts, and program synthesis prompts, to enhance problem-solving capabilities. Wei et al. (
2022b) introduced CoT prompting, a method that guides LLMs through step-by-step reasoning, followed by multiple CoT reasoning introduced in (
Wang et al., 2022b). Zhou et al. (
2023a) proposed self-verification based on explicit code, a novel prompting method that allows models to verify their answers through code. The second category focuses on improving problem-solving performance through external strategies. An et al. (
2023) enhanced LLM performance by converting mathematical expressions into English, while Yamauchi et al. (
2023) employed external tools, particularly the Python REPL, to correct errors in CoT reasoning and boost performance. The third approach involves fine-tuning LFMs. Some studies (
Yang et al., 2023d) enhance the model’s understanding of relevant concepts by introducing intermediate steps during the fine-tuning process, thereby improving performance. Other research incorporates additional relevant datasets, including those generated by the model itself, for further training.
The intelligent problem-solving system begins by semantically parsing the input question to identify the relevant knowledge points and concepts, drawing on the extensive knowledge base built during the model’s pre-training phase. Using this understanding, the model generates a structured sequence of problem-solving steps, including the application of formulas, logical derivations, and intermediate calculations, ensuring the rigor and coherence of the solution process. These steps and the final answer undergo multiple rounds of internal validation and optimization to improve accuracy and reliability, often incorporating heuristic searches or optimization algorithms to verify the reasonableness of the results. Fig.7 illustrates examples of problem solving using LLMs.
LLMs still face significant limitations in intelligent problem-solving. First, while current research has largely focused on organizing large-scale datasets (
He et al., 2024;
Zhang et al., 2024), there remains a lack of robust generalization across different datasets, grade levels, and question types. This highlights the need for continual learning approaches, similar to the process of human skill acquisition, to improve the generalization capabilities of LLMs. Second, LLMs exhibit substantial shortcomings in reasoning, particularly in handling inconsistent performance across various textual formats (e.g., words and numbers), and in providing different answers to the same question through varying reasoning paths across multiple attempts. Third, current reasoning methods, such as CoT, fail to adequately address students’ needs and comprehension levels. Studies (
Gattupalli et al., 2023;
Yen & Hsu, 2023) indicated that LFMs often misinterpret students’ questions in dialogues and fail to provide adaptive feedback. Moreover, they tend to overlook the comprehension abilities of younger students, generating overly complex responses that may cause confusion. Thus, it is essential to incorporate human-factors design into AI research to better meet the nuanced needs of K-12 education.
3.5 Intelligent Question Answering
Intelligent question answering involves the automated and immediate response to student inquiries. Its core function is to comprehend the student’s question, take into account their individual characteristics, and generate contextually relevant answers.
In K-12 education, LLMs must offer precise responses tailored to specific subjects and tasks. To meet this demand, fine-tuning techniques have become crucial for enhancing the effectiveness of intelligent question answering systems (
Chen et al., 2023c). In addition, methods such as prompt engineering, CoT, and the Socratic method have been employed to enhance model interaction and learning outcomes (
Westerlund & Shcherbakov, 2024). Retrieval-augmented generation techniques have further improved LLM functionality by incorporating external databases (
Miladi et al., 2024;
Sequeda, et al., 2024), thereby alleviating the errors and hallucinations, an important feature in educational contexts where accuracy is essential. As technology evolves, intelligent question answering systems are increasingly incorporating multimodal capabilities (
Lim et al., 2024;
Luo et al., 2023). This expansion goes beyond text-based inputs and outputs, enabling the simultaneous processing of text, voice, images, and other forms of data, thus offering richer and more diverse support for students’ learning experiences.
Fig.8 presents two practical applications of intelligent question–answering technology in education (
Dan et al., 2023). Fig.8(a) involves an open-domain question-answering system based on the retrieval-augmented generation method. In this scenario, the system retrieves the correct fact—that Cao Cao is the author of the poem—and generates a detailed response that not only provides the author’s information but also explains the poem’s creation background and literary value, including a reference link for further reading. Fig.8(b) demonstrates an interactive learning system using the Socratic method for intelligent question–answering. When the question been raised, the system does not provide a direct answer but instead guides the user through a series of probing questions, helping them progressively understand the concept.
LFMs have demonstrated impressive capabilities in intelligent question answering, yet they continue to face significant challenges in practical applications, particularly in handling complex mathematical calculations and teaching methodologies. Despite notable advancements, most developments have focused on higher education, with customized models tailored for K-12 education still lacking widespread adoption. As a result, the coverage and depth of these models fall short of addressing the diverse and personalized needs of K-12 learners. Furthermore, existing educational LFMs have not systematically incorporated the deeper knowledge and principles of pedagogy, limiting their ability to fully support personalized learning, provide emotional feedback, or deliver in-depth subject knowledge for both teachers and students.
3.6 Automated Scoring
Automated scoring is a revolutionary application in the field of education, shown in Fig.9. This technology not only alleviates the workload of teachers but also enhances grading efficiency and objectivity. Automatic scoring is typically divided into scoring for objective questions and subjective questions. The scoring process for objective questions is well-established and is typically carried out by comparing responses to standardized answers, enabling rapid and accurate grading.
However, the challenges encountered by automated scoring technology in scoring subjective questions are more complex. The main reason lies in the higher requirements for semantic analysis, contextual reasoning, creativity, and critical thinking assessment. Traditional scoring for subjective questions relies on human evaluators, who, while capable of providing detailed feedback and personalized assessments, tends to be less efficient and is easily influenced by subjective factors. In recent years, automatic grading technologies based on LLMs have begun to emerge. Kundu & Barbosa (
2024) evaluated the effectiveness of LLMs such as ChatGPT and Llama in the automated essay scoring task, with a particular focus on their consistency with human scores. They found that LLMs often scored more stringently than human raters and exhibited weak correlation with human grading. Additionally, they explored the ability of LLMs to identify spelling and grammatical errors, discovering that LLMs could reliably detect these mistakes and take them into account during scoring. Kim & Jo (
2024) introduced a novel hybrid method that combines LLMs and comparative judgment (CJ). This method used zero-shot prompting to enable LLMs to select the better of two essays, simulating the human evaluator’s comparative judgment process. The study results indicated that the CJ approach outperformed traditional rubric-based scoring methods in LLM-driven essay assessments. Rationale-based multiple trait scoring (
Chu et al., 2024) combined prompt-based LLMs with fine-tuned small LLMs to accurately predict multi-faceted scores based on feature-specific rationales generated by LLMs linked to scoring criteria.
These studies indicate that while LLMs demonstrate potential in automated grading tasks, there is still room for improvement in their understanding of scoring criteria and their ability to generate consistent scores. Future research must further explore how to optimize LLMs to better mimic the behavior of human evaluators and enhance the accuracy and reliability of automated grading systems. Moreover, in the scoring of subjective questions, the results generated by LLMs may be influenced by factors such as race, gender, and culture, which can lead to biased outcomes.
3.7 Exam Paper Generation
Exam paper generation refers to the process of automatically generating exam papers that meet educational requirements. Traditionally, teachers need to manually select appropriate questions from a large pool of items in the question bank. This process is often cumbersome and time-consuming, while automated methods effectively address this issue. As shown in Fig.10, the model integrates various constraints, including students’ knowledge levels, teaching objectives, and question types, to identify and extract relevant questions from the question bank, subsequently organizing them into a coherent and structured exam paper.
Commonly used methods include random question selection algorithms, backtracking algorithms, and genetic algorithms (
Cen et al., 2010;
Joshi et al., 2016;
Umardand & Gaikwad, 2017). Random selection algorithms are typically used in certification exams. Backtracking algorithms are commonly applied in adaptive assessments, such as the TOEFL language proficiency test. Genetic algorithms, which iteratively optimize an initial set of question combinations, progressively generate the most optimal test set that satisfies the constraints, making them particularly effective in solving high-dimensional, multi-constraint problems.
In recent years, with the advancement of deep learning, neural network-based exam paper generation algorithms have begun to emerge. Cen et al. (
2010) employed an adaptive genetic algorithm based on artificial intelligence and information processing to optimize the question selection and paper organization process. Joshi et al. (
2016) suggested that the use of NLP in the exam generation process could directly extract textual information from questions, preventing the frequent selection of certain questions. Wang et al. (
2020) proposed a question selection strategy for cognitive diagnosis, integrating artificial intelligence recognition models to enhance the efficiency and accuracy of the question selection. The advent of LFMs has brought new breakthroughs to exam paper generation systems. Unlike traditional methods, LFMs are capable of processing unstructured raw question text, automatically analyzing and understanding the semantics, context, and underlying key points within the questions, without the need for prior complex labeling or manual feature extraction. This capability allows LFMs to generate exam papers that more flexibly meet specific educational needs, further enhancing the intelligence of the exam paper generation process.
However, there are still several limitations. First, the performance heavily depends on the accurate modeling of the generation objectives. If the objective modeling is flawed, such as inaccurate estimates of difficulty distribution, it could result in exam papers that are unreasonable in terms of difficulty control. Second, in addition to basic requirements for difficulty and key points coverage, real-world educational scenarios may involve more complex exam paper requirements, such as assessing creative thinking and critical analysis abilities. Therefore, exam paper generation systems need to integrate additional dimensions to ensure that the generated papers are comprehensive and reasonable.
3.8 Question Quality Evaluation
Test items and examination papers are essential tools widely used in evaluating students’ academic performance and teachers’ instructional quality. They not only directly reflect students’ learning outcomes but also serve as important instruments for assessing teaching effectiveness and guiding instructional improvements. Therefore, the quality assessment of test items and examination papers is a critical process in educational measurement. High-quality test items and examination papers need to balance several key aspects.
(1) Fairness. All candidates are tested under the same conditions.
(2) Reliability. Students get similar scores when tested at different times.
(3) Validity. The questions should cover the necessary knowledge points, and maintain an appropriate level of difficulty.
(4) Teaching improvement. By analyzing students’ answers, teachers can understand the strengths and weaknesses of each student, and adjust teaching strategies.
The quality assessment of test items and exam papers mainly includes single-item assessment and whole-paper assessment. In single-item assessment, the focus is on ensuring that each question effectively measures specific knowledge and skills. The assessment dimensions typically include content relevance, difficulty compatibility, discrimination power, design reasonableness, and uniqueness of answers. In whole-paper assessment, the focus is on thoroughly evaluating the entire test paper to ensure its logical structure, appropriate difficulty level, comprehensive content coverage, and alignment with curriculum requirements.
Currently, most quality assessments rely heavily on manual review. While manual review ensures high accuracy, it suffers from low efficiency and strong subjectivity, making it unsuitable for large-scale assessments. Metrics like bilingual evaluation understudy, recall-oriented understudy for gisting evaluation, and metric for evaluation of translation with explicit ordering, commonly used in machine translation tasks, have been applied to quality assessments. However, these rely on manually crafted test items as benchmarks and merely compare structural similarities, lacking deeper correlations. Consequently, they offer limited practical value for high-quality evaluations of test items and examination papers (
Görgün & Bulut, 2024). Nowadays, LFMs have been used in this field, aiding in the improvement of assessments. First, we can build student profiles from educational data using deep learning methods such as clustering, decision trees, and knowledge graphs. These profiles, combined with test items and examination papers, serve as inputs for LFMs, which can produce multi-dimensional evaluation results (see Fig.11).
However, there are still several limitations. First, LFMs may lack in-depth understanding of specialized knowledge, leading to biases when assessing domain-specific items. Second, current evaluation standards lack strong associations and integration with curriculum standards, making it difficult to conduct a comprehensive evaluation of quality.
3.9 Discussion
LFMs for K-12 education are typically developed by fine-tuning pre-trained models on extensive educational datasets, such as exam questions and textbooks. While these models capitalize on the robust comprehension and generative capabilities to deliver promising results in various educational scenarios, their underlying approach remains rooted in data fitting. They heavily depend on massive datasets, extensive parameters, and immense computational power, epitomizing a brute-force methodology. However, this approach falls short of true intelligence in the strictest sense and continues to face notable limitations and challenges.
3.9.1 Lack of Multimodal Capabilities
Currently, LFMs mainly focus on breakthroughs in NLP, but their abilities in text-to-image and text-to-video generation are still far from practical use. This makes it hard to meet the multimodal needs in education. For example, LFMs are not good at generating teaching materials that combine text and visuals. Educational data naturally includes rich and complex multimodal information, where different modalities complement and cannot replace each other. LFMs for K-12 education must be able to accurately understand and reason across multiple modalities.
3.9.2 Complex Reasoning Remains a Weakness
LFMs rely on the accumulation of massive correlations, which is essentially akin to the “Clever Hans” effect. They use a data-driven paradigm to identify correlations but fail to achieve reasoning that requires causal relationships. Xu et al. (
2023) showed that as the number of reasoning steps increases, the reasoning performance of LFMs (such as ChatGPT and text-davinci-003) drops significantly.
3.9.3 Poor Understanding of Diagrams
Diagrams are a unique type of visual content widely used in K-12 education, differing significantly from natural images in terms of sample quantity, low-level features, and high-level semantics. LFMs are primarily trained on massive datasets of natural images, fitting well to the distribution of these images. However, their performance remains poor on diagrams, which have distinct characteristics. This limitation affects their ability to achieve satisfactory results in solving problems in subjects like mathematics and physics.
3.9.4 Poor Personalization Capability.
Currently, educational applications based on LFMs mainly focus on personalization concepts inherited from prior computer-assisted learning experiences. These applications are typically limited to a few aspects, such as recommending learning paths. However, effective teaching requires consideration of each student’s unique characteristics and personalized optimization based on their profiles.
3.9.5 Lack of Educational Expertise.
Educational activities are highly specialized knowledge transfer and talent development processes that must follow systematic and rigorous educational theories. While some LFMs fail to integrate educational and pedagogical theories into the entire process of model construction and application. As a result, issues still arise, such as inaccuracies in the knowledge and facts used for reasoning and a reasoning process that does not comply with the standards of educational applications.
4 Future Prospects
Accoding to the research above, we summarize some directions for improvement and development in the field of LFMs for K-12 education.
(1) Knowledge understanding. To construct LFMs for K-12 education, it is essential to build multimodal databases tailored to educational Big Data and create diverse knowledge graphs. This would provide a foundation for deep understanding, complex reasoning, and accurate generation. Additionally, based on the multimodal databases and knowledge graphs, designing appropriate training tasks can help the LFMs better understand subject-specific knowledge and educational content.
(2) Teaching expertise. Integrating educational and teaching principles and theory into the design and optimization of LFMs can enable them to better assist teachers in their teaching. For example, how to integrate learning motivation theory and emotional factors in educational psychology into the training and application of the model, so as to better meet the diverse needs of students in K-12 education. After that, accurate profiles can be generated, which reflect students’ learning behaviors, knowledge levels, and emotional states. Based on these profiles, adaptive teaching content planning algorithms should be developed to dynamically generate personalized teaching paths. These paths would consider both the student’s cognitive level and emotional state, thereby better supporting the implementation of teaching activities of LFMs.
(3) Studying assistance. Education has shifted from a knowledge-transfer, lecture-based model to a more heuristic approach that focuses on developing core competencies. Therefore, LFMs should not only assist students with learning, but also emphasize skill development. For example, models could incorporate Socratic-style guiding interactions and empathetic response mechanisms to meet the needs of exploratory, interactive, and personalized learning.
(4) Privacy and Ethics. Recently, privacy and ethical concerns have gained increasing attention, yet they remain insufficiently addressed in current research. To enhance data security, institutions can adopt differential privacy techniques to prevent data leakage. Additionally, they can implement homomorphic encryption for secure computations on sensitive educational data and leverage federated learning to reduce risks associated with centralized data storage. To address ethical challenges in model-generated content, such as bias, discrimination, and factual inaccuracies, techniques like adversarial debiasing and RLHF can be integrated. These technical methods are essential for improving the trustworthiness and security of LLMs in educational applications while ensuring alignment with pedagogical principles and ethical standards.
(5) Evaluation benchmarks and methods. Since LFMs are intended for students whose cognitive abilities are still developing, the reliability and safety of their outputs become even more critical. However, many existing LFMs still present significant application risks. Issues such as inaccurate content and the generation of biased, discriminatory, abusive, or unlawful statements remain unresolved. Therefore, it is crucial to establish reasonable benchmarks, which would involve rigorous testing of the model’s outputs to ensure their safety and reliability. Building on these benchmarks, evaluation methods for various application scenarios should be tailored to align with educational goals and students’ needs. For instance, in intelligent question-answering systems, metrics such as students’ comprehension levels and their improvement in problem-solving skills can serve as key evaluation indicators. This approach ensures a more precise assessment of the practical impact and effectiveness of LFMs in educational settings.
5 Conclusions
The rapid advancement of LFMs is reshaping education and opening up new possibilities in various education applications. This review provides an in-depth exploration of LFMs applications, categorized into student and teacher support such as recommending instructional resources, generating instructional content, solving problems, answering questions, etc. Along with a discussion of the challenges and limitations, we suggest directions for future research and experts to guide and inspire further progress in using LFMs to improve education.