德雷塞尔大学,3230 市场街,费城,宾夕法尼亚州,美国11email: adit.gupta@drexel.edu 22institutetext: Georgia Institute of Technology, North Avenue, Atlanta, GA, USA
institutetext: 佐治亚理工学院,北大道,亚特兰大,GA,美国 22email: {jreddig,dweitekamp,cmaclellan}@gatech.edu 33institutetext: Politecnico di Torino, Torino, Italy
institutetext: 都灵理工大学,都灵,意大利33email: tommaso.calo@polito.it
Beyond Final Answers:
Evaluating Large Language Models for Math Tutoring
超越最终答案:评估大型语言模型在数学辅导中的应用
Abstract 摘要
Researchers have made notable progress in applying Large Language Models (LLMs) to solve math problems, as demonstrated through efforts like GSM8k, ProofNet, AlphaGeometry, and MathOdyssey.
This progress has sparked interest in their potential use for tutoring students in mathematics.
However, the reliability of LLMs in tutoring contexts—where correctness and instructional quality are crucial—remains underexplored.
Moreover, LLM problem-solving capabilities may not necessarily translate into effective tutoring support for students.
In this work, we present two novel approaches to evaluate the correctness and quality of LLMs in math tutoring contexts.
The first approach uses an intelligent tutoring system for college algebra as a testbed to assess LLM problem-solving capabilities.
We generate benchmark problems using the tutor, prompt a diverse set of LLMs to solve them, and compare the solutions to those generated by the tutor.
The second approach evaluates LLM as tutors rather than problem solvers.
We employ human evaluators, who act as students seeking tutoring support from each LLM.
We then assess the quality and correctness of the support provided by the LLMs via a qualitative coding process.
We applied these methods to evaluate several ChatGPT models, including 3.5 Turbo, 4, 4o, o1-mini, and o1-preview.
Our findings show that when used as problem solvers, LLMs generate correct final answers for 85.5% of the college algebra problems tested.
When employed interactively as tutors, 90% of LLM dialogues show high-quality instructional support; however, many contain errors—only 56.6% are entirely correct.
We conclude that, despite their potential, LLMs are not yet suitable as intelligent tutors for math without human oversight or additional mechanisms to ensure correctness and quality.
研究人员在将大型语言模型(LLMs)应用于解决数学问题方面取得了显著进展,这体现在 GSM8k、ProofNet、AlphaGeometry 和 MathOdyssey 等工作中。这些进展激发了人们对其在数学教学方面潜在应用的兴趣。然而,在教学中——正确性和教学质量至关重要——LLMs 的可靠性仍待深入探索。此外,LLMs 的解题能力未必能转化为有效的教学支持。在这项工作中,我们提出了两种评估 LLMs 在数学教学环境中正确性和质量的新方法。第一种方法使用一个大学代数智能教学系统作为测试平台,评估 LLMs 的解题能力。我们使用该系统生成基准问题,提示一组多样化的 LLMs 来解答这些问题,并将它们的解答与系统生成的解答进行比较。第二种方法将 LLMs 作为教师而非解题者进行评估。我们采用人类评估者,他们扮演寻求 LLM 教学支持的学生角色。 我们随后通过定性编码过程评估了 LLMs 提供的支持的质量和正确性。我们将这些方法应用于评估多个 ChatGPT 模型,包括 3.5 Turbo、4、4o、o1-mini 和 o1-preview。我们的研究结果表明,当作为问题解决者使用时,LLMs 为测试的大学代数问题中有 85.5%生成了正确的最终答案。当作为互动式导师使用时,90%的 LLM 对话显示出高质量的教学支持;然而,其中许多包含错误——只有 56.6%是完全正确的。我们得出结论,尽管具有潜力,但在没有人工监督或额外的机制来确保正确性和质量的情况下,LLMs 目前还不适合作为数学智能导师。
Keywords:
Intelligent Tutors, Large Language Models, Generative AI, Math Education关键词:智能导师,大型语言模型,生成式人工智能,数学教育
1 Introduction 1 引言
Large Language Models (LLMs) have started to exhibit moderate proficiency at mathematical problem solving. For example, GPT-4 correctly solves over 90% of the problems in GSM8K benchmark [4] and approximately 80% of the problems in the MATH benchmark [6] using advanced prompting techniques [21]. Although these results indicate progress, there are still many limitations. Findings from the GSM-Symbolic benchmark [19] suggest that LLMs struggle with perturbed or novel problem formulations that are easily solved by humans, indicating that their relatively high performance on standard benchmarks is partially due to memorization. Furthermore, LLM performance remains inconsistent across different problem classes, in contrast to traditional intelligent tutors, which are developed to provide 100% accurate support. These inconsistencies warrant the need for a deeper investigation into the capabilities, limitations, and implications of LLM for education.
大型语言模型(LLMs)在数学问题解决方面已展现出中等水平的能力。例如,GPT-4 使用先进的提示技术[21],正确解决了 GSM8K 基准测试[4]中超过 90%的问题,以及 MATH 基准测试[6]中约 80%的问题。尽管这些结果表明了进步,但仍存在许多局限性。GSM-Symbolic 基准测试[19]的结果表明,LLMs 在处理被扰动或新颖的问题表述方面存在困难,这些问题对人类来说很容易解决,这表明它们在标准基准测试上的相对较高性能部分是由于记忆所致。此外,与传统智能导师不同——智能导师被开发出来以提供 100%准确的支持——LLM 在不同问题类别上的表现仍然不一致。这些不一致性表明,有必要对 LLMs 在教育方面的能力、局限性和影响进行更深入的调查。
Companies such as Duolingo and Khan Academy have started to leverage LLMs to offer personalized learning experiences, facilitate interactive problem-solving, and provide real-time feedback to learners. However, significant challenges remain to ensure the accuracy, reliability, and adaptability of LLMs in tutoring settings. Despite their remarkable capabilities, studies have shown that LLMs frequently produce plausible yet incorrect solutions to complex mathematical problems, especially in areas that require precise calculations and multi-step reasoning [19, 10]. In mathematics, not only is the correctness of the final answer crucial, but also the quality of stepwise guidance that fosters effective learning. One recent classroom study comparing LLM-tutoring to traditional classroom instruction showed positive results [12]. However, considering that LLMs likely produce errors in around 10% of responses—using the best GSM8k performance as an optimistic measure—there is a possibility that they may do more harm than good. The manner in which LLMs confidently “hallucinate” incorrect, yet seemly plausible information is a recipe for several possible negative effects [11]. In the best case, students’ trust in LLM tutors may be rightfully eroded upon recognizing mistakes. In the worst case, LLM hallucinations may lead students to form misconceptions that compromise future learning.
多邻国和可汗学院等公司已经开始利用 LLMs 提供个性化学习体验,促进互动式问题解决,并为学习者提供实时反馈。然而,为确保 LLMs 在辅导环境中的准确性、可靠性和适应性,仍存在重大挑战。尽管 LLMs 具有显著能力,但研究表明,它们在解决复杂的数学问题时,经常产生看似合理却错误的结果,尤其是在需要精确计算和多步骤推理的领域[19, 10]。在数学中,不仅最终答案的正确性至关重要,逐步指导的质量对于促进有效学习同样重要。一项最近对比 LLM 辅导与传统课堂教学的课堂研究表明结果积极[12]。然而,考虑到 LLMs 大约在 10%的回应中产生错误——以 GSM8k 的最佳性能作为乐观指标——它们可能弊大于利。LLMs 自信地“幻觉”出错误但看似合理的信息的方式,可能导致多种负面效应[11]。 在最好的情况下,学生可能因为认识到错误而合理地丧失对 LLM 助教的信任。 在最坏的情况下,LLM 的幻觉可能导致学生形成错误观念,从而影响未来的学习。
LLM tutors mark an unusual inflection in the history of intelligent tutoring systems. It has been known for decades that automated computer-delivered tutors produce learning gains comparable to or greater than those of human tutors [25, 16], who famously provide learning gains up to two standard deviations higher than those from traditional classroom instruction [2]. The original artificial intelligence (AI) tutors: hard-coded intelligent tutoring systems, have found success by taking a cognitivist approach to tutoring—tracking and quantifying student knowledge by comparison to an expert model [5], and then adapting instruction accordingly [20, 24]. As compelling as LLMs’ generative capabilities are, when used as standalone tutors, they arguably mark a regression in actual AI tutoring capabilities compared to traditional ITSs since they are consistently inaccurate and lack the cognition-oriented adaptivity of prior approaches.
LLM 助教标志着智能辅导系统历史中的一个特殊转折点。几十年来,人们已经知道自动计算机提供的助教产生的学习效果与人类助教相当甚至更好[25,16],而人类助教以提供高达两个标准差的学习效果而闻名[2]。最初的人工智能(AI)助教:硬编码的智能辅导系统,通过认知方法取得成功——通过将学生的知识与其他专家模型进行比较来跟踪和量化[5],然后相应地调整教学[20,24]。尽管 LLM 的生成能力很吸引人,但作为独立的助教使用时,它们可能标志着与传统 ITS 相比实际 AI 辅导能力的倒退,因为它们持续不准确且缺乏先前方法的认知导向适应性。
This study evaluates the potential of LLMs in educational contexts by systematically assessing their performance on structured algebra tasks. We selected algebra for this study, given its long-standing use in previous research on intelligent tutoring systems [15, 8, 1]. This study aims to investigate the following questions:
本研究通过系统评估 LLMs 在结构化代数任务上的表现,探讨了其在教育环境中的潜力。鉴于代数在智能辅导系统先前研究中的长期应用[15, 8, 1],我们选择代数作为本研究的研究对象。本研究旨在探究以下问题:
-
•
RQ1: How accurately can LLMs generate solutions to the kinds of algebra problems currently supported by intelligent tutoring systems?
• RQ1:LLMs 在生成当前智能辅导系统支持的代数问题解决方案时,其准确性如何? -
•
RQ2: What is the accuracy and quality of the tutoring support provided by LLMs (e.g., scaffolding, hints, and feedback) on these algebra problems?
• RQ2:LLMs 在这些问题上提供的辅导支持(例如,支架、提示和反馈)的准确性和质量如何?
We employ two techniques to explore these questions: (1) an automated approach that uses an existing algebra tutor as a testbed for evaluating LLM problem solving and (2) a qualitative approach to assess the quality and correctness of LLM dialogues generated by having evaluators interactively prompt an LLM for tutor support.
For the second method, we also conducted a thematic analysis [3] to identify and categorize observations about LLM tutoring behaviors.
我们采用两种技术来探索这些问题:(1) 一种自动化方法,使用现有的代数辅导系统作为评估 LLMs 问题解决能力的测试平台;(2) 一种定性方法,通过让评估人员交互式地提示 LLM 获取辅导支持,来评估 LLM 生成对话的质量和正确性。对于第二种方法,我们还进行了主题分析[3],以识别和分类关于 LLM 辅导行为的观察结果。
The findings of this study contribute to the field of intelligent tutoring systems by providing empirical evidence on the strengths and limitations of LLMs in math tutoring contexts, thus enriching the ongoing discourse on the role of AI in supporting learning. Specifically, our study makes the following contributions:
这项研究的发现通过提供关于 LLMs 在数学辅导环境中的优势和局限性的实证证据,为智能辅导系统领域做出了贡献,从而丰富了关于 AI 在支持学习中所扮演角色的持续讨论。具体而言,我们的研究做出了以下贡献:
-
•
We introduce a novel method that uses intelligent tutors as testbeds for evaluating LLM problem solving.
• 我们介绍了一种使用智能辅导系统作为评估 LLM 问题解决能力测试平台的新方法。 -
•
We introduce a second method for interactively evaluating LLM tutoring correctness and quality.
• 我们介绍了一种用于交互式评估 LLM 辅导正确性和质量的新方法。 -
•
We show that while LLMs largely generate responses aligned with pedagogical best practices, they frequently contain mistakes and inaccuracies, suggesting they are not yet ready for direct in-class deployment.
• 我们表明,尽管 LLMs 在生成与教学最佳实践一致的响应方面表现良好,但它们经常包含错误和不准确之处,这表明它们尚未准备好直接用于课堂教学。 -
•
We offer actionable guidelines for developers, emphasizing how LLMs can support aspects of tutoring, such as question generation and hint production, rather than serving as comprehensive, standalone tutoring solutions.
• 我们为开发者提供可操作的指南,强调 LLMs 如何支持辅导的各个方面,例如问题生成和提示生成,而不是作为全面的、独立的辅导解决方案。
2 Related Works 2 相关工作
Researchers have begun exploring the use of LLMs in educational applications. The existing literature shows that LLMs can generate worked examples and guide structured problem solving. For example, WorkedGen [27] uses prompt chaining and one-shot learning to produce interactive programming examples. Although user studies indicate that 77% of students found WorkedGen helpful, such self-reported feedback does not necessarily confirm improved learning outcomes. Similarly, Jamplate [14] harnesses AI-powered templates for idea generation, providing reflection-based scaffolding, but noting a tendency toward reduced critical thinking among students.
研究人员已经开始探索 LLMs 在教育应用中的使用。现有文献表明,LLMs 可以生成例题并指导结构化问题解决。例如,WorkedGen [27] 使用提示链和单次学习来生成交互式编程例题。尽管用户研究表明 77%的学生认为 WorkedGen 有帮助,但这种自我报告的反馈并不一定证实了学习成果的改善。类似地,Jamplate [14] 利用 AI 驱动的模板进行想法生成,提供基于反思的脚手架,但指出学生批判性思维能力有下降的趋势。
Although these studies highlight the potential of LLMs to create structured examples and facilitate reflective engagement, researchers must develop a consistent, stepwise evaluation framework for algebraic or multi-step reasoning tasks. Existing benchmarks, such as GSM8K [4], assess the accuracy of the final answer rather than examining the detailed intermediate steps or the iterative feedback necessary for model-tracing [9] in a typical tutoring context. As a result, there is still a need for a more systematic methodology that tests how effectively LLMs handle multi-step problems and adapt to the pedagogical requirements of a tutoring environment.
尽管这些研究突出了 LLMs 在创建结构化示例和促进反思性参与方面的潜力,研究人员必须为代数或多步骤推理任务开发一个一致、分步骤的评价框架。现有的基准测试,如 GSM8K [4],评估的是最终答案的准确性,而不是检查详细的中间步骤或典型辅导环境中模型追踪所需的迭代反馈[9]。因此,仍然需要一种更系统的评估方法,以测试 LLMs 处理多步骤问题的有效性以及适应辅导环境教学需求的能力。
Another growing area of research investigates the use of LLMs for tutoring in various domains for non-fluent English speakers. For example, a comparative study of models such as GPT-4, Llama-2-ko-DPO-13B, and eT5-chat reveals trade-offs between individualization and correctness [22]. Smaller models provided more personalized interactions, while GPT-4 exhibited greater correctness but less personalized assistance. Tutoring is an immensely personal activity, and both correctness and individualization are needed. These studies demonstrate the need for more investigation into LLM shortcomings in stepwise instruction, as well as how to better integrate LLM into existing intelligent tutoring platforms. Moreover, current studies often prioritize correctness of the final answer, overlooking the quality of intermediate steps that are crucial for meaningful learning [26]. For example, in mathematics education, breaking problems down into their steps ensures that students grasp foundational concepts rather than simply arriving at the correct solution.
另一个不断增长的研究领域是探讨 LLMs 在非流利英语使用者的多领域辅导中的应用。例如,一项比较 GPT-4、Llama-2-ko-DPO-13B 和 eT5-chat 等模型的研究揭示了个性化与正确性之间的权衡[22]。较小的模型提供了更个性化的互动,而 GPT-4 表现出更高的正确性但个性化帮助较少。辅导是一项极具个性化的活动,既需要正确性也需要个性化。这些研究表明,需要进一步研究 LLMs 在分步指导中的不足,以及如何更好地将 LLMs 整合到现有的智能辅导平台中。此外,当前的研究往往优先考虑最终答案的正确性,而忽略了对于有意义学习至关重要的中间步骤的质量[26]。例如,在数学教育中,将问题分解为步骤可以确保学生掌握基础概念,而不仅仅是得出正确答案。
The work in this paper aims to fill this gap by introducing a novel method that evaluates LLM performance on a wide range of math questions from college algebra, generated from the Apprentice Tutors platform [7]. This platform was designed as a web-based intelligent tutoring platform to support personalized learning in mathematics. The platform supports more than ten tutors including topics like radicals, factoring polynomials, and solving logarithmic equations.
本文的工作旨在通过引入一种新方法来填补这一空白,该方法评估了从学徒导师平台[7]生成的大学代数范围内各种数学问题中 LLM 的表现。该平台被设计为一个基于网络的智能辅导平台,以支持数学的个性化学习。该平台支持超过十个导师,包括根式、多项式因式分解和对数方程求解等主题。
3 Methodology 3 方法
We employ two complementary approaches to evaluate LLMs. First, we developed an automated approach that uses an existing intelligent tutoring system to assess LLM problem-solving accuracy. We generate problems from the tutor and submit them to multiple LLMs. We then use the tutor expert model to evaluate the correctness of the LLM responses. Second, to evaluate LLMs as tutors, we had evaluators interactively engage with the LLMs to request tutoring guidance as if they were students. We then qualitatively evaluated the tutor support generated by the LLMs.
我们采用两种互补的方法来评估 LLM。首先,我们开发了一种自动化方法,该方法使用现有的智能辅导系统来评估 LLM 的问题解决准确性。我们从导师那里生成问题,并将它们提交给多个 LLM。然后,我们使用导师专家模型来评估 LLM 响应的正确性。其次,为了评估 LLM 作为导师的表现,我们让评估人员以学生身份与 LLM 互动,请求辅导指导。然后,我们定性评估了 LLM 生成的辅导支持。
We collected and analyzed data from both approaches to analyze the strengths and limitations of LLMs in structured problem-solving tasks. Figures 1 and 2 illustrate the workflows for these methodologies, which we describe in further detail below.
我们收集并分析了这两种方法的数据,以分析 LLMs 在结构化问题解决任务中的优势和局限性。图 1 和图 2 展示了这些方法的工作流程,我们将在下文中进一步详细描述。
3.1 Evaluating LLM using Intelligent Tutors as Testbeds
3.1 使用智能导师作为测试平台评估 LLM
图 1:我们提出的基于智能导师的 LLM 评估流程。该流程始于测试平台搭建——托管在 Google Colab 上(A),随后使用智能导师(我们称之为学徒导师)生成问题及解决方案(B)。在评估中,我们将 22 种类型的问题提交给 GPT-3.5 Turbo 和 GPT-4 等 LLM 模型(C)。每个 LLM 的响应通过将其与导师答案一同提交给另一个 LLM 进行检查(D)。接着我们进行人工验证以确认第二个模型的响应准确性。最后,所有结果被记录到性能追踪电子表格中(F)。
We developed our evaluation system in Python to automate tutor problem generation, LLM interaction, and response evaluation. This allowed us to systematically test multiple LLMs on a variety of educational tasks. For this study, we evaluated GPT-3.5 Turbo, GPT-4, GPT-4o, o1 Mini, and o1 Preview. Although these models were the focus of this analysis, the benchmarking tool is designed to be extensible to other tutor content and can be easily adapted to test other models, such as Google’s Gemini, Anthropic’s Claude series, or Deepseek’s open models.111To maintain anonymity, we have not shared the link to the automated benchmark or data. However, upon acceptance, we will provide the GitHub link to all our code and data.
为保持匿名性,我们未共享自动基准测试或数据的链接。然而,一旦被接受,我们将提供所有代码和数据的 GitHub 链接。
我们使用 Python 开发了评估系统,以自动化助教问题生成、LLM 交互和响应评估。这使我们能够系统地在各种教育任务上测试多个 LLM。在本研究中,我们评估了 GPT-3.5 Turbo、GPT-4、GPT-4o、o1 Mini 和 o1 Preview。尽管这些模型是本次分析的重点,但基准测试工具设计为可扩展,适用于其他助教内容,并且可以轻松地适应测试其他模型,例如 Google 的 Gemini、Anthropic 的 Claude 系列或 Deepseek 的开源模型。 1
The workflow for the tutor-based evaluation process is outlined in Figure 1. The process begins with the testbed setup (A), where we define parameters for the evaluation, including the number and type of algebra problems to be tested. For this study, we identified 22 problem types from the Apprentice Tutors platform and generated five problems of each type. The problems and their corresponding step-by-step solutions were generated directly from the Apprentice Tutors software (B). The generated problems were then submitted to each LLM (C), which was prompted to produce a solution. The exact prompt provided to each LLM was:
基于导师的评估流程如图 1 所示。该流程从测试平台设置(A)开始,我们在此定义评估参数,包括要测试的代数问题的数量和类型。在本研究中,我们从学徒导师平台中确定了 22 种问题类型,并为每种类型生成了五个问题。这些问题及其对应的逐步解决方案直接从学徒导师软件(B)中生成。然后,生成的这些问题被提交给每个 LLM(C),并提示其生成解决方案。提供给每个 LLM 的确切提示是:
The benchmarking system processed the response from the initial LLM prompt and extracted the output, generating a structured list of questions paired with their corresponding answers produced by the LLM. The LLM responses were then evaluated by submitting the original problem and the generated solution to a second LLM (D) to verify accuracy and logical consistency. We used GPT-4 as the evaluation model in all tests. Each LLM was evaluated sequentially and the results were recorded and analyzed before proceeding to the next model. The exact prompt used by the second LLM to evaluate each answer was:
基准测试系统处理了初始 LLM 提示的响应并提取了输出,生成了由问题及其对应的 LLM 生成的答案组成的结构化列表。然后通过将原始问题和生成的解决方案提交给第二个 LLM(D)来评估 LLM 的响应,以验证准确性和逻辑一致性。我们在所有测试中都使用 GPT-4 作为评估模型。每个 LLM 依次进行评估,结果被记录和分析后才会进行下一个模型的测试。第二个 LLM 用于评估每个答案的确切提示如下:
To further validate the quality of the LLM evaluation, we performed manual human verification (E). During this process, reviewers compared the ground-truth responses generated by the Apprentice Tutors platform with the responses produced by the LLM and the correctness assessments provided by the second LLM. In certain cases, discrepancies arose due to differences in interpretation, such as when the second LLM marked an expression like as incorrect because it expected the simplified answer of . These instances were noted and the human reviewer marked the answer as correct if it was mathematically accurate in its final form. However, stepwise solutions were also considered, ensuring that intermediate simplifications (e.g., distinguishing between and 2 when necessary) aligned with expected problem-solving conventions.
为了进一步验证 LLM 评估的质量,我们进行了人工人工验证(E)。在此过程中,审阅者将学徒导师平台生成的真实响应与 LLM 生成的响应以及第二个 LLM 提供的正确性评估进行了比较。在某些情况下,由于解释差异导致出现分歧,例如当第二个 LLM 将表达式 标记为错误,因为它期望的简化答案是 时。这些情况被记录下来,如果最终形式在数学上是正确的,人工审阅者会将答案标记为正确。然而,也考虑了逐步解决方案,确保中间简化(例如,在必要时区分 和 2)符合预期的解题规范。
Finally, all data collected, including LLM responses, human evaluations, and any discrepancies identified during the validation process, were systematically recorded in a performance tracker spreadsheet (F). This structured logging approach facilitated detailed analysis and allowed for a robust comparison between different LLM models and problem types. The goal was to gain insights into their performance and limitations.
最后,所有收集到的数据,包括 LLM 的响应、人工评估以及验证过程中发现的任何差异,都被系统地记录在一个性能追踪电子表格(F)中。这种结构化的记录方法促进了详细的分析,并允许在不同 LLM 模型和问题类型之间进行稳健的比较。目的是了解它们的性能和局限性。
3.2 Evaluating LLMs via Interactive Prompting
3.2 通过交互式提示评估 LLMs
We conducted a second study to assess how the LLMs perform when interacting with learners as tutors (see Figure 2). This study was designed to provide a qualitative perspective on the educational capabilities of the model, in contrast to the previous automated evaluation of problem-solving capabilities. We performed manual evaluation of the LLMs, using a standardized variation of the prompt from Salman Khan’s widely cited ChatGPT interview [13] to ensure consistency between sessions. By comparing these human-guided interactions with the outputs of the intelligent tutor, we investigated how well the step-by-step guidance of LLMs align with real user queries and misconceptions in a math tutoring context. Here is the prompt that was used:
我们进行了一项第二项研究,以评估 LLMs 在与学习者作为导师互动时的表现(见图 2)。这项研究旨在提供模型教育能力的定性视角,与之前对问题解决能力的自动化评估形成对比。我们使用 Salman Khan 在广泛引用的 ChatGPT 访谈[13]中提示的标准变体对 LLMs 进行人工评估,以确保会话之间的一致性。通过比较这些人类引导的互动与智能导师的输出,我们研究了 LLMs 的逐步指导如何与数学辅导环境中的真实用户查询和误解相匹配。以下是所使用的提示:
图 2:我们提出的通过交互式提示评估 LLM 的方法。该图使用 ChatGPT 的聊天界面来说明这一流程。工作流程始于让评估者提示 ChatGPT,如同学生一样提供辅导问题(A)。评估者交互式提交查询并从 ChatGPT 接收逐步指导(B)。评估者将每次聊天对话系统地记录在电子表格追踪器中以供分析(C)。
The problems were taken directly from the Apprentice Tutors and entered into ChatGPT. The evaluators interacted with the model as it guided them through problem-solving, responding to ChatGPT’s hints and prompts as they progressed. After each session, they logged a link to the chat history, the final answers provided by ChatGPT, and whether the responses matched the correct answers generated by the Apprentice Tutors. An example of this recording is shown in part B of Figure 2.
这些问题直接来自学徒导师,并输入到 ChatGPT 中。评估者在与模型互动时,按照其指导进行问题解决,逐步回应 ChatGPT 的提示和引导。每次会话后,他们记录聊天历史的链接、ChatGPT 提供的最终答案,以及这些回答是否与学徒导师生成的正确答案相匹配。这一记录的示例显示在图 2 的 B 部分。
After collecting all responses, two independent reviewers assessed the interactions to answer the following questions about each tutoring dialogue:
收集所有回复后,两位独立审阅者评估了互动,以回答关于每个辅导对话的以下问题:
-
•
Quality: Do the steps represent a high-quality tutoring interaction?
• 质量:这些步骤是否代表了一次高质量的辅导互动? -
•
Correctness: Were all the LLM responses in the dialogue correct?
• 正确性:对话中的所有 LLM 回复是否都正确?
| Low 低 | High 高 | |||
|---|---|---|---|---|
| Criterion 标准 | 1 (Poor) 1(差) | 2 (Fair) 2(一般) | 3 (Good) 3 (好) | 4 (Excellent) 4 (优秀) |
| Clarity of Explanation 解释的清晰度 |
Explanations are frequently unclear, leading to student frustration. 解释经常不清楚,导致学生感到沮丧。 |
Explanations are sometimes unclear or overly complex. 解释有时不清晰或过于复杂。 |
Explanations are generally clear, with occasional confusion. 解释通常清晰,偶尔存在困惑。 |
Explanations are consistently clear, logical, and easy to understand. 解释始终清晰、逻辑性强且易于理解。 |
| Feedback 反馈 |
Feedback is rare or unhelpful, lacking specificity. 反馈很少或无帮助,缺乏具体性。 |
Feedback is provided, but is sometimes vague or untimely. 提供了反馈,但有时模糊或不及时。 |
Feedback is generally constructive, specific, and timely. 反馈通常是建设性的、具体的和及时的。 |
Feedback is consistently constructive, specific, timely, and enhances student understanding. 反馈始终是建设性的、具体的、及时的,并增强了学生的理解。 |
| Scaffolded Support 支架式支持 |
The tutor provides too little or too much support, hindering learning. 导师提供太少或太多的支持,阻碍了学习。 |
Support is inconsistent, with occasional mismatches to the student’s needs. 支持不一致,偶尔与学生的需求不匹配。 |
Support is generally well-matched to the student’s needs, with appropriate scaffolding. 支持通常与学生的需求相匹配,并提供了适当的支架。 |
Support is perfectly calibrated, with scaffolding gradually reduced as the student gains confidence. 支持被完美校准,随着学生逐渐增强信心,支架会逐渐减少。 |
|
Problem-Solving Strategies 解题策略 |
Problem-solving strategies are not discussed or modeled. 未讨论或示范解题策略。 |
Some strategies are mentioned, but with limited discussion or modeling. 提及了一些策略,但讨论或示范有限。 |
Effective problem-solving strategies are discussed and modeled. 有效的解题策略被讨论和示范。 |
The tutor effectively models and teaches problem-solving strategies, encouraging independent thinking. 导师有效地示范和教授解题策略,鼓励独立思考。 |
|
Encouragement and Reinforcement 鼓励与强化 |
The tutor provides little to no encouragement, leading to a negative learning atmosphere. 导师几乎不提供鼓励,导致学习氛围消极。 |
Encouragement is sporadic, sometimes failing to motivate the student. 鼓励是零星的,有时无法激励学生。 |
The tutor provides positive reinforcement, generally maintaining a supportive atmosphere. 导师提供积极强化,通常保持支持性的氛围。 |
The tutor consistently encourages and motivates the student, fostering a highly positive learning environment. 导师持续鼓励和激励学生,培养了一个高度积极的学习环境。 |
表 1:评估互动辅导对话质量的五项标准。
To answer the first question and classify response quality, each dialogue was evaluated with respect to the structured rubric shown in Table 1. Each criterion was scored on a scale from 1 to 4. The rubric was designed to measure several key aspects of tutoring quality. Specifically, it evaluated (A) the correctness of explanations, (B) the depth of scaffolding, and (C) the alignment with instructional best practices. This structured approach aimed to reflect established learning science principles. We summed the scores across all criteria. If the total score was 10 or below, the response was categorized as “No” (not good quality); if the score was above 10, it was categorized as “Yes” (good quality). Once the scores were converted into Yes/No label, we measured inter-rater reliability using Cohen’s Kappa [18] to assess agreement between reviewers and confirm the robustness of our classifications.
要回答第一个问题并分类响应质量,每个对话都根据表 1 中显示的结构化评分标准进行评估。每个标准都在 1 到 4 的量表上评分。该评分标准旨在衡量辅导质量的关键方面。具体来说,它评估了(A)解释的正确性、(B)支架的深度以及(C)与教学最佳实践的一致性。这种结构化方法旨在反映既定的学习科学原理。我们汇总了所有标准的分数。如果总分低于 10,则响应被归类为“否”(质量不佳);如果分数高于 10,则被归类为“是”(质量良好)。一旦分数被转换为是/否标签,我们使用 Cohen's Kappa [ 18] 测量评分者间信度,以评估评审者之间的协议并确认我们分类的稳健性。
To answer the second question and assess response correctness, each reviewer also independently assessed whether all the LLM generated content was correct. These evaluations consisted of considering each LLM response from the dialogue, and noting any mistakes or errors. If there were any errors, then the dialogue was coded as incorrect, otherwise it was recorded as correct. Similar to question 1, we measured inter-rater reliability of the two evaluations using Cohen’s Kappa. We also conducted a thematic analysis [3] of the LLM responses to identifying recurring patterns in tutoring interactions. The goal was to identify and categorize common patterns, counting their frequency and noting whether they corresponded to positive or negative tutoring behaviors.
为了回答第二个问题并评估回答的正确性,每位审阅者也独立评估了所有由 LLM 生成的内容是否正确。这些评估包括考虑对话中的每个 LLM 回答,并记录任何错误或错误。如果有任何错误,则对话被编码为不正确,否则被记录为正确。与问题 1 类似,我们使用 Cohen's Kappa 测量了这两种评估的评分者间信度。我们还对 LLM 回答进行了主题分析[3],以识别辅导互动中的重复模式。目标是识别和分类常见模式,计算其频率,并记录它们是否对应于积极的或消极的辅导行为。
4 Results and Analysis 4 结果与分析
We present the results of the two evaluation methods used to assess the performance of LLMs in math tutoring contexts. We first present the results from our tutor-based evaluation method and then the results from evaluators interactively prompting the LLMs as if they were students.
我们展示了用于评估 LLM 在数学辅导环境中表现的两个评估方法的结果。我们首先展示基于辅导员的评估方法的结果,然后展示评估者以学生身份交互式提示 LLM 的结果。
4.1 Tutor-Based LLM Evaluation Results
4.1 基于辅导员的 LLM 评估结果
Table 2 summarizes the results from the tutor-based evaluation. The apprentice tutors had 22 problem types and we sampled 5 problems of each type to produce a test set that contained 110 problem and answer pairs.
表 2 总结了基于导师评估的结果。学徒导师有 22 种问题类型,我们每种类型抽取 5 个问题,以生成包含 110 个问题和答案对测试集。
表 2:基于导师的 LLM 评估结果,仅评估了最终答案。
| Model | # Problem Types # 问题类型 | # Problems # 问题 | # Correct # 正确 | Accuracy 准确率 |
|---|---|---|---|---|
| GPT-3.5 Turbo | 22 | 110 | 85 | 77.3% |
| GPT-4 | 22 | 110 | 83 | 74.5% |
| GPT-4o | 22 | 110 | 107 | 97.3% |
| o1-mini | 22 | 110 | 101 | 91.8% |
| o1-preview | 22 | 110 | 94 | 85.5% |
| Overall Avg. 总体平均 | 22 | 110 | 94 | 85.5% |
We identified twenty-five instances (6.3% of total responses) where the second LLM marked answers incorrectly.
From our observations, the second LLM would incorrectly mark the answer when comparing the tutor-generated response to the LLM-generated response for the following reasons: (1) a mismatch in the operational order (e.g., vs. ), (2) differences in simplification (e.g., vs. 0.5), and (3) differences in operator context (e.g., multiplication represented by “x” vs. “*”).
我们识别出二十五例(占总回复的 6.3%)第二 LLM 标记答案错误的情况。根据我们的观察,当比较助教生成的回复与 LLM 生成的回复时,第二 LLM 会因以下原因错误标记答案:(1) 操作顺序不匹配(例如, 对比 ),(2) 简化方式不同(例如, 对比 0.5),以及 (3) 运算符上下文差异(例如,“x”表示乘法对比 “*”)。
4.2 Interactive Prompting-Based LLM Evaluation
4.2 基于交互式提示的 LLM 评估
The evaluators prompted each of the five models to provide tutoring support on the same set of 30 problems, resulting in a total of 150 LLM dialogues. The two reviewers independently analyzed each dialogue to classify whether it was high quality (using rubric from Table 1) and fully correct. We also evaluated the final answer accuracy from each dialogue by comparing it to the tutor solution.
Table 3 shows the results of these assessments, breaking out the accuracy of final LLM answers alongside reviewer assessments of the quality and correctness of the LLM dialogues.
We calculated Cohen’s Kappa () to evaluate the reviewer agreement for both the quality and correctness ratings. This score, which ranges from 0 to 1, represents agreement after correcting for chance. A score greater than 0.7 is typically viewed as strong agreement. For the quality ratings, Cohen’s , and for the assessment of whether the LLM dialogues were entirely correct, Cohen’s . These scores indicate very strong agreement between the independent reviewers.
评估人员提示五个模型对同一组 30 个问题提供辅导支持,共生成 150 个 LLM 对话。两位审阅者独立分析每个对话,根据表 1 的评分标准将其分类为高质量和完全正确。我们还通过将对话的最终答案与导师解决方案进行比较,评估了每个对话的最终答案准确性。表 3 展示了这些评估的结果,其中列出了最终 LLM 答案的准确性以及审阅者对 LLM 对话质量和正确性的评估。我们计算了 Cohen’s Kappa( )来评估审阅者在质量和正确性评分上的意见一致性。该分数范围从 0 到 1,表示在排除偶然因素后的意见一致程度。通常认为 0.7 以上的分数表示强一致性。对于质量评分,Cohen’s ;对于评估 LLM 对话是否完全正确,Cohen’s 。这些分数表明独立审阅者之间具有非常强的意见一致性。
表 3:LLM 辅导交互质量和准确性的评估。各列展示了最终答案的准确性,以及被评审员 R1 和 R2 判定为高质量且完全正确的 LLM 对话的百分比。每个案例中的问题数量在括号中注明。
| Model | # Problems # 问题 | Final Accuracy 最终准确率 | % High Quality % 高质量 | % Fully Correct % 完全正确 | ||
| R1 | R2 | R1 | R2 | |||
| GPT 3.5 Turbo | 30 | 90.0% (27) | 90.0% (27) | 90.0% (27) | 46.7% (14) | 53.3% (16) |
| GPT 4 | 30 | 83.3% (25) | 93.3% (28) | 93.3% (28) | 43.3% (13) | 50.0% (15) |
| GPT 4o | 30 | 93.3% (28) | 90.0% (27) | 90.0% (27) | 70.0% (21) | 80.0% (24) |
| o1 mini | 30 | 86.7% (26) | 86.7% (26) | 80.0% (24) | 56.7% (17) | 43.3% (13) |
| o1 preview | 30 | 90.0% (27) | 90.0% (27) | 96.7% (29) | 73.3% (22) | 50.0% (15) |
| Overall Avg. 总体平均 | 30 | 88.6% | 90.0% | 56.6% | ||
表 4:互动辅导评估中的关键观察结果总结。
| Observation 观察 | Occurrences 出现次数 | Sentiment 情感 |
|---|---|---|
|
The final answer was correct, even though there were incurabilities in the sub-steps. 最终答案正确,尽管子步骤中存在无法解决的问题。 |
6 | Negative 负面 |
|
Even though the prompt was not to share an answer, it was possible to obtain the answer by manipulating responses (via yes/no questions). 尽管提示中未要求分享答案,但通过操纵回应(通过是/否问题)仍有可能获得答案。 |
4 | Negative 负面 |
|
For topics like factoring, there was an overemphasis on teaching basics (e.g., multiplying) instead of demonstrating specific methods (e.g., “slip and slide”). 对于因式分解等主题,过于强调基础教学(例如,乘法),而不是演示具体方法(例如,“滑行和滑动”)。 |
4 | Negative 负面 |
|
LLM over-indexes on ensuring the final answer is correct rather than emphasizing the student’s step-by-step skill acquisition. LLM 过度关注确保最终答案正确,而不是强调学生的逐步技能掌握。 |
3 | Negative 负面 |
|
LLM occasionally produces an incorrect conclusion and refuses to accept a correct student answer. LLM 偶尔会得出错误结论,并拒绝接受学生的正确答案。 |
4 | Negative 负面 |
|
Sub-steps are sometimes flagged as incorrect even though they are actually correct. 即使子步骤实际上是正确的,有时也会被标记为错误。 |
3 | Negative 负面 |
|
Difficult math notation (e.g., quadratic expressions) can be challenging to input from a standard keyboard. 复杂的数学符号(例如,二次表达式)从标准键盘输入可能具有挑战性。 |
2 | Negative 负面 |
|
LLM is flexible about answer formats, accepting multiple notational styles. LLM 对答案格式很灵活,接受多种符号风格。 |
3 | Positive 正面 |
|
LLM excels at generating hints and extra worked examples to support instruction. LLM 在生成提示和额外的工作示例以支持教学方面表现出色。 |
2 | Positive 正面 |
|
LLM provides encouraging feedback and positive reinforcement, which could benefit student well-being. LLM 提供鼓励性反馈和正面强化,这有助于学生的心理健康。 |
7 | Positive 正面 |
|
LLM nudges students to answer queries in sequence when they attempt to skip ahead. LLM 在学生试图跳过问题时,会引导他们按顺序回答查询。 |
2 | Positive 正面 |
Finally, the reviewers evaluated each model’s performance, documenting key behavioral patterns and noting any common issues. Table 4 summarizes these findings, classifying observations as positive or negative based on their potential impact on learners. This analysis highlights the strengths and weaknesses of each model’s tutoring approach, offering insight into their effectiveness in guiding students through problem-solving.
最后,审稿人评估了每个模型的性能,记录了关键行为模式,并指出了任何常见问题。表 4 总结了这些发现,根据其对学习者的潜在影响将观察结果分类为正面或负面。这项分析突出了每个模型辅导方法的优缺点,为其在指导学生解决问题的有效性提供了见解。
5 Discussion 5 讨论
Both of our evaluation methods suggest that the LLMs show reasonable final answer accuracy. Our tutor-based evaluation showed that GPT 4o had the highest final-answer accuracy at 97.3%. In the interactive prompting-based evaluation, we found that GPT-4o also got the highest accuracy at 93.3%.
Although these accuracies seem reasonable, it still means that these models will generate incorrect final answers for about 1 in 18 problems. We were also surprised to find that GPT-4 performed the worst and that model rankings changed based on the evaluation.
我们的两种评估方法都表明,LLMs 显示出合理的最终答案准确率。基于导师的评估显示,GPT 4o 具有最高的最终答案准确率,达到 97.3%。在交互式提示评估中,我们发现 GPT-4o 也获得了最高的准确率,为 93.3%。尽管这些准确率看似合理,但这仍然意味着这些模型大约会在 18 个问题中的 1 个产生错误的最终答案。我们还惊讶地发现 GPT-4 表现最差,而且模型排名根据评估而变化。
Newer models often performed worse than older ones, despite expected improvements. This suggests that LLM performance can no longer be expected to improve with each new model release and that there are continued gaps in how LLMs process multi-step math questions.
In terms of final answer accuracy, we also observed that the interactive prompt-based evaluation results were higher than those from the tutor-based evaluation, probably because human testers engaged in multi-turn interactions, allowing LLMs to refine responses. In contrast, the tutor-based evaluation provided only a single prompt, requiring the model to solve problems correctly in one step.
较新的模型往往表现不如旧模型,尽管预期会有改进。这表明 LLM 的性能不再随着每个新模型的发布而提升,并且 LLM 在处理多步数学问题时仍然存在持续的差距。在最终答案准确率方面,我们还观察到基于交互式提示的评估结果高于基于助教的评估结果,这可能是由于人类测试者参与了多轮交互,使 LLM 能够完善回答。相比之下,基于助教的评估只提供了一个提示,要求模型在一步内正确解决问题。
During our interactive prompting-based evaluation, we found that LLMs generate high-quality tutoring support most of the time.
Although GPT-4 had the lowest final answer accuracy, it scored near the top in terms of quality, with 93.3% of its chat dialogues being classified as having high pedagogical quality.
Although the LLMs achieved reasonable final answer accuracies, we found that when considering their entire tutoring dialogues they often would make mistakes. Along this dimension, GPT-4o achieved the best performance, with 75% of its dialogues being classified as fully correct (averaging across the two reviewers).
Across all five of the LLMs only 56.6% of the tutor dialogues were entirely correct.
This suggests that nearly 1 in 2 interactive LLM tutoring sessions with a student will likely contain errors.
These results suggest that LLMs error rates are likely much higher would be suggested by benchmarks that only evaluate them in terms of the final answer. This raises concerns about their use as standalone tutors, as less-than-perfect accuracy can harm learners. If one in two dialogues contains errors, students may lose trust in the tutor and, worse, develop misconceptions that hinder future learning.
Our results also suggest that future evaluations must consider the correctness of the entire tutoring dialogue, not just the final answer accuracy.
在我们的交互式提示评估中,我们发现 LLMs 大部分时间都能生成高质量的辅导支持。尽管 GPT-4 的最终答案准确率最低,但在质量方面它得分接近顶尖,其 93.3%的聊天对话被归类为具有高教学质量。虽然 LLMs 达到了合理的最终答案准确率,但我们发现当考虑它们整个辅导对话时,它们常常会出错。在这个维度上,GPT-4o 表现最佳,其 75%的对话被归类为完全正确(平均两个评审员的结果)。在所有五个 LLMs 中,只有 56.6%的辅导对话是完全正确的。这表明几乎有一半的与学生的交互式 LLM 辅导会包含错误。这些结果表明,LLMs 的错误率可能远高于仅评估最终答案的基准测试所暗示的水平。这引发了它们作为独立辅导员的担忧,因为不够完美的准确率可能会损害学习者的学习。 如果一半的对话中包含错误,学生可能会失去对导师的信任,更糟糕的是,可能会形成错误观念,阻碍未来的学习。我们的结果还表明,未来的评估必须考虑整个辅导对话的正确性,而不仅仅是最终答案的准确性。
Unlike intelligent tutors, which are often developed through meticulous cognitive task analysis [17] to ensure that sub-steps are carefully designed for effective learning, LLMs tend to prioritize arriving at the final answer rather than ensuring students understand the intermediate steps. For example, in factoring problems, the LLMs frequently provided overly general guidance, such as basic multiplication rules, instead of focusing on specific strategies such as the “slip-and-slide” method explicitly requested in the task. Although the question will be marked right in the evaluation, not using the specified method reduces the overall quality of the tutoring guidance.
与通常通过细致的认知任务分析[17]来确保子步骤精心设计以实现有效学习的智能导师不同,LLMs 往往优先考虑得出最终答案,而不是确保学生理解中间步骤。例如,在因式分解问题中,LLMs 经常提供过于笼统的指导,如基本的乘法规则,而不是专注于特定的策略,如任务中明确要求的“滑滑梯”方法。虽然问题在评估中会被标记为正确,但未使用指定方法会降低辅导指导的整体质量。
Table 4 summarizes reviewer observations of tutoring interactions. Reviewers noted that the LLMs sometimes refused to accept correct answers, miscalculated sub-steps, or overemphasized fundamentals at the expense of specialized techniques. However, the LLMs also provided flexible response formats, detailed hints, and encouraging feedback. Manual review of chat logs revealed inconsistencies in how LLMs handled intermediate steps. While they often produced correct final answers, sub-steps were occasionally miscalculated or erroneously flagged as incorrect (8 instances, as shown in Table 4). These errors fell into three categories: (1) simplified vs. unsimplified answers (e.g., vs. ), (2) differences in term order (e.g., vs. ), and (3) formatting mismatches (e.g., missing required LaTeX tags). These issues highlight inconsistencies in how LLMs evaluate semantic equivalence.
表 4 总结了评审者对辅导交互的观察。评审者指出,LLMs 有时拒绝接受正确答案,计算子步骤错误,或在牺牲专业技巧的情况下过度强调基础知识。然而,LLMs 也提供了灵活的响应格式、详细的提示和鼓励性的反馈。对聊天记录的手动审查揭示了 LLMs 处理中间步骤时存在的不一致性。虽然它们通常能给出正确的最终答案,但子步骤偶尔会被计算错误或错误地标记为不正确(如表 4 所示,共 8 个实例)。这些错误分为三类:(1)简化答案与未简化答案(例如, 与 ),(2)术语顺序差异(例如, 与 ),(3)格式不匹配(例如,缺少所需的 LaTeX 标签)。这些问题突显了 LLMs 在评估语义等价性时存在的不一致性。
Though LLMs may be insufficient compared to ITS when tutoring, recent work demonstrates how LLMs could support ITS in hint generation. By leveraging the expert model of ITS, LLMs can generate correctness feedback personalized to student responses without needing the LLM to perform any mathematical calculations [23].
If integrated with other educational technologies, they could offer several potential benefits. Their ability to generate hints, provide alternative explanations, and accommodate various answer formats makes them flexible and adaptive. However, our finding that LLMs sometimes mark correct answers as incorrect—even when provided with the solutions—suggest that LLMs integrations will need to be carefully evaluated before deployment.
尽管在数学辅导方面,LLMs 相较于 ITS 可能存在不足,但近期的研究展示了 LLMs 如何支持 ITS 进行提示生成。通过利用 ITS 的专家模型,LLMs 可以根据学生的回答生成个性化的正确性反馈,而无需 LLM 执行任何数学计算 [23]。如果与其他教育技术集成,它们可以带来多种潜在益处。其生成提示、提供替代解释以及适应多种答案格式的能力,使它们具有灵活性和适应性。然而,我们的研究发现 LLMs 有时会将正确答案标记为错误——即使提供了正确答案——这表明在部署 LLMs 集成之前,需要进行仔细的评估。
Additionally, their use of positive reinforcement, such as motivational nudges and encouraging feedback, could fosters an engaging learning environment, as long as all learners receive equal encouragement. For example, several chat logs included statements like, “Way to go, you are close to the answer!” or “That’s not right, but let’s keep trying.” These reinforcements might help motivate learners to persist and promote sustained engagement with educational tools. This approach aligns with adult learners’ need for constructive feedback and encouragement [7]. Future research should systematically evaluate the motivational potential of LLMs interactions and whether they translate into improved learning outcomes.
此外,他们使用积极强化措施,如激励性提示和鼓励性反馈,可以营造一个吸引人的学习环境,只要所有学习者都能得到平等的鼓励。例如,一些聊天记录中包含了类似“做得好,你快接近答案了!”或“那不对,但让我们继续尝试。”这样的语句。这些强化措施可能有助于激励学习者坚持,并促进他们持续使用教育工具。这种方法符合成人学习者对建设性反馈和鼓励的需求[7]。未来的研究应该系统地评估 LLMs 互动的激励潜力,以及它们是否转化为更好的学习成果。
6 Limitations and Future Work
6 局限性及未来工作
One limitation of our LLM evaluation methods is that they did not directly evaluate against actual students. We chose our approach because we knew that LLMs have reliability issues and we did not want to cause harm to students by giving them incorrect tutoring guidance during our experiments. Although our approach provides a means of safe, controlled evaluation, it may not fully capture the unique ways in which real students would engage with LLM tutors. A future iteration of this work could involve deploying the system in real-world educational settings and analyzing chat logs generated from authentic student interactions to gain a more comprehensive understanding of LLM performance. However, our work suggests that LLMs make mistakes in just over half of their student tutoring dialogues, so future research must account for the risks to students that this poses.
我们 LLM 评估方法的一个局限性是它们没有直接针对实际学生进行评估。我们选择这种方法是因为我们知道 LLM 存在可靠性问题,而且我们不想在实验过程中通过给学生们提供错误的辅导指导而对他们造成伤害。尽管我们的方法提供了一种安全、可控的评估方式,但它可能无法完全捕捉到真实学生与 LLM 导师互动的独特方式。这项工作的未来迭代可以涉及将系统部署到真实的教育环境中,并分析从真实学生互动中生成的聊天记录,以更全面地了解 LLM 的表现。然而,我们的研究表明,LLM 在超过一半的学生辅导对话中会犯错误,因此未来的研究必须考虑到这对学生所构成的潜在风险。
Furthermore, this study was conducted using an earlier version of the Apprentice Tutors platform, which focused exclusively on math-related questions. The Apprentice Tutors platform has since been expanded to include other types of questions, such as those related to nursing education. Future research could explore how LLMs, including ChatGPT, perform in this and other domains, extending the scope of evaluation to understand their domain-specific adaptability and effectiveness.
此外,这项研究是在一个早期版本的学徒导师平台进行的,该平台专门用于处理数学相关的问题。学徒导师平台此后已扩展,以包含其他类型的问题,例如与护理教育相关的问题。未来的研究可以探索 LLMs(包括 ChatGPT)在这些和其他领域中的表现,将评估范围扩展以了解它们在特定领域的适应性和有效性。
Finally, another limitation is that the analysis presented was restricted to a set of LLMs within the ChatGPT family. With the rapid development of new LLMs, such as Google’s Gemini, Anthropic’s Claude and several open-source LLMs, there is an opportunity to expand this study to evaluate these emerging models too. Comparing performance across a broader range of LLMs would provide a more holistic view of their strengths and weaknesses in educational contexts. Lastly, this study used the publicly available versions of the ChatGPT models accessible online. In practice, commercial production environments often deploy models that are fine-tuned to specific domains or tasks. Evaluating a custom-tuned LLM tailored to specific educational needs could offer a more accurate representation of how these tools would perform in real-world applications.
最后,另一个局限性在于所呈现的分析仅限于 ChatGPT 系列中的一组 LLMs。随着新的 LLMs 如 Google 的 Gemini、Anthropic 的 Claude 以及多个开源 LLMs 的快速发展,有机会将这项研究扩展以评估这些新兴模型。跨更广泛的 LLMs 进行性能比较将提供对其在教育环境中的优缺点的更全面视角。最后,本研究使用了可通过网络获取的 ChatGPT 模型的公开版本。在实际中,商业生产环境通常部署针对特定领域或任务进行微调的模型。评估针对特定教育需求进行定制微调的 LLM,可以更准确地反映这些工具在实际应用中的表现。
7 Conclusion 7 结论
In this study, we evaluated the ability of various LLMs to solve college algebra problems and to interactively provide step-by-step tutor guidance. We evaluated multiple models, including GPT-3.5 Turbo, GPT-4, GPT-4o, o1 Mini and o1 Preview, identifying both their strengths and limitations. The performance results presented in this study, though commendable, are significantly lower than the 100% accuracy achieved by traditional intelligent tutors on the same set of problems. While we saw an overall final accuracy of 85.5% with the automated tutor-based evaluation and 88.6% with our interactive prompting-based evaluation, our analysis of the entire LLM tutoring dialogues showed that only 56.6% were entirely correct. This discrepancy suggests a core limitation of using LLMs as tutors: while they often generate correct final answers, ensuring the pedagogical soundness and accuracy of intermediate steps remains challenging.
在本研究中,我们评估了不同 LLMs 解决大学代数问题的能力,以及它们提供逐步辅导指导的交互能力。我们评估了多个模型,包括 GPT-3.5 Turbo、GPT-4、GPT-4o、o1 Mini 和 o1 Preview,并识别了它们的优缺点。本研究中展示的性能结果虽然值得称赞,但与在相同问题集上达到 100%准确率的传统智能辅导系统相比,仍显著偏低。尽管通过自动化辅导评估我们观察到总体最终准确率为 85.5%,通过交互式提示评估为 88.6%,但我们对整个 LLM 辅导对话的分析显示,只有 56.6%是完全正确的。这种差异表明使用 LLMs 作为辅导师存在一个核心局限性:虽然它们经常生成正确的最终答案,但确保中间步骤的教学合理性和准确性仍然具有挑战性。
Despite these limitations, LLMs exhibit several capabilities that have the potential to improve learning outcomes. Their flexibility in accepting diverse answer formats, the ability to generate hints and alternative problem explanations, and the use of positive reinforcement, such as motivational nudges, could help foster a more supportive and engaging learning environment. However, there are risks associated with the deployment of LLMs in educational settings. For example, biases within the models may lead to inflexibility in pedagogical approaches, such as internal biases that favor some methods of solving problems over others. Furthermore, inaccuracies in responses—with around one in two dialogues containing errors—can undermine the trust of students in the guidance of the tutor and reduce their confidence in the system. To address these challenges, future work might explore how to leverage their independent capabilities, such as problem generation, hint generation, and positive reinforcement. By balancing these strengths with strategies to manage and mitigate their limitations, LLMs could effectively supplement other educational technologies, such as intelligent tutoring systems.
尽管存在这些局限性,LLMs 展现出多种可能改善学习成果的能力。它们接受多样化答案格式的灵活性、生成提示和替代问题解释的能力,以及使用积极强化(如激励性提示),有助于营造更支持性和更吸引人的学习环境。然而,在教育环境中部署 LLMs 存在风险。例如,模型中的偏见可能导致教学方法的僵化,比如内部偏见倾向于某些解题方法而非其他方法。此外,响应中的不准确性与大约一半对话包含错误——会削弱学生对导师指导的信任,降低他们对系统的信心。为应对这些挑战,未来工作可以探索如何利用它们的独立能力,如问题生成、提示生成和积极强化。 通过平衡这些优势与管理和减轻其局限性的策略,LLMs 可以有效地补充其他教育技术,例如智能辅导系统。
Acknowledgments 致谢
This work was generously funded in part from several sources, including the NSF National AI Institutes program (#2247790 and #2112532). The views, opinions and/or findings expressed are those of the author and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government. We also thank the members of the Teachable Artificial Intelligence (TAIL) Lab for their feedback and suggestions.
这项工作得到了多个来源的部分慷慨资助,包括美国国家科学基金会国家人工智能研究所项目(#2247790 和#2112532)。所表达的观点、意见和/或发现均为作者个人观点,不应被视为代表国防部或美国政府官方观点或政策。我们还要感谢可教人工智能(TAIL)实验室的成员们提供的反馈和建议。
References
- [1] V. Aleven, B. McLaren, J. Sewall, and K. Koedinger. The cognitive tutor authoring tools (ctat): Preliminary evaluation of efficiency gains. In M. Ikeda, K. Ashley, and T. Chan, editors, Intelligent Tutoring Systems, volume 4053 of Lecture Notes in Computer Science, pages 61–70, Berlin, Heidelberg, 2006. Springer.
- [2] B. S. Bloom. The 2 sigma problem: The search for methods of group instruction as effective as one-to-one tutoring. Educational researcher, 13(6):4–16, 1984.
- [3] V. Braun and V. Clarke. Using thematic analysis in psychology. Qualitative Research in Psychology, 3(2):77–101, 2006.
- [4] K. Cobbe, V. Kosaraju, M. Bavarian, J. H. Chen, C. Raffel, and J. Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- [5] A. T. Corbett and J. R. Anderson. Knowledge tracing: Modeling the acquisition of procedural knowledge. User modeling and user-adapted interaction, 4:253–278, 1994.
- [6] M. Fang, X. Wan, F. Lu, F. Xing, and K. Zou. Mathodyssey: Benchmarking mathematical problem-solving skills in large language models using odyssey math data. arXiv preprint arXiv:2406.18321, 2024.
- [7] A. Gupta, M. Siddiqui, G. Smith, J. Reddig, and C. MacLellan. Intelligent tutors for adult learners: An analysis of needs and challenges, nov 2024. arXiv preprint.
- [8] N. Heffernan and C. Heffernan. The assistments ecosystem: Building a platform that brings scientists and teachers together for minimally invasive research on human learning and teaching. Int J Artif Intell Educ, 24:470–497, 2014.
- [9] N. T. Heffernan, K. R. Koedinger, and L. Razzaq. Expanding the model-tracing architecture: A 3rd generation intelligent tutor for algebra symbolization. International Journal of Artificial Intelligence in Education, 18(2):153–178, 2008.
- [10] D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring Mathematical Problem Solving With the MATH Dataset. arXiv preprint, arXiv:2103.03874, 2021.
- [11] L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin, and T. Liu. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. 2023. Accepted by ACM Transactions on Information Systems (TOIS); revised version (v2) submitted on November 19, 2024.
- [12] G. Kestin, K. Miller, A. Klales, et al. AI Tutoring Outperforms Active Learning. Research Square, Preprint (Version 1), 2024.
- [13] S. Khan. Sal khan explores chatgpt in education. YouTube video, March 2023. Published on Khan Academy’s official channel.
- [14] J. S. Kim and Lee. Jamplate: Exploring llm-enhanced templates for idea reflection. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, page 1–14, New York, NY, USA, 2023. Association for Computing Machinery.
- [15] K. R. Koedinger, J. R. Anderson, W. H. Hadley, and M. A. Mark. Intelligent tutoring goes to school in the big city. International Journal of Artificial Intelligence in Education, 8:30–43, 1997. ffhal-00197383.
- [16] J. A. Kulik and J. Fletcher. Effectiveness of intelligent tutoring systems: a meta-analytic review. Review of educational research, 86(1):42–78, 2016.
- [17] M. C. Lovett. Cognitive task analysis in service of intelligent tutoring systems design: A case study in statistics. In B. P. Goettl, H. M. Halff, C. L. Redfield, and V. J. Shute, editors, Intelligent Tutoring Systems, volume 1452 of Lecture Notes in Computer Science, pages 234–243. Springer, New York, 1998.
- [18] M. L. McHugh. Interrater reliability: The kappa statistic. Biochemia Medica, 22(3):276–282, 2012.
- [19] I. Mirzadeh, K. Alizadeh, H. Shahrokhi, O. Tuzel, S. Bengio, and M. Farajtabar. Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models, 10 2024.
- [20] H. S. Nwana. Intelligent tutoring systems: an overview. Artificial Intelligence Review, 4(4):251–277, 1990.
- [21] OpenAI. Gpt-4 technical report, 2023. https://arxiv.org/abs/2303.08774.
- [22] J. W. Park, M. J. Kim, and S. W. Lee. Developing conversational intelligent tutoring for speaking skills in foreign language education. In F. Koch, L. Zhang, F. Chen, and P. Dillenbourg, editors, Artificial Intelligence in Education, page 123–134. Springer International Publishing, Cham, 2023.
- [23] J. M. Reddig, A. Arora, and C. MacLellan. Generating in-context, personalized feedback for intelligent tutors with large language models. Preprint provided by authors, 2024.
- [24] K. VanLehn. The behavior of tutoring systems. International journal of artificial intelligence in education, 16(3):227–265, 2006.
- [25] K. VanLehn. The relative effectiveness of human tutoring, intelligent tutoring systems, and other tutoring systems. Educational Psychologist, 46(4):197–221, 2011.
- [26] S. Xia and Others. Evaluating mathematical reasoning beyond accuracy. arXiv preprint arXiv:2404.05692, 2024.
- [27] Y. Zhang, H. Li, Z. Wang, Y. Li, Y. Wang, Z. Wang, Y. Li, Y. Li, Y. Li, and Y. Li. Workedgen: Using llms to generate interactive worked programming examples. In Proceedings of the 2023 Conference on Learning at Scale, page 123–134, New York, NY, USA, 2023. Association for Computing Machinery.