Evaluating the Effectiveness of LLMs in Introductory Computer Science Education: A Semester-Long Field Study (2024)

Wenhan LyuWilliam & MaryWilliamsburgVAUSAwlyu@wm.edu0009-0004-9129-8689,Yimeng WangWilliam & MaryWilliamsburgVAUSAywang139@wm.edu0009-0005-0699-4581,Tingting (Rachel) ChungWilliam & MaryWilliamsburgVAUSArachel.chung@mason.wm.edu0000-0002-0250-4873,Yifan SunWilliam & MaryWilliamsburgVAUSAysun25@wm.edu0000-0003-3532-6521andYixuan ZhangWilliam & MaryWilliamsburgVAUSAyzhang104@wm.edu0000-0002-7412-4669

(2024)

Abstract.

The integration of AI assistants, especially through the development of Large Language Models (LLMs), into computer science education has sparked significant debate, highlighting both their potential to augment student learning and the risks associated with their misuse. An emerging body of work has looked into using LLMs in education, primarily focusing on evaluating the performance of existing models or conducting short-term human subject studies. However, very little work has examined the impacts of LLM-powered assistants on students in entry-level programming courses, particularly in real-world contexts and over extended periods. To address this research gap, we conducted a semester-long, between-subjects study with 50 students using CodeTutor, an LLM-powered assistant developed by our research team. Our study results show that students who used CodeTutor (the “CodeTutor group” as the experimental group) achieved statistically significant improvements in their final scores compared to peers who did not use the tool (the “control group”). Within the CodeTutor group, those without prior experience with LLM-powered tools demonstrated significantly greater performance gain than their counterparts. We also found that students expressed positive feedback regarding CodeTutor’s capability to comprehend their queries and assist in learning programming language syntax. However, they had concerns about CodeTutor’s limited role in developing critical thinking skills. Over the course of the semester, students’ agreement with CodeTutor’s suggestions decreased, with a growing preference for support from traditional human teaching assistants. Our findings also show that students turned to CodeTutor for different tasks, including programming task completion, syntax comprehension, and debugging, particularly seeking help for programming assignments. Our analysis further reveals that the quality of user prompts was significantly correlated with CodeTutor’s response effectiveness. Building upon these results, we discuss the implications of our findings for the need to integrate Generative AI literacy into curricula to foster critical thinking skills, and turn to examining the temporal dynamics of user engagement with LLM-powered tools. We further discuss the discrepancy between the anticipated functions of tools and students’ actual capabilities, which sheds light on the need for tailored strategies to improve educational outcomes.

Field study, Large Language Models, Tutoring

journalyear: 2024copyright: rightsretainedconference: Proceedings of the Eleventh ACM Conference on Learning @ Scale; July 18–20, 2024; Atlanta, GA, USAbooktitle: Proceedings of the Eleventh ACM Conference on Learning @ Scale (L@S ’24), July 18–20, 2024, Atlanta, GA, USAdoi: 10.1145/3657604.3662036isbn: 979-8-4007-0633-2/24/07ccs: Human-centered computingHuman computer interaction (HCI)

1. Introduction

Recent advancements in Generative AI and Large Language Models (LLMs), exemplified by GitHub Copilot(GitHub, Inc., 2024) and ChatGPT(OpenAI, 2024), have demonstrated their capacity to tackle complex problems with human-like proficiency. These innovations raise significant concerns within the educational domain, particularly as students might misuse these tools, thereby compromising the quality of education and breaching academic integrity norms(Perkins etal., 2023). Specifically, entry-level computer science education is directly affected by the progress in LLMs(Zhou etal., 2024). LLMs’ capability in handling programming tasks means they can complete many assignments typically given in introductory courses, thus becoming highly appealing to students looking for easy solutions.

Despite these challenges, LLM-powered tools offer great opportunities to enrich computer science education(Kumar etal., 2023). When used ethically and appropriately, they can serve as powerful educational resources. For instance, LLMs can provide students instant feedback on their coding assignments or generate diverse examples of code that help demonstrate programming concepts(Pankiewicz and Baker, 2023). Moreover, as Generative AIs are becoming popular in production environments, familiarizing students with these technologies is increasingly becoming a crucial aspect of computer science education.

The unique challenges posed by LLMs stem from the difficulty in detecting the use of AI tools(Wu etal., 2023b; Zhou etal., 2023). Traditional approaches, such as plagiarism detection software, fall short in determining the originality of student submissions(Meyer etal., 2023). Given the challenges in identifying LLMs usage and recognizing the potential advantages of these technologies, we consider integrating LLMs into computer science education inevitable. As students have already started using such tools, the impact of LLMs on computer science education remains unknown. Indeed, a growing body of research has begun to explore the application of LLMs within educational settings, primarily focusing on assessing the capabilities of current models with existing datasets or previous assignments from students(Hicke etal., 2023; Mehta etal., 2023). However, there is still a research gap in understanding how students interact with LLM-powered tools in introductory programming classes, particularly regarding their engagement in genuine learning settings over extended periods. Furthermore, while previous studies have shown individual differences in intelligent tutoring systems(Kulik and Fletcher, 2016), research into how these differences apply to LLM tools is lacking. Investigating these variations is important for tailoring educational strategies to diverse student needs. In short, understanding these nuanced attitudes of and interactions with LLM-powered tools in CS education over extended periods is crucial for identifying the evolving challenges and opportunities LLMs introduce.

To address the research gap, we asked the following research questions (RQs) in this work:
RQ1. Does the integration of LLM-powered tools in introductory programming courses enhance or impair students’ learning outcomes, compared to traditional teaching methods? How are individual differences associated with students’ learning outcomes using LLM-powered tools?
RQ2. What are students’ attitudes towards LLM-powered tools, how do they change over time, and which factors might influence these attitudes?
RQ3. How do students engage with LLM-powered tools, and how do they respond to their programming needs?

We believe that addressing the following research questions (RQs) is critical for enabling researchers to make informed decisions about incorporating LLMs into their courses and guiding students on the optimal and responsible use of LLM-powered tools. To answer the questions, we conducted a longitudinal, between-subject field study with 50 students over the course of the fall semester from September to December 2023 with a web-based tool we developed called CodeTutor.

The contributions of this work are:1) We conducted a semester-long longitudinal field study to assess the effectiveness of an LLM-powered tool (CodeTutor) on students’ learning outcomes in an introductory programming course. By comparing the performance of students who used CodeTutor against those who did not, our study contributes to new empirical evidence regarding the role of LLM-powered tools in the programming learning experience;2) We characterized patterns of student engagement with CodeTutor and analyzed the ways in which it can meet students’ learning needs. Through the analysis of conversational interactions and feedback loops between students and the tool, we contributed new knowledge regarding how CodeTutor facilitates or impedes learning; and3) We offered insights and outlined design implications for future research.

2. Related Work

2.1. Intelligent Tutoring Systems

Using computerized tools for assisting educational purposes is not a new idea. As early as the 1950s, the first concept of using computers to assist learning has already emerged(Nwana, 1990). From where the factor of intelligence had been considered and it had started evolving into Intelligent Tutoring Systems (ITS)(Sleeman and Brown, 1982). ITS leverages artificial intelligence to provide personalized learning experiences in computer science education, adapting instruction and feedback to individual student needs(Anderson etal., 1985; Elsom-Cook, 1984). These systems have enhanced student engagement, comprehension, and problem-solving skills by offering tailored support and immediate feedback, similar to one-on-one tutoring(VanLehn, 2011; Demszky and Liu, 2023). Research has demonstrated that ITS can significantly improve understanding of complex concepts in programming courses compared to traditional teaching methods, leading to higher student satisfaction due to the personalized learning environment(Corbett etal., 1997; Ritter etal., 2007). The Internet also empowered ITS to offer more interactivity and adaptivity(Brusilovsky etal., 1996, 1998; Butz etal., 2006), leveraging the path of later boost with natural language processing techniques(ElSaadawi etal., 2008; Hooshyar etal., 2015).

However, prior work has shown that as the granularity of tutoring decreases, its effectiveness increases(VanLehn, 2011). Significant limitations for ITS include the complexity and cost of building them, the incapability to answer questions and tasks out of their programmed domains, and the difficulty to develop with the purpose of productively used by individuals without expertise(Graesser etal., 2018). Even though the Generalized Intelligent Framework for Tutoring (GIFT) framework(Sottilare etal., 2012) was proposed and evolved for developing ITS for use at scale, those limitations mostly remain unresolved.

2.2. Large Language Models in CS Education

The release of ChatGPT and other Generative AI applications brought LLMs into the public view and attracted enormous attention(Achiam etal., 2023; Sun etal., 2024). LLMs offer researchers and users the flexibility to employ a single tool across various tasks(Wei etal., 2022), such as medical research(Thirunavukarasu etal., 2023; Clusmann etal., 2023), finance(Wu etal., 2023a), and education(Kasneci etal., 2023). Adopting LLM-powered tools in educational settings is facilitated by their broad accessibility and cost-free nature(Zamfirescu-Pereira etal., 2023). Recent studies have looked into the potential of AI assistants to enhance student learning by helping with students’ problem-solving(Ahmed etal., 2022; Leinonen etal., 2023b; Phung etal., 2023a) and generating computer science content(Sarsa etal., 2022; Denny etal., 2022). Current research on the use of LLMs in education has primarily looked into their performance and capabilities(Prather etal., 2023) compared to humans, such as generating code for programming tasks(Leinonen etal., 2023a; Poldrack etal., 2023), answering general inquriries(Savelka etal., 2023; Phung etal., 2023b), addressing textbook questions(Jalil etal., 2023) and exam questions(Dobslaw and Bergh, 2023).

Despite the growing interest in examining the capabilities of LLMs in education, very few empirical studies have examined the emerging concerns regarding their impact. Therefore, there is an urgent need for research into the long-term effects of LLMs in CS education and the development of strategies to counteract potential negative consequences. One exceptional work was conducted by Liffiton et al.(Liffiton etal., [n. d.]), who developed a tool called CodeHelp for assisting students with their debug needs in an undergraduate course over 12 weeks. Their follow-up study(Sheese etal., 2023) categorized history messages in their tool, and found a positive relationship between tool usage and course performance. However, their study specifically focused on debugging issues and did not compare the outcomes with those achieved through traditional TA methods.

Furthermore, prior research has demonstrated that individual differences, such as gender, race, and prior experiences with technologies, significantly influence the effectiveness of intelligent tutoring systems(Kulik and Fletcher, 2016). However, work that examines how individual differences affect interactions with and perceptions of LLM-powered tools in educational settings is sparse, even though understanding the role of demographic and individual variability is crucial(Zhang etal., 2024). This is particularly important for developing inclusive and effective educational tools that suit the diverse needs of students.

Our work seeks to address these research gaps by conducting a field study that evaluates the use of LLM-powered tools for an extended period of time. Particularly, our study not only aims to evaluate the practicality of LLMs in programming learning educational contexts, but also intends to contribute to a more nuanced understanding of their long-term implications for learning and teaching methodologies.

3. Method

In this section, we describe the design of CodeTutor (subsection3.1), an overview of our participants (subsection3.2), our study procedure and data collection (subsection3.3), and our quantitative and qualitative data analysis (subsection3.4). The source code of CodeTutor, pre-test questions, and data analysis code is available on osf.io/e3zgh.

3.1. Design of CodeTutor

Evaluating the Effectiveness of LLMs in Introductory Computer Science Education: A Semester-Long Field Study (1)

We developed CodeTutor, a browser-based web application using TypeScript and front-end frameworks (e.g., SolidJS, Astro, and libraries such as Zag), for a responsive and interactive user interface. CodeTutor integrates OpenAPI API, which enables the GPT-3.5 model offered by OpenAI. The main interface is shown in Figure1.
Login. Students log in to CodeTutor using their email addresses, with a randomly generated unique identifier (UID) that tracks their activities anonymously.
User Interface. The CodeTutor interface features a navigation sidebar and a central chat area. The sidebar enables easy navigation, with a button for starting new conversations and a chronological listing of existing ones for quick access.
User Feedback Structure. Feedback is important in CodeTutor in order to understand user engagement and students’ attitudes towards it. CodeTutor provides two feedback mechanisms: 1) conversation-level and 2) message-level feedback.
Data Storage. CodeTutor stores data locally on the user’s browser with IndexedDB and can only upload essential information with our secure server for research purposes, where a unique ID for anonymous tracking identifies each conversation. To protect privacy, CodeTutor cannot read stored data from our server.
API Usage.OpenAI only offered limited configuration ability for their API at the time we started our experiment. So we carefully crafted the system role text in our implementation to specify the model to answer questions as a teaching assistant in an entry-level Python class, making answers from OpenAI API consistent even if the length of a conversation exceeds its token limit.

3.2. Participants

Upon approval from our institution’s Institutional Review Board (IRB), we conducted a field study evaluation study with 50 participants. The field study took place in the Computer Science Department of a 4-year university in the United States. Our criteria for participation include: Participants need to be 18 years or older, be able to speak and write in English, and register as entry-level undergraduate computer science students at our institution. Table1 presents an overview of our participants’ demographic information.

CharacteristicsOptionsNumber of
participants
GenderWoman22
Man25
Non-binary1
Prefer not to say2
MajorComputer Science18
Data Science9
Biology5
Mathematics4
Economics3
Others10
Not reported1
Year of StudyFreshman37
Sophom*ore5
Junior6
Senior1
Not reported1
RaceAfrican American or Black1
Asian17
Multiracial3
White26
Not reported3
EthnicityLatino/Hispanic3
Prior ExperienceOnly ChatGPT28
with LLM toolsChatGPT and other tools11
Never used11

3.3. Study Procedure & Data Collection

Our field study lasted from September 27 (after the course add-drop period) to December 11, 2023 (the final exam due). Below, we describe each component of our study.

3.3.1. Pre-test

Participants were initially requested to provide their consent to participate, with being informed about the study’s objectives, procedures, and their rights as participants, including the right to withdraw at any time without penalty. Following the consent process, the pre-test assessment was administered to evaluate students’ existing knowledge of Python programming, providing a baseline for subsequent analysis.

This pre-test included three sections with Python questions, with a total of 22 questions that varied in difficulty for an evaluation of participant skills. The first section featured eight questions (Questions 1-8, for example, “What is the output of the following code: print(3+4)?” ), the second section included seven questions of medium difficulty (Questions 9-15, for example, “If I wanted a function to return the product of two numbers a and b, what should the return statement look like?”), and the third section presented seven challenging questions (Questions 16-22, for example, “What will be the output of the following code? [Multiple lines of code]”). The total score of the three sections was 100 points. Pre-test submissions were graded by our researchers with Computer Science backgrounds, using predetermined scoring criteria.

This pre-test also asked about participants’ prior experience with LLMs, specifically asking, “Which of the following Large Language Model AI tools have you used before? Please select all that apply.” Participants were also asked to provide demographic information, including their major (or intended major), gender, and race/ethnicity. Participants were assured that all demographic information would remain anonymous and be used solely for research purposes.

3.3.2. Control vs. Experimental Group

Participants were divided into two groups: the control group, which used traditional learning methods and had access to human teaching assistants (TAs) for additional support outside class hours, and the experimental group, which used CodeTutor as their primary educational tool beyond class hours, alongside access to standard learning materials and human TAs. Using LLM-based tools other than CodeTutor in this course was prohibited.

To divide participants into a control group and an experimental group, we initially sorted the entire sample based on their previous engagement with LLM-powered tools, resulting in two groups: those who have used any LLM-powered tools before (Used Before) and those who have not (Never Used). Within the Used Before category, we split the participants into two subsets, Used Before Subset A and Used Before Subset B, based on the overall pre-test result distribution to ensure both subsets are representative of the wider group. The same process was applied to the Never Used group, generating two additional subsets: Never Used Subset A and Never Used Subset B. The experimental group is then formed by combining Used Before Subset A with Never Used Subset A, while the control group consists of the combination of Used Before Subset B and Never Used Subset B. This method ensures the experimental and control groups were balanced regarding prior experience with Chatbots and their pre-test performance (see Figure2).

Evaluating the Effectiveness of LLMs in Introductory Computer Science Education: A Semester-Long Field Study (6)

Following their group assignments, students in the experimental group were sent detailed instructions via email on how to access and use CodeTutor. In the field study, participants were not mandated to adhere to a specific frequency of engagement with CodeTutor; instead, they were encouraged to utilize the tool at their own pace. This approach allowed for a naturalistic observation of how students integrate LLM-powered educational resources into their learning processes, without imposing additional constraints that could influence their study habits or the study’s outcomes.

3.3.3. Student Evaluation

At the end of the semester, students’ final grades were used as a primary measure to assess their learning outcomes and the impact of CodeTutor interventions. While acknowledging that final grades are influenced by various factors, they offer a standardized measure of overall academic success, enabling an assessment of CodeTutor’s role in improving student learning outcomes.

Final grades were determined by a weighted average that includes several components for each student: labs (practical mini-projects), assignments (individual coding tasks, such as array summation), mid-terms, and a final exam (comprising questions similar to those in the pre-test). Note that a student’s final grade can surpass 100 if bonus points are awarded throughout the semester. Access to CodeTutor is restricted during mid-terms and final exams, categorizing the assessment components into two groups: CodeTutor-Allowed (labs and assignments) and CodeTutor-Not-Allowed (mid-terms and final exams). This categorization facilitates an analysis of CodeTutor’s impact on student performance by examining potential dependencies on the tool and the improvement of learning outcomes in its absence.

3.4. Data Analysis

3.4.1. Quantitative Data Analysis

We examined the students’ scores, interaction behaviors, and attitudes of using CodeTutor through multiple statistical analyses.

First, we calculated descriptive statistics for all variables, including frequency with percentage for categorical variables and means and standard deviations for continuous variables. To examine the variation in students’ scores before and after the intervention (i.e., the use of CodeTutor), we conducted paired-t tests for both the experimental and control groups. Multiple regression analyses with family-wise p-value adjustment were used to examine the effects of CodeTutor on score improvement, taking into account students’ past experiences using LLM-powered tools and demographic variables, such as major, gender, and race. We then investigated the impact of CodeTutor accessibility on academic performance with ANOVA method. Moreover, we conducted a chi-squared test to explore the relationship between the quality of students’ content and prompts and CodeTutor performance. To understand students’ attitudes towards CodeTutor, we calculated Spearman’s correlation matrix for continuous variables, given the characteristics of our data, which are non-normal and exhibit unequal variance. Furthermore, to examine differences between questions, we used the Kruskal-Wallis Rank Sum Test (using R package stats(R Core Team, 2022)) and then performed post-hoc tests using Dunnett’s test (using the R package FSA(Ogle etal., 2023)) in cases where significant differences were found. To investigate the importance of time on students’ attitudes towards CodeTutor, we introduced a linear mixed effects (LME) model (using the R package lme4(Bates etal., 2015)). We considered statistical significance at a significance level of p<𝑝absentp<italic_p < 0.05 for most cases, except in multiple regression analyses where we used p<𝑝absentp<italic_p < 0.1 and showed effect sizes were significant enough to indicate the relationship of variables.

3.4.2. Qualitative Data Analysis

We also analyzed the conversational history between users. Specifically, we used the General Inductive Approach(Thomas, 2006) to guide our thematic analysis of the conversational data. The first author conducted a close reading of the data to gain a preliminary understanding of the conversational data and then labeled the text segments to formulate categories, which served as the basis for constructing low-level codes to capture specific elements of the user-CodeTutor interactions. Similar low-level codes were then clustered together to achieve high-level themes. During the analysis, the research team engaged in ongoing discussions to refine and clarify emerging themes.

4. Results

In this section, we examined the impact of CodeTutor on student academic performance (subsection4.1 to answer RQ1), analyzed students’ attitudes towards learning with CodeTutor (subsection4.2 to answer RQ2), and characterized their engagement patterns in entry-level programming courses (subsection4.3 to answer RQ3).

4.1. RQ1: Learning Outcomes with CodeTutor

4.1.1. Comparative Analysis of Score Improvements

Overall, students in the experimental group exhibited a greater average improvement in scores, as illustrated by comparing their pre-test and final scores to those in the control group. Specifically, the average increase for the experimental group was 12.50, whereas the control group showed an average decrease of 3.17 when comparing final scores to pre-test scores.

We conducted paired t-tests for both the experimental and control groups to determine if the observed improvements were statistically significant, starting with the premise that there were no differences in pre-test scores between these two groups. Our null hypothesis assumed that the true mean difference between pre-test and final scores was zero. For the control group, the null hypothesis could not be rejected, suggesting that the differences between pre-test and final scores were not statistically significant (t=𝑡absentt=italic_t = -0.879, p=𝑝absentp=italic_p = 0.394). Conversely, participants in the experimental group demonstrated significant improvement from the pre-test to final scores, indicating a statistically significant enhancement in their scores (t=𝑡absentt=italic_t = -2.847, p=𝑝absentp=italic_p = 0.009).

Furthermore, when examining the improvement in CodeTutor-Not-Allowed components, the experimental group exhibited an average increase of 7.33, whereas the control group showed no significant change. By conducting a paired t-test comparing the pre-test and final exam scores (during which the use of CodeTutor was not permitted), it was observed that students in the experimental group demonstrated a statistically significant improvement (t=𝑡absentt=italic_t = -2.405, p=𝑝absentp=italic_p = 0.026). This result suggests that students who have used CodeTutor exhibit more substantial improvement even when CodeTutor is unavailable.

4.1.2. Effect of CodeTutor Accessibility on Academic Performance

Evaluating the Effectiveness of LLMs in Introductory Computer Science Education: A Semester-Long Field Study (7)

By constructing the CodeTutor-Allowed and CodeTutor-Not-Allowed, we determine the correlation between CodeTutor’s accessibility and student academic performance. Using the ANOVA technique on the data from the experimental group, Figure3 reveals that the mean score for the CodeTutor-Allowed category stands at 102.29, in contrast to the CodeTutor-Not-Allowed components, which has a mean score of 93.40. The statistical analysis results show a significant difference between the two groups (t=𝑡absentt=italic_t = 2.31, p=𝑝absentp=italic_p = 0.03), suggesting that the allowance of CodeTutor correlates with higher student scores.

4.1.3. Correlation Between Student Demographics and Final Scores in the Experimental Group

EstimateStd. Errort valuePr(>>>||||t||||)
Const93.6833.87724.1660.000 ***
\cdashline1-5Prior Experiences with LLM tools
(Reference: Used before)
Never used18.8775.0543.7350.032 *
\cdashline1-5Major
(Reference: Computer science)
Data Science14.5325.6622.5670.073
Mathematics17.6925.8523.0230.057
Biology16.2575.6622.8710.057
Economics1.3624.7990.2840.784
Others-13.0046.022-2.1600.115
\cdashline1-5Gender
(Reference: Female)
Male5.9173.8451.5390.223
\cdashline1-5Race
(Reference: White)
Asian-7.8313.933-1.9910.128
African American or Black8.0997.1071.1400.322
Others6.1025.4161.1270.322

Subsequently, we evaluated demographic factors to determine whether specific student groups, particularly those with prior tech experience, experienced greater benefits from CodeTutor.subsubsection4.1.3 shows the results of multiple regression models, examining how students’ final scores in the experimental group are associated with their LLM history, major, gender, and race. Students who have never used any LLM-powered tools performed a significant increase (β=𝛽absent\beta=italic_β = 18.877, p=𝑝absentp=italic_p = 0.032) in final score than the students who used it before.

Moreover, differences in final scores among various majors within the experimental group were statistically significant, indicating that majors play a substantial role in final scores in the experimental group. Students majoring in data science (β=𝛽absent\beta=italic_β = 14.532, p=𝑝absentp=italic_p = 0.073), mathematics (β=𝛽absent\beta=italic_β = 17.692, p=𝑝absentp=italic_p = 0.057), and biology (β=𝛽absent\beta=italic_β = 16.257, p=𝑝absentp=italic_p = 0.057) exhibited a significant positive correlation with final scores compared to those majoring in computer science, suggesting that these majors achieved higher final scores. In terms of gender, no significant effects were observed, indicating no difference between genders in final scores. Additionally, no significant differences were noted across the races in final scores.

Summary of results of RQ1: Collectively, our findings suggest that students in the experimental group achieved significant score improvements with CodeTutor. Particularly, those who were new to CodeTutor achieved even greater improvements, while students majoring in data science, mathematics, and biology surpassed their computer science counterparts. Moreover, students exhibited higher scores when permitted to use CodeTutor.

4.2. RQ2: Students’ Attitudes towards CodeTutor

4.2.1. Descriptive Analysis

In terms of students’ attitudes towards CodeTutor (see Figure1 Evaluating the Effectiveness of LLMs in Introductory Computer Science Education: A Semester-Long Field Study (8) for the specific questions), we found that a small portion of students (8%) strongly disagreed or disagreed that CodeTutor accurately understood what students intended to ask, while most (67%) agreed or strongly agreed. In addition, 35% strongly disagreed or disagreed that CodeTutor helped them think critically, while 19% agreed or strongly agreed. Furthermore, 13% students disagreed that CodeTutor improved their understanding of programming syntax, with a larger proportion of individuals agreeing (33%) or strongly agreeing (25%). Nearly half of the students (42%) agreed or strongly agreed that CodeTutor helped students build their own understandings, while very few (17%) strongly disagreed or disagreed. Finally, regarding the potential of CodeTutor to substitute for a human teaching assistant111In our analysis, response values to the TA Replacement question were reversed, so a higher score indicates a stronger preference for our tool over human teaching assistants. This reversal is consistently applied across all subsequent analyses., 20% of the students strongly disagreed or disagreed with this notion, while 42% of them agreed or strongly agreed. Figure4 shows the distribution of students’ responses across these five questions.

Evaluating the Effectiveness of LLMs in Introductory Computer Science Education: A Semester-Long Field Study (9)

4.2.2. Exploring Relationships in Student Attitudes Toward CodeTutor

Evaluating the Effectiveness of LLMs in Introductory Computer Science Education: A Semester-Long Field Study (10)
Evaluating the Effectiveness of LLMs in Introductory Computer Science Education: A Semester-Long Field Study (11)
ComprehensionCritical ThinkingSyntax MasteryIndependent LearningTA Replacement
β𝛽\betaitalic_β (Std. Error)β𝛽\betaitalic_β (Std. Error)β𝛽\betaitalic_β (Std. Error)β𝛽\betaitalic_β (Std. Error)β𝛽\betaitalic_β (Std. Error)
Const4.700(0.297)***2.690(0.247)***3.760(0.262)***3.044(0.218)***3.964(0.330)***
Time-0.114(0.039)**0.040(0.037)-0.018(0.041)0.054(0.036)-0.099(0.051)

Figure5 reveals key relationships among students’ attitudes on CodeTutor. The moderate positive correlation between Comprehension and Syntax Mastery suggests that proficiency in one is associated with higher performance in the other. Critical Thinking is slightly positive with Comprehension and Independent Learning but slightly negative with TA Replacement. Furthermore, Syntax Mastery strongly correlates with Independent Learning, indicating a close relationship between mastering programming syntax and self-directed learning outcomes. In addition, TA Replacement has minimal to no significant correlations with other variables, suggesting its effects vary independently of these educational aspects.

To further explore the relationship of different students’ attitudes among questions, we present the results of multiple comparisons across the five questions. Specifically, our results show that respondents’ attitudes (χ2=superscript𝜒2absent\chi^{2}=italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 32.99, p<𝑝absentp<italic_p < 0.05) significantly differ across questions. Our post-hoc tests (see Figure6) further reveal that students were significantly less in agreement about CodeTutor’s assistance in fostering critical thinking compared to its ability to understand, help in learning syntax and serving as a replacement for a teaching assistant. Moreover, our findings suggest that respondents were significantly more in agreement with CodeTutor’s effectiveness in comprehension than in its ability to improve students’ understanding of programming syntax.

We then conducted a linear mixed effects (LME) model to explore time’s influence on students’ attitudes toward CodeTutor:QuestionIndicator_it=β_0+b_0i+(β_1+b_1i)t+ϵ_itwhere β0subscript𝛽0\beta_{0}italic_β start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT are unknown fixed effect parameters; b0isubscript𝑏0𝑖b_{0i}italic_b start_POSTSUBSCRIPT 0 italic_i end_POSTSUBSCRIPT and b1isubscript𝑏1𝑖b_{1i}italic_b start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT are the unknown student-specific random intercept and slope, respectively, which are assumed to have a bivariate normal distribution with mean zero and covariance matrix D𝐷Ditalic_D; QuestionIndicator𝑄𝑢𝑒𝑠𝑡𝑖𝑜𝑛𝐼𝑛𝑑𝑖𝑐𝑎𝑡𝑜𝑟QuestionIndicatoritalic_Q italic_u italic_e italic_s italic_t italic_i italic_o italic_n italic_I italic_n italic_d italic_i italic_c italic_a italic_t italic_o italic_r is the student response at time t𝑡titalic_t; and ϵitsubscriptitalic-ϵ𝑖𝑡\epsilon_{it}italic_ϵ start_POSTSUBSCRIPT italic_i italic_t end_POSTSUBSCRIPT is the residual error for student i𝑖iitalic_i at time t𝑡titalic_t, with a normal distribution N(0,σ2)𝑁0superscript𝜎2N(0,\sigma^{2})italic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), which is assumed to be independent of the random effects. From Table3, we can see that students’ attitudes toward CodeTutor show a significant decrease in Comprehension (β=𝛽absent\beta=italic_β = -0.114, p<𝑝absentp<italic_p < 0.01), which indicates that students disagree with CodeTutor’s understanding accuracy over time. Moreover, there is a weakly significant decrease in TA Replacement (β=𝛽absent\beta=italic_β = -0.099, p<𝑝absentp<italic_p < 0.1) with increasing time. This shows a slight tendency for them to consider more human TA help over time. Also, students perform no significant difference over time in Critical Thinking, Syntax Mastery, and Independent Learning.

Summary of results of RQ2: In summary, students recognize CodeTutor’s ability to understand their queries and assist with programming syntax yet question its capacity to promote critical thinking skills. Additionally, students’ confidence in CodeTutor’s comprehension abilities decreases over time, with a growing preference for support from human teaching assistants.

4.3. RQ3: Students’ Engagement with CodeTutor

In total, we documented 82 conversation sessions222In our analysis, a conversation session is a continuous exchange of messages between users and CodeTutor within a specific period, characterized by a coherent topic or purpose. with CodeTutor, encompassing a total of 2,567 messages. In these sessions, 415 unique topics were discussed, averaging 5.06 topics per session and 6.19 messages per topic.

4.3.1. Message Classification & Interaction Patterns

In total, we collected 2567 conversational messages exchanged between users and the CodeTutor. Of these, 1288 messages originated from the users, and CodeTutor responded with 1279 messages.

Table4 presents categorizations of messages between users and CodeTutor. Each category has a description and an example to illustrate the message type. Categories of messages from both users Evaluating the Effectiveness of LLMs in Introductory Computer Science Education: A Semester-Long Field Study (12) and CodeTutor Evaluating the Effectiveness of LLMs in Introductory Computer Science Education: A Semester-Long Field Study (13) include Programming Task inquiries, addressing specific Python programming challenges; Grammar and Syntax questions, focusing on Python’s basic grammar or syntax without necessitating runnable programs; General Questions, which are not directly related to Python; and Greetings, initiating or finishing interaction.

From the users’ side Evaluating the Effectiveness of LLMs in Introductory Computer Science Education: A Semester-Long Field Study (14), additional categories highlight their engagement with CodeTutor: Modification Requests for alterations to previous answers; Help Ineffective indicating issues or errors in CodeTutor’s provided solutions; Further Information to elaborate on prior queries; and Debug Requests for assistance in resolving bugs or errors in code snippets.

CodeTutor’s responses Evaluating the Effectiveness of LLMs in Introductory Computer Science Education: A Semester-Long Field Study (15) are classified into Corrections, which address and amend errors in previous responses and Explanations, providing further details on provided solutions or clarify why certain requests cannot be fulfilled.

Category NameDescriptionExamplePercentage
Evaluating the Effectiveness of LLMs in Introductory Computer Science Education: A Semester-Long Field Study (18)Evaluating the Effectiveness of LLMs in Introductory Computer Science Education: A Semester-Long Field Study (19)Programming TaskAny questions or answers related to Python programming.“Write a function that prints the nth(argument) prime number.”86.52%
Evaluating the Effectiveness of LLMs in Introductory Computer Science Education: A Semester-Long Field Study (20)Evaluating the Effectiveness of LLMs in Introductory Computer Science Education: A Semester-Long Field Study (21)Grammar & SyntaxWhen a message is related to basic Python grammar or syntax problems, a runnable program is most likely unnecessary.“What does {} do in Python?”14.26%
Evaluating the Effectiveness of LLMs in Introductory Computer Science Education: A Semester-Long Field Study (22)Evaluating the Effectiveness of LLMs in Introductory Computer Science Education: A Semester-Long Field Study (23)General QuestionWhen a message is not directly related to Python.“What is ASCII?”4.29%
Evaluating the Effectiveness of LLMs in Introductory Computer Science Education: A Semester-Long Field Study (24)Evaluating the Effectiveness of LLMs in Introductory Computer Science Education: A Semester-Long Field Study (25)GreetingsWhen a message is greeting.“Hello! How can I assist you today?”0.62%
Evaluating the Effectiveness of LLMs in Introductory Computer Science Education: A Semester-Long Field Study (26)Help IneffectiveWhen a user message says the previous answer generated by CodeTutor is wrong or provides error information.“This code still fails.”12.86%
Evaluating the Effectiveness of LLMs in Introductory Computer Science Education: A Semester-Long Field Study (27)Debug RequestWhen a user message asks CodeTutor to fix bugs or explain what was wrong in code snippets provided or in previous messages.“Debug this code. [Code Snippet]”8.22%
Evaluating the Effectiveness of LLMs in Introductory Computer Science Education: A Semester-Long Field Study (28)Modification RequestWhen a user requires CodeTutor to change something on its previous answer.“Remove comments.”4.48%
Evaluating the Effectiveness of LLMs in Introductory Computer Science Education: A Semester-Long Field Study (29)Further InformationWhen a user message provides more context on their previous input.“All the input strings will be the same length.”3.97%
Evaluating the Effectiveness of LLMs in Introductory Computer Science Education: A Semester-Long Field Study (30)ExplanationWhen CodeTutor explains something in previous messages or why it cannot complete the current task from users.“I’m sorry, but I need more information to provide the answers for questions 4 and 6.”28.94%
Evaluating the Effectiveness of LLMs in Introductory Computer Science Education: A Semester-Long Field Study (31)CorrectionWhen CodeTutor corrects content in its previous answer.“Apologies for the syntax error. Here is the corrected version: [Code Snippet]”13.95%

4.3.2. Analysis of Prompt Quality & Correlation with Response Effectiveness

To further examine user interaction patterns with CodeTutor and their implications for its educational value, we analyzed the relationship between prompt quality and response accuracy. This analysis stems from the premise that detailed and precise prompts are likely to improve the AI’s understanding of user requirements, thereby potentially raising the standard of its responses.

To do so, we evaluated a corpus of 1,190 prompts, after removing all greeting messages, to assess their quality. Our analysis showed that 37% were deemed good quality. The remaining 63% were identified as poor quality. We defined “good quality” prompts as providing sufficient detail for CodeTutor to generate an accurate response. In contrast, “poor quality” prompts were those that did not meet this criterion. We categorized the deficiencies in poor quality prompts into four types: incomplete information (n=𝑛absentn=italic_n = 189, 25%), which lacked specific details necessary for CodeTutor to understand the context; lack of clear goals (n=𝑛absentn=italic_n = 172, 23%), where the desired outcome was not explicitly stated; over-reliance on CodeTutor (n=𝑛absentn=italic_n = 362, 48%), where the assignment questions are directly copied and pasted into CodeTutor; and poor structural organization (n=𝑛absentn=italic_n = 25, 3%), which exhibited unclear or confusing request structures. Prompts were further labeled as “working” if they elicited an appropriate response from CodeTutor, and “not working” if they failed to do so.

Using a Chi-square test, we investigated whether the prompt quality and the effectiveness of CodeTutor’s responses were independent. Our results showed a significant correlation (χ2=superscript𝜒2absent\chi^{2}=italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 144.84, p<𝑝absentp<italic_p < 0.001). In other words, clearer and more detailed prompts are associated with responses that are more likely to be effective.

Summary of results of RQ3: We characterized the messages exchanged between users and CodeTutor. We categorize these interactions between users and CodeTutor into inquiries (e.g., programming tasks, syntax questions) and feedback alongside CodeTutor’s responses (corrections and explanations), illustrating a dynamic exchange aimed at facilitating learning. We also found that the clarity and completeness of prompts are significantly correlated with the quality of responses from CodeTutor.

5. Discussion

Our semester-long field study provided insights into how students in introductory computer science courses utilized CodeTutor and its effects on educational outcomes. In short, our results show that 1) students who used CodeTutor had shown significant improvements in scores; 2) while CodeTutor was valued for its assistance in comprehension and syntax, students expressed concerns about its capacity to enhance critical thinking skills; 3) skepticism regarding CodeTutor as an alternative to human teaching assistants grew over time; 4) CodeTutor was primarily used for various coding tasks, including syntax comprehension, debugging, and clarifying fundamental concepts; 5) the effectiveness of CodeTutor responses was notably higher when prompts were clearer and more detailed. Building on these findings, we discuss the implications for future enhancements and research directions in the rest of the section.

5.1. Towards Enhancing Generative AI Literacy

Our research indicates a positive correlation between the use of Generative AI tools and improved student learning outcomes. However, 63% of student-generated prompts were deemed unsatisfactory, indicating a lack of essential skills to fully exploit Generative AI tools. This finding also suggests the need to promote Generative AI literacy among students. Here, we define Generative AI literacy as the ability to effectively interact with AI tools and understand how to formulate queries and interpret responses. Our findings suggest that while students can leverage CodeTutor for practical coding assistance and syntax understanding, there is a gap in using these tools to enhance critical thinking skills. We suggest educational programs integrate Generative AI literacy as a core component of their curriculum, teaching students how to use these tools for immediate problem-solving and engaging with them to promote deeper analytical and critical thinking. This could include workshops on effective query formulation, sessions on interpreting AI responses, and exercises designed to challenge students to critically evaluate the information and solutions offered by AI tools.

We also propose approaches to integrate HCI tools and principles into LLM-enabled platforms, such as prompt construction templates providing users with templates or structured forms for crafting queries. They can guide users in formulating more effective and precise questions. Templates could include placeholders for essential details and context, providing the necessary information for the AI to generate accurate responses to users. Furthermore, integrating Critical Thinking Prompts might be particularly effective in stimulating in-depth analytical thinking. For example, the interface could pose follow-up questions encouraging users to assess AI answers’ adequacy critically. Questions such as, “Does this response fully address your query?” or “What additional information might you need?” may prompt users to engage in a more thorough evaluation of the information provided, fostering a habit of critical reflection and assessment. Another possible approach is Facilitating Collaborative Query Building, which leverages the power of collective intelligence. By designing interfaces that support real-time collaboration among users, individuals can work together to construct and refine queries. We can also use LLMs to evaluate and refine user questions instantly as they perform well in prompting(Zhou etal., 2022).

5.2. Turning to the Temporal Dynamics of LLM-Powered Tutoring Tools

The temporality aspect of using CodeTutor in computer science education presents a nuanced perspective on their integration and effectiveness over time. Our analysis reveals a complex relationship between the duration of CodeTutor use and students’ attitudes towards it. Specifically, our results show that although students initially find CodeTutor a reliable tool for understanding their queries, their confidence in its accuracy diminishes with prolonged use. Additionally, our model uncovers a weakly significant decrease in students’ preference for CodeTutor as a TA replacement over time. This trend implies a growing inclination among students to seek human TA support as they progress in their courses, possibly due to the nuanced understanding and personalized feedback that human TAs can offer, which might not be fully replicated by LLMs. However, our study found no significant temporal change in students’ attitudes toward CodeTutor’s impact on critical thinking, syntax mastery, and independent learning. This stability suggests that while students may question CodeTutor’s comprehension abilities and its adequacy as a TA replacement over time, they still recognize its utility in facilitating certain aspects of the learning process, such as mastering syntax and promoting independent study habits.

Collectively, our findings highlight the importance of investigating the temporal dynamics of student attitudes towards and their use of LLM-powered tools for learning and shed light on the need for a balanced approach to integrating LLMs into CS education. While these tools offer great support in specific areas, their limitations become more apparent with extended use. In other words, it is important to complement LLMs with human instruction to address learning objectives, such as critical thinking and problem-solving, which are crucial for computer science education. Furthermore, we argue that educators and developers should work collaboratively to enhance the capabilities of LLM-powered tutoring systems, ensuring they remain effective and relevant over time.

5.3. Alignments of LLMs for Education

Our observations regarding students’ utilization of CodeTutor provide insights into their learning approaches and completion of assignments. The exams that prohibit using CodeTutor reflect students’ understanding of programming, as they must rely solely on their internal knowledge. Conversely, assignments and lab tasks that permit using CodeTutor result in higher scores, indicating that students may prioritize completion over deep comprehension(Gustafson, 2022). While students employ CodeTutor to fulfill homework requirements, they may not perceive it as a tool for a comprehensive understanding of course materials.

Our results show that nearly half of the low-quality prompts classified as over-reliance were copied and pasted original assignment questions into CodeTutor. This suggests that students primarily used CodeTutor as a quick-fix solution, neglecting the opportunity to engage with the underlying question logic and determine appropriate solutions to the question. As the complexity of assignments increased, students’ perceptions of CodeTutor’s ability to understand their queries turned more negative. However, students acknowledge its proficiency in syntax mastery, which reveals a gap between their expectations and the tool’s capabilities. Complex questions require students to integrate and apply the knowledge acquired in class(Trautwein and Köller, 2003), challenging the notion that CodeTutor can easily break down questions into manageable components. Additionally, CodeTutor’s limitations, such as its training on a predetermined database and inability to handle custom or complex queries, suggest that it is important to simplify questions and structure prompts effectively for optimal results.

Furthermore, we argue that students’ previous experiences with chatbots, if unrelated to structured learning, such as a simple one-line request (e.g., “help me write a summary”), may not adequately prepare them for using CodeTutor effectively in a programming context, as evidenced by our findings that nearly 70% of student submissions in our corpus were of poor quality. Students with limited experience interacting with chatbots might be hesitant to trust tools like CodeTutor fully, potentially affecting their use and reliance on its outputs. This lack of familiarity could lead them to prefer traditional learning approaches, fostering deeper analytical thinking and minimizing dependency on automated assistance.

Design Implications. Our findings shed light on the future implementation and enhancement of CodeTutor in programming courses. The inherent limitations of CodeTutor, which is trained on a general dataset, may necessitate the creation of custom datasets tailored to specific class contexts. Through instructors’ reflections on the quality of students’ assignments, it becomes evident that while CodeTutor produces impressive results due to its training on datasets crafted by professional programmers aimed at efficiency, the emphasis in entry-level classes should prioritize human-readable code over complex solutions. One potential solution is to leverage GPT models with the Assistant API(OpenAI, 2024). This API enables the development of AI assistants with features, such as the Code Interpreter(OpenAI, 2024a), which can execute Python code in a sandboxed environment, and Knowledge Retrieval(OpenAI, 2024b), allowing users to upload documents to enhance the assistant’s knowledge base. These features align more closely with the requirements of a virtual TA in entry-level programming courses. The Code Interpreter can enhance the quality of responses containing code blocks, while Knowledge Retrieval empowers instructors to provide course-specific information. Meanwhile, providing systematic instructions to students can enhance their understanding of how to use the tool effectively while improving its accessibility through additional instructional features. Additionally, it is crucial to emphasize the boundaries of using LLM-powered tools, clarifying what is permissible and the consequences of inappropriate usage.

6. Limitations and Future Work

Our study, while providing valuable insights into the use of LLM-powered tools in educational settings, has several limitations that suggest avenues for further research. First, The current study was conducted on a relatively small scale, limiting the generalizability of our findings. Therefore, our future work will conduct larger-scale tests involving more diverse student populations and settings. Second, regarding the applicability to different levels of coding courses, our work has focused on beginning levels of CS courses. Our findings may not directly translate to intermediate or advanced programming courses. Furthermore, we relied on GPT-3.5 in this study, which may not always provide accurate or contextually appropriate responses, potentially affecting the quality of tutoring provided. Lastly, controlling the experimental environment in a semester-long study, particularly the control group, was challenging, indicating the need for more experimental designs in future studies to better understand the factors affecting student learning.

7. Conclusion

In this work, we conducted a semester-long between-subjects study with 50 students to examine the ways in which students use an LLM-powered virtual teaching assistant (i.e., CodeTutor) in their introductory-level programming learning. The experimental group using CodeTutor showed significant improvements in final scores over the control group, with first-time users of LLM-powered tools experiencing the most substantial gains. While positive feedback was received on CodeTutor’s ability to understand queries and aid in syntax learning, concerns were raised about its effectiveness in cultivating critical thinking skills. Over time, we observed a shift towards preferring human assistant support over CodeTutor, despite its utility in completing programming tasks, understanding syntax, and debugging. Our study also shows the importance of prompt quality in leveraging CodeTutor’s effectiveness, indicating that detailed and clear prompts yield more accurate responses. Our findings point to the critical need for embedding Generative AI literacy into educational curricula and to promote critical thinking abilities among students. Looking ahead, our research suggests integrating LLM-powered tools in computer science education requires more tools, resources, and regulations to help students develop Generative AI literacy and customize teaching strategies to bridge the gap between tool capabilities and educational goals. By adjusting expectations and guiding students on effective tool use, educators may harness the full potential of Generative AI to complement traditional teaching methods.

Acknowledgements.

This project is funded by the Studio for Teaching & Learning Innovation Learn, Discover, Innovate Grant, the Faculty Research Grant from William & Mary, and the Microsoft Accelerate Foundation Models Research Award. We thank our participants in this study and our anonymous reviewers for their feedback.

References

  • (1)
  • Achiam etal. (2023)Josh Achiam, StevenAdler, Sandhini Agarwal, Lama Ahmad,Ilge Akkaya, FlorenciaLeoni Aleman,Diogo Almeida, Janko Altenschmidt,Sam Altman, Shyamal Anadkat,etal. 2023.Gpt-4 technical report.arXiv preprint arXiv:2303.08774(2023).https://doi.org/10.48550/arXiv.2303.08774
  • Ahmed etal. (2022)Toufique Ahmed, NoahRoseLedesma, and Premkumar Devanbu.2022.SYNSHINE: improved fixing of syntax errors.IEEE Transactions on Software Engineering49, 4 (2022),2169–2181.https://doi.org/10.1109/TSE.2022.3212635
  • Anderson etal. (1985)JohnR Anderson,CFranklin Boyle, and BrianJ Reiser.1985.Intelligent tutoring systems.Science 228,4698 (1985), 456–462.https://doi.org/10.1126/science.228.4698.456
  • Bates etal. (2015)Douglas Bates, MartinMächler, Ben Bolker, and SteveWalker. 2015.Fitting Linear Mixed-Effects Models Using lme4.Journal of Statistical Software67, 1 (2015),1–48.https://doi.org/10.18637/jss.v067.i01
  • Brusilovsky etal. (1998)Peter Brusilovsky etal.1998.Adaptive educational systems on the world-wide-web:A review of available technologies. In Proceedingsof Workshop” WWW-Based Tutoring” at 4th International Conference onIntelligent Tutoring Systems (ITS’98), San Antonio, TX.
  • Brusilovsky etal. (1996)Peter Brusilovsky, ElmarSchwarz, and Gerhard Weber.1996.ELM-ART: An intelligent tutoring system on WorldWide Web. In Intelligent Tutoring Systems: ThirdInternational Conference, ITS’96 Montréal, Canada, June 12–14, 1996Proceedings 3. Springer, 261–269.https://doi.org/10.1007/3-540-61327-7_123
  • Butz etal. (2006)CoryJ Butz, Shan Hua,and RBrien Maguire. 2006.A web-based bayesian intelligent tutoring systemfor computer programming.Web Intelligence and Agent Systems: AnInternational Journal 4, 1(2006), 77–97.
  • Clusmann etal. (2023)Jan Clusmann, FionaRKolbinger, HannahSophie Muti, ZunamysICarrero, Jan-Niklas Eckardt,NarminGhaffari Laleh, ChiaraMariaLavinia Löffler, Sophie-Caroline Schwarzkopf,Michaela Unger, GregoryP Veldhuizen,etal. 2023.The future landscape of large language models inmedicine.Communications Medicine3, 1 (2023),141.https://doi.org/10.1038/s43856-023-00370-1
  • Corbett etal. (1997)AlbertT Corbett,KennethR Koedinger, and JohnRAnderson. 1997.Intelligent tutoring systems.In Handbook of human-computerinteraction. Elsevier, 849–874.https://doi.org/10.1016/B978-044481862-1.50103-5
  • Demszky and Liu (2023)Dorottya Demszky andJing Liu. 2023.M-Powering Teachers: Natural Language ProcessingPowered Feedback Improves 1: 1 Instruction and Student Outcomes.(2023).https://doi.org/10.1145/3573051.3593379
  • Denny etal. (2022)Paul Denny, Sami Sarsa,Arto Hellas, and Juho Leinonen.2022.Robosourcing Educational Resources–LeveragingLarge Language Models for Learnersourcing.arXiv preprint arXiv:2211.04715(2022).https://doi.org/10.1145/3501385.3543957
  • Dobslaw and Bergh (2023)Felix Dobslaw and PeterBergh. 2023.Experiences with Remote Examination Formats inLight of GPT-4.arXiv preprint arXiv:2305.02198(2023).https://doi.org/10.48550/arXiv.2305.02198
  • ElSaadawi etal. (2008)GilanM ElSaadawi, EugeneTseytlin, Elizabeth Legowski, DrazenJukic, Melissa Castine, Jeffrey Fine,Robert Gormley, and RebeccaS Crowley.2008.A natural language intelligent tutoring system fortraining pathologists: Implementation and evaluation.Advances in health sciences education13 (2008), 709–722.https://doi.org/10.1007/s10459-007-9081-3
  • Elsom-Cook (1984)Mark Elsom-Cook.1984.Design considerations of an intelligenttutoring system for programming languages.Ph. D. Dissertation.University of Warwick.
  • GitHub, Inc. (2024)GitHub, Inc.2024.GitHub Copilot.https://github.com/features/copilot.Accessed: 2024-02-11.
  • Graesser etal. (2018)ArthurC Graesser, XiangenHu, and Robert Sottilare.2018.Intelligent tutoring systems.In International handbook of the learningsciences. Routledge, 246–255.
  • Gustafson (2022)Morgan Gustafson.2022.The Effect of Homework Completion on Students’Academic Performance.Dissertations, Theses, and Projects.https://red.mnstate.edu/thesis/662662.
  • Hicke etal. (2023)Yann Hicke, AnmolAgarwal, Qianou Ma, and Paul Denny.2023.ChaTA: Towards an Intelligent Question-AnswerTeaching Assistant using Open-Source LLMs.arXiv preprint arXiv:2311.02775(2023).https://doi.org/10.48550/arXiv.2311.02775
  • Hooshyar etal. (2015)Danial Hooshyar,RodinaBinti Ahmad, Moslem Yousefi,FarrahDina Yusop, and S-J Horng.2015.A flowchart-based intelligent tutoring system forimproving problem-solving skills of novice programmers.Journal of computer assisted learning31, 4 (2015),345–361.https://doi.org/10.1111/jcal.12099
  • Jalil etal. (2023)Sajed Jalil, SuzzanaRafi, ThomasD LaToza, Kevin Moran,and Wing Lam. 2023.Chatgpt and software testing education: Promises &perils. In 2023 IEEE International Conference onSoftware Testing, Verification and Validation Workshops (ICSTW). IEEE,4130–4137.https://doi.org/10.1109/ICSTW58534.2023.00078
  • Kasneci etal. (2023)Enkelejda Kasneci, KathrinSeßler, Stefan Küchemann, MariaBannert, Daryna Dementieva, FrankFischer, Urs Gasser, Georg Groh,Stephan Günnemann, EykeHüllermeier, etal. 2023.ChatGPT for good? On opportunities and challengesof large language models for education.Learning and individual differences103 (2023), 102274.https://doi.org/10.1016/j.lindif.2023.102274
  • Kulik and Fletcher (2016)JamesA Kulik and JDFletcher. 2016.Effectiveness of intelligent tutoring systems: ameta-analytic review.Review of educational research86, 1 (2016),42–78.https://doi.org/10.3102/0034654315581420
  • Kumar etal. (2023)Harsh Kumar, IlyaMusabirov, Mohi Reza, Jiakai Shi,Anastasia Kuzminykh, JosephJay Williams,and Michael Liut. 2023.Impact of Guidance and Interaction Strategies forLLM Use on Learner Performance and Perception.arXiv preprint arXiv:2310.13712(2023).https://doi.org/10.48550/arXiv.2310.13712
  • Leinonen etal. (2023a)Juho Leinonen, PaulDenny, Stephen MacNeil, Sami Sarsa,Seth Bernstein, Joanne Kim,Andrew Tran, and Arto Hellas.2023a.Comparing code explanations created by students andlarge language models.arXiv preprint arXiv:2304.03938(2023).https://doi.org/10.48550/arXiv.2304.03938
  • Leinonen etal. (2023b)Juho Leinonen, ArtoHellas, Sami Sarsa, Brent Reeves,Paul Denny, James Prather, andBrettA Becker. 2023b.Using large language models to enhance programmingerror messages. In Proceedings of the 54th ACMTechnical Symposium on Computer Science Education V. 1.563–569.https://doi.org/10.1145/3545945.3569770
  • Liffiton etal. ([n. d.])Mark Liffiton, BradESheese, Jaromir Savelka, and PaulDenny. [n. d.].Codehelp: Using large language models withguardrails for scalable support in programming classes.([n. d.]), 1–11.https://doi.org/10.1145/3631802.3631830
  • Mehta etal. (2023)Atharva Mehta, NipunGupta, Dhruv Kumar, Pankaj Jalote,etal. 2023.Can ChatGPT Play the Role of a Teaching Assistantin an Introductory Programming Course?arXiv preprint arXiv:2312.07343(2023).https://doi.org/10.48550/arXiv.2312.07343
  • Meyer etal. (2023)JesseG Meyer, RyanJUrbanowicz, PatrickCN Martin, KarenO’Connor, Ruowang Li, Pei-Chen Peng,TiffaniJ Bright, Nicholas Tatonetti,KyoungJae Won, GracielaGonzalez-Hernandez, etal. 2023.ChatGPT and large language models in academia:opportunities and challenges.BioData Mining 16,1 (2023), 20.https://doi.org/10.1186/s13040-023-00339-9
  • Nwana (1990)HyacinthS Nwana.1990.Intelligent tutoring systems: an overview.Artificial Intelligence Review4, 4 (1990),251–277.https://doi.org/10.1007/BF00168958
  • Ogle etal. (2023)DerekH. Ogle, JasonC.Doll, A.Powell Wheeler, and AlexisDinno. 2023.FSA: Simple Fisheries Stock AssessmentMethods.https://CRAN.R-project.org/package=FSAR package version 0.9.4.
  • OpenAI (2024)OpenAI. 2024.Assistants Overview - OpenAI API.https://platform.openai.com/docs/assistants/overview.Accessed: 2024-02-11.
  • OpenAI (2024)OpenAI. 2024.ChatGPT.https://openai.com/chatgpt.Accessed: 2024-02-11.
  • OpenAI (2024a)OpenAI. 2024a.Code Interpreter.https://platform.openai.com/docs/assistants/tools/code-interpreter.Accessed: 2024-02-11.
  • OpenAI (2024b)OpenAI. 2024b.Knowledge Retrieval.https://platform.openai.com/docs/assistants/tools/knowledge-retrieval.Accessed: 2024-02-11.
  • Pankiewicz and Baker (2023)Maciej Pankiewicz andRyanS Baker. 2023.Large Language Models (GPT) for automating feedbackon programming assignments.arXiv preprint arXiv:2307.00150(2023).https://doi.org/10.48550/arXiv.2307.00150
  • Perkins etal. (2023)Mike Perkins, Jasper Roe,Darius Postma, James McGaughran, andDon Hickerson. 2023.Detection of GPT-4 generated text in highereducation: Combining academic judgement and software to identify generativeAI tool misuse.Journal of Academic Ethics(2023), 1–25.https://doi.org/10.1007/s10805-023-09492-6
  • Phung etal. (2023a)Tung Phung, JoséCambronero, Sumit Gulwani, Tobias Kohn,Rupak Majumdar, Adish Singla, andGustavo Soares. 2023a.Generating High-Precision Feedback for ProgrammingSyntax Errors using Large Language Models.arXiv preprint arXiv:2302.04662(2023).https://doi.org/10.48550/arXiv.2302.04662
  • Phung etal. (2023b)Tung Phung,Victor-Alexandru Pădurean, JoséCambronero, Sumit Gulwani, Tobias Kohn,Rupak Majumdar, Adish Singla, andGustavo Soares. 2023b.Generative AI for Programming Education:Benchmarking ChatGPT, GPT-4, and Human Tutors.International Journal of Management21, 2 (2023),100790.https://doi.org/10.48550/arXiv.2306.17156
  • Poldrack etal. (2023)RussellA Poldrack, ThomasLu, and Gašper Beguš.2023.AI-assisted coding: Experiments with GPT-4.arXiv preprint arXiv:2304.13187(2023).https://doi.org/10.48550/arXiv.2304.13187
  • Prather etal. (2023)James Prather, PaulDenny, Juho Leinonen, BrettA Becker,Ibrahim Albluwi, Michelle Craig,Hieke Keuning, Natalie Kiesler,Tobias Kohn, Andrew Luxton-Reilly,etal. 2023.The robots are here: Navigating the generative airevolution in computing education.arXiv preprint arXiv:2310.00658(2023).https://doi.org/10.1145/3623762.3633499
  • R Core Team (2022)R Core Team.2022.R: A Language and Environment forStatistical Computing.R Foundation for Statistical Computing, Vienna, Austria.https://www.R-project.org/
  • Ritter etal. (2007)Steven Ritter, JohnRAnderson, KennethR Koedinger, andAlbert Corbett. 2007.Cognitive Tutor: Applied research in mathematicseducation.Psychonomic bulletin & review14 (2007), 249–255.https://doi.org/10.3758/BF03194060
  • Sarsa etal. (2022)Sami Sarsa, Paul Denny,Arto Hellas, and Juho Leinonen.2022.Automatic generation of programming exercises andcode explanations using large language models. InProceedings of the 2022 ACM Conference onInternational Computing Education Research-Volume 1.27–43.
  • Savelka etal. (2023)Jaromir Savelka, AravAgarwal, Christopher Bogart, and MajdSakr. 2023.Large language models (gpt) struggle to answermultiple-choice questions about code.arXiv preprint arXiv:2303.08033(2023).https://doi.org/10.48550/arXiv.2303.08033
  • Sheese etal. (2023)Brad Sheese, MarkLiffiton, Jaromir Savelka, and PaulDenny. 2023.Patterns of Student Help-Seeking When Using a LargeLanguage Model-Powered Programming Assistant.arXiv preprint arXiv:2310.16984(2023).https://doi.org/10.1145/3636243.3636249
  • Sleeman and Brown (1982)Derek Sleeman andJohnSeely Brown. 1982.Intelligent tutoring systems.London: Academic Press.
  • Sottilare etal. (2012)RobertA Sottilare,KeithW Brawner, BenjaminS Goldberg,and HeatherK Holden. 2012.The generalized intelligent framework for tutoring(GIFT).Orlando, FL: US Army ResearchLaboratory–Human Research & Engineering Directorate (ARL-HRED)(2012).
  • Sun etal. (2024)Lichao Sun, Yue Huang,Haoran Wang, Siyuan Wu,Qihui Zhang, Chujie Gao,Yixin Huang, Wenhan Lyu,Yixuan Zhang, Xiner Li, etal.2024.Trustllm: Trustworthiness in large languagemodels.arXiv preprint arXiv:2401.05561(2024).https://doi.org/10.48550/arXiv.2401.05561
  • Thirunavukarasu etal. (2023)ArunJames Thirunavukarasu,Darren ShuJeng Ting, Kabilan Elangovan,Laura Gutierrez, TingFang Tan, andDaniel ShuWei Ting. 2023.Large language models in medicine.Nature medicine 29,8 (2023), 1930–1940.https://doi.org/10.1038/s41591-023-02448-8
  • Thomas (2006)DavidR Thomas.2006.A general inductive approach for analyzingqualitative evaluation data.American journal of evaluation27, 2 (2006),237–246.https://doi.org/10.1177/1098214005283748
  • Trautwein and Köller (2003)Ulrich Trautwein andOlaf Köller. 2003.The relationship between homework andachievement—still much of a mystery.Educational psychology review15 (2003), 115–145.https://doi.org/10.1023/A:1023460414243
  • VanLehn (2011)Kurt VanLehn.2011.The relative effectiveness of human tutoring,intelligent tutoring systems, and other tutoring systems.Educational psychologist46, 4 (2011),197–221.https://doi.org/10.1080/00461520.2011.611369
  • Wei etal. (2022)Jason Wei, Yi Tay,Rishi Bommasani, Colin Raffel,Barret Zoph, Sebastian Borgeaud,Dani Yogatama, Maarten Bosma,Denny Zhou, Donald Metzler,etal. 2022.Emergent abilities of large language models.arXiv preprint arXiv:2206.07682(2022).https://doi.org/10.48550/arXiv.2206.07682
  • Wu etal. (2023b)Junchao Wu, Shu Yang,Runzhe Zhan, Yulin Yuan,DerekF Wong, and LidiaS Chao.2023b.A survey on llm-gernerated text detection:Necessity, methods, and future directions.arXiv preprint arXiv:2310.14724(2023).https://doi.org/10.48550/arXiv.2310.14724
  • Wu etal. (2023a)Shijie Wu, Ozan Irsoy,Steven Lu, Vadim Dabravolski,Mark Dredze, Sebastian Gehrmann,Prabhanjan Kambadur, David Rosenberg,and Gideon Mann. 2023a.Bloomberggpt: A large language model for finance.arXiv preprint arXiv:2303.17564(2023).https://doi.org/10.48550/arXiv.2303.17564
  • Zamfirescu-Pereira etal. (2023)JD Zamfirescu-Pereira,RichmondY Wong, Bjoern Hartmann, andQian Yang. 2023.Why Johnny can’t prompt: how non-AI experts try(and fail) to design LLM prompts. In Proceedingsof the 2023 CHI Conference on Human Factors in Computing Systems.1–21.https://doi.org/10.1145/3544548.3581388
  • Zhang etal. (2024)Yixuan Zhang, YimengWang, Nutchanon Yongsatianchot, JosephDGaggiano, NurulM Suhaimi, Anne Okrah,Miso Kim, Jacqueline Griffin, andAndreaG Parker. 2024.Profiling the Dynamics of Trust & Distrust inSocial Media: A Survey Study.(2024).https://doi.org/10.1145/3613904.3642927
  • Zhou etal. (2023)Jiawei Zhou, YixuanZhang, Qianni Luo, AndreaG Parker,and Munmun DeChoudhury.2023.Synthetic lies: Understanding ai-generatedmisinformation and evaluating algorithmic and human solutions. InProceedings of the 2023 CHI Conference on HumanFactors in Computing Systems. 1–20.https://doi.org/10.1145/3544548.3581318
  • Zhou etal. (2024)KyrieZhixuan Zhou,Zachary Kilhoffer, MadelynRoseSanfilippo, Ted Underwood, Ece Gumusel,Mengyi Wei, Abhinav Choudhry, andJinjun Xiong. 2024.”The teachers are confused as well”: AMultiple-Stakeholder Ethics Discussion on Large Language Models in ComputingEducation.arXiv preprint arXiv:2401.12453(2024).https://doi.org/10.48550/arXiv.2401.12453
  • Zhou etal. (2022)Yongchao Zhou, AndreiIoanMuresanu, Ziwen Han, Keiran Paster,Silviu Pitis, Harris Chan, andJimmy Ba. 2022.Large language models are human-level promptengineers.arXiv preprint arXiv:2211.01910(2022).https://doi.org/10.48550/arXiv.2211.01910
Evaluating the Effectiveness of LLMs in Introductory Computer Science Education: A Semester-Long Field Study (2024)
Top Articles
Latest Posts
Article information

Author: Rob Wisoky

Last Updated:

Views: 5441

Rating: 4.8 / 5 (68 voted)

Reviews: 91% of readers found this page helpful

Author information

Name: Rob Wisoky

Birthday: 1994-09-30

Address: 5789 Michel Vista, West Domenic, OR 80464-9452

Phone: +97313824072371

Job: Education Orchestrator

Hobby: Lockpicking, Crocheting, Baton twirling, Video gaming, Jogging, Whittling, Model building

Introduction: My name is Rob Wisoky, I am a smiling, helpful, encouraging, zealous, energetic, faithful, fantastic person who loves writing and wants to share my knowledge and understanding with you.