Evaluating the Effectiveness of LLMs in Introductory Computer Science Education: A Semester-Long Field Study (2024)

Wenhan LyuWilliam & MaryWilliamsburgVAUSAwlyu@wm.edu0009-0004-9129-8689,Yimeng WangWilliam & MaryWilliamsburgVAUSAywang139@wm.edu0009-0005-0699-4581,Tingting (Rachel) ChungWilliam & MaryWilliamsburgVAUSArachel.chung@mason.wm.edu0000-0002-0250-4873,Yifan SunWilliam & MaryWilliamsburgVAUSAysun25@wm.edu0000-0003-3532-6521andYixuan ZhangWilliam & MaryWilliamsburgVAUSAyzhang104@wm.edu0000-0002-7412-4669

(2024)

Abstract.

The integration of AI assistants, especially through the development of Large Language Models (LLMs), into computer science education has sparked significant debate, highlighting both their potential to augment student learning and the risks associated with their misuse. An emerging body of work has looked into using LLMs in education, primarily focusing on evaluating the performance of existing models or conducting short-term human subject studies. However, very little work has examined the impacts of LLM-powered assistants on students in entry-level programming courses, particularly in real-world contexts and over extended periods. To address this research gap, we conducted a semester-long, between-subjects study with 50 students using CodeTutor, an LLM-powered assistant developed by our research team. Our study results show that students who used CodeTutor (the “CodeTutor group” as the experimental group) achieved statistically significant improvements in their final scores compared to peers who did not use the tool (the “control group”). Within the CodeTutor group, those without prior experience with LLM-powered tools demonstrated significantly greater performance gain than their counterparts. We also found that students expressed positive feedback regarding CodeTutor’s capability to comprehend their queries and assist in learning programming language syntax. However, they had concerns about CodeTutor’s limited role in developing critical thinking skills. Over the course of the semester, students’ agreement with CodeTutor’s suggestions decreased, with a growing preference for support from traditional human teaching assistants. Our findings also show that students turned to CodeTutor for different tasks, including programming task completion, syntax comprehension, and debugging, particularly seeking help for programming assignments. Our analysis further reveals that the quality of user prompts was significantly correlated with CodeTutor’s response effectiveness. Building upon these results, we discuss the implications of our findings for the need to integrate Generative AI literacy into curricula to foster critical thinking skills, and turn to examining the temporal dynamics of user engagement with LLM-powered tools. We further discuss the discrepancy between the anticipated functions of tools and students’ actual capabilities, which sheds light on the need for tailored strategies to improve educational outcomes.

Field study, Large Language Models, Tutoring

^†^†journalyear: 2024^†^†copyright: rightsretained^†^†conference: Proceedings of the Eleventh ACM Conference on Learning @ Scale; July 18–20, 2024; Atlanta, GA, USA^†^†booktitle: Proceedings of the Eleventh ACM Conference on Learning @ Scale (L@S ’24), July 18–20, 2024, Atlanta, GA, USA^†^†doi: 10.1145/3657604.3662036^†^†isbn: 979-8-4007-0633-2/24/07^†^†ccs: Human-centered computingHuman computer interaction (HCI)

1. Introduction

Recent advancements in Generative AI and Large Language Models (LLMs), exemplified by GitHub Copilot(GitHub, Inc., 2024) and ChatGPT(OpenAI, 2024), have demonstrated their capacity to tackle complex problems with human-like proficiency. These innovations raise significant concerns within the educational domain, particularly as students might misuse these tools, thereby compromising the quality of education and breaching academic integrity norms(Perkins etal., 2023). Specifically, entry-level computer science education is directly affected by the progress in LLMs(Zhou etal., 2024). LLMs’ capability in handling programming tasks means they can complete many assignments typically given in introductory courses, thus becoming highly appealing to students looking for easy solutions.

Despite these challenges, LLM-powered tools offer great opportunities to enrich computer science education(Kumar etal., 2023). When used ethically and appropriately, they can serve as powerful educational resources. For instance, LLMs can provide students instant feedback on their coding assignments or generate diverse examples of code that help demonstrate programming concepts(Pankiewicz and Baker, 2023). Moreover, as Generative AIs are becoming popular in production environments, familiarizing students with these technologies is increasingly becoming a crucial aspect of computer science education.

The unique challenges posed by LLMs stem from the difficulty in detecting the use of AI tools(Wu etal., 2023b; Zhou etal., 2023). Traditional approaches, such as plagiarism detection software, fall short in determining the originality of student submissions(Meyer etal., 2023). Given the challenges in identifying LLMs usage and recognizing the potential advantages of these technologies, we consider integrating LLMs into computer science education inevitable. As students have already started using such tools, the impact of LLMs on computer science education remains unknown. Indeed, a growing body of research has begun to explore the application of LLMs within educational settings, primarily focusing on assessing the capabilities of current models with existing datasets or previous assignments from students(Hicke etal., 2023; Mehta etal., 2023). However, there is still a research gap in understanding how students interact with LLM-powered tools in introductory programming classes, particularly regarding their engagement in genuine learning settings over extended periods. Furthermore, while previous studies have shown individual differences in intelligent tutoring systems(Kulik and Fletcher, 2016), research into how these differences apply to LLM tools is lacking. Investigating these variations is important for tailoring educational strategies to diverse student needs. In short, understanding these nuanced attitudes of and interactions with LLM-powered tools in CS education over extended periods is crucial for identifying the evolving challenges and opportunities LLMs introduce.

To address the research gap, we asked the following research questions (RQs) in this work:
RQ1. Does the integration of LLM-powered tools in introductory programming courses enhance or impair students’ learning outcomes, compared to traditional teaching methods? How are individual differences associated with students’ learning outcomes using LLM-powered tools?
RQ2. What are students’ attitudes towards LLM-powered tools, how do they change over time, and which factors might influence these attitudes?
RQ3. How do students engage with LLM-powered tools, and how do they respond to their programming needs?

We believe that addressing the following research questions (RQs) is critical for enabling researchers to make informed decisions about incorporating LLMs into their courses and guiding students on the optimal and responsible use of LLM-powered tools. To answer the questions, we conducted a longitudinal, between-subject field study with 50 students over the course of the fall semester from September to December 2023 with a web-based tool we developed called CodeTutor.

The contributions of this work are:1) We conducted a semester-long longitudinal field study to assess the effectiveness of an LLM-powered tool (CodeTutor) on students’ learning outcomes in an introductory programming course. By comparing the performance of students who used CodeTutor against those who did not, our study contributes to new empirical evidence regarding the role of LLM-powered tools in the programming learning experience;2) We characterized patterns of student engagement with CodeTutor and analyzed the ways in which it can meet students’ learning needs. Through the analysis of conversational interactions and feedback loops between students and the tool, we contributed new knowledge regarding how CodeTutor facilitates or impedes learning; and3) We offered insights and outlined design implications for future research.

2. Related Work

2.1. Intelligent Tutoring Systems

Using computerized tools for assisting educational purposes is not a new idea. As early as the 1950s, the first concept of using computers to assist learning has already emerged(Nwana, 1990). From where the factor of intelligence had been considered and it had started evolving into Intelligent Tutoring Systems (ITS)(Sleeman and Brown, 1982). ITS leverages artificial intelligence to provide personalized learning experiences in computer science education, adapting instruction and feedback to individual student needs(Anderson etal., 1985; Elsom-Cook, 1984). These systems have enhanced student engagement, comprehension, and problem-solving skills by offering tailored support and immediate feedback, similar to one-on-one tutoring(VanLehn, 2011; Demszky and Liu, 2023). Research has demonstrated that ITS can significantly improve understanding of complex concepts in programming courses compared to traditional teaching methods, leading to higher student satisfaction due to the personalized learning environment(Corbett etal., 1997; Ritter etal., 2007). The Internet also empowered ITS to offer more interactivity and adaptivity(Brusilovsky etal., 1996, 1998; Butz etal., 2006), leveraging the path of later boost with natural language processing techniques(ElSaadawi etal., 2008; Hooshyar etal., 2015).

However, prior work has shown that as the granularity of tutoring decreases, its effectiveness increases(VanLehn, 2011). Significant limitations for ITS include the complexity and cost of building them, the incapability to answer questions and tasks out of their programmed domains, and the difficulty to develop with the purpose of productively used by individuals without expertise(Graesser etal., 2018). Even though the Generalized Intelligent Framework for Tutoring (GIFT) framework(Sottilare etal., 2012) was proposed and evolved for developing ITS for use at scale, those limitations mostly remain unresolved.

2.2. Large Language Models in CS Education

The release of ChatGPT and other Generative AI applications brought LLMs into the public view and attracted enormous attention(Achiam etal., 2023; Sun etal., 2024). LLMs offer researchers and users the flexibility to employ a single tool across various tasks(Wei etal., 2022), such as medical research(Thirunavukarasu etal., 2023; Clusmann etal., 2023), finance(Wu etal., 2023a), and education(Kasneci etal., 2023). Adopting LLM-powered tools in educational settings is facilitated by their broad accessibility and cost-free nature(Zamfirescu-Pereira etal., 2023). Recent studies have looked into the potential of AI assistants to enhance student learning by helping with students’ problem-solving(Ahmed etal., 2022; Leinonen etal., 2023b; Phung etal., 2023a) and generating computer science content(Sarsa etal., 2022; Denny etal., 2022). Current research on the use of LLMs in education has primarily looked into their performance and capabilities(Prather etal., 2023) compared to humans, such as generating code for programming tasks(Leinonen etal., 2023a; Poldrack etal., 2023), answering general inquriries(Savelka etal., 2023; Phung etal., 2023b), addressing textbook questions(Jalil etal., 2023) and exam questions(Dobslaw and Bergh, 2023).

Despite the growing interest in examining the capabilities of LLMs in education, very few empirical studies have examined the emerging concerns regarding their impact. Therefore, there is an urgent need for research into the long-term effects of LLMs in CS education and the development of strategies to counteract potential negative consequences. One exceptional work was conducted by Liffiton et al.(Liffiton etal., [n. d.]), who developed a tool called CodeHelp for assisting students with their debug needs in an undergraduate course over 12 weeks. Their follow-up study(Sheese etal., 2023) categorized history messages in their tool, and found a positive relationship between tool usage and course performance. However, their study specifically focused on debugging issues and did not compare the outcomes with those achieved through traditional TA methods.

Furthermore, prior research has demonstrated that individual differences, such as gender, race, and prior experiences with technologies, significantly influence the effectiveness of intelligent tutoring systems(Kulik and Fletcher, 2016). However, work that examines how individual differences affect interactions with and perceptions of LLM-powered tools in educational settings is sparse, even though understanding the role of demographic and individual variability is crucial(Zhang etal., 2024). This is particularly important for developing inclusive and effective educational tools that suit the diverse needs of students.

Our work seeks to address these research gaps by conducting a field study that evaluates the use of LLM-powered tools for an extended period of time. Particularly, our study not only aims to evaluate the practicality of LLMs in programming learning educational contexts, but also intends to contribute to a more nuanced understanding of their long-term implications for learning and teaching methodologies.

3. Method

In this section, we describe the design of CodeTutor (subsection3.1), an overview of our participants (subsection3.2), our study procedure and data collection (subsection3.3), and our quantitative and qualitative data analysis (subsection3.4). The source code of CodeTutor, pre-test questions, and data analysis code is available on osf.io/e3zgh.

3.1. Design of CodeTutor

Evaluating the Effectiveness of LLMs in Introductory Computer Science Education: A Semester-Long Field Study (1)

We developed CodeTutor, a browser-based web application using TypeScript and front-end frameworks (e.g., SolidJS, Astro, and libraries such as Zag), for a responsive and interactive user interface. CodeTutor integrates OpenAPI API, which enables the GPT-3.5 model offered by OpenAI. The main interface is shown in Figure1.
Login. Students log in to CodeTutor using their email addresses, with a randomly generated unique identifier (UID) that tracks their activities anonymously.
User Interface. The CodeTutor interface features a navigation sidebar and a central chat area. The sidebar enables easy navigation, with a button for starting new conversations and a chronological listing of existing ones for quick access.
User Feedback Structure. Feedback is important in CodeTutor in order to understand user engagement and students’ attitudes towards it. CodeTutor provides two feedback mechanisms: 1) conversation-level and 2) message-level feedback.
Data Storage. CodeTutor stores data locally on the user’s browser with IndexedDB and can only upload essential information with our secure server for research purposes, where a unique ID for anonymous tracking identifies each conversation. To protect privacy, CodeTutor cannot read stored data from our server.
API Usage.OpenAI only offered limited configuration ability for their API at the time we started our experiment. So we carefully crafted the system role text in our implementation to specify the model to answer questions as a teaching assistant in an entry-level Python class, making answers from OpenAI API consistent even if the length of a conversation exceeds its token limit.

3.2. Participants

Upon approval from our institution’s Institutional Review Board (IRB), we conducted a field study evaluation study with 50 participants. The field study took place in the Computer Science Department of a 4-year university in the United States. Our criteria for participation include: Participants need to be 18 years or older, be able to speak and write in English, and register as entry-level undergraduate computer science students at our institution. Table1 presents an overview of our participants’ demographic information.

Characteristics	Options	Number of
		participants
Gender	Woman	22
	Man	25
	Non-binary	1
	Prefer not to say	2
Major	Computer Science	18
	Data Science	9
	Biology	5
	Mathematics	4
	Economics	3
	Others	10
	Not reported	1
Year of Study	Freshman	37
	Sophom*ore	5
	Junior	6
	Senior	1
	Not reported	1
Race	African American or Black	1
	Asian	17
	Multiracial	3
	White	26
	Not reported	3
Ethnicity	Latino/Hispanic	3
Prior Experience	Only ChatGPT	28
with LLM tools	ChatGPT and other tools	11
	Never used	11

3.3. Study Procedure & Data Collection

Our field study lasted from September 27 (after the course add-drop period) to December 11, 2023 (the final exam due). Below, we describe each component of our study.

3.3.1. Pre-test

Participants were initially requested to provide their consent to participate, with being informed about the study’s objectives, procedures, and their rights as participants, including the right to withdraw at any time without penalty. Following the consent process, the pre-test assessment was administered to evaluate students’ existing knowledge of Python programming, providing a baseline for subsequent analysis.

This pre-test included three sections with Python questions, with a total of 22 questions that varied in difficulty for an evaluation of participant skills. The first section featured eight questions (Questions 1-8, for example, “What is the output of the following code: print(3+4)?” ), the second section included seven questions of medium difficulty (Questions 9-15, for example, “If I wanted a function to return the product of two numbers a and b, what should the return statement look like?”), and the third section presented seven challenging questions (Questions 16-22, for example, “What will be the output of the following code? [Multiple lines of code]”). The total score of the three sections was 100 points. Pre-test submissions were graded by our researchers with Computer Science backgrounds, using predetermined scoring criteria.

This pre-test also asked about participants’ prior experience with LLMs, specifically asking, “Which of the following Large Language Model AI tools have you used before? Please select all that apply.” Participants were also asked to provide demographic information, including their major (or intended major), gender, and race/ethnicity. Participants were assured that all demographic information would remain anonymous and be used solely for research purposes.

3.3.2. Control vs. Experimental Group

Participants were divided into two groups: the control group, which used traditional learning methods and had access to human teaching assistants (TAs) for additional support outside class hours, and the experimental group, which used CodeTutor as their primary educational tool beyond class hours, alongside access to standard learning materials and human TAs. Using LLM-based tools other than CodeTutor in this course was prohibited.

To divide participants into a control group and an experimental group, we initially sorted the entire sample based on their previous engagement with LLM-powered tools, resulting in two groups: those who have used any LLM-powered tools before (Used Before) and those who have not (Never Used). Within the Used Before category, we split the participants into two subsets, Used Before Subset A and Used Before Subset B, based on the overall pre-test result distribution to ensure both subsets are representative of the wider group. The same process was applied to the Never Used group, generating two additional subsets: Never Used Subset A and Never Used Subset B. The experimental group is then formed by combining Used Before Subset A with Never Used Subset A, while the control group consists of the combination of Used Before Subset B and Never Used Subset B. This method ensures the experimental and control groups were balanced regarding prior experience with Chatbots and their pre-test performance (see Figure2).

Evaluating the Effectiveness of LLMs in Introductory Computer Science Education: A Semester-Long Field Study (6)

Following their group assignments, students in the experimental group were sent detailed instructions via email on how to access and use CodeTutor. In the field study, participants were not mandated to adhere to a specific frequency of engagement with CodeTutor; instead, they were encouraged to utilize the tool at their own pace. This approach allowed for a naturalistic observation of how students integrate LLM-powered educational resources into their learning processes, without imposing additional constraints that could influence their study habits or the study’s outcomes.

3.3.3. Student Evaluation

At the end of the semester, students’ final grades were used as a primary measure to assess their learning outcomes and the impact of CodeTutor interventions. While acknowledging that final grades are influenced by various factors, they offer a standardized measure of overall academic success, enabling an assessment of CodeTutor’s role in improving student learning outcomes.

Final grades were determined by a weighted average that includes several components for each student: labs (practical mini-projects), assignments (individual coding tasks, such as array summation), mid-terms, and a final exam (comprising questions similar to those in the pre-test). Note that a student’s final grade can surpass 100 if bonus points are awarded throughout the semester. Access to CodeTutor is restricted during mid-terms and final exams, categorizing the assessment components into two groups: CodeTutor-Allowed (labs and assignments) and CodeTutor-Not-Allowed (mid-terms and final exams). This categorization facilitates an analysis of CodeTutor’s impact on student performance by examining potential dependencies on the tool and the improvement of learning outcomes in its absence.

3.4. Data Analysis

3.4.1. Quantitative Data Analysis

We examined the students’ scores, interaction behaviors, and attitudes of using CodeTutor through multiple statistical analyses.

First, we calculated descriptive statistics for all variables, including frequency with percentage for categorical variables and means and standard deviations for continuous variables. To examine the variation in students’ scores before and after the intervention (i.e., the use of CodeTutor), we conducted paired-t tests for both the experimental and control groups. Multiple regression analyses with family-wise p-value adjustment were used to examine the effects of CodeTutor on score improvement, taking into account students’ past experiences using LLM-powered tools and demographic variables, such as major, gender, and race. We then investigated the impact of CodeTutor accessibility on academic performance with ANOVA method. Moreover, we conducted a chi-squared test to explore the relationship between the quality of students’ content and prompts and CodeTutor performance. To understand students’ attitudes towards CodeTutor, we calculated Spearman’s correlation matrix for continuous variables, given the characteristics of our data, which are non-normal and exhibit unequal variance. Furthermore, to examine differences between questions, we used the Kruskal-Wallis Rank Sum Test (using R package stats(R Core Team, 2022)) and then performed post-hoc tests using Dunnett’s test (using the R package FSA(Ogle etal., 2023)) in cases where significant differences were found. To investigate the importance of time on students’ attitudes towards CodeTutor, we introduced a linear mixed effects (LME) model (using the R package lme4(Bates etal., 2015)). We considered statistical significance at a significance level of $p<$ 0.05 for most cases, except in multiple regression analyses where we used $p<$ 0.1 and showed effect sizes were significant enough to indicate the relationship of variables.

3.4.2. Qualitative Data Analysis

We also analyzed the conversational history between users. Specifically, we used the General Inductive Approach(Thomas, 2006) to guide our thematic analysis of the conversational data. The first author conducted a close reading of the data to gain a preliminary understanding of the conversational data and then labeled the text segments to formulate categories, which served as the basis for constructing low-level codes to capture specific elements of the user-CodeTutor interactions. Similar low-level codes were then clustered together to achieve high-level themes. During the analysis, the research team engaged in ongoing discussions to refine and clarify emerging themes.

4. Results

In this section, we examined the impact of CodeTutor on student academic performance (subsection4.1 to answer RQ1), analyzed students’ attitudes towards learning with CodeTutor (subsection4.2 to answer RQ2), and characterized their engagement patterns in entry-level programming courses (subsection4.3 to answer RQ3).

4.1. RQ1: Learning Outcomes with CodeTutor

4.1.1. Comparative Analysis of Score Improvements

Overall, students in the experimental group exhibited a greater average improvement in scores, as illustrated by comparing their pre-test and final scores to those in the control group. Specifically, the average increase for the experimental group was 12.50, whereas the control group showed an average decrease of 3.17 when comparing final scores to pre-test scores.

We conducted paired t-tests for both the experimental and control groups to determine if the observed improvements were statistically significant, starting with the premise that there were no differences in pre-test scores between these two groups. Our null hypothesis assumed that the true mean difference between pre-test and final scores was zero. For the control group, the null hypothesis could not be rejected, suggesting that the differences between pre-test and final scores were not statistically significant ( $t=$ -0.879, $p=$ 0.394). Conversely, participants in the experimental group demonstrated significant improvement from the pre-test to final scores, indicating a statistically significant enhancement in their scores ( $t=$ -2.847, $p=$ 0.009).

4.1.2. Effect of CodeTutor Accessibility on Academic Performance

Evaluating the Effectiveness of LLMs in Introductory Computer Science Education: A Semester-Long Field Study (7)

By constructing the CodeTutor-Allowed and CodeTutor-Not-Allowed, we determine the correlation between CodeTutor’s accessibility and student academic performance. Using the ANOVA technique on the data from the experimental group, Figure3 reveals that the mean score for the CodeTutor-Allowed category stands at 102.29, in contrast to the CodeTutor-Not-Allowed components, which has a mean score of 93.40. The statistical analysis results show a significant difference between the two groups ( $t=$ 2.31, $p=$ 0.03), suggesting that the allowance of CodeTutor correlates with higher student scores.

	Estimate	Std. Error	t value	Pr( $>$ $\|$ t $\|$ )
Const	93.683	3.877	24.166	0.000 ***
\cdashline1-5Prior Experiences with LLM tools
(Reference: Used before)
Never used	18.877	5.054	3.735	0.032 *
\cdashline1-5Major
(Reference: Computer science)
Data Science	14.532	5.662	2.567	0.073 ^†
Mathematics	17.692	5.852	3.023	0.057 ^†
Biology	16.257	5.662	2.871	0.057 ^†
Economics	1.362	4.799	0.284	0.784
Others	-13.004	6.022	-2.160	0.115
\cdashline1-5Gender
(Reference: Female)
Male	5.917	3.845	1.539	0.223
\cdashline1-5Race
(Reference: White)
Asian	-7.831	3.933	-1.991	0.128
African American or Black	8.099	7.107	1.140	0.322
Others	6.102	5.416	1.127	0.322

	Comprehension	Critical Thinking	Syntax Mastery	Independent Learning	TA Replacement
	$\beta$ (Std. Error)	$\beta$ (Std. Error)	$\beta$ (Std. Error)	$\beta$ (Std. Error)	$\beta$ (Std. Error)
Const	4.700(0.297)***	2.690(0.247)***	3.760(0.262)***	3.044(0.218)***	3.964(0.330)***
Time	-0.114(0.039)**	0.040(0.037)	-0.018(0.041)	0.054(0.036)	-0.099(0.051)^†

Category Name	Description	Example	Percentage
Programming Task	Any questions or answers related to Python programming.	“Write a function that prints the nth(argument) prime number.”	86.52%
Grammar & Syntax	When a message is related to basic Python grammar or syntax problems, a runnable program is most likely unnecessary.	“What does {} do in Python?”	14.26%
General Question	When a message is not directly related to Python.	“What is ASCII?”	4.29%
Greetings	When a message is greeting.	“Hello! How can I assist you today?”	0.62%
Help Ineffective	When a user message says the previous answer generated by CodeTutor is wrong or provides error information.	“This code still fails.”	12.86%
Debug Request	When a user message asks CodeTutor to fix bugs or explain what was wrong in code snippets provided or in previous messages.	“Debug this code. [Code Snippet]”	8.22%
Modification Request	When a user requires CodeTutor to change something on its previous answer.	“Remove comments.”	4.48%
Further Information	When a user message provides more context on their previous input.	“All the input strings will be the same length.”	3.97%
Explanation	When CodeTutor explains something in previous messages or why it cannot complete the current task from users.	“I’m sorry, but I need more information to provide the answers for questions 4 and 6.”	28.94%
Correction	When CodeTutor corrects content in its previous answer.	“Apologies for the syntax error. Here is the corrected version: [Code Snippet]”	13.95%

Evaluating the Effectiveness of LLMs in Introductory Computer Science Education: A Semester-Long Field Study (2024)

Abstract.

1. Introduction

2. Related Work

2.1. Intelligent Tutoring Systems

2.2. Large Language Models in CS Education

3. Method

3.1. Design of CodeTutor

3.2. Participants

3.3. Study Procedure & Data Collection

3.3.1. Pre-test

3.3.2. Control vs. Experimental Group

3.3.3. Student Evaluation

3.4. Data Analysis

3.4.1. Quantitative Data Analysis

3.4.2. Qualitative Data Analysis

4. Results

4.1. RQ1: Learning Outcomes with CodeTutor

4.1.1. Comparative Analysis of Score Improvements

4.1.2. Effect of CodeTutor Accessibility on Academic Performance

4.1.3. Correlation Between Student Demographics and Final Scores in the Experimental Group

4.2. RQ2: Students’ Attitudes towards CodeTutor

4.2.1. Descriptive Analysis

4.2.2. Exploring Relationships in Student Attitudes Toward CodeTutor

4.3. RQ3: Students’ Engagement with CodeTutor

4.3.1. Message Classification & Interaction Patterns

4.3.2. Analysis of Prompt Quality & Correlation with Response Effectiveness

5. Discussion

5.1. Towards Enhancing Generative AI Literacy

5.2. Turning to the Temporal Dynamics of LLM-Powered Tutoring Tools

5.3. Alignments of LLMs for Education

6. Limitations and Future Work

7. Conclusion

Acknowledgements.

References