Evaluation framework for using GenAI to grade IB Chemistry IAs and provide feedback for students and moderators.

Introduction

The International Baccalaureate (IB) encourages students and teachers to think critically about using AI ethically and effectively in teaching and learning. They have made clear their position to embrace AI adoption since AI will inevitably become mainstream (IB, 2025). However, IB’s AI policies are student-centric with guidelines aimed to ensure academic integrity of student assessments, as demonstrated by the following statements from the IB Chemistry Guide (2025):

Students will need to be taught to understand the bias inherent in the content that an AI tool produces and to critically review it

Students should also be taught that AI-generated work included in a piece of assessment must be credited in the body of the text and appropriately referenced in the bibliography.

Remember, artificial intelligence should be used as part of research, not used to produce the work for the student.

…students must learn to be critical readers of information produced by artificial intelligence.

IB’s AI guidelines for educators are vague and implicitly assume that adults should already know how to ethically and effectively use AI, when there is a good possibility that we have less time with emergent technologies than students:

Many teachers find asking students to mark examples of work a great learning experience and using AI addresses many of the ethical and practical problems in this classroom activity. The teacher would need to explain why it is ethical for them to use the AI tools in this way but not for the student to use it to write their work, but this is part of a much more important point. Teachers should role model best practice when using AI, as students learn by observing the respected adults in their lives far more than by listening to them (Glanville, 2023).

While the activity suggested by Glanville has potential to be meaningful, there is no clear guidance or training provided by IB. Such activities require valuable lesson time in an already rigorous curriculum. Simultaneously, teachers also need more time ensuring academic integrity of student work with the proliferation of AI since ghost-writing is more accessible than ever, compared years before where hefty fees paid to human ghost-writers provided more of an obstacle. A triple whammy results for teachers who are also expected to deliver higher quality lessons by “leveraging these AI tools”.

And not only do we think that these tools are beneficial for students, they also hold the potential to lighten the workload of our dedicated educators and schools. By leveraging these AI tools, educators can free up valuable time to spend more quality moments with their students (IB, 2025).

Considerations When Using Generative AI (GenAI)

Is it necessary to use AI?

What are the benefits and do they outweigh the costs?
AI is not a solution for every task
The inevitability should not be presupposed; careful consideration should be used every time AI is proposed as a solution

What are benefits of using AI?

Saving time by potentially deprioritizing knowledge and understanding for speed and efficiency (Coleman, 2021)
Potential to remove teacher bias towards students they know or like by reducing (one type of) subjectivity
Polished grammar over in-depth analysis and nuance

What are the costs of using AI?

g. energy consumption, environmental pollution, labour extraction
Research has shown that reliance on AI can lead to cognitive decline in the long term (Oakley et al., 2025)
Are there stereotypes and biases in the AI algorithm and are you perpetuating them by using it?
Are you contributing to the systemic oppression of those who are powerless to fight against the corporations who control AI?
What is the information you are providing AI being used for?

Are your goals achievable using capabilities of current GenAI?

AI are trained on historical data and might not be able to predict different or new outcomes.
Some AI cannot extra text from images or graphs.
g. language models (LMs) predict the next probable word(s) rather than generating new sentences; they “lack understanding despite anthropomorphic portrayals” (Suchman, 2023)
What type of data was used to train the AI? What biases and errors were present?

Do the AI’s goals align with yours?

Who programmed the AI and what are their intentions?
Are there political, economic, or moral conflicts of interest between your goals and the programmers of the AI?
What will the AI do with those inputs that conflict with its own objectives? How will the output be modified or censored in such cases?
How does the AI algorithm work?

Using GenAI for Writing Chemistry Internal Assessment (IA) Commentary

This evaluation framework aims to provide teachers with guidance on using GenAI to write commentary for Chemistry Internal Assessments (IAs) that are selected for moderation by IB.

Before using GenAI, it is a teacher’s responsibility to ensure that assessments are authentic and meet IB’s academic integrity policies. To do so, a teacher must thoroughly read student submissions to verify that the work submitted was produced by the student themselves without plagiarism.

Many popular GenAI are Large Language Models (LLMs) that predict words or phrases to produce coherent sentences but do not truly create new ideas or provide analysis. Therefore, commentary created by current GenAI available in June 2025 without providing any additional input other than the AI prompt will be generic and lack detail specific to student work. The commentary typically generated reads as if a teacher with less than five years of IB chemistry teaching experience has written it. For experienced teachers or teachers who have been trained in IA marking, it is recommended to avoid using GenAI to write commentary from scratch. Preferably, a teacher will have read through an IA and graded it while writing down brief notes that can be entered into an AI prompt for the commentary to expand upon.

Due to the predictive nature of LLMs, they are best used in contexts and formats where plenty of training data have been provided. Predictive abilities of LLMs will likely be lower for academic writing compared to other more common formats that can be scraped from the internet. GenAI will be even less proficient in generating commentary for new research since no previous training data existed on the topic. It is even possible that AI could mark topics unknown to it even more harshly due to poor understanding. LLMs use historical data to predict probably words rather than generating nuanced feedback and analysis.

The quality of the commentary generated by AI will heavily depend on the topic of the IA, how frequently the investigation has been done in the past, the quality (if any) of comments provided by the teacher, and the specificity of the prompt input into the AI.

The strongest use case for GenAI in writing IA comments is to save some time with formatting and grammar by asking AI to generate coherent sentences that incorporate brief comments already written by a teacher. The quality of comments added by the teacher requires expertise in Chemistry, so this use is suitable for subject specialists with previous IA experience. In this case, AI can elaborate on a teacher’s justifications for each criterion grade and to make them more convincing to an IB moderator, reducing the likelihood the IA gets sent back to the teacher for re-marking.

A teacher can also use AI as a tutor for writing good commentary on a scientific research paper. This not only ensures that a teacher can improve the quality of his/her commentary but also invests in improving writing skills for the long run, while also demonstrating an effective and ethical use of AI that can be modeled for students.

A less ideal use case would involve less teacher input or lower quality input attributed to a specialist chemistry teacher new to IB or IA grading, or a non-chemistry science teacher who is in tasked with a chemistry class or grading a set of IAs, as sometimes happens in smaller or newer schools.

It is not recommended to use GenAI to write commentary without additional commentary input from the teacher. Even though the commentary generated is likely passable in providing relevant points to justify most mid-range scores, scores on the extreme ends would require detailed explanations from moderators, which AI does not provide. It is unwise to risk IAs getting sent back for re-marking as it would likely tarnish the reputation of a school and put future cohorts under closer scrutiny, just to save a bit of time. This use case should only be reserved as a last resort in extenuating circumstances.

Suggested Procedure for Generating Commentary

Use the IB Chemistry IA Marking Rubric to grade an IA submission by highlighting the most appropriate scores for each strand under each criterion.
Write one or two brief but specific comments justifying each score.
Upload completed rubric with comments and original IA submission to a GenAI platform.
Input the following prompt:

“Act as an IB Chemistry IA moderator. Using the official rubric and the student’s submission (provided below):

For each criterion (Research Design, Data Analysis, Conclusion, Evaluation):
- Use the score provided for each strandon a scale of 0-6, with 1-3 sentences of justification per strand. Justifications must:
  - Quote specific examples from the submission
  - Reference page numbers
  - Align with IB markband descriptors
- Calculate the criterion markby averaging the strand scores (rounded to nearest integer).
Format the output as follows:
- Criterion Name (Total Mark/6)
  - Strand 1:[Score/6] + [1-3 sentences with evidence]
  - Strand 2:[Score/6] + [1-3 sentences with evidence]
  - Strand 3 (if applicable):[Score/6] + [1-3 sentences with evidence]
  - Criterion Mark Calculation:(Sum of strands ÷ number of strands) → [Final Mark]

Additional Requirements:
- Highlight why scores don’t reach higher bands(e.g., missing “explained” methodology for Research Design 5-6).
- Include a summary tableof all strand scores and final marks.
- Total the marks across criteria to confirm the final IA score (/24).
Student Submission Excerpts:
[Insert key excerpts from the IA, especially:
- Research question
- Methodology details
- Data tables/calculations
- Conclusion statements
- Evaluation weaknesses/improvements]
Focus Areas:
- Precision (e.g., units, uncertainty handling)
- Theoretical depth (e.g., Van der Waals vs. ideal gas)
- Critical analysis (e.g., literature comparison)

Reflection

The original use case I had in mind was not only for GenAI or provide commentary for IAs but also grade them. However, even after providing moderator commentary from marked IAs as calibration, the AI scores from both Deepseek and GPT-4.1 were wildly inaccurate (off by 15 marks out of 24 at times). Even after the AI eloquently explained the reasoning behind the discrepancy, there was still no improvement, which clearly demonstrated that the LLMs really cannot reflect and improve from mistakes.

In my second attempt, I chose a more attainable goal of writing commentary using a graded rubric. I chose this task because if a teacher has already read and graded an IA (which can be done in about ten minutes to an accuracy of +/- 2 marks out of 24), then the commentary only serves to show the moderators that the teacher has done their job and the grade is not fabricated. Students never see these comments so they are not for providing feedback. I think this is a suitable job for AI since is it only an exercise for saving IB moderators time in their marking.

It took many modifications of my original prompt to get the exact output I was looking for. Oftentimes, the AI tool would completely ignore or misunderstand certain instructions as was the case when I asked for justification for each strand of the criteria. The confusion likely came from the fact that “strand” was not used anywhere on the rubric. There were also cases where the tools inexplicably changed the scores of the graded IA. When I finally got a result I wanted, I made sure to ask the AI tools for the exact input I should enter next time if I wanted to reproduce those results.

I believe the commentary generated by AI using just the raw scores without providing additional commentary, are convincing enough for moderators to accept the grades as they were. I would consider this an ethical use of AI since by the time of writing commentary, a teacher has already provided students with genuine feedback on their drafts and then graded their submissions. If the grades are not accepted by IB, no amount of commentary will help to avoid a re-grade; the commentary are going into a blackhole never to return. The use of AI in this case is in alignment with IB’s statement of leveraging AI tools to save time and provide quality education to students.

References

Coleman, B. (2021). Technology of The Surround. Catalyst: Feminism, Theory, Technoscience, 7(2), 1–21.

Glanville, M. (2023). Artificial intelligence in IB assessment and education: a crisis or an opportunity? | IB Community Blog (no date). Available at: https://blogs.ibo.org/2023/02/27/artificial-intelligence-ai-in-ib-assessment-and-education-a-crisis-or-an-opportunity/.

Oakley, B., Johnston, M., Chen, K.-Z., Jung, E., & Sejnowski, T. (2025). “The Memory Paradox: Why Our Brains Need Knowledge in an Age of AI.” In The Future of Artificial Intelligence: Economics, Society, Risks and Global Policy (Springer Nature, forthcoming).

Suchman, L. (2023). The uncontroversial ‘thingness’ of AI. Big Data & Society, 10(2) https://doi.org/10.1177/20539517231206794

The International Baccalaureate (2025). IB Chemistry Guide. https://anatolia.edu.gr/images/highschool/IBDP/Chemistry%20Guide%202025.pdf

Roger Zhai's e-Portfolio

Assignment 2