Data Dive: AI’s Accuracy in Exam Item Review

AI’s accuracy in exam item review shows promise, potentially speeding up high-quality exam development.

Jul 10, 2025

LISTEN TO THIS POST

2:46

Item review is a critical step in ensuring high-quality items. No matter how good the SME item author (or AI) is, exam items often have issues like an unclear stem, a correct answer that sticks out, or implausible distractors. Such quality issues can undermine the usefulness of the exam. The Certiverse platform already uses AI and machine learning to guide item writers through in the development process, providing guidance when necessary to promote high-quality items. However, quality issues are not always immediately obvious, even to the trained eye. This is why a thorough (and typically labor intensive) review process is necessary.

Wouldn’t it be great if AI could automatically review newly written exam items and provide fast, actionable feedback? Could AI find the same issues that human reviewers find, and guide item writers to make targeted revisions? The first step to using AI for item review and revision is to see if large language models (LLMs) are capable of picking up on the same item-quality issues as human review. Therefore, we tested how accurately two LLMs identified the same key issues as a human review within an exam containing 43 four-option multiple-choice items.

We found promising initial results for leveraging AI to review items. As shown in the table below, human review determined that the most common issues in our exam items were (1) a correct answer that stood out (i.e., conspicuous key) and (2) unrealistic or obvious distractors (i.e., distractor plausibility). Both LLMs identified nearly all these issues, but one of the LLMs identified all but one case. Specifically, GPT 4.1 identified 100% of the conspicuous keys and 96% of the implausible distractors, outperforming GPT 4o. The LLMs also identified nearly all the less common issues (e.g., stem clarity issues).

Quality Issue	Frequency of Issue from Human Review	GPT-4.1 Accuracy	GPT-4o Accuracy
Conspicuous Key	20	100.00%	80.00%
Distractor Clarity	2	100.00%	100.00%
Distractor Plausibility	27	96.00%	96.00%
Distractor Redundancy	2	100.00%	100.00%
Double Key	2	50.00%	0.00%
key Clarity	1	100.00%	100.00%
key Plausibility	2	100.00%	100.00%
Stem Clarity	10	100.00%	100.00%
Superficial Question	3	100.00%	100.00%

In short, GPT-4.1 outperformed GPT-4o in identifying the most common item-quality issues, which included implausible distractors, conspicuous keys, and unclear stems. Specifically, GPT-4.1 found at least 96% of these common issues.

Because these initial results are a promising first step, the next steps will be to evaluate the usefulness of the AI-generated feedback and suggested revisions. Additionally, because the current results reflect only a single exam, we need to ensure that the AI item review accurately identifies and addresses item-quality issues across various item types and content domains. Overall, these findings suggest that we can leverage state-of-the-art LLMs to reduce the amount of time and effort needed to review exam items, allowing faster development of high-quality exams.

Data Dive: AI’s Accuracy in Exam Item Review

Similar posts

Is AI a Psychometrician or an Exam Developer?

Psychometricians: Who, What, and Why?

Easing Angoff: How To Streamline Your Standard Setting