<iframe src="https://www.googletagmanager.com/ns.html?id=GTM-WR8CQ6XP" height="0" width="0" style="display:none;visibility:hidden">
Skip to Content
Our products
Icon-AI-Powered
How It Works

See how Certiverse builds your certification in days

Security & Compliance
Security & Compliance

Enterprise-grade security you can prove to your stakeholders

Icon-Bias-Review
Integrations

Connect the tools your team already uses

By use case
Customer Certification
Customer Certification

Turn your training program into a credential that means something

Icon-SME-Collaboration-64x64
Partner Enablement

Prove your partners know their stuff — and show it

Security & Compliance
Professional Credentials

Launch an industry-recognized certification from scratch

By program maturity
Icon-Validated-Certification
New to Certification

Never built a certification before? Start here.

Icon-Guided-Steps
Existing Programs

Already running a program? Move off legacy systems without the 12-month wait.

Learn
Speech-Mark-Open
Blog

Insights for certification builders

Customer Certification
Customer Stories

How teams launched with Certiverse

Icon-Validated-Certification
Certification 101

The fundamentals of building a defensible certification

Icon-Transparent-64x64
E-books & Guides

Deep dives on exam design and credentialing

Support
Security & Compliance
Documentation

Guides, APIs, and technical references

Icon-Guided-Steps
Help Center

Answers to common questions

Icon-SME-Collaboration-64x64
Contact Support

Talk to our team

Compare
Icon-Speed-Fast (1)
vs. Legacy Vendors

Why teams are moving away from PSI and Pearson

Icon-Bias-Review
vs. LMS Platforms

Why quizzes aren't certifications

Community
Icon-Speed-Clock
Community

Connect with certification builders

Data Dive: AI’s Accuracy in Exam Item Review

AI’s accuracy in exam item review shows promise, potentially speeding up high-quality exam development.
Data Dive: AI’s Accuracy in Exam Item Review
Listen to this post

 

LISTEN TO THIS POST
2:46

 

Item review is a critical step in ensuring high-quality items. No matter how good the SME item author (or AI) is, exam items often have issues like an unclear stem, a correct answer that sticks out, or implausible distractors. Such quality issues can undermine the usefulness of the exam. The Certiverse platform already uses AI and machine learning to guide item writers through in the development process, providing guidance when necessary to promote high-quality items. However, quality issues are not always immediately obvious, even to the trained eye. This is why a thorough (and typically labor intensive) review process is necessary.

Wouldn’t it be great if AI could automatically review newly written exam items and provide fast, actionable feedback? Could AI find the same issues that human reviewers find, and guide item writers to make targeted revisions? The first step to using AI for item review and revision is to see if large language models (LLMs) are capable of picking up on the same item-quality issues as human review. Therefore, we tested how accurately two LLMs identified the same key issues as a human review within an exam containing 43 four-option multiple-choice items.

We found promising initial results for leveraging AI to review items. As shown in the table below, human review determined that the most common issues in our exam items were (1) a correct answer that stood out (i.e., conspicuous key) and (2) unrealistic or obvious distractors (i.e., distractor plausibility). Both LLMs identified nearly all these issues, but one of the LLMs identified all but one case. Specifically, GPT 4.1 identified 100% of the conspicuous keys and 96% of the implausible distractors, outperforming GPT 4o. The LLMs also identified nearly all the less common issues (e.g., stem clarity issues).

Quality Issue

Frequency of Issue from Human Review

GPT-4.1 Accuracy

GPT-4o Accuracy

Conspicuous Key

20

100.00%

80.00%

Distractor Clarity

2

100.00%

100.00%

Distractor Plausibility

27

96.00%

96.00%

Distractor Redundancy

2

100.00%

100.00%

Double Key

2

50.00%

0.00%

key Clarity

1

100.00%

100.00%

key Plausibility

2

100.00%

100.00%

Stem Clarity

10

100.00%

100.00%

Superficial Question

3

100.00%

100.00%

 

In short, GPT-4.1 outperformed GPT-4o in identifying the most common item-quality issues, which included implausible distractors, conspicuous keys, and unclear stems. Specifically, GPT-4.1 found at least 96% of these common issues.  

Because these initial results are a promising first step, the next steps will be to evaluate the usefulness of the AI-generated feedback and suggested revisions. Additionally, because the current results reflect only a single exam, we need to ensure that the AI item review accurately identifies and addresses item-quality issues across various item types and content domains. Overall, these findings suggest that we can leverage state-of-the-art LLMs to reduce the amount of time and effort needed to review exam items, allowing faster development of high-quality exams.