Pens and GPT Models Down Please: AI Undetectable in University Exam Submissions

New university study shows 94% of AI-generated exam answers went undetected, often outscoring actual student submissions.

In a not at all surprising turn of events, artificial intelligence (AI) generated submissions were almost entirely undetected in a recent test of the University of Reading’s examination system, with these AI-written answers often outperforming those by actual students. This revelation, detailed in a study published in PLOS ONE by Peter Scarfe and his team, underscores significant vulnerabilities in current educational assessments.

As AI tools like ChatGPT become increasingly sophisticated and accessible, there is growing concern over their potential misuse in academic settings. This anxiety has been exacerbated by the shift from traditional supervised exams to more unsupervised, take-home formats—a trend accelerated by the COVID-19 pandemic. Despite the existence of tools intended to identify AI-generated content, their efficacy remains largely unproven.

To probe these issues, researchers at the University of Reading orchestrated a covert operation within the School of Psychology and Clinical Language Sciences. They crafted exam answers using the advanced AI chatbot GPT-4 and submitted these under the guise of 33 fictitious students. The exam graders, unaware of the deception, evaluated the submissions as they would any others.

The results were startling: 94 percent of AI-generated submissions slipped through undetected, and these not only matched but frequently surpassed the quality of answers from genuine students. In over 83% of comparisons, AI submissions achieved higher grades than a randomly selected control group of real student responses.

Median grades attained by real (orange) and AI (blue) submissions across each individual module and all combined. Grade boundaries  2:2, 2:1 and 1st classifications are shown as dotted lines, Credit: Scarfe et al., 2024, PLOS ONE

This finding raises the alarming prospect that students could exploit AI not just to sidestep learning but to actually obtain superior results, potentially skewing academic fairness and outcomes. There’s also a chance that some real students might have successfully passed off AI-generated work as their own during the study, further complicating the detection challenge.

From an integrity perspective, these results are deeply troubling. The researchers suggest that reinstating supervised, in-person examinations could mitigate some of the risks posed by AI. However, as AI continues to permeate various aspects of professional and academic life, there is also an argument to be made for adapting to this “new normal.”

“A rigorous blind test of a real-life university examinations system shows that exam submissions generated by artificial intelligence were virtually undetectable and robustly gained higher grades than real students” the study authors explain. They advocate for a recalibrated approach that recognizes the dual threats and opportunities presented by AI. The University of Reading, in response, is crafting new policies and offering guidance to staff and students on navigating the complex landscape shaped by these intelligent tools. The broader academic world, the study suggests, might have to follow suit.

For more information, you can read the full paper “A real-world test of artificial intelligence infiltration of a university examinations system: A “Turing Test” case study” in PLOS ONE.

Staff Writer

Our in-house science writing team has prepared this content specifically for Lab Horizons

Leave a Reply

Your email address will not be published. Required fields are marked *