×

I spent 5 months asking the same question to different AI models every day. Google Gemini is most consistent of all.

I spent 5 months asking the same question to different AI models every day. Google Gemini is most consistent of all.

Exploring the Consistency of AI Responses: A 5-Month Experiment

In an era where artificial intelligence (AI) is becoming increasingly prevalent in our daily lives, the question arises: how consistent are the answers we receive from different AI models? Intrigued by this query, I dedicated five months to an experiment that would shed light on the variations (or lack thereof) among AI responses.

Every day for 153 consecutive days, I posed the same question to five distinct AI models: ChatGPT, Claude, Google Gemini, Perplexity, and DeepSeek. The query was straightforward yet timeless: “Which movies are most recommended as ‘all-time classics’ by AI?”

The results were enlightening.

Consistency in Responses

While all models exhibited some degree of variation—a common trait of large language models (LLMs)—Google Gemini emerged as the champion of consistency. Throughout the duration of my experiment, it consistently featured the same three films at the forefront of its recommendations: Citizen Kane, The Godfather, and Casablanca. Although the order of these films occasionally shifted, their presence remained unyielding at the top of the list.

In contrast, other models displayed varying degrees of reliability. DeepSeek came in as a close second in terms of consistency, while Claude landed in the middle. ChatGPT, on the other hand, frequently rearranged the list, leading to a lack of predictability in its responses. Surprisingly, Perplexity, known for its citation features, delivered remarkably inconsistent results, sometimes misconstruing my inquiry to focus on AI-themed films rather than classic cinema.

Visualizing AI Performance

To better understand the trends, I tracked Google Gemini’s positioning over time. A chart (not displayed here) provided a compelling visual of the “Relative Position of First Mention,” allowing me to assess how specific films ranked in each AI’s response. This measure was derived from calculating the position of the first mention relative to the overall length of each answer. Even for a well-documented topic like classic cinema, it was fascinating to observe that no two AI responses were identical.

The most surprising takeaway was Perplexity’s performance. Despite its ability to cite sources, the model occasionally misinterpreted the question, showcasing the unpredictability that can arise, even in a well-defined context.

Join the Conversation

This study invites discussions about the consistency of AI responses. Have you conducted similar experiments? What topics did you explore, and what outcomes did you encounter? Share

Post Comment