Study: To which degree is GPT-5 and Qwen3 overconfident / underconfident in their answers?

Virtual Reality GAIadmin October 3, 2025 0 Comments

Study: To which degree is GPT-5 and Qwen3 overconfident / underconfident in their answers?

Analyzing Confidence Levels in Next-Generation Language Models: A Study of GPT-5 and Qwen3

As artificial intelligence continues to rapidly evolve, understanding the behavior of advanced language models becomes increasingly important—particularly when it comes to their confidence in providing answers. Recent investigations have shed light on the varying degrees of confidence and potential pitfalls such models may exhibit, especially regarding the phenomenon often referred to as “confidently hallucinating.”

This article explores the contrasting behaviors of two prominent language models: Qwen3 and GPT-5. By examining their responses under specific prompting strategies, we aim to understand the extent to which these models are overconfident, underconfident, or potentially misleading in their outputs.

The Context: Confidence and Hallucinations in Language Models

Large language models (LLMs) are trained on vast datasets and are designed to generate human-like text. However, a recurring challenge is their tendency to produce confident-sounding answers that are factually incorrect—a phenomenon known as “hallucination.” Understanding how these models gauge and express their confidence is critical for deploying AI responsibly across applications such as customer service, education, and decision support systems.

The Comparative Study: Qwen3 Versus GPT-5

Qwen3: Built for Cautiousness

In the course of testing, researchers prompted Qwen3 with questions designed to gauge its confidence levels. Interestingly, when instructed to assess or project confidence, Qwen3 demonstrated a tendency toward underconfidence. In practice, it appeared restrained, often refusing to assert certainty even when facts supported a confident answer. This behavior suggests that Qwen3 may have inherent or imposed restrictions that lead it to “know” but avoid expressing full confidence—potentially a safeguard against hallucination, ensuring it doesn’t make unwarranted claims.

GPT-5: Bold and Unrestrained

In contrast, GPT-5 exhibited a markedly different behavior. When tested with similar confidence prompts, GPT-5 readily outputted answers with high confidence—even when the responses were factually incorrect. Its responses often lacked self-evaluation or uncertainty markers, reflecting a model that is less constrained and more willing to assert answers confidently, regardless of their accuracy. This can be advantageous in rapid information generation but also raises concerns about misinformation and overconfidence.

Implications and Future Directions

This preliminary analysis highlights significant differences in how emerging language models handle confidence. Qwen3’s cautious approach may help mitigate