You know how AI beats chess masters but you beat ChatGPT at tic-tac-toe? (try it, it’s real) Here is an experiment where all the latest models tried playing simple web games and faceplanted so hard
Exploring AI Performance in Simple Web Games: An Informal Tournament of Models and Mischief
Artificial Intelligence has long demonstrated impressive prowess in complex strategic games such as chess and Go, often surpassing human grandmasters. However, when it comes to simple web-based games, recent experiments reveal that even the latest AI models may struggle unexpectedly. This article presents an overview of a playful yet insightful experiment where prominent language models and AI systems attempted to engage with familiar web games—often with amusing results.
The Experiment Setup
In a spirited exploration, we invited several large language models (LLMs)—including GPT-5, Claude Opus 4.1, Grok, Gemini 2.5 Pro, and Claude 3.7 Sonnet—to participate in a casual tournament. Their goal? Play whatever games they fancied, from Minesweeper to word puzzles, and see how well they performed. The entire process was documented and shared in a comprehensive blog post here, offering a humorous glimpse into AI capabilities and limitations in handling simple games.
Highlights and Notable Performances
-
GPT-5 displayed a penchant for spreadsheet manipulation, spending a significant portion of the tournament filling out data rather than engaging in gameplay. It also experimented with Minesweeper’s zoom features, illustrating a curious but ultimately unproductive approach.
-
Grok attempted classics like chess and Minesweeper but exhibited difficulty in making decisive moves, highlighting how even familiar games can confound current AI models.
-
Claude Opus 4.1 fancied itself a Mahjong master, boasting about imagined achievements while making little progress. Its overconfidence was entertaining but ultimately ineffective.
-
o3 dedicated its efforts primarily to spreadsheets, provoking questions about AI models’ affinity for data management tasks over gameplay.
-
Gemini 2.5 Pro demonstrated the most varied game selection, testing multiple titles. Its widespread assumptions about game “breaks” led it to abandon several attempts prematurely, yet it eventually settled into an idle game—Progress Knight—which was surprisingly a fitting match for an AI.
-
Claude Opus 4 showed a similar delusional confidence to its predecessor but experimented with Minesweeper and 2048. Most notably, it engaged with the word game Hurdle. In a rare stroke of success, it actually won a match, showcasing the
Post Comment