I built a community crowdsourced LLM benchmark leaderboard (Claude Sonnet/Opus, Gemini, Grok, GPT-5, o3)

Virtual Reality GAIadmin October 18, 2025 0 Comments

I built a community crowdsourced LLM benchmark leaderboard (Claude Sonnet/Opus, Gemini, Grok, GPT-5, o3)

Creating a Community-Driven Benchmark for Large Language Models: Introducing CodeLens.AI

As the landscape of artificial intelligence continues to evolve rapidly, the need for meaningful, real-world assessments of large language models (LLMs) has become more pressing. Traditional benchmarks like HumanEval and SWE-Bench, while valuable, often fall short in capturing the nuances of developer workflows and practical coding challenges. Recognizing this gap, I embarked on developing a novel, community-powered platform designed to evaluate LLMs based on authentic coding tasks.

Introducing CodeLens.AI

CodeLens.AI is an innovative tool that enables users to compare the performance of six leading LLMs — including GPT-5, Claude Opus 4.1, Claude Sonnet 4.5, Grok 4, Gemini 2.5 Pro, and o3 — on real-world programming tasks. By facilitating direct, task-specific evaluations, the platform aims to provide a clearer picture of how these models perform in scenarios that mirror everyday developer responsibilities.

How the Platform Works

The evaluation process is streamlined for simplicity and transparency:

Task Submission: Users upload a snippet of code and provide a description of the task at hand—such as refactoring, security auditing, architectural suggestions, or code review.
Parallel Execution: Once submitted, all six models process the task simultaneously, completing in approximately 2 to 5 minutes.
Comparative Results: Results are displayed side-by-side, accompanied by AI-generated scores from an impartial judging system.
Community Engagement: Users can vote on which model produced the most effective solution, fostering a community-driven assessment environment.

Motivation Behind the Platform

The impetus for creating CodeLens.AI stemmed from a desire to see beyond conventional benchmarks. While datasets like HumanEval assess models on generic prompts, they often don’t reflect the complexities of day-to-day development tasks. As a developer working with legacy TypeScript code, React components, and security considerations, I wanted a tool that could evaluate models on the kinds of challenges I face regularly. This platform aims to fill that gap, empowering developers to identify the most suitable AI assistance for their specific needs.

Current Status and Future Outlook

The platform is now live at https://codelens.ai.
We have conducted around 20 evaluations so far—a modest beginning, but promising.
A free tier allows users to perform up to three evaluations daily

I built a community crowdsourced LLM benchmark leaderboard (Claude Sonnet/Opus, Gemini, Grok, GPT-5, o3)

Introducing CodeLens.AI

How the Platform Works

Motivation Behind the Platform

Current Status and Future Outlook

Post Comment Cancel reply

You May Have Missed

FINNISHED!! “A Framework for Functional Equivalence in Artificial Intelligence” Model/Engine!!

I had the following conversation with Gemini to fact check. Gemini said the reports were false and that Charlie Kirk was not assassinated, there was no killer involved, and the news source links were not credible, as they were fabricated and appeared to come from the future.

I asked Google Gemini to make a world map with flags

Create a heartfelt polaroid of the grown-up version of me (from photo 1) gently hugging my younger self (from photo 2). The adult looks protective and loving, the child curious and happy. Set in a misty park at sunset, with golden light. Hyper-realistic, 4K.

Gemini says it can’t do the exact task I asked it a day ago

Is it just me, or is Gemini’s image editing going down the shitter FAST?

Gemini made up a ridiculous theory and then tried to gaslight me by retroactively changing all its responses

Student Offer Issue – “Verification Limit Exceeded” after SheerID Verification (Google AI Pro / Gemini)

GeminiAI in the news – some of the links shared on Hacker News this week

Is there an easy way to visualize how Gemini 2.5 would tokenize some input?

I built a community crowdsourced LLM benchmark leaderboard (Claude Sonnet/Opus, Gemini, Grok, GPT-5, o3)

Introducing CodeLens.AI

How the Platform Works

Motivation Behind the Platform

Current Status and Future Outlook

Related Posts

Post Comment Cancel reply

You May Have Missed