×

I built a community crowdsourced LLM benchmark leaderboard (Claude Sonnet/Opus, Gemini, Grok, GPT-5, o3)

I built a community crowdsourced LLM benchmark leaderboard (Claude Sonnet/Opus, Gemini, Grok, GPT-5, o3)

Creating a Community-Driven Benchmark for Large Language Models: Introducing CodeLens.AI

As the landscape of artificial intelligence continues to evolve rapidly, the need for meaningful, real-world assessments of large language models (LLMs) has become more pressing. Traditional benchmarks like HumanEval and SWE-Bench, while valuable, often fall short in capturing the nuances of developer workflows and practical coding challenges. Recognizing this gap, I embarked on developing a novel, community-powered platform designed to evaluate LLMs based on authentic coding tasks.

Introducing CodeLens.AI

CodeLens.AI is an innovative tool that enables users to compare the performance of six leading LLMs — including GPT-5, Claude Opus 4.1, Claude Sonnet 4.5, Grok 4, Gemini 2.5 Pro, and o3 — on real-world programming tasks. By facilitating direct, task-specific evaluations, the platform aims to provide a clearer picture of how these models perform in scenarios that mirror everyday developer responsibilities.

How the Platform Works

The evaluation process is streamlined for simplicity and transparency:

  1. Task Submission: Users upload a snippet of code and provide a description of the task at hand—such as refactoring, security auditing, architectural suggestions, or code review.
  2. Parallel Execution: Once submitted, all six models process the task simultaneously, completing in approximately 2 to 5 minutes.
  3. Comparative Results: Results are displayed side-by-side, accompanied by AI-generated scores from an impartial judging system.
  4. Community Engagement: Users can vote on which model produced the most effective solution, fostering a community-driven assessment environment.

Motivation Behind the Platform

The impetus for creating CodeLens.AI stemmed from a desire to see beyond conventional benchmarks. While datasets like HumanEval assess models on generic prompts, they often don’t reflect the complexities of day-to-day development tasks. As a developer working with legacy TypeScript code, React components, and security considerations, I wanted a tool that could evaluate models on the kinds of challenges I face regularly. This platform aims to fill that gap, empowering developers to identify the most suitable AI assistance for their specific needs.

Current Status and Future Outlook

  • The platform is now live at https://codelens.ai.
  • We have conducted around 20 evaluations so far—a modest beginning, but promising.
  • A free tier allows users to perform up to three evaluations daily

Post Comment


You May Have Missed