Sonnet 5 review: I ran 64 generations to find out if it’s worth it

Source: Lenny’s Newsletter by Claire Vo. Read the original article

TL;DR Summary of How I AI Bench: Benchmarking Frontier AI Models Live with Claude Code

The How I AI Bench is a custom, repeatable evaluation tool built live using Claude Code to benchmark five leading AI models, including Sonnet 5. It combines human vibe scoring (70%) with LLM judge scoring (30%) for more reliable results. The benchmark covers tasks like PRD quality, prototype generation, and agent personality, revealing surprising insights about model performance. The framework is designed for ongoing, comparable AI testing rather than one-off checks.

Optimixed’s Overview: A Novel Framework for Live, Repeatable AI Model Evaluation

Introduction to the How I AI Bench

In response to the need for consistent and comparable AI model testing, the How I AI Bench was developed as a live-built evaluation harness using Claude Code. This tool tests multiple frontier AI models under controlled, repeatable conditions, moving beyond the typical one-off vibe checks common in AI benchmarking.

Methodology and Scoring System

Models Tested: Sonnet 5, Sonnet 4.6, Opus 4.8, GPT-5.5, and Gemini 3 Pro.
Evaluation Tasks: Product Requirements Document (PRD) quality, prototype generation, agentic task completion, and voice/personality assessments.
Scoring Approach: A hybrid system weighing human subjective scores at 70% and large language model (LLM) judging at 30% to balance objectivity and human intuition.
Implementation: A local HTML scoring page enables real-time gut-feel ratings with JSON export for data tracking and analysis.

Key Findings and Recommendations

The benchmark results challenged initial expectations, highlighting nuanced strengths across models:

PRDs: Certain models excelled in delivering high-quality, structured product documents.
Prototypes: Different AI engines showed varied capabilities in generating complex prototype concepts.
Agent Interaction: Personality and conversational ability varied, informing daily agent chat recommendations.

Future Enhancements and Use Cases

The creator plans continuous improvements such as refining scoring weights and expanding model coverage. This framework is useful for developers, product managers, and AI researchers seeking a reliable, reproducible way to assess AI capabilities over time.

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31