ProgramBench: New Benchmark Reveals Language Models Struggle to Build Complete Software Projects From Scratch

[Submitted on 5 May 2026]

24d ago· 2 min readenInsight

74/100

Toasty

Bagelometer↗

A second-rack bagel that's nearly first-rack. Tasty stuff.

Score74TypeanalysisSentimentneutral

Summary

This paper introduces ProgramBench, a new benchmark designed to evaluate the ability of language model-based software engineering agents to build complete software projects from scratch. Unlike existing benchmarks that focus on narrow tasks like bug fixing or single feature development, ProgramBench requires agents to architect and implement entire codebases that match reference executable behavior, using only the program and its documentation. The benchmark includes 200 tasks ranging from simple CLI tools to complex software like FFmpeg, SQLite, and the PHP interpreter. Evaluation of 9 language models showed that none fully resolved any task, with the best model passing 95% of tests on only 3% of tasks. The study also found that models tend to produce monolithic, single-file implementations that differ significantly from human-written code.

Key quotes

· 5 pulled

Turning ideas into full software projects from scratch has become a popular use case for language models.

Existing benchmarks measure focused, limited tasks such as fixing a single bug or developing a single, specified feature.

We evaluate 9 LMs and find that none fully resolve any task, with the best model passing 95% of tests on only 3% of tasks.

Models favor monolithic, single-file implementations that diverge sharply from human-written code.

End-to-end behavioral tests are generated via agent-driven fuzzing, enabling evaluation without prescribing implementation structure.

Snippet from the RSS feed

Turning ideas into full software projects from scratch has become a popular use case for language models. Agents are being deployed to seed, maintain, and grow codebases over extended periods with minimal human oversight. Such settings require models to m

You might also wanna read

New ITBench-AA Benchmark Reveals AI Models Struggle with Enterprise SRE Tasks

ITBench-AA, a new benchmark developed by Artificial Analysis and IBM Research over six months, reveals that leading AI models like Claude Op

genainews.tech·4d ago

ITBench-AA Benchmark Launched: Frontier AI Models Score Below 50% on Enterprise IT Tasks

Artificial Analysis and IBM Software Innovation Lab have launched ITBench-AA, a new benchmark series evaluating AI models on agentic enterpr

huggingface.co·3d ago