All Topics

Technology

Art

Browser Automation Benchmark: LLM Performance Comparison on 100 Hard Web Tasks

MagMueller

4mo ago· 4 min readenInsight

85/100

Golden Brown

Bagelometer↗

Front-window bakery material. Catches the eye, delivers the goods.

Score85TypeanalysisSentimentpositive

Summary

The article presents a new open-source benchmark called BU Bench V1 for evaluating LLM models on browser automation tasks. It includes 100 hand-selected challenging tasks drawn from five established sources, with Browser Use Cloud scoring 78% and outperforming the best open-source model by 16 points. The benchmark aims to provide standardized evaluation for comparing different models and versions in web automation performance.

Key quotes

· 5 pulled

To truly understand our agent performance, we built a suite of internal tools for evaluating our agent in a standardized and repeatable way so we can compare versions and models and continuously improve.

This is our first open source benchmark. BU Bench V1: 100 hand-selected tasks that are hard but possible, drawn from five established sources.

Browser Use Cloud scores 78%, 16 points ahead of the best open-source model.

We take evaluations seriously. As of now, we have over 600,000 tasks run in testing.

SourceTasksDescriptionCustom20Page interaction challenges (iframes, drag-and-drop, complex forms)

Snippet from the RSS feed

We benchmark every major LLM on 100 hard browser tasks. Browser Use Cloud scores 78%, 16 points ahead of the best open-source model.

You might also wanna read

Web Bench: A Comprehensive Benchmark for AI Browser Agent Performance

Web Bench is a new benchmark platform designed to evaluate and compare AI browser agents' performance in web navigation tasks. It provides c

Product Hunt·1y ago

LLM Stats: Platform for Comparing AI Language Models by Benchmarks, Cost, and Capabilities

LLM Stats is a platform that allows users to compare various AI language models (LLMs) across multiple dimensions including performance benc

Product Hunt·7mo ago

LiveBrowseComp reveals LLM search agents rely on memorized knowledge, not genuine web searching

This paper introduces the concept of Intrinsic Knowledge Dependence (IKD), showing that LLM-based search agents often rely on pre-trained kn

arxiv.org·16d ago

Shopping Companion: Benchmarking and Training LLM Agents for Long-Horizon Preference-Grounded E-Commerce Tasks

arxiv.org·14d ago

New ITBench-AA Benchmark Reveals AI Models Struggle with Enterprise SRE Tasks

ITBench-AA, a new benchmark developed by Artificial Analysis and IBM Research over six months, reveals that leading AI models like Claude Op

genainews.tech·16d ago

MobilityBench: A New Benchmark for Evaluating LLM-Based Route-Planning Agents Using Real-World Mobility Data

This paper introduces MobilityBench, a scalable benchmark for evaluating LLM-based route-planning agents in real-world mobility scenarios. B

arxiv.org·1d ago