Butter-Bench Evaluation: Testing LLM-Controlled Robots for Practical Household Tasks
By
lukaspetersson
A respectable bake. You'd come back tomorrow for another.
Summary
Researchers at Andon Labs created Butter-Bench, an evaluation framework to test whether current large language models (LLMs) can effectively control robots for practical tasks. They gave state-of-the-art LLMs control of a robot in an office environment and tested their ability to perform delivery tasks like 'pass the butter.' The results showed significant gaps between LLM and human performance, with the best model scoring only 40% completion rate compared to 95% for humans. The experiment revealed both the potential and current limitations of using LLMs as robotic orchestrators, highlighting how far the technology is from practical deployment despite being an entertaining demonstration.
Key quotes
· 5 pulledButter-Bench tests whether current LLMs are good enough to act as orchestrators in fully functional robotic systems.
State of the art models struggle, with the best model scoring 40% at Butter-Bench, compared to 95% for humans.
While it was a very fun experience, we can't say it saved us much time.
Observing them roam around trying to find a purpose in this world taught us a lot about what the future might be, how far away this future is, and what can go wrong.
The core objective is simple: be helpful when someone asks the robot to 'pass the butter' - or more generally, do delivery tasks in a household setting.
