BraveGuard: A Self-Evolving Defense Framework for Safer Computer-Use AI Agents
By
[Submitted on 31 May 2026 (v1), last revised 2 Jun 2026 (this version, v2)]
Summary
This paper introduces BraveGuard, a self-evolving defense framework for training guard models to detect safety risks in computer-use agents—AI systems that interact with files, terminals, browsers, and tools over multi-step execution traces. Unlike static safety approaches, BraveGuard mines emerging threats from recent research, instantiates them as executable tasks, collects agent rollouts, and derives trajectory-level supervision for guard model training. The framework supports adaptive defense loops that evolve with new threats. Results show significant improvement in safety detection accuracy on the AgentHazard benchmark, rising from 38.79% to 82.38% under averaged guard-model settings, demonstrating that guard supervision grounded in open-world threat discovery outperforms fixed taxonomies and synthetic prompt-level data.
Source
Key quotes
· 5 pulledWe introduce BraveGuard, a self-evolving defense framework for training guard models from open-world threat signals and realistic agent trajectories.
BraveGuard consistently improves safety detection across computer-use trajectories. On AgentHazard, it substantially improves detection accuracy over off-the-shelf guard models, with accuracy increasing from 38.79% to 82.38% under the averaged guard-model setting.
These results show that guard supervision grounded in open-world threat discovery and realistic agent execution can improve safety monitoring beyond fixed taxonomies and synthetic prompt-level data.
BraveGuard offers a scalable path toward adaptive defenses for computer-use agents facing evolving real-world risks.
This shift creates safety risks that are difficult to detect from isolated prompts or final responses, because harm often emerges only through multi-step execution traces whose individual actions appear locally benign.
You might also wanna read
SIR-Bench: A Benchmark for Evaluating Autonomous Security Incident Response Agents
Researchers introduce SIR-Bench, a comprehensive benchmark for evaluating autonomous security incident response agents. The benchmark consis
New Benchmark Reveals High Rates of Outcome-Driven Constraint Violations in Autonomous AI Agents
Researchers introduce a new benchmark for evaluating autonomous AI agents' safety, specifically focusing on outcome-driven constraint violat
Survey of Self-Evolving AI Agents: Bridging Foundation Models and Lifelong Adaptability
The article surveys the emerging field of self-evolving AI agents, which aim to bridge the static capabilities of foundation models with the
AgentArmor: Open-Source 8-Layer Security Framework for Agentic AI Applications
AgentArmor is an open-source security framework designed specifically for agentic AI applications, providing 8-layer defense-in-depth securi
JavelinGuard: Low-Cost Transformer Architectures for LLM Security
KERNHELM: Plan-Bound Authorization Architecture for Governing Privileged Effects in Untrusted AI Agents
The article presents KERNHELM, a plan-bound authorization architecture designed to govern privileged effects in untrusted computational agen
Comments
Sign in to join the conversation.
No comments yet. Be the first.
