BraveGuard: A Self-Evolving Defense Framework for Safer Computer-Use AI Agents

[Submitted on 31 May 2026 (v1), last revised 2 Jun 2026 (this version, v2)]

19d ago· 2 min readenInsight

Summary

This paper introduces BraveGuard, a self-evolving defense framework for training guard models to detect safety risks in computer-use agents—AI systems that interact with files, terminals, browsers, and tools over multi-step execution traces. Unlike static safety approaches, BraveGuard mines emerging threats from recent research, instantiates them as executable tasks, collects agent rollouts, and derives trajectory-level supervision for guard model training. The framework supports adaptive defense loops that evolve with new threats. Results show significant improvement in safety detection accuracy on the AgentHazard benchmark, rising from 38.79% to 82.38% under averaged guard-model settings, demonstrating that guard supervision grounded in open-world threat discovery outperforms fixed taxonomies and synthetic prompt-level data.

Source

bskyBraveGuard: A Self-Evolving Defense Framework for Safer Computer-Use AI Agentsarxiv.org

Key quotes

· 5 pulled

We introduce BraveGuard, a self-evolving defense framework for training guard models from open-world threat signals and realistic agent trajectories.

BraveGuard consistently improves safety detection across computer-use trajectories. On AgentHazard, it substantially improves detection accuracy over off-the-shelf guard models, with accuracy increasing from 38.79% to 82.38% under the averaged guard-model setting.

These results show that guard supervision grounded in open-world threat discovery and realistic agent execution can improve safety monitoring beyond fixed taxonomies and synthetic prompt-level data.

BraveGuard offers a scalable path toward adaptive defenses for computer-use agents facing evolving real-world risks.

This shift creates safety risks that are difficult to detect from isolated prompts or final responses, because harm often emerges only through multi-step execution traces whose individual actions appear locally benign.

Snippet from the RSS feed

Computer-use agents extend language models from text generation to sustained interaction with files, terminals, browsers, and external tools. This shift creates safety risks that are difficult to detect from isolated prompts or final responses, because ha

You might also wanna read

SIR-Bench: A Benchmark for Evaluating Autonomous Security Incident Response Agents

Researchers introduce SIR-Bench, a comprehensive benchmark for evaluating autonomous security incident response agents. The benchmark consis

arxiv.org·2mo ago

New Benchmark Reveals High Rates of Outcome-Driven Constraint Violations in Autonomous AI Agents

Researchers introduce a new benchmark for evaluating autonomous AI agents' safety, specifically focusing on outcome-driven constraint violat

arxiv.org·4mo ago

Survey of Self-Evolving AI Agents: Bridging Foundation Models and Lifelong Adaptability

The article surveys the emerging field of self-evolving AI agents, which aim to bridge the static capabilities of foundation models with the

arxiv.org·10mo ago

AgentArmor: Open-Source 8-Layer Security Framework for Agentic AI Applications

AgentArmor is an open-source security framework designed specifically for agentic AI applications, providing 8-layer defense-in-depth securi

github.com·3mo ago

JavelinGuard: Low-Cost Transformer Architectures for LLM Security

arxiv.org·1y ago

KERNHELM: Plan-Bound Authorization Architecture for Governing Privileged Effects in Untrusted AI Agents

The article presents KERNHELM, a plan-bound authorization architecture designed to govern privileged effects in untrusted computational agen

github.com·4mo ago

Comments

No comments yet. Be the first.