New ITBench-AA Benchmark Reveals AI Models Struggle with Enterprise SRE Tasks

Models perform poorly on new benchmark, highlighting gaps for AI in enterprise IT.

By GenAI News1mo ago2 min readenNews

You might also wanna read

EdgeBench studies how agents learn from real-world environments across 134 day-long executable tasks.

Qwen3.5-9B scores 93.8% on 96 real security AI tests — within 4 points of GPT-5.4 — running entirely on Apple Silicon. Full benchmark result

Agent Skills are structured packages of procedural knowledge that augment LLM agents at inference time. Despite rapid adoption, there is no

Turning ideas into full software projects from scratch has become a popular use case for language models. Agents are being deployed to seed,

We tested 19 LLMs on their ability to handle real-world software engineering tasks like compiling old code and cross-compiling. See how Anth

We're creating reinforcement learning environments for AI agents.

No comments yet. Be the first.