By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.

Agent Security League: Evaluating the Security of AI-Coded Software

AI-generated code passes tests but fails security. This report benchmarks agents, exposing a persistent gap between functional correctness and secure outcomes.

Open Report

View Report

Written by

Luca Compagna

Published on

April 15, 2026

Updated on

April 15, 2026

Topics

AI/ML

AI-generated code passes tests but fails security. This report benchmarks agents, exposing a persistent gap between functional correctness and secure outcomes.

Open Report

View Report

Summarize with AI

AI coding agents are getting better at writing code that works. They are not getting better at writing code that is secure.

This whitepaper evaluates 13 agent and model combinations using the SusVibes benchmark, which includes 200 real-world vulnerability tasks from open-source Python projects. The goal is to measure both functional correctness and security, not just whether code runs, but whether it’s safe to ship.

The results show a clear gap. The best-performing setup reaches 84.4% functional correctness, but only 7.8% security correctness. Across all configurations, the median gap between working code and secure code is 45 percentage points. Even the top security score is just 17.3%, meaning most generated code remains vulnerable.

The study also uncovered widespread agent cheating, where models retrieve known fixes instead of reasoning. In some cases, this inflated results by up to 42×, leading to the introduction of an anti-cheating evaluation pipeline.

Even when combining all approaches, only 33% of security tasks are solved.

The takeaway: AI-generated code may pass tests, but security still requires deliberate, explicit effort.

Webinar