By clicking “Accept”, you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. View our Privacy Policy for more information.
18px_cookie
e-remove
Blog
Glossary
Customer Story
Video
eBook / Report
Solution Brief

Agent Security League: Evaluating the Security of AI-Coded Software

AI-generated code passes tests but fails security. This report benchmarks agents, exposing a persistent gap between functional correctness and secure outcomes.

AI-generated code passes tests but fails security. This report benchmarks agents, exposing a persistent gap between functional correctness and secure outcomes.

AI-generated code passes tests but fails security. This report benchmarks agents, exposing a persistent gap between functional correctness and secure outcomes.

Written by
Luca Compagna
Luca Compagna
Published on
April 15, 2026
Updated on
April 15, 2026
Topics

AI-generated code passes tests but fails security. This report benchmarks agents, exposing a persistent gap between functional correctness and secure outcomes.

AI-generated code passes tests but fails security. This report benchmarks agents, exposing a persistent gap between functional correctness and secure outcomes.

AI coding agents are getting better at writing code that works. They are not getting better at writing code that is secure.

This whitepaper evaluates 13 agent and model combinations using the SusVibes benchmark, which includes 200 real-world vulnerability tasks from open-source Python projects. The goal is to measure both functional correctness and security, not just whether code runs, but whether it’s safe to ship.

The results show a clear gap. The best-performing setup reaches 84.4% functional correctness, but only 7.8% security correctness. Across all configurations, the median gap between working code and secure code is 45 percentage points. Even the top security score is just 17.3%, meaning most generated code remains vulnerable.

The study also uncovered widespread agent cheating, where models retrieve known fixes instead of reasoning. In some cases, this inflated results by up to 42×, leading to the introduction of an anti-cheating evaluation pipeline.

Even when combining all approaches, only 33% of security tasks are solved.

The takeaway: AI-generated code may pass tests, but security still requires deliberate, explicit effort.

Webinar

Give Your AI Coding Assistants the Security Tools They Deserve

Find out More

The Challenge

The Solution

The Impact

Book a Demo

Book a Demo

Book a Demo

Welcome to the resistance
Oops! Something went wrong while submitting the form.

Book a Demo

Book a Demo

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Book a Demo