AI coding agents are getting better at writing code that works. They are not getting better at writing code that is secure.
This whitepaper evaluates 13 agent and model combinations using the SusVibes benchmark, which includes 200 real-world vulnerability tasks from open-source Python projects. The goal is to measure both functional correctness and security, not just whether code runs, but whether it’s safe to ship.
The results show a clear gap. The best-performing setup reaches 84.4% functional correctness, but only 7.8% security correctness. Across all configurations, the median gap between working code and secure code is 45 percentage points. Even the top security score is just 17.3%, meaning most generated code remains vulnerable.
The study also uncovered widespread agent cheating, where models retrieve known fixes instead of reasoning. In some cases, this inflated results by up to 42×, leading to the introduction of an anti-cheating evaluation pipeline.
Even when combining all approaches, only 33% of security tasks are solved.
The takeaway: AI-generated code may pass tests, but security still requires deliberate, explicit effort.
Give Your AI Coding Assistants the Security Tools They Deserve



What's next?
When you're ready to take the next step in securing your software supply chain, here are 3 ways Endor Labs can help:









