Behind the LLM Cybersecurity Hype

Table of Contents

Introduction
#

The same vendors publishing breakthrough numbers on LLM cyber capability are publishing the disclaimers in the same posts. Read both halves and the picture changes.

The numbers that get the headlines
#

Anthropic’s Claude Mythos Preview evaluation by AISI¹ reports 73% success on expert-level CTFs and three full completions of a 32-step autonomous attack chain that AISI estimates at roughly 20 hours of human effort. A May arXiv preprint on multi-agent harness synthesis² claims ten zero-days in Chrome, including two critical sandbox escapes. Vincenzo Iozzo’s Firefox 147 benchmark³ saw Mythos produce 181 working exploits where Opus 4.6 produced two. AISI’s follow-up⁴ measures the doubling rate of agent task length in cyber tasks at 4.7 months, down from eight months three quarters earlier.

Those numbers travel. They show up in press releases, board decks, and policy papers.

The numbers that don’t
#

The same AISI evaluation notes — in the same post — that the test ranges have no active defenders, no defensive tooling, and no penalties for triggering alerts. Anthropic’s much-cited “thousands of critical vulnerabilities” claim was extrapolated from 198 manually reviewed cases⁵, one of which (an FFmpeg “critical”) Anthropic itself later walked back. Mythos surfaced real Linux kernel bugs and then could not exploit any of them because of the kernel’s defense-in-depth. Mozilla credited Mythos with 271 Firefox 150 vulnerabilities⁶, but the interesting question — what kind of bugs, how serious, do they shift the balance — went unanswered.

Iozzo’s framing is the cleanest: the models pattern-match against known bug shapes; they don’t reason about novel ones. “The only thing keeping us honest in this experiment was the ability to spot-check — and that only worked because I already knew the answer.”

The clearest signal
#

The most useful data point of the last month came from XBOW⁷ — an autonomous-pentest vendor whose entire business depends on this technology working. They disclosed CVE-2026-45185, an unauthenticated Exim RCE, and structured the disclosure as a head-to-head: their autonomous track against a senior exploit developer using LLMs as an assistant. The autonomous system succeeded against simplified CTF-style builds. Against the production Exim binary, with ASLR and PIE enabled, it never even leaked a stack address. Federico Kirschbaum, head of XBOW’s Security Lab, summarized it:

Honestly, I don’t think LLMs alone are quite ready to write exploits against real-world software yet.

When the vendor whose business depends on autonomous exploitation publicly concedes the autonomous part isn’t ready, the headline benchmark numbers deserve a second read.

What’s actually shipping
#

Strip the hype and the picture is consistent. LLMs are now industrial-scale bug finders and triage assistants — useful enough that curl, the Linux kernel, and Firefox are all dealing with their output at volume, sometimes productively⁸, sometimes as a moderation problem⁹. They are not yet the autonomous attackers that the “AlphaGo moment” framing suggests. And dbreunig’s “proof of work”¹⁰ economics — roughly $12,500 per attempted exploitation run, $125,000 for a ten-attempt campaign — describe a landscape where well-funded defenders may scale this technology before opportunistic attackers do.

The hype is interesting. The disclaimers are useful. Pay more attention to the second.

AISI, “Our evaluation of Claude Mythos Preview’s cyber capabilities.” https://www.aisi.gov.uk/blog/our-evaluation-of-claude-mythos-previews-cyber-capabilities ↩︎
arXiv, Synthesizing Multi-Agent Harnesses for Vulnerability Discovery (2604.20801). https://arxiv.org/abs/2604.20801 ↩︎
Vincenzo Iozzo, “The AlphaGo moment for vulnerability research?” https://vincenzoiozzo.com/blog/alphago-moment-vuln-research ↩︎
AISI, “How fast is autonomous AI cyber capability advancing?” https://www.aisi.gov.uk/blog/how-fast-is-autonomous-ai-cyber-capability-advancing ↩︎
BioCatch, “Claude Mythos: Hype or reality?” https://www.biocatch.com/blog/claude-mythos-hype-or-reality ↩︎
xark.es, “A quick look at Mythos run on Firefox: too much hype?” https://xark.es/b/mythos-firefox-150 ↩︎
XBOW, “Dead.Letter: CVE-2026-45185 — XBOW found an RCE in Exim.” https://xbow.com/blog/dead-letter-cve-2026-45185-xbow-found-rce-exim ↩︎
ghostbyt3, “N-Day Research with AI: Using Ollama and n8n.” https://ghostbyt3.github.io/blog/nday-research-ai ↩︎
Hacker News discussion: “Linux kernel removing modules because of LLM bug-report load.” https://news.ycombinator.com/item?id=47862230 ↩︎
Drew Breunig, “Cybersecurity Looks Like Proof of Work Now.” https://www.dbreunig.com/2026/04/14/cybersecurity-is-proof-of-work-now.html ↩︎

Introduction#

The numbers that get the headlines#

The numbers that don’t#

The clearest signal#

What’s actually shipping#

Introduction
#

The numbers that get the headlines
#

The numbers that don’t
#

The clearest signal
#

What’s actually shipping
#