The Math Problem AI Just Changed for Security Testing
- David O'Neil
- Cybersecurity
- 22 Mar, 2026
Published: 2026-03-22 | RSA 2026 Pre-Conference Series
Here’s the problem every security team lives with but rarely says out loud.
Your environment changes every time a developer merges code, every time someone adds a cloud resource, every time a vendor pushes an update. In 2025, organizations with fewer than 50 security tools reported a 93% breach rate in the prior two years. Not a typo. 93%. And the Verizon DBIR found that over two-thirds of those breaches involved vulnerabilities that had been sitting unpatched for more than 90 days — despite organizations having run assessments recently.
The annual pentest was never the problem. The arithmetic was.
Attackers have infinite attempts and machine speed. Defenders historically had one or two point-in-time snapshots of their posture per year, which covered maybe 5-10% of the actual attack surface at any given moment. The math has always been asymmetric. What’s changed in the last eighteen months is that AI has started reshaping the variables.
I’m writing this the week before RSA 2026, where the dominant AI conversation will be about governance, agent risk, and securing AI infrastructure. That’s the right conversation for the audience — CISOs in buy mode, enterprise leaders, policy people. But there’s a gap in the RSA agenda: almost nobody is talking about what AI is actually doing to offensive security and penetration testing. That conversation is happening at DEF CON and Black Hat. It should also be happening in the boardroom.
Here’s what the data says.
Three Variables, Not One
The AI pen testing pitch usually collapses into one claim: it’s faster. Faster is real, but it’s the least interesting thing. The more accurate frame is that AI-augmented penetration testing is changing three variables simultaneously — volume, speed, and accuracy. Each one matters. Their interaction is where it gets interesting.
Volume: From Point-in-Time to Continuous
Traditional penetration testing is constrained by human hours. A qualified pentester costs $150-250/hr plus engagement overhead. Annual or semi-annual testing is the industry norm not because it’s the right frequency but because it’s what most budgets can sustain.
AI removes the human-hours constraint. NodeZero (Horizon3.ai) has now executed over 170,000 autonomous penetration tests. Eighty-two percent of their customers run tests at least monthly. That is not a speed claim — it’s a frequency claim. Organizations that previously ran one comprehensive engagement per year are now continuously validating their posture at the same or lower cost.
The scale numbers are also worth noting: their largest single autonomous engagement covered more than 100,000 IP addresses for a municipal transit authority. That’s not something a human team is doing in a two-week sprint. Pentera’s survey of 500 global CISOs found that 55% of enterprises now use software-based penetration testing as their primary method. The shift isn’t coming — it’s largely happened.
For a CISO, the volume shift changes the conversation with the board. You’re no longer presenting a point-in-time compliance artifact. You’re presenting a trend line: here’s our attack surface profile last month, here’s this month, here’s what changed and why.
Speed: From Weeks to Hours
Speed is the claim vendors lead with and where the most hype lives, so I’ll be precise about what’s actually validated.
XBOW, an autonomous penetration testing AI, reached #1 on HackerOne’s global leaderboard in August 2025 — outperforming thousands of human hackers. In a controlled comparison, it matched a senior pentester’s 40-hour engagement in 28 minutes with an identical 85% vulnerability score. That is extraordinary, and I want to be honest that it reflects near-optimal conditions. Complex enterprise environments with custom applications and multi-system interactions will not hit that benchmark consistently.
What’s more broadly applicable: NodeZero identified a critical attack path exposing customer data in a financial services environment within 16 hours of deployment. ManticoreAI consistently delivers findings in 48 hours versus the 6-8 week industry standard for traditional consultancies. Cobalt, which positions itself as human-led and AI-powered, reports 2.6x faster time-to-report compared to traditional pentesting processes.
The relevant concept here is “Patch Tuesday to Pentest Wednesday.” Before AI, same-day validation of newly patched systems was practically impossible. Patch something, wait weeks for a human engagement, discover whether the patch held. Now that window collapses to hours. The 90-day vulnerability exposure window that Verizon identified? That’s the target metric AI pen testing is designed to eliminate.
Accuracy: The Part That Actually Requires Expertise
This is where I’ll push back on the vendors who sell fully autonomous solutions without nuance.
AI without expert oversight reliably produces false positives. This is not a criticism — it’s a structural characteristic of how these systems work. The ARTEMIS academic study (arxiv 2512.09882) found that AI agents exhibit higher false-positive rates than human professionals even when they outperform on volume. HackerOne’s research characterized AI pen testing as producing “wide but shallow” coverage with a higher rate of hallucinated findings.
The common failure modes are specific and worth knowing:
- Hallucinated CVEs: LLMs generate plausible-sounding but non-existent vulnerability references. Pattern matching that produces confident output on fabricated findings.
- OWASP tunnel vision: Tools trained on standard vulnerability patterns miss novel findings, configuration-specific issues, and business logic flaws that require understanding how the application is supposed to work.
- Scope drift: Autonomous agents may follow attack paths beyond authorized scope without explicit guardrails — a significant legal and operational risk.
- False confidence: The most dangerous failure mode. Organizations using autonomous tools without human review may believe they’ve validated their posture when the actual coverage is shallow.
Here’s what expert oversight changes: Cobalt’s “Human-Led, AI-Powered” model explicitly states that agentic AI is not mature enough to find complex, nuanced vulnerabilities and exploits that experienced pentesters can. Their AI is trained on 10+ years of real pentesting data, and their human experts validate, contextualize, and prioritize AI findings before they hit the report. XBOW’s structural response to the false-positive problem is to only report findings when exploitability is confirmed through controlled validation — not when the tool suspects a vulnerability, but when it has demonstrated impact.
The ARTEMIS study put a number on the economics: $18/hour for the AI agent versus $60/hour for a professional pentester. The AI is cheaper, but unmanaged AI produces more noise per dollar. The optimum is an expert at $60/hr validating and directing AI running $18/hr coverage — effectively multiplying the expert’s output capacity three to five times without degrading the accuracy of what gets reported.
That is not a hedge. That is the actual finding.
What This Means for CISOs Buying at RSA This Week
RSA 2026’s floor will have no shortage of AI pen testing vendors. Here’s how to evaluate them honestly.
Ask about false-positive reduction mechanisms, not just speed claims. Any autonomous platform should be able to explain how it distinguishes confirmed findings from suspected ones. “Only reports exploitability-confirmed findings” (XBOW’s model) is a substantively different claim than “AI-powered vulnerability identification.” The mechanism matters.
Understand where the human is in the loop — and where it isn’t. The bifurcation in the market is real. Fully autonomous platforms (NodeZero, Pentera, XBOW) are appropriate for continuous validation — testing frequency and coverage surface that human-only engagements can’t match. Human-augmented PTaaS (Cobalt) is appropriate when the engagement requires depth: custom applications, business logic validation, compliance artifacts that require human sign-off. Most organizations should be using both models for different purposes.
Recognize that compliance is not solved. SOC 2, HIPAA, PCI DSS, and FedRAMP explicitly require human-validated penetration testing. AI-only reports do not satisfy those requirements today. If a vendor implies otherwise, that’s a gap in their pitch, not a feature.
Match the tool to the attack surface, not the budget. Organizations using fewer than 50 security tools are the most breached — in part because of coverage gaps, not just tool quality. The frequency argument for AI pen testing is strongest where the attack surface is large and dynamic: cloud environments, DevOps pipelines, organizations deploying continuously. If you’re running static on-premises infrastructure, the ROI math looks different.
The cyber insurance angle is coming. Fifty-nine percent of CISOs have already implemented at least one solution recommended by their cyber insurance provider (Pentera survey). Insurers are beginning to value continuous validation programs over periodic compliance artifacts. The CISO who can show a monthly attack surface trend line is in a stronger position at renewal than the one presenting an annual point-in-time report.
What AI Pen Testing Doesn’t Solve
Since I’ve argued for the value of AI-augmented testing, here’s what it doesn’t address.
Business logic vulnerabilities — the kind that require understanding how your application is supposed to work — remain human territory. If a user can manipulate a workflow in a way that exploits a process flaw rather than a technical one, current AI tools will miss it. The academic literature is consistent on this.
Social engineering, physical security, and insider threat are entirely outside the scope of technical pen testing, AI or otherwise. The conversation about AI-powered phishing (QR phishing incidents rose 28% to 1.2 million in 2025) is real, but that’s a threat your defensive controls address, not your pen testing program.
And the strategic advisory relationship with a trusted security partner doesn’t compress into a tool. Knowing that you have attack path X doesn’t tell you whether remediating it is a 20-hour project or a six-month architecture change. That contextual judgment — the “so what” of a finding — still requires a practitioner who understands your environment.
The Honest Summary
AI-augmented penetration testing is real, measurable, and changing the math of defensive security. NodeZero’s 170,000 autonomous pentests. XBOW matching a 40-hour assessment in 28 minutes. Cobalt delivering 255,000 testing hours in 2025. The data is not vendor marketing — it’s operational track record.
The accuracy qualifier is not a limitation to be footnoted. It’s the finding. AI without expert oversight amplifies noise. AI with expert oversight multiplies expert capacity. The practitioners who figure out how to manage AI-augmented workflows — validating findings, directing coverage, applying contextual judgment — are the ones who will deliver the most value per dollar in the next five years.
The arithmetic of security testing just changed. The question for this RSA is whether your organization’s validation program has caught up.
David O’Neil is a CISO and security leader. He writes at CISOExpert on practical security strategy, AI in security, and what actually works when the board is watching.
If this framing challenged your assumptions about AI pen testing, share it with your team. If I got something wrong, tell me — I’d rather be corrected than confident.
RSA 2026 content series | March 2026