TL;DR
AI is making capable security researchers more effective while enabling less skilled practitioners to produce polished but inaccurate output at scale, and the reconciling variable is the harness built around the model and the expertise of the person driving it. The bug bounty collapse is the clearest evidence: programs across the industry have paused, tightened, or shut down entirely under a flood of AI-generated submissions that look legitimate and aren't. As the cost of producing plausible output falls to zero, the cost of validating it rises, and the only thing that pays it down is human judgment.
Depending on your algorithm, you might be led to believe that AI has either transformed vulnerability research or drowned it in noise. Reality, as is usually the case, lies somewhere in between. AI models are making the best security researchers better and enabling the least capable to produce convincing garbage at scale.
Raising the Ceiling
When a capable researcher pairs real expertise with an LLM and a well-built workflow, the results are impressive. I won’t belabor this point. Just scroll down on your LinkedIn feed, examples will find you.
Yet one is worth naming because the person behind it carries both halves of the argument at once. Niels Provos committed the OpenBSD TCP SACK implementation, bug and all, back in 1998. Anthropic's red team made that same 27-year-old flaw the marquee finding of their Mythos Preview report, framing deep vulnerability discovery as a frontier-model capability gated behind restricted access. Provos took that framing apart: using his open-source IronCurtain orchestration framework, he replicated the OpenBSD finding and discovered new zero-days in widely deployed software — not with Mythos, but with commercially available models like Opus and Sonnet, and even an open-weight model. His conclusion:
"Vulnerability discovery is an orchestration problem, not a frontier-model problem."
But read his write-up closely, and you find the part that the current "the harness is everything" commentary tends to skip. His successes required a human in the loop at decisive moments, e.g., steering the model manually when an early run stalled, guiding validation when constrained environments masked a trigger, and doing hands-on analysis to confirm exploitability for accurate severity scoring. The harness enabled use of the engine, but an expert still had to drive it.
Our own experience is consistent with this. Over years of evolving how we use these tools, our team has discovered multiple zero-day vulnerabilities in commercial products in concert with LLMs. In those cases, we achieved the greatest impact when pairing our expertise with LLMs to reverse engineer software and rapidly produce working exploits. Check out our blog for those vulnerabilities that we’ve been allowed to publicly talk about.
Lowering the Bar
Now the other end of the spectrum. The same generative capability, handed to someone with no real understanding, produces output that looks like expert work and is wrong. Cisco's Talos team found that models like ChatGPT, Claude, and Gemini "generated polished-looking results that often contained significant inaccuracies, unusual conclusions, and inconsistent writing styles."
Our team’s experience echoes Cisco’s. In a client deliverable, a false positive is the fastest way to burn the trust the entire engagement runs on. It sends their team chasing a phantom, and the moment one finding doesn't hold up, every finding is in doubt.
Polished-looking and significantly inaccurate in reporting is a dangerous combination. The output looks reasonable enough to pass a quick glance but wrong enough to waste time or hurt you if you trust it.
The bug bounty world is the clearest casualty:
- Bugcrowd watched its triage queue grow "more than 334%" over a three-week stretch; a flood they characterized by "thin evidence, templated write-ups, and a high likelihood that the issue was not verified before submission."
- HackerOne paused its long-running Internet Bug Bounty in March 2026 after a 76% year-over-year jump in submissions. That program is the one that quietly paid curl's bounties for years; something curl's maintainer said the project "could not have done without." The underwriter of the healthy era became a casualty of the new one.
- curl itself, a program that paid out over $100,000 across 87 confirmed vulnerabilities, saw fewer than 5% of its 2025 submissions turn out to be legitimate and shut its bounty down at the end of January 2026.
- Nextcloud ended rewards citing the same "massive increase of low-quality reports."
These aren't isolated complaints. And here is the part most people miss: the more fluent, confident, and plausible the slop becomes, the harder it is to validate. A report that's obviously garbage gets closed in a minute. A report that's polished, well-structured, and wrong has to be closely reviewed.
Bugcrowd named the underlying mechanic precisely: AI "changed the economics of any human-validated system. Convincing content got cheap to produce, and checking whether it's actually correct did not."
Why Both are True at Once
Same models. Opposite outcomes. The reconciling variable is the harness and the person driving it.
This isn't unique to security, which is worth sitting with for a moment. It's the same reason Anthropic and OpenAI have both stood up forward-deployed engineering teams to embed with their customers. The reason isn't that the models are weak; it's that deployment is the hard part. MIT's research on enterprise AI found that the overwhelming majority of pilots showed no measurable business impact, and that the failures traced to implementation, not model quality. The model is necessary but not sufficient. The human layer (the people who know the domain, build the workflow, and validate the output) is where the value is actually realized.
A nod to Cisco’s racing metaphor, we'd push it one notch further. The model is an engine. A remarkable one. But an engine isn't a car. To get anywhere you need a transmission, a frame, tires, and a steering wheel. That's the harness: the validation layers, the orchestration, the context injection, the triage logic. And even a fully built car doesn't win races on its own. It needs a driver who knows when to brake, which line to take, and when the cautious-looking exit was the one that mattered. The new arrivals are just now discovering they need to build the car. What they haven't hit yet, and consider this the heads-up, is that the car still needs a driver who can actually race it.
The Value is Appreciating
The reflexive fear about these tools is that they commoditize expertise; once everyone has the engine, the people who used to be valuable for finding bugs no longer are. The evidence points hard in the other direction.
Provos' zero-days required an expert at the wheel. Cisco's sub-3% false-positive rate came from embedding human domain knowledge into the harness and validating every finding through human review. As the cost of producing plausible output falls to zero, the cost of telling true from false rises, and the only thing that pays it down is judgment. The gap between the researcher who knows when to trust the model and the one who forwards whatever it produced is not shrinking. It's widening.
That's why we think the expertise of the operator is becoming more important, not less, and why the value of that expertise is appreciating. Mythos, and everything like it, doesn't deploy itself.
Subscribe to our blog
Be first to learn about latest tools, advisories, and findings.
Thank You! You have been subscribed.
Recommended Posts