Executive brief on how PCI DSS 4.0 affects offensive security practices, penetration testing, and segmentation testing. Watch Now

Vulnerability Discovery with LLM-Powered Patch Diffing

Vulnerability Discovery with LLM-Powered Patch Diffing blog title card.

Share

TL;DR: LLMs demonstrate significant potential to streamline vulnerability research by automating patch diff analysis. Our structured experiments across high-impact CVEs show that targeted LLM applications can reduce analysis time while maintaining accuracy, with Claude Sonnet 3.7 emerging as the optimal balance of performance and cost efficiency

Vulnerability researchers spend countless hours performing patch diffing, the meticulous process of comparing software versions to identify security fixes and understand attack vectors. This becomes particularly challenging when security advisories provide incomplete information, forcing analysts to manually decompile binaries and sift through thousands of code changes to locate the actual vulnerability.

Traditional patch diffing workflows involve decompiling affected and patched binaries, generating differential reports, and manually analyzing each function change. For complex applications, this process can consume days or weeks of expert time, creating bottlenecks in vulnerability research pipelines.

This blog details our recent research into how LLMs can assist in scaling patch diffing workflows, saving valuable time in a crucial race against attackers. For full details on methodology and results, download our LLM-Assisted Vulnerability Research Guide.

Experimental Framework

Our research employed a structured methodology targeting four high-impact vulnerabilities with CVSS scores of 9.4 or higher within the following vulnerability classes: information disclosure, format string injection, authorization bypass, and stack buffer overflow.

For each test case, we used traditional automation to decompile the affected and patched binaries (using Binary Ninja), generate a differential report (using BinDiff), and extract the decompiled (medium-level instruction language) code for changed functions only.

We gave the LLMs two prompts.

  • Prompt 1: Included the decompiled functions and asked the LLM to suggest a name for each function, summarize the function’s purpose, and summarize the changes made to the function.
  • Prompt 2: Included the text of the vendor’s security advisory along with the output from the first prompt and asked the LLM to rank the functions in order of their relevance to the advisory. This ranking was not a one-shot prompt, but followed the iterative methodology used by our open-source raink tool.

We then reviewed the results from each test to determine the placement of the known vulnerable function. In cases where multiple functions were associated with the vulnerability, we used whichever one had the best placement.

To ensure statistical reliability, we utilized three LLM models (Claude Haiku 3.5, Claude Sonnet 3.7, and Claude Sonnet 4) across multiple iterations.

Performance Results

The majority of test cases (66%) produced a known vulnerable function within the Top 25 ranked results, suggesting that the overall approach has great promise.

Claude Haiku 3.5

Claude Sonnet 3.7

Claude Sonnet 4

INFO DISCLOSURE

Changed functions: 27
Known vulnerable: 2


Top 5: 10/10 (100%)
Top 25: 10/10 (100%)
Avg. time: 23 mins
Avg. cost: $0.72


Top 5: 10/10 (100%)
Top 25: 10/10 (100%)
Avg. time: 8 mins
Avg. cost: $2.71


Top 5: 10/10 (100%)
Top 25: 10/10 (100%)
Avg. time: 39 mins
Avg. cost: $2.70

FORMAT STRING

Changed functions: 134
Known vulnerable: 1


Top 5: 0/10 (0%)
Top 25: 0/10 (0%)
Avg. time: 5 mins
Avg. cost: $0.43


Top 5: 10/10 (100%)
Top 25: 10/10 (100%)
Avg. time: 6 mins
Avg. cost: $1.71


Top 5: 10/10 (100%)
Top 25: 10/10 (100%)
Avg. time: 7 mins
Avg. cost: $1.66

AUTH BYPASS

Changed functions: 1424
Known vulnerable: 2


Top 5: 7/9 (78%)
Top 25: 8/9 (89%)
Avg. time: 39 mins
Avg. cost: $9.26


Top 5: 10/10 (100%)
Top 25: 10/10 (100%)
Avg. time: 42 mins
Avg. cost: $35.52


Top 5: 0/10 (0%)
Top 25: 0/10 (0%)
Avg. time: 412 mins
Avg. cost: $34.71

STACK OVERFLOW

Changed functions: 708
Known vulnerable: 1


Top 5: 0/9 (0%)
Top 25: 7/9 (78%)
Avg. time: 22 mins
Avg. cost: $5.77


Top 5: 0/10 (0%)
Top 25: 0/10 (0%)
Avg. time: 28 mins
Avg. cost: $22.30


Top 5: 0/10 (0%)
Top 25: 0/10 (0%)
Avg. time: 88 mins
Avg. cost: $21.79
 Figure 1: Summary of results by test case and LLM model

Vulnerability Analysis

To better understand where this approach excels and where it falters, let’s take a closer look at each test case individually.

  • The information disclosure vulnerability was perhaps the simplest test for the LLM workflow, as the patch diff only contained 27 changed functions in total. 
    • Nevertheless, what we found was that all three LLM models were able to correctly identify the vulnerable functions and rank at least one of them within the Top 5 results in every single test run. 
    • This is the gold standard we hope to attain with every patch analysis.
  • The format string vulnerability had 134 changed functions in the patch diff, which presented slightly more of a challenge for analysis. 
    • While Claude Haiku 3.5 failed to identify the known vulnerable function in all cases, Claude Sonnet 3.7 and 4 successfully ranked the vulnerable function within the Top 5 results 100% of the time. 
    • The time and cost were exceedingly low, so this is considered a success with the caveat that using a more robust model is essential.
  • The authorization bypass vulnerability was perhaps the best real-world test case, as the patch diff had 1.4k+ changed functions (too many for a human analyst to review quickly) and did not have a clearly-scoped bug class (an analyst couldn’t just look for memory reads and writes, for example). 
    • Claude Sonnet 3.7 performed quite well considering these challenges, but came with a high cost, averaging $35 per test.
  • The stack buffer overflow vulnerability presented two unique challenges: it had an unusually terse advisory that included little relevant information for identifying the bug, and the majority of the 708 changes included in the patch diff indicated the insertion of stack canaries. 
    • All three LLM models struggled, just as a human analyst would have. Initially, all test runs failed to identify the vulnerable function, thus we adjusted the prompt to ignore any changes related to stack canaries, but even then, only Claude Haiku 3.5 ranked the vulnerable function within the Top 25 results (in seven of nine runs); the other models still failed to identify it.
    • This vulnerability was an outlier among our test cases and highlighted the importance of continuing research to refine our methodology.

Key Takeaways

Overall, benchmarking these four real-world vulnerabilities with our LLM-powered workflow showed that most patch differential analyses stand to benefit from a first pass by artificial intelligence. This type of analysis is typically performed under extremely time-sensitive conditions, so if an LLM can guide analysts toward vulnerable code faster than usual, without incurring excessive costs, then it’s worth doing.

Perhaps unsurprisingly, our research indicated that more powerful models tend to perform better but can increase analysis time. Balancing cost and speed remain an important consideration. We also found that the LLMs fail to produce helpful results in some cases, especially when minimal information about a vulnerability is available. In the end, though, our methodology proved to be largely successful and further refinement of the workflow is warranted.

The primary takeaway from this experiment was that LLMs have great potential for improving vulnerability research, especially when used in a targeted (rather than holistic) approach. We at Bishop Fox have already begun incorporating these LLM-assisted techniques into our standard research methodology and continue experimenting with both targeted and holistic approaches.

Subscribe to our blog and advisories

Be first to learn about latest tools, advisories, and findings.


Jon Williams

About the author, Jon Williams

Senior Security Engineer

As a researcher for the Bishop Fox Capability Development team, Jon spends his time hunting for vulnerabilities and writing exploits for software on our customers' attack surface. Jon has written and presented research on various topics including enterprise wireless network attacks, bypassing network access controls, and reverse-engineering edge security device firmware.

More by Jon

This site uses cookies to provide you with a great user experience. By continuing to use our website, you consent to the use of cookies. To find out more about the cookies we use, please see our Privacy Policy.