
TL;DR: LLMs demonstrate significant potential to streamline vulnerability research by automating patch diff analysis. Our structured experiments across high-impact CVEs show that targeted LLM applications can reduce analysis time while maintaining accuracy, with Claude Sonnet 3.7 emerging as the optimal balance of performance and cost efficiency
Vulnerability researchers spend countless hours performing patch diffing, the meticulous process of comparing software versions to identify security fixes and understand attack vectors. This becomes particularly challenging when security advisories provide incomplete information, forcing analysts to manually decompile binaries and sift through thousands of code changes to locate the actual vulnerability.
Traditional patch diffing workflows involve decompiling affected and patched binaries, generating differential reports, and manually analyzing each function change. For complex applications, this process can consume days or weeks of expert time, creating bottlenecks in vulnerability research pipelines.
This blog details our recent research into how LLMs can assist in scaling patch diffing workflows, saving valuable time in a crucial race against attackers. For full details on methodology and results, download our LLM-Assisted Vulnerability Research Guide.
Experimental Framework
Our research employed a structured methodology targeting four high-impact vulnerabilities with CVSS scores of 9.4 or higher within the following vulnerability classes: information disclosure, format string injection, authorization bypass, and stack buffer overflow.
For each test case, we used traditional automation to decompile the affected and patched binaries (using Binary Ninja), generate a differential report (using BinDiff), and extract the decompiled (medium-level instruction language) code for changed functions only.
We gave the LLMs two prompts.
- Prompt 1: Included the decompiled functions and asked the LLM to suggest a name for each function, summarize the function’s purpose, and summarize the changes made to the function.
- Prompt 2: Included the text of the vendor’s security advisory along with the output from the first prompt and asked the LLM to rank the functions in order of their relevance to the advisory. This ranking was not a one-shot prompt, but followed the iterative methodology used by our open-source raink tool.
We then reviewed the results from each test to determine the placement of the known vulnerable function. In cases where multiple functions were associated with the vulnerability, we used whichever one had the best placement.
To ensure statistical reliability, we utilized three LLM models (Claude Haiku 3.5, Claude Sonnet 3.7, and Claude Sonnet 4) across multiple iterations.
Performance Results
The majority of test cases (66%) produced a known vulnerable function within the Top 25 ranked results, suggesting that the overall approach has great promise.
Claude Haiku 3.5 |
Claude Sonnet 3.7 |
Claude Sonnet 4 |
|
INFO DISCLOSUREChanged functions: 27Known vulnerable: 2 |
Top 5: 10/10 (100%)Top 25: 10/10 (100%)Avg. time: 23 minsAvg. cost: $0.72 |
Top 5: 10/10 (100%)Top 25: 10/10 (100%)Avg. time: 8 minsAvg. cost: $2.71 |
Top 5: 10/10 (100%)Top 25: 10/10 (100%)Avg. time: 39 minsAvg. cost: $2.70 |
FORMAT STRINGChanged functions: 134Known vulnerable: 1 |
Top 5: 0/10 (0%)Top 25: 0/10 (0%)Avg. time: 5 minsAvg. cost: $0.43 |
Top 5: 10/10 (100%)Top 25: 10/10 (100%)Avg. time: 6 minsAvg. cost: $1.71 |
Top 5: 10/10 (100%)Top 25: 10/10 (100%)Avg. time: 7 minsAvg. cost: $1.66 |
AUTH BYPASSChanged functions: 1424Known vulnerable: 2 |
Top 5: 7/9 (78%)Top 25: 8/9 (89%)Avg. time: 39 minsAvg. cost: $9.26 |
Top 5: 10/10 (100%)Top 25: 10/10 (100%)Avg. time: 42 minsAvg. cost: $35.52 |
Top 5: 0/10 (0%)Top 25: 0/10 (0%)Avg. time: 412 minsAvg. cost: $34.71 |
STACK OVERFLOWChanged functions: 708Known vulnerable: 1 |
Top 5: 0/9 (0%)Top 25: 7/9 (78%)Avg. time: 22 minsAvg. cost: $5.77 |
Top 5: 0/10 (0%)Top 25: 0/10 (0%)Avg. time: 28 minsAvg. cost: $22.30 |
Top 5: 0/10 (0%)Top 25: 0/10 (0%)Avg. time: 88 minsAvg. cost: $21.79 |
Figure 1: Summary of results by test case and LLM model
Vulnerability Analysis
To better understand where this approach excels and where it falters, let’s take a closer look at each test case individually.
- The information disclosure vulnerability was perhaps the simplest test for the LLM workflow, as the patch diff only contained 27 changed functions in total.
- Nevertheless, what we found was that all three LLM models were able to correctly identify the vulnerable functions and rank at least one of them within the Top 5 results in every single test run.
- This is the gold standard we hope to attain with every patch analysis.
- The format string vulnerability had 134 changed functions in the patch diff, which presented slightly more of a challenge for analysis.
- While Claude Haiku 3.5 failed to identify the known vulnerable function in all cases, Claude Sonnet 3.7 and 4 successfully ranked the vulnerable function within the Top 5 results 100% of the time.
- The time and cost were exceedingly low, so this is considered a success with the caveat that using a more robust model is essential.
- The authorization bypass vulnerability was perhaps the best real-world test case, as the patch diff had 1.4k+ changed functions (too many for a human analyst to review quickly) and did not have a clearly-scoped bug class (an analyst couldn’t just look for memory reads and writes, for example).
- Claude Sonnet 3.7 performed quite well considering these challenges, but came with a high cost, averaging $35 per test.
- The stack buffer overflow vulnerability presented two unique challenges: it had an unusually terse advisory that included little relevant information for identifying the bug, and the majority of the 708 changes included in the patch diff indicated the insertion of stack canaries.
- All three LLM models struggled, just as a human analyst would have. Initially, all test runs failed to identify the vulnerable function, thus we adjusted the prompt to ignore any changes related to stack canaries, but even then, only Claude Haiku 3.5 ranked the vulnerable function within the Top 25 results (in seven of nine runs); the other models still failed to identify it.
- This vulnerability was an outlier among our test cases and highlighted the importance of continuing research to refine our methodology.
Key Takeaways
Overall, benchmarking these four real-world vulnerabilities with our LLM-powered workflow showed that most patch differential analyses stand to benefit from a first pass by artificial intelligence. This type of analysis is typically performed under extremely time-sensitive conditions, so if an LLM can guide analysts toward vulnerable code faster than usual, without incurring excessive costs, then it’s worth doing.
Perhaps unsurprisingly, our research indicated that more powerful models tend to perform better but can increase analysis time. Balancing cost and speed remain an important consideration. We also found that the LLMs fail to produce helpful results in some cases, especially when minimal information about a vulnerability is available. In the end, though, our methodology proved to be largely successful and further refinement of the workflow is warranted.
The primary takeaway from this experiment was that LLMs have great potential for improving vulnerability research, especially when used in a targeted (rather than holistic) approach. We at Bishop Fox have already begun incorporating these LLM-assisted techniques into our standard research methodology and continue experimenting with both targeted and holistic approaches.
Subscribe to our blog and advisories
Be first to learn about latest tools, advisories, and findings.
Thank You! You have been subscribed.