vuln-intel / field notes

Can a CVE corpus make an LLM a better bug hunter?

The seductive idea behind every "AI security brain" is this: a frontier model already knows a lot about vulnerabilities, but a curated corpus of historical bugs, their mechanisms, their fix patterns, their cross-product variants, should let it reason better, recognize new bugs, and out-hunt a base model.

We had the corpus to test it: a large body of CVEs with embeddings, mechanism-sibling retrieval, fix-commit links, and a cross-product alias graph. So we tested it, eight different ways, over a couple of weeks, with held-out evaluations and (where possible) contamination control.

The verdict, up front: the corpus produces genuinely expert-grade output, but it delivers only a modest edge over a strong frontier model, and zero uplift on a cheap model. As a reasoning multiplier with a moat, the thesis did not hold. As a grounding, prioritization, and verification layer, it's real and valuable, which is a different, more honest claim.

Here are the eight experiments and what each one showed.

The thesis, stated precisely

There's a weak version of "learn from past bugs" and a strong version, and conflating them is where most hype lives:

The strong version is real, it's how good hunters think, and there's published precedent (Vul-RAG, VulInstruct, vEcho). The question was never whether the idea is sound. It's whether our corpus gives a model an edge it doesn't already have, and especially whether it lifts a cheap model up to a frontier model's level, which is the difference between a product and a parlor trick.

Experiments 1–4: mechanism prediction

We built Vul-RAG-style "pattern cards", distilled mechanism knowledge (root cause → trigger → fix pattern → detection signature) from older CVEs, and tested whether they help a model recognize new instances of the same mechanism.

  1. Flagship mechanism backtest. Temporal holdout on three mechanisms (SSRF, deserialization, prototype-pollution): cards distilled from older CVEs recognized held-out newer CVEs of the same class with strong recall and zero false positives against boilerplate-sharing distractors. Encouraging, but the task (recognize a known mechanism) was friendly.

  2. Raw-code localization. Asked the model to localize the vulnerable code, with and without cards. The cards did not beat a strong frontier model.

  3. Cross-stack class prediction. Predict the bug class likely to appear next in a product, across stacks. Again: real output, no categorical edge over the base model.

  4. Invariant reconstruction. The deepest version: reconstruct the developer's broken assumption from a product's bug history (plus cross-product anchors) and predict where it leaks next. Leave-one-out on three products. Result: a coverage wash overall, but with a sharp, honest signal. On nocodb, the invariant approach genuinely won: it reconstructed "authorization is checked at the route layer, not the data/ACL layer" and "token deletion doesn't revoke issued sessions," then correctly predicted held-out bugs that were novel manifestations of exactly those assumptions (OAuth-scope-at-the-ACL-layer, stale-auth-after-token-deletion). On lollms, it lost: it over-committed to the dominant past pattern (path traversal) and missed a regime shift to access-control bugs.

The pattern across 1–4 was consistent: real, expert-grade reasoning; a modest, not categorical, edge over a strong model; clearest exactly on the deep cross-class cases, and prone to over-fitting when it over-commits to "the" assumption.

Experiment 5: cheap-model uplift, the decider

This is the one that mattered. A modest edge on a frontier model is a feature. The product question is: does the framing turn a cheap model into an expert? If Haiku + the corpus's invariants matches Opus alone, that's a business. If not, it's a sharpening that still needs an expensive base model.

Setup: three fresh products, three arms, a cheap model (Haiku) + invariants + cross-product anchors, the cheap model alone, and a frontier model (Opus) alone, on a blind, post-cutoff held-out set, scored on both recall and precision.

Result: both halves of the hypothesis failed. The framing didn't make the cheap model match the frontier model, the gap stayed wide, and it didn't even reliably help the cheap model: with the corpus framing it scored slightly worse than the cheap model alone. The mechanism: the "predict deeper / cross-class" framing pushes a model toward ambitious, exotic predictions, and a cheap model can't calibrate the gamble, so it trades away the boring-but-correct answer. Model tier dominated the framing by a wide margin.

Experiment 6: change history as a leading indicator

A new idea, enabled by ingesting NVD's full change history (every time a CVE's CVSS, CWE, or references are added or revised): maybe analysts quietly escalate a CVE, bump its score, add an exploit reference, before it gets KEV-listed, giving an early-warning signal.

We tested it on a sample of recently-exploited CVEs: does the escalation event lead the KEV listing, or lag it?

Falsified. The corpus's edit history is a reactive record of exploitation, not a predictor of it. We'd flagged this exact risk when proposing the idea; the data confirmed the pessimistic branch. Cost to find out: about ten minutes and zero model calls, the value of a cheap proof-first probe.

Experiment 7: vulnerable-vs-patched discrimination

The published Vul-RAG result that's hardest to ignore: base models reportedly score close to random at distinguishing vulnerable code from its patched version, and distilled cards reportedly lift that substantially. We hadn't tested this hard discrimination task, only prediction. So we did, on a small set of contamination-controlled pairs (post-model-cutoff CVEs, real fix commits).

Result, and the lesson is in the misread: the cheap model scored high from the code alone, and the corpus context added essentially nothing. But that high score doesn't contradict the near-random published baseline, because our pairwise-with-the-diff-visible setup is the easy version of the task, the patched side literally shows the added guard, so the model just picks the one missing it. There was no headroom for the corpus to help. The run is therefore inconclusive about Vul-RAG's claim (which uses the hard single-function framing), not a clean refutation, and a fair test of it remains unrun, with a low prior after seven negatives.

Experiment 8: the honest meta-result

Counting the data analysis and the eight model experiments together, every lever pointed the same way. So we did the most useful experiment of all: stopped, and recorded the verdict so no one re-runs experiment #9.

What it means (for everyone building "AI security brains")

This is not "LLMs are bad at security." Frontier models are good, that's exactly the problem for the corpus-as-moat thesis. The honest takeaways:

FAQ

Does retrieval-augmented generation (RAG) improve LLM vulnerability detection? In our experiments, a CVE corpus gave a modest edge to a strong frontier model and no edge to a cheap model. It reliably improves grounding (cited, accurate facts) but did not measurably improve reasoning over a capable base model. Published positive results (e.g. Vul-RAG) exist but are often evaluated in-distribution; blind, post-cutoff tests are far less generous.

Can an LLM predict a software product's next vulnerability? Partially. It can reconstruct a developer's recurring broken assumption and sometimes predict a novel manifestation of it (we saw this clearly on one product). But it's a modest edge over a strong base model, it over-fits when it over-commits to one pattern, and a cheap model with the same corpus does not match a frontier model alone.

Can an LLM tell vulnerable code from patched code? With both versions visible (a diff), yes, easily, that's reading the fix. The genuinely hard task is judging a single function in isolation, where base models reportedly score near-random; we did not get a clean test of whether a corpus helps there.

Is a vulnerability "AI brain" a real product? On this evidence: it's a real grounding, prioritization, and verification tool, and a real feature. As a defensible reasoning moat over frontier models, the evidence says be skeptical.


Background: the data behind the prioritization signals is in What actually predicts exploitation. The practical output, a priority ranking and a recon dig-order, is in Vulnerability prioritization that works.


For authorized, defensive security research only.