Code Security

Secure by Reward: Teaching AI to Write Better Code

Coding agents have changed how software gets built. They haven’t changed what makes software secure.

Frontier models are remarkably capable code generators. They complete functions, suggest implementations, and produce working output at a pace no human developer can match. But capable and secure are different properties. The benchmarks that shaped these models, HumanEval, SWEBench, and their successors, measure functional correctness. They measure whether the code runs and produces the right output. They don’t measure whether the code is safe to ship.

The consequences of that optimization choice are starting to surface. Recently published research by Stanford’s SecureForge project [1] found that frontier models produce statically verifiable security vulnerabilities approximately 23% of the time; even when explicitly prompted to write secure, production-ready code. The more troubling finding sits underneath that number: 12.7% of the time, the output was both vulnerable and passed unit tests. The code looked correct. It behaved correctly. It just had a security flaw that would survive every functional review and ship quietly into production.

This isn’t a model quality problem in the way most people frame it. The models are doing what they were trained to do. The training objective never included security as a measurable outcome. Fixing that requires either changing what the model is optimized for, or changing what it’s told at inference time. Those are the two paths we’ll explore further.

Two Paths to Closing the Gap

The options for improving secure code generation fall into two categories: prompt optimization and fine-tuning.

Prompt optimization doesn’t touch model weights. It changes what the model is told at inference time; the system prompt, the framing, the context it receives before a coding task begins. It’s faster to implement, doesn’t require access to model internals, and is reversible if it produces unintended side effects. The ceiling is real though: you can only steer a model toward behaviors it’s already capable of. Prompt optimization works with what the model already knows how to do. It doesn’t create new capability.

Fine-tuning goes deeper. It updates the model’s weights using new training data, changing the underlying behavior rather than just the framing. The results are more durable and more significant, but the investment is proportionally larger. It requires weight access, MLOps infrastructure, careful training data design, and ongoing maintenance. For large organizations with the right conditions, the ceiling is substantially higher.

Real-World Note: Reward signals are one mechanism for providing that feedback, and the one this piece focuses on. It’s worth acknowledging that Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Retrieval-Augmented Generation (RAG) are all viable alternatives; each with different tradeoffs around data requirements, implementation complexity, and how deep the behavior change runs.

Static Analysis as a Reward Signal

Static analysis tools are a natural fit for reward signal design. They’re deterministic. They map findings to established taxonomy, specifically the MITRE Common Weakness Enumeration (CWE) framework. They require no human annotation to produce a pass or fail signal. And they scale.

The core idea is straightforward: run the analyzer against the model’s output, and use the result, finding or no finding, vulnerable or clean, as the feedback signal. A model that produces fewer findings over time is moving in the right direction. A model that produces more is not. The signal is objective, repeatable, and grounded in decades of security research rather than subjective human judgment about what good code looks like.

Recent research has demonstrated this approach works in practice. Using static analysis as the primary feedback mechanism, optimized system prompts have produced meaningful reductions in vulnerability rates across frontier models; without any changes to model weights, and with improvements that transferred to real-world coding tasks outside the original test distribution.

Static analysis as a reward signal works. The question is what it misses.

The Limits of a Single Signal

To create our first signal, we can use Semgrep’s community edition. It contains a ruleset that is well-maintained, has CWE-mappings, and is a widely deployed static analysis tool. It’s also a bounded one. It catches what its rules are written to catch; known patterns, known vulnerability classes, known code structures that correlate with known weaknesses. That’s a meaningful coverage area. It’s not the whole surface.

Pattern-matching static analysis doesn’t catch semantic logic flaws that are structurally novel. It doesn’t evaluate runtime behavior. It doesn’t surface business logic abuse. These aren’t edge cases, they represent significant categories of real-world exploitable vulnerabilities.

More critically, using a single tool as the reward signal introduces reward hacking. A model being optimized against Semgrep’s rule set learns to produce outputs that satisfy Semgrep. That’s not the same as learning to produce secure code. A sufficiently capable model can restructure vulnerable logic into a form that doesn’t match the pattern the rule was written to detect. The vulnerability remains. The scanner approves it. The reward signal is satisfied.

This isn’t a design flaw, it’s the natural ceiling of any single-signal reward system. Passing the scanner and being secure are related but distinct properties. Conflating them in a training or optimization objective produces models that are better at the former without reliably improving at the latter.

Multi-Tool Reward Learning

The response to single-signal limitations is an ensemble: multiple tools, each covering different parts of the vulnerability surface, combined into a reward signal that’s harder to game and more representative of actual security posture.

Semgrep covers the static code analysis layer; pattern-based CWE detection, fast, configurable, strong on code-level weaknesses. OpenGrep is the community-governed fork of Semgrep, maintaining independent rule evolution and governance. In practice, running both surfaces rule set divergence and provides a consistency check across findings. Adding in a third scan engine to the mix, like Snyk, creates another independent evaluation layer based on a different ruleset and governance model.

The combination of these tools together produces a richer series of signals. A prompt that generates Semgrep-clean but Snyk-flagged code is a different training signal than a prompt that fails both. A prompt that passes all three tools is a stronger positive example than one that passes only one. The label taxonomy that emerges, clean, static-vulnerable, dependency-vulnerable, or both, gives a fine-tuning training dataset meaningful structure that a single-tool label set can’t provide.

Ensemble reward design requires care. If the reward penalizes any finding from any tool, the optimization pressure pushes toward code that avoids the entire flagged surface; producing outputs that are conservative to the point of being unhelpfully minimal. Signal weighting, severity thresholds, and consensus requirements all need deliberate design. The goal is a reward that reflects genuine security improvement, not one that the model can satisfy by generating less code.

Real-World Note: One point that deserves direct acknowledgment: combining output across multiple static analysis tools is not a straightforward data pipeline task. Different tools use different severity schemas, different finding formats, and different rule naming conventions. Before any of this is usable as a reward signal or training corpus, it needs a normalization layer that reconciles those differences into a consistent representation. That work is significant and often underestimated. It isn’t covered in depth here, but it shouldn’t be treated as an implementation detail; in practice, it’s frequently the longest part of the project.

From Reward Signal to Fine-Tuning Datasets

This is where the long term bet begins, but the greatest benefits can be realized. For large organizations with mature AppSec programs and significant codebase scale, the value compounds over time. The goal isn’t a one-time model improvement, it’s a progressively more accurate model of what secure code looks like in your environment specifically. That framing matters because it sets the right expectations for the effort involved and the timeline for seeing returns.

Co-ownership between security and engineering is a structural prerequisite, not a project management preference. Security brings signal quality: the domain expertise to evaluate findings, the judgment to distinguish true positives from noise, and the understanding of which vulnerability classes matter most in context. Engineering brings infrastructure: the MLOps pipeline, the model serving environment, and the integration work that puts a fine-tuned model in front of developers. Neither function can build this alone.

For the CTO or CIO driving AI adoption across the organization, this is the investment that makes that adoption defensible. Developer productivity gains built on a model that generates vulnerable code at scale aren’t gains, they’re deferred liability.

The tooling available to support this has matured significantly. Platforms like Mistral Forge and similar fine-tuning services have substantially reduced the MLOps burden. Infrastructure that previously required dedicated ML engineering capacity is increasingly accessible. That doesn’t eliminate the co-ownership requirement, but it shifts the conversation from “how do we build the training infrastructure” to “how do we build a dataset worth training on.”

The training data itself is built in stages.

Collect

Instrument existing SAST and SCA pipelines to capture findings with full code context, not just line numbers and severity scores. The surrounding code, the file, the service, the framework, and the remediation outcome all matter. Raw findings stripped of context make poor training data. The finding tells you what went wrong. The context tells you why, and what the correct version looks like.

Store

A vector database serves as the persistence layer. Findings, code embeddings, tool attribution, remediation status, and true/false positive labels accumulate over time into a corpus that reflects real organizational patterns. The vector representation enables retrieval by semantic similarity; surfacing related findings and remediation examples that inform both training and, eventually, inference-time context.

Threshold

Define the quality bar before triggering any fine-tuning run. Volume matters, but it’s not the primary variable. Diversity across CWE classes, confirmed true positive ratios, and label consistency are more important than raw finding count. A million noisy findings with inconsistent labels are worse training data than fifty thousand clean, well-attributed ones. The threshold gate is where corpus discipline gets enforced, and where the human review that no automated pipeline fully replaces earns its place.

Fine-Tune

Use the labeled training data to run a targeted fine-tuning pass on a code-generation Small Language Model (SLM). The model learns from real organizational context and real remediation history, not synthetic CWE scenarios constructed from MITRE examples. The resulting model has encountered your frameworks, your patterns, and your actual security decisions.

Swap

Replace the generic frontier model in the IDE or terminal with the fine-tuned SLM for code generation tasks. The frontier model doesn’t get retired, it moves to a different role. Reasoning, planning, and code review tasks where broad capability matters more than domain-specific calibration remain appropriate frontier model territory. The SLM handles generation, where security behavior and false positive calibration to your codebase matter most. That’s domain-specific training data combined with false positive calibration to your environment. Raw capability doesn’t close that gap.

At enterprise scale, tens of thousands of files, findings approaching millions, this is where the economics shift. The same dataset that represents a false positive management burden at current tooling maturity becomes the training data that systematically reduces it.

What This Looks Like in Practice

The full system connects in sequence: the AppSec pipeline generates findings across SAST and SCA tools; a normalization layer reconciles the output into a consistent schema; labeled findings with code context flow into a vector database; a quality threshold gate governs when the dataset is ready for a fine-tuning run; the fine-tuned model gets deployed into the developer environment.

The feedback loop is what makes this compound.

As the fine-tuned model generates new code, that code enters the AppSec pipeline like any other. New findings get labeled; confirmed true positives, suppressed false positives, remediated outputs, and flow back into the dataset. Each cycle improves the accuracy of the model’s calibration to the environment it’s operating in.

Continuous re-evaluation is part of the operational model, not an optional enhancement. Codebases evolve. New frameworks get adopted. New vulnerability classes emerge. The fine-tuned model’s advantage against any static snapshot of the dataset erodes over time. Recurring fine-tuning runs, triggered by data growth thresholds rather than calendar schedules, keep the model current.

Security Consideration: The fine-tuning datasets are, and should be treated as, sensitive data. It maps your organization’s vulnerability patterns, codebase structure, and remediation history in considerable detail. The vector database, the fine-tuning pipeline, and the resulting model artifacts warrant the same access controls and supply chain integrity practices you’d apply to any production AI system. The security of the training pipeline is part of the security posture of what comes out of it.

Acknowledging Limitations

Multi-tool static analysis is a meaningfully better reward signal than single-tool static analysis. However it is not a complete one.

Runtime behavior falls outside its reach. Prompt injection through code context isn’t something a static analyzer catches. Semantic vulnerabilities also remain in the gap. And the adversarial ceiling is real: a reward signal that’s known can be optimized against. A model sufficiently capable of understanding the reward signal can learn to satisfy it without fully internalizing the security property it’s meant to represent.

The transfer gap also deserves honest acknowledgment. Improvement on static analysis benchmarks doesn’t guarantee improvement on real-world exploitability. The two are correlated, not equivalent. Behavioral testing, red teaming, adversarial evaluation, runtime monitoring, remains necessary alongside whatever the static analysis layer catches.

The evaluation recursion is the hardest limit to address. Determining whether your reward signal is genuinely measuring security improvement requires a ground truth that doesn’t fully exist. Static analysis approximates it. Human expert review approximates it differently. Neither is a definitive answer, but rather a high probabilistic reality. The dataset quality gate, where human judgment is explicitly applied to finding labels before any training run, is currently the most important control in the system; and the one most likely to be underinvested in.

None of these limits argue against the approach. They argue for building it with clear eyes about what it measures, what it misses, and where the human judgment that no automated pipeline replaces needs to stay in the loop.

[1] https://arxiv.org/pdf/2605.08382