Internalizing System Prompts Into the Model: How Microsoft's OPCD Framework Is Changing LLM Deployment
The Hidden Cost of System Prompts
If you run an LLM-based service, you’re well acquainted with system prompts. These lengthy instructions — covering safety policies, response tone, domain knowledge, and formatting rules — are sent alongside every single request to the model. The problem is that this isn’t free.
Enterprise-level system prompts commonly run to thousands of tokens. When token counts grow, two costs increase simultaneously. First, inference latency. In the Transformer architecture, attention computation scales quadratically with sequence length, so longer prompts noticeably increase time-to-first-token. Second, monetary cost. Since most API pricing is proportional to input tokens, repeatedly sending the same system prompt millions of times a day adds up to a substantial bill.
As explored in a previous article on LLM serving architecture, prefill-stage computation is directly proportional to input token count. System prompts were, in effect, a fixed overhead on that prefill cost.
So what if you could “bake” these repeated instructions directly into the model weights? OPCD (On-Policy Context Distillation), a framework published by Microsoft Research in February 2025, is precisely the answer to that question1.
The Basic Idea of Context Distillation
To understand OPCD, you first need to grasp the concept of context distillation. The principle is based on a teacher-student paradigm:
- Teacher model: Receives the full context including the long system prompt and generates high-quality responses accordingly.
- Student model: Receives only the user query without the system prompt. It learns to reproduce the teacher’s behavior by observing the teacher’s responses.
Once training is complete, the student model can generate responses similar to the teacher’s even without the system prompt. The information that was in the prompt has been internalized into the model parameters.
This idea itself wasn’t new. However, traditional context distillation had two fundamental limitations.
Two Flaws of Existing Approaches
Off-Policy Learning and Exposure Bias
Traditional context distillation operated in an off-policy manner. Training data consisted of fixed datasets collected before training began. The student model learned only from “correct” sequences generated by the teacher, which caused exposure bias.
During training, correct tokens were always provided, but in deployment the model must predict the next token based on its own generated tokens. A single wrong token can cause the entire subsequent sequence to cascade into failure. Co-author Tianzhu Ye compared this to “showing someone driving videos and then putting them behind the wheel”2.
The Problem with Forward KL Divergence
The second issue lay in the training objective function. Existing methods minimized forward KL divergence, which induces mode-covering behavior — the student tries to “cover” the teacher’s entire distribution.
Since the student model is smaller or operates without context, it lacks the capacity to perfectly replicate the teacher’s complex distribution. Trying to encompass all possibilities anyway causes the predicted distribution to spread too broadly, leading to hallucinations and generalization failures.
OPCD’s Core Design: On-Policy + Reverse KL
OPCD simultaneously solves both problems.
On-Policy Learning
In OPCD, the student model learns from its own generated responses rather than pre-prepared datasets. The specific workflow:
- The student model receives a query without the system prompt and generates a response.
- The teacher model, with the full system prompt context, evaluates the token distribution at each generation step of the student.
- The student’s parameters are updated based on the difference between the student’s and teacher’s token distributions.
The key is that the student directly experiences and corrects its own mistakes. Rather than only seeing correct answers as in off-policy methods, it learns from teacher feedback in situations where it can actually go wrong.
Reverse KL Divergence
OPCD minimizes reverse KL divergence instead of forward KL. Reverse KL induces mode-seeking behavior: it focuses on regions where the student distribution assigns high probability, while suppressing tokens that the teacher rated highly but the student rated poorly.
As Ye explained: “Minimizing reverse KL encourages mode-seeking behavior that focuses on the student’s high-probability regions. Tokens the student deems unlikely are suppressed even if the teacher assigns high probability to them”2.
The combined effect was clear: the student model focused on the most accurate responses within its capability range, and the problem of hallucinations from over-ambitiously mimicking the teacher’s full distribution was greatly reduced.
Benchmark Results: The Numbers
The OPCD paper reported results across two experimental scenarios1.
Empirical Knowledge Distillation
The first experiment verified whether a model could internalize problem-solving strategies accumulated while solving math problems.
| Model | Task | Baseline Accuracy | After OPCD |
|---|---|---|---|
| Llama-3.1-8B | Math reasoning | 75.0% | 80.9% |
| Qwen2.5-1.5B | Frozen Lake game | 6.3% | 38.3% |
The roughly 6× performance improvement (6.3% → 38.3%) in the 1.5B small model was particularly noteworthy.
System Prompt Distillation
The second experiment — the core topic of this article — tested scenarios where safety policies and medical domain prompts were baked into the model.
| Model | Task | Without Prompt | OPCD Internalized |
|---|---|---|---|
| Qwen2.5-3B | Safety/toxicity classification | 30.7% | 83.1% |
| Qwen2.5-3B | Medical QA | 59.4% | 76.3% |
Safety classification accuracy jumping from 30.7% to 83.1% in a 3B model meant that performance without a system prompt approached the level achieved with one.
General Performance Preservation
On the perennial fine-tuning problem of catastrophic forgetting, OPCD also showed favorable results. A model that had internalized safety rules maintained approximately 4 percentage points higher performance on an unrelated medical QA task compared to off-policy approaches. It secured both specialization and general performance simultaneously.
Comparison with Existing Prompt Compression Techniques
OPCD wasn’t the first attempt to solve the cost problem of system prompts. Comparing it with representative existing approaches clarifies OPCD’s position.
LLMLingua: Token-Level Compression
LLMLingua, released by Microsoft in 2023, used small language models (GPT-2, LLaMA-7B, etc.) to remove low-importance tokens from prompts3. It achieved up to 20× compression while minimizing performance degradation. The follow-up, LongLLMLingua (ACL 2024), further improved compression for long-context scenarios.
However, LLMLingua-family approaches require performing compression at inference time for every request. Compression itself demands computation, the original prompt must still exist somewhere, and the compressed prompt’s token count is never zero.
Soft Prompts and Prompt Tuning
Prompt tuning prepends learnable continuous vectors (soft prompts) to the input to steer model behavior. Optimizing in continuous space rather than with discrete tokens achieves similar effects with far fewer parameters. However, soft prompts still need to be added to the input at every inference, and they suffer from poor interpretability.
OPCD’s Differentiator
OPCD takes a fundamentally different approach. Rather than “reducing” the prompt, it eliminates it entirely. System prompt information is directly internalized into model weights, so no additional input is needed at inference time at all. If model compression techniques like quantization and pruning reduce model size, OPCD reduces input size — or more precisely, eliminates it.
| Technique | Approach | Prompt at Inference | Additional Compute |
|---|---|---|---|
| LLMLingua | Token removal | Required (reduced) | Compression model run |
| Prompt tuning | Soft prompt | Required (vectors) | None |
| OPCD | Weight internalization | Not required | None |
Practical Deployment Conditions and Constraints
OPCD’s adoption barrier was relatively low. According to the paper, teams already running RLVR (Reinforcement Learning from Verifiable Rewards) pipelines could apply it without major architectural changes. The implementation was built on the open-source RLVR codebase verl, and Microsoft stated plans to release the code after internal review2.
Hardware requirements were approximately 8 A100 GPUs — realistic compared to large-scale pretraining. Data requirements were also light: empirical knowledge distillation needed only about 30 seed examples, and system prompt distillation required just the existing optimized prompt and standard task datasets.
However, OPCD isn’t a silver bullet. Ye noted: “When the needed information is very dynamic, or relates to frequently updated large external databases, RAG is a more appropriate solution”2. OPCD is fundamentally optimized for internalizing static, repetitive instructions — system prompts, safety policies, and domain rules that don’t change.
Position in the LLM Serving Pipeline
Assuming OPCD is deployed alongside a serving engine like vLLM, the benefits manifest through two pathways:
Prefill stage reduction: Eliminating the system prompt shortens the input sequence length by that much. For a service using 2,000 tokens of system prompt, each request saves 2,000 tokens’ worth of prefill computation. KV cache memory usage decreases proportionally.
Throughput increase: With shorter inputs, more requests can be batched in the same GPU memory. In continuous batching environments, this directly translates to higher throughput.
For large-scale services where thousands of tokens of system prompt are repeatedly sent across millions of daily queries, these savings can add up to a meaningful reduction in infrastructure costs.
A Stepping Stone Toward Self-Improving Models
OPCD’s long-term significance extends beyond simple cost savings. As the empirical knowledge distillation experiment showed, if a model can extract rules from its own successful experiences and internalize them into its parameters, this hints at the possibility of a self-improvement loop.
If a deployed model organizes success patterns accumulated during operation and internalizes them via OPCD into the next version — and this cycle repeats — the model could progressively optimize for a specific domain. Of course, realizing this vision requires solving challenges like automated experience extraction, quality verification, and safe update pipelines.
Summary
OPCD is a framework that realized the simple idea of “baking system prompts into the model instead of sending them every time” through the technical innovations of on-policy learning and reverse KL divergence. Solving the exposure bias and hallucination problems of existing context distillation while pushing safety classification accuracy from 30.7% to 83.1% in a 3B model was impressive.
If prompt compression was the “reducing” approach, OPCD is the “eliminating” approach. For enterprises running LLM-based services at scale, being able to remove thousands of tokens of fixed cost per request is a practical game changer. Once Microsoft releases the code, it will be worth watching how quickly this technology permeates production pipelines.
Footnotes
-
Tianzhu Ye, Li Dong, Xun Wu et al., “On-Policy Context Distillation for Language Models”, arXiv:2602.12275, 2025. Paper ↩ ↩2
-
Carl Franzen, “Microsoft’s new AI training method eliminates bloated system prompts without sacrificing model performance”, VentureBeat, February 28, 2026. Article ↩ ↩2 ↩3 ↩4
-
Huiqiang Jiang et al., “LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models”, EMNLP 2023. Repository ↩