Internalizing System Prompts Into the Model: How Microsoft's OPCD Framework Is Changing LLM Deployment

· # AI 활용
OPCD system prompt LLM optimization Microsoft prompt engineering fine-tuning

The Hidden Cost of System Prompts

If you run an LLM-based service, you’re well acquainted with system prompts. These lengthy instructions — covering safety policies, response tone, domain knowledge, and formatting rules — are sent alongside every single request to the model. The problem is that this isn’t free.

Enterprise-level system prompts commonly run to thousands of tokens. When token counts grow, two costs increase simultaneously. First, inference latency. In the Transformer architecture, attention computation scales quadratically with sequence length, so longer prompts noticeably increase time-to-first-token. Second, monetary cost. Since most API pricing is proportional to input tokens, repeatedly sending the same system prompt millions of times a day adds up to a substantial bill.

As explored in a previous article on LLM serving architecture, prefill-stage computation is directly proportional to input token count. System prompts were, in effect, a fixed overhead on that prefill cost.

So what if you could “bake” these repeated instructions directly into the model weights? OPCD (On-Policy Context Distillation), a framework published by Microsoft Research in February 2025, is precisely the answer to that question1.

The Basic Idea of Context Distillation

To understand OPCD, you first need to grasp the concept of context distillation. The principle is based on a teacher-student paradigm:

  1. Teacher model: Receives the full context including the long system prompt and generates high-quality responses accordingly.
  2. Student model: Receives only the user query without the system prompt. It learns to reproduce the teacher’s behavior by observing the teacher’s responses.

Once training is complete, the student model can generate responses similar to the teacher’s even without the system prompt. The information that was in the prompt has been internalized into the model parameters.

This idea itself wasn’t new. However, traditional context distillation had two fundamental limitations.

Two Flaws of Existing Approaches

Off-Policy Learning and Exposure Bias

Traditional context distillation operated in an off-policy manner. Training data consisted of fixed datasets collected before training began. The student model learned only from “correct” sequences generated by the teacher, which caused exposure bias.

During training, correct tokens were always provided, but in deployment the model must predict the next token based on its own generated tokens. A single wrong token can cause the entire subsequent sequence to cascade into failure. Co-author Tianzhu Ye compared this to “showing someone driving videos and then putting them behind the wheel”2.

The Problem with Forward KL Divergence

The second issue lay in the training objective function. Existing methods minimized forward KL divergence, which induces mode-covering behavior — the student tries to “cover” the teacher’s entire distribution.

Since the student model is smaller or operates without context, it lacks the capacity to perfectly replicate the teacher’s complex distribution. Trying to encompass all possibilities anyway causes the predicted distribution to spread too broadly, leading to hallucinations and generalization failures.

OPCD’s Core Design: On-Policy + Reverse KL

OPCD simultaneously solves both problems.

On-Policy Learning

In OPCD, the student model learns from its own generated responses rather than pre-prepared datasets. The specific workflow:

  1. The student model receives a query without the system prompt and generates a response.
  2. The teacher model, with the full system prompt context, evaluates the token distribution at each generation step of the student.
  3. The student’s parameters are updated based on the difference between the student’s and teacher’s token distributions.

The key is that the student directly experiences and corrects its own mistakes. Rather than only seeing correct answers as in off-policy methods, it learns from teacher feedback in situations where it can actually go wrong.

Reverse KL Divergence

OPCD minimizes reverse KL divergence instead of forward KL. Reverse KL induces mode-seeking behavior: it focuses on regions where the student distribution assigns high probability, while suppressing tokens that the teacher rated highly but the student rated poorly.

As Ye explained: “Minimizing reverse KL encourages mode-seeking behavior that focuses on the student’s high-probability regions. Tokens the student deems unlikely are suppressed even if the teacher assigns high probability to them”2.

The combined effect was clear: the student model focused on the most accurate responses within its capability range, and the problem of hallucinations from over-ambitiously mimicking the teacher’s full distribution was greatly reduced.

Benchmark Results: The Numbers

The OPCD paper reported results across two experimental scenarios1.

Empirical Knowledge Distillation

The first experiment verified whether a model could internalize problem-solving strategies accumulated while solving math problems.

ModelTaskBaseline AccuracyAfter OPCD
Llama-3.1-8BMath reasoning75.0%80.9%
Qwen2.5-1.5BFrozen Lake game6.3%38.3%

The roughly 6× performance improvement (6.3% → 38.3%) in the 1.5B small model was particularly noteworthy.

System Prompt Distillation

The second experiment — the core topic of this article — tested scenarios where safety policies and medical domain prompts were baked into the model.

ModelTaskWithout PromptOPCD Internalized
Qwen2.5-3BSafety/toxicity classification30.7%83.1%
Qwen2.5-3BMedical QA59.4%76.3%

Safety classification accuracy jumping from 30.7% to 83.1% in a 3B model meant that performance without a system prompt approached the level achieved with one.

General Performance Preservation

On the perennial fine-tuning problem of catastrophic forgetting, OPCD also showed favorable results. A model that had internalized safety rules maintained approximately 4 percentage points higher performance on an unrelated medical QA task compared to off-policy approaches. It secured both specialization and general performance simultaneously.

Comparison with Existing Prompt Compression Techniques

OPCD wasn’t the first attempt to solve the cost problem of system prompts. Comparing it with representative existing approaches clarifies OPCD’s position.

LLMLingua: Token-Level Compression

LLMLingua, released by Microsoft in 2023, used small language models (GPT-2, LLaMA-7B, etc.) to remove low-importance tokens from prompts3. It achieved up to 20× compression while minimizing performance degradation. The follow-up, LongLLMLingua (ACL 2024), further improved compression for long-context scenarios.

However, LLMLingua-family approaches require performing compression at inference time for every request. Compression itself demands computation, the original prompt must still exist somewhere, and the compressed prompt’s token count is never zero.

Soft Prompts and Prompt Tuning

Prompt tuning prepends learnable continuous vectors (soft prompts) to the input to steer model behavior. Optimizing in continuous space rather than with discrete tokens achieves similar effects with far fewer parameters. However, soft prompts still need to be added to the input at every inference, and they suffer from poor interpretability.

OPCD’s Differentiator

OPCD takes a fundamentally different approach. Rather than “reducing” the prompt, it eliminates it entirely. System prompt information is directly internalized into model weights, so no additional input is needed at inference time at all. If model compression techniques like quantization and pruning reduce model size, OPCD reduces input size — or more precisely, eliminates it.

TechniqueApproachPrompt at InferenceAdditional Compute
LLMLinguaToken removalRequired (reduced)Compression model run
Prompt tuningSoft promptRequired (vectors)None
OPCDWeight internalizationNot requiredNone

Practical Deployment Conditions and Constraints

OPCD’s adoption barrier was relatively low. According to the paper, teams already running RLVR (Reinforcement Learning from Verifiable Rewards) pipelines could apply it without major architectural changes. The implementation was built on the open-source RLVR codebase verl, and Microsoft stated plans to release the code after internal review2.

Hardware requirements were approximately 8 A100 GPUs — realistic compared to large-scale pretraining. Data requirements were also light: empirical knowledge distillation needed only about 30 seed examples, and system prompt distillation required just the existing optimized prompt and standard task datasets.

However, OPCD isn’t a silver bullet. Ye noted: “When the needed information is very dynamic, or relates to frequently updated large external databases, RAG is a more appropriate solution”2. OPCD is fundamentally optimized for internalizing static, repetitive instructions — system prompts, safety policies, and domain rules that don’t change.

Position in the LLM Serving Pipeline

Assuming OPCD is deployed alongside a serving engine like vLLM, the benefits manifest through two pathways:

Prefill stage reduction: Eliminating the system prompt shortens the input sequence length by that much. For a service using 2,000 tokens of system prompt, each request saves 2,000 tokens’ worth of prefill computation. KV cache memory usage decreases proportionally.

Throughput increase: With shorter inputs, more requests can be batched in the same GPU memory. In continuous batching environments, this directly translates to higher throughput.

For large-scale services where thousands of tokens of system prompt are repeatedly sent across millions of daily queries, these savings can add up to a meaningful reduction in infrastructure costs.

A Stepping Stone Toward Self-Improving Models

OPCD’s long-term significance extends beyond simple cost savings. As the empirical knowledge distillation experiment showed, if a model can extract rules from its own successful experiences and internalize them into its parameters, this hints at the possibility of a self-improvement loop.

If a deployed model organizes success patterns accumulated during operation and internalizes them via OPCD into the next version — and this cycle repeats — the model could progressively optimize for a specific domain. Of course, realizing this vision requires solving challenges like automated experience extraction, quality verification, and safe update pipelines.

Summary

OPCD is a framework that realized the simple idea of “baking system prompts into the model instead of sending them every time” through the technical innovations of on-policy learning and reverse KL divergence. Solving the exposure bias and hallucination problems of existing context distillation while pushing safety classification accuracy from 30.7% to 83.1% in a 3B model was impressive.

If prompt compression was the “reducing” approach, OPCD is the “eliminating” approach. For enterprises running LLM-based services at scale, being able to remove thousands of tokens of fixed cost per request is a practical game changer. Once Microsoft releases the code, it will be worth watching how quickly this technology permeates production pipelines.

Footnotes

  1. Tianzhu Ye, Li Dong, Xun Wu et al., “On-Policy Context Distillation for Language Models”, arXiv:2602.12275, 2025. Paper 2

  2. Carl Franzen, “Microsoft’s new AI training method eliminates bloated system prompts without sacrificing model performance”, VentureBeat, February 28, 2026. Article 2 3 4

  3. Huiqiang Jiang et al., “LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models”, EMNLP 2023. Repository

← AI-Created Works Have No Copyright: What the U.S. Supreme Court's Final Ruling Means for Creators The Qwen3.5 Small Series Shock: When 9B Beats 120B, a New Standard for Local AI →