Researchers at Microsoft have proposed a new training framework called On-Policy Context Distillation (OPCD) that eliminates the need for long system prompts in large language models (LLMs) without sacrificing model performance. This innovation has the potential to significantly reduce inference latency and per-query costs in enterprise applications.
In building LLM applications, enterprises often have to create very long system prompts to adjust the model’s behavior for their applications. These prompts contain company knowledge, preferences, and application-specific instructions, which can push inference latency past acceptable thresholds and drive up costs. OPCD uses the model’s own responses during training, which avoids some of the pitfalls of other training techniques. According to Tianzhu Ye, co-author of the paper and researcher at Microsoft Research Asia, “Enterprises often use long system prompts to enforce safety constraints or to provide domain-specific expertise. However, lengthy prompts significantly increase computational overhead and latency at inference time.” The main idea behind context distillation is to train a model to internalize the information that you repeatedly insert into the context, following a teacher-student paradigm.
The researchers tested OPCD in two key areas: experiential knowledge distillation and system prompt distillation. For experiential knowledge distillation, the researchers wanted to see if an LLM could learn from its own past successes and permanently adopt those lessons. The results showed that the models improved dramatically without needing the learned experience pasted into their prompts anymore. On complex math problems, an 8-billion-parameter model improved from a 75.0% baseline to 80.9%. The second set of experiments were on long system prompts, and the results showed that OPCD successfully internalized complex rules and massively boosted performance. When testing a 3-billion parameter Llama model on safety and toxicity classification, the base model scored 30.7%. After using OPCD to internalize the safety prompt, its accuracy spiked to 83.1%.
OPCD has the potential to pave the way for genuinely self-improving models that continuously adapt to bespoke enterprise environments. Once deployed, a model can extract lessons from real-world interactions and use OPCD to progressively internalize those characteristics without requiring manual supervision or data annotation from model trainers. According to Ye, “This represents a fundamental paradigm shift in model improvement: the core improvements to the model would move from training time to test time.” The researchers plan to release their implementation as open source following internal reviews, and the technique can be integrated into existing workflows with very little friction, using about eight A100 GPUs from Nvidia.

















Leave a Reply