Microsoft researchers have proposed a new training framework called On-Policy Context Distillation (OPCD) that eliminates the need for lengthy system prompts in large language models (LLMs) without sacrificing model performance. OPCD works by having a student model learn from its own generation trajectories, rather than a static dataset, and uses a teacher model to evaluate the student's output and provide feedback. The researchers tested OPCD in two key areas: experiential knowledge distillation and system prompt distillation, and found that it improved model performance significantly, with an 8-billion-parameter model improving from 75.0% to 80.9% on complex math problems, and a 3-billion parameter Llama model improving from 30.7% to 83.1% on safety and toxicity classification. OPCD also maintained the model's general knowledge, outperforming old off-policy methods by approximately 4 percentage points, and can be integrated into existing workflows with minimal friction, requiring only standard hardware and a small amount of data. The researchers plan to release their implementation of OPCD as open source, and believe that it paves the way for genuinely self-improving models that can continuously adapt to bespoke enterprise environments without requiring manual supervision or data annotation.

Researchers at Microsoft have proposed a new training framework called On-Policy Context Distillation (OPCD) that eliminates the need for long system prompts in large language models (LLMs) without sacrificing model performance. This innovation has the potential to significantly reduce inference latency and per-query costs in enterprise applications.

In building LLM applications, enterprises often have to create very long system prompts to adjust the model’s behavior for their applications. These prompts contain company knowledge, preferences, and application-specific instructions, which can push inference latency past acceptable thresholds and drive up costs. OPCD uses the model’s own responses during training, which avoids some of the pitfalls of other training techniques. According to Tianzhu Ye, co-author of the paper and researcher at Microsoft Research Asia, “Enterprises often use long system prompts to enforce safety constraints or to provide domain-specific expertise. However, lengthy prompts significantly increase computational overhead and latency at inference time.” The main idea behind context distillation is to train a model to internalize the information that you repeatedly insert into the context, following a teacher-student paradigm.

The researchers tested OPCD in two key areas: experiential knowledge distillation and system prompt distillation. For experiential knowledge distillation, the researchers wanted to see if an LLM could learn from its own past successes and permanently adopt those lessons. The results showed that the models improved dramatically without needing the learned experience pasted into their prompts anymore. On complex math problems, an 8-billion-parameter model improved from a 75.0% baseline to 80.9%. The second set of experiments were on long system prompts, and the results showed that OPCD successfully internalized complex rules and massively boosted performance. When testing a 3-billion parameter Llama model on safety and toxicity classification, the base model scored 30.7%. After using OPCD to internalize the safety prompt, its accuracy spiked to 83.1%.

OPCD has the potential to pave the way for genuinely self-improving models that continuously adapt to bespoke enterprise environments. Once deployed, a model can extract lessons from real-world interactions and use OPCD to progressively internalize those characteristics without requiring manual supervision or data annotation from model trainers. According to Ye, “This represents a fundamental paradigm shift in model improvement: the core improvements to the model would move from training time to test time.” The researchers plan to release their implementation as open source following internal reviews, and the technique can be integrated into existing workflows with very little friction, using about eight A100 GPUs from Nvidia.

Techno News

Microsoft researchers have proposed a new training framework called On-Policy Context Distillation (OPCD) that elimin…

Leave a Reply Cancel reply

Recent Posts