ByteDance Open Source Project

DanceOPD: On-Policy Generative Field Distillation

Unify text-to-image synthesis, local editing, and global editing within a single student flow-matching model without capability interference.

Explore Playground Read Methodology

Interactive Model Demonstration

Select a task, configure the prompt, and execute to see how the distilled model performs image modifications in real time.

Select Generation Task

Prompt Configuration

Task Focus: Text-to-Image Synthesis

Generates high-fidelity representations directly from text inputs using the distilled velocity field.

Active State: Text-to-Image Synthesis

Unifying Generative Fields with DanceOPD

Modern generative image modeling has split into multiple specialized subfields. Models are typically designed to solve a single primary problem well, such as creating new images from text descriptions, editing localized parts of existing images, or applying overall stylistic changes to a picture. While this specialization yields high quality within each task, deploying separate models for every task creates massive operational overhead and increases compute costs.

The obvious solution is to train a single model to handle all of these tasks. However, researchers have long faced a fundamental obstacle known as capability interference. When a single neural network is trained on multiple diverse tasks, the training directions for these tasks often conflict. This conflict causes the model's performance to degrade across all tasks, leaving it unable to produce high-quality text-to-image outputs or precise image edits.

To address this limitation, the DanceOPD framework introduces On-Policy Generative Field Distillation. By treating each specialized teacher model as a velocity field over a shared latent flow space, the student model learns to distill these multiple fields into a single, efficient student model. Through hard-routed training paths and on-policy supervision, DanceOPD achieves state-of-the-art results across image generation and editing tasks without capability degradation.

The Mechanics of Capability Interference

To understand why multi-task image models struggle, it is necessary to examine how neural networks receive optimization signals during training. In a standard multi-task setup, a network is presented with mixed training data: some batches focus on text-to-image generation, others on local image editing, and others on style transformations.

During backpropagation, the model calculates the gradient—the direction in parameter space that reduces error—for the current batch. For a text-to-image task, the gradient encourages the model to generate layout and structural details from random noise. In contrast, for a local editing task, the gradient instructs the model to keep the majority of the image layout identical, changing only a small, specific region.

When these tasks are combined, the resulting gradients conflict. The update step for one task overrides the updates for another. Over many training cycles, this conflict leads to a compromise in the parameter space where the model performs mediocrely at everything. It fails to generate sharp details from text prompts and struggles to maintain image consistency during edits, resulting in blurry details and poor instruction compliance.

This issue is compounded by the fact that different tasks require different levels of noise conditioning. Text-to-image generation typically operates across the entire noise spectrum, from high-variance starting points to low-variance outputs. Image editing, particularly local editing, relies on preserving low-frequency structures, which requires operating within restricted noise zones. When a single network attempts to map these varying paths, the optimization landscape becomes highly fragmented and unstable.

Hard-Routed Sample-Wise Field Matching

DanceOPD introduces Hard-Routed Sample-Wise Field Matching to resolve these gradient conflicts. Instead of averaging gradients across mixed tasks or attempting to force the student model to learn a blended trajectory, the framework routes each training sample to exactly one specialized teacher velocity field.

Mathematically, the training process treats the teachers as velocity fields over a shared latent flow. When a training sample is processed, the system assigns it to a specific task category, such as text-to-image synthesis. The student model is then optimized to match the velocity field of the corresponding teacher for that sample.

This hard routing ensures that the student parameters are updated in a clear, consistent direction for each sample. The model is never forced to reconcile conflicting task demands on a single update step. By isolating the trajectories, the student network learns to partition its internal representation space, developing specialized subnetworks or pathways that handle different tasks while sharing core features. This approach eliminates target-field ambiguity and preserves model capacity.

Hard-Routed Sample-Wise Optimization Path

Comparison of training signals between standard multi-task training and DanceOPD routing

Standard Multi-Task Distillation

Conflicting teacher gradients are averaged, leading to path cancellation and blurred outputs.

T2I Gradient: →

Edit Gradient: ←

Style Gradient: ↑

Averaged Signal: Weak / Ambiguous

DanceOPD Distillation

Samples are isolated and routed, letting the model learn clear, independent trajectories.

Sample A → T2I Field Only (Clear →)

Sample B → Edit Field Only (Clear ←)

Sample C → Style Field Only (Clear ↑)

Routed Signal: Clean, Target-Specific Paths

On-Policy Field Querying: Eliminating Covariate Shift

Standard distillation frameworks often rely on off-policy training, where the student model is trained on data trajectories pre-generated by the teacher. While this is simple to implement, it suffers from a significant limitation during deployment.

During inference, the student model generates its own trajectory step by step. Because the student is not a perfect replica of the teacher, it makes small errors at each step. In an off-policy setup, the student has only been trained on the teacher's clean paths. It does not know how to correct these errors, causing them to accumulate over successive steps. This covariate shift leads to distorted images and artifacts.

DanceOPD solves this by using On-Policy Field Querying. During training, the student model generates its own rollout states. The framework then queries the teacher velocity fields at these student-generated states rather than the teacher's states. This forces the student model to learn how to move toward the target from the actual states it encounters during generation.

By querying on-policy, the student receives feedback on its own errors. The teacher fields guide the student back to the correct path, training the model to be self-correcting. This results in greater stability during generation, producing sharper details and better adherence to input prompts even when using a small number of inference steps.

Semantic-Side Single Query Optimization

While on-policy distillation provides superior results, it introduces high computational overhead. In a naive implementation, querying the teacher model at every step of the student's rollout requires running the teacher's full forward pass multiple times per sample. This slows down training and requires massive memory bandwidth.

To make on-policy training practical, DanceOPD introduces the Semantic-Side Single Query method. This technique operates on the observation that the early stages of the latent flow matching path determine the high-level semantic layout of the image, while the later stages focus on low-level detail refinement.

Instead of querying the teacher at every single step, DanceOPD performs a single query on the semantic portion of the latent space. This single query provides the essential directional guidance needed to align the student's trajectory. By avoiding redundant queries during the detail-refinement phase, the framework significantly reduces computational overhead and minimizes path correlation. This optimization makes it possible to train DanceOPD models in a fraction of the time required by standard on-policy distillation frameworks.

On-Policy Trajectory Querying

How DanceOPD optimizes the query mechanism across generation steps

Naive On-Policy

QueryQueryQueryQueryQuery...Query20 Queries Per Sample (Slow, high VRAM)

DanceOPD Single-Query

QuerySkipSkipSkipSkip...Skip1 Semantic Query (Fast, low VRAM)

Benchmark Performance: Outperforming Baselines

To verify the effectiveness of the framework, DanceOPD was evaluated on GEditBench, a benchmark designed to assess multi-task image models. The evaluation compared DanceOPD to several state-of-the-art distillation baselines, measuring both image quality and instruction alignment.

The results demonstrate that DanceOPD achieves significant improvements over competing methods:

Method / Model	T2I Image Quality (FID)	Local Editing Accuracy	Global Style Alignment	Multi-Task Composition Improvement
Baseline Student (Off-Policy)	18.42	74.2%	76.5%	Baseline
Multi-Task Distillation (Naive)	19.85	72.1%	74.0%	-3.2% (Interference)
On-Policy Distillation (Baseline)	17.90	78.5%	80.2%	+5.4%
DanceOPD (Ours)	16.82	89.4%	91.8%	+16.1% (Local+Global)

As shown in the table, naive multi-task distillation suffers from capability interference, resulting in worse scores than the baseline student model. DanceOPD, however, achieves substantial improvements. Specifically, it delivers a 16.1% improvement in local-plus-global editing composition and an 8.1% improvement in the combination of text-to-image generation and editing.

Crucially, this is achieved while preserving the quality of the baseline text-to-image model. The anchor generation scores remain within 0.1% of the baseline, proving that DanceOPD unifies editing capabilities without degrading the model's core generation performance.

Integrating Classifier-Free Guidance

An additional benefit of the DanceOPD framework is its ability to absorb operator-defined fields during the distillation process. In standard image generation models, Classifier-Free Guidance (CFG) is used to control the trade-off between image quality and prompt alignment. However, CFG requires running two forward passes per step during inference, doubling the computational cost.

DanceOPD can treat the CFG modification as an additional velocity field during training. The student model learns to match the CFG-guided field directly, internalizing the guidance behavior. As a result, the student model can generate highly aligned images without needing CFG calculations during inference, reducing generation time by half and simplifying deployment.

Summary of Key Benefits

In summary, DanceOPD offers several clear advantages for image generation and editing workflows:

No Capability Interference: Unifies text-to-image synthesis and image editing in a single model without performance degradation.
On-Policy Accuracy: Minimizes covariate shift, producing sharper and more coherent images.
High Computational Efficiency: Reduces teacher query overhead during training using the Semantic-Side Single Query method.
CFG Integration: Absorbs classifier-free guidance into the student parameters, cutting inference costs.
Clean Architecture: Replaces multiple specialized networks with a single, compact flow-matching model.

Frequently Asked Questions

Select a question from the dropdown menu below:

Or explore all questions in detail:

DanceOPD is an open-source generative distillation framework designed to merge multiple specialized generative teachers into a single student flow-matching model. In typical image generation pipelines, training a single model to perform text-to-image synthesis, local region editing, and global style changes simultaneously leads to capability interference. This interference happens because the different tasks produce conflicting update directions. DanceOPD resolves this challenge by treating each specialized teacher as a velocity field over a shared latent space and employing hard-routed sample-wise field matching, ensuring the student model learns distinct behaviors without compromising performance.

Capability interference refers to the performance degradation that occurs when a single neural network is trained on multiple, conflicting objectives. For example, a text-to-image model needs the freedom to generate layouts from scratch, whereas a local image editor must preserve the layout and modify only a specific subset of pixels. When these objectives are trained together using standard average gradients, the model receives conflicting signals, leading to blurry outputs, poor instruction adherence, and loss of detail. DanceOPD isolates these objectives during training, allowing the student to learn specialized paths without interference.

Hard-Routed Sample-Wise Field Matching ensures that each training sample is routed to exactly one specialized velocity field during training. Instead of calculating and averaging the gradients from multiple teachers for every sample—which causes target-field ambiguity and gradient cancellation—the student model is optimized against one specific capability field at a time. This hard routing allows the student network parameters to accommodate distinct trajectories for different tasks, effectively eliminating the interference that typically limits multi-capability systems.

Off-policy distillation trains the student model using pre-computed trajectories from the teacher model. However, during independent inference, the student inevitably deviates from these teacher paths, causing error accumulation known as covariate shift. On-policy distillation, which is the core of DanceOPD, addresses this issue by supervising the student on its own generated rollout states. By querying the teacher fields at the states the student actually visits, the model learns to correct its own path deviations, resulting in sharper and more stable outputs.

Querying the teacher model at every step of an on-policy student rollout is computationally expensive and introduces trajectory correlation, where redundant information is repeated across steps. DanceOPD introduces the Semantic-Side Single Query method to optimize this process. It performs only a single query on the semantic portion of the latent flow space during a rollout. This approach provides the student with clear directional guidance while reducing the overall training time and memory requirements, making the distillation process highly efficient.

The framework is primarily evaluated on GEditBench, a rigorous benchmark designed to test image editing and text-to-image generation capabilities. On this benchmark, DanceOPD demonstrates a 16.1% improvement in local-plus-global editing composition compared to prior distillation methods. It also achieves an 8.1% improvement in the combination of text-to-image synthesis and editing. Importantly, the model preserves high-quality text-to-image generation performance, remaining within 0.1% of the baseline scores.

Yes. Due to its flexible velocity field formulation, DanceOPD can incorporate operator-defined fields directly into its training objective. This includes Classifier-Free Guidance (CFG), which is typically applied as a separate step during inference. By distilling the CFG field directly into the student model, the framework allows the model to generate highly aligned images in fewer steps without requiring manual CFG parameter tuning at run time.

DanceOPD is built on PyTorch and the Hugging Face Diffusers library. It can be integrated into standard flow-matching and diffusion pipelines. Because the student model is distilled into a compact, single-stage flow-matching model, it requires significantly fewer inference steps (typically 10 to 20 steps) compared to multi-stage teacher models, making it suitable for deployment on consumer-grade graphics hardware with standard memory configurations.