Methodology & Architecture

The core innovation of DanceOPD is the unification of distinct generative capabilities—text-to-image synthesis, local editing, and global styling—into a single student flow-matching network. This is achieved by formulating tasks as velocity fields over a shared latent space and training the student model using on-policy trajectory matching.

Mathematical Framework

Let $z_t$ represent the latent state at time step $t \in [0, 1]$ in a continuous flow-matching formulation. The objective is to define a velocity field that guides a source distribution (Gaussian noise) to a target distribution (clean image data). Specialized models are trained to estimate these velocity fields:

Velocity Field Formulation

v_m(z_t, t; x) ≈ dx_t / dt

Here, $v_m$ represents the velocity field for a specific task $m$ (such as image editing or generation) conditioned on the input context $x$. In a standard multi-task system, training a student model to predict the average of these fields causes path conflicts. DanceOPD avoids this by isolating the updates through a hard-routed sample-wise optimization path.

Hard-Routed Sample-Wise Optimization

During training, each input sample is assigned to exactly one target task. If a sample is designated for local image editing, the student model is trained exclusively to match the velocity field of the local edit teacher.

By ensuring that the model is only exposed to one task's trajectory per sample, we prevent the gradient cancellation that occurs when conflicting task updates are combined. This hard routing allows the model's weight layers to accommodate distinct paths for different generation behaviors, maintaining high fidelity across all output types.

On-Policy Trajectory Querying

Traditional distillation processes are off-policy: they train the student using static paths generated by the teacher. When the student runs independently, however, small errors accumulate, leading to path drift.

DanceOPD addresses this shift by querying the teacher fields on the student's own generated rollout states. This on-policy supervision ensures that the student is trained on the specific states it actually visits, allowing the model to correct path deviations and produce sharper details in fewer inference steps.

Semantic-Side Single Query

To minimize the computation required for on-policy training, DanceOPD performs only a single query on the semantic portion of the latent space during a generation run. The early steps of the flow matching path determine the layout and composition of the image, while the later steps focus on detail refinement.

By limiting teacher queries to the semantic phase, we obtain the necessary trajectory guidance while avoiding redundant calculations during detail generation. This single query optimization significantly reduces training time and memory requirements, making the distillation framework highly practical for large-scale setups.