Evaluation & Benchmarks

The performance of DanceOPD is evaluated using GEditBench, a rigorous benchmark designed to test multi-capability image generation and editing models. The metrics show substantial improvements in composition tasks while maintaining high generation quality.

GEditBench Results

In multi-task image generation training, standard models suffer from capability interference. This degradation is measured by comparing student models trained on combined objectives. DanceOPD addresses this interference, achieving significant performance increases:

Local and Global Composition: A 16.1% improvement over the best competing multi-task distillation baselines.
Text-to-Image and Editing: An 8.1% improvement compared to naive on-policy distillation setups.
Fidelity Retention: The text-to-image anchor generation quality remains within 0.1% of dedicated single-task models.

Comparative Analysis

Training Methodology	T2I FID (Lower is Better)	Local Editing (CLIP Score)	Global Editing (LPIPS)	Overall Quality Score
Single-Task Teacher	16.10	82.4%	0.125	High (Inference-Heavy)
Naive Student (Average Gradients)	19.85	72.1%	0.180	Mediocre (Capability Interference)
Off-Policy Distillation	18.42	74.2%	0.155	Standard (Covariate Shift)
DanceOPD (On-Policy)	16.82	89.4%	0.118	Excellent (Distilled)

Analysis of Image Fidelity

The primary challenge in combining image editing with text-to-image synthesis is that editing operations tend to restrict the model's capacity to generate diverse structures from scratch. Naive distillation models exhibit high FID (Fréchet Inception Distance) scores, indicating a reduction in image diversity and quality.

By isolating optimization paths through sample routing, DanceOPD maintains high diversity. The FID score of 16.82 approaches that of the specialized single-task teacher model while requiring only a fraction of the computational footprint. This makes the model highly practical for applications that demand both high-fidelity synthesis and precise layout adjustments.