Evaluation & Benchmarks
The performance of DanceOPD is evaluated using GEditBench, a rigorous benchmark designed to test multi-capability image generation and editing models. The metrics show substantial improvements in composition tasks while maintaining high generation quality.
GEditBench Results
In multi-task image generation training, standard models suffer from capability interference. This degradation is measured by comparing student models trained on combined objectives. DanceOPD addresses this interference, achieving significant performance increases:
- Local and Global Composition: A 16.1% improvement over the best competing multi-task distillation baselines.
- Text-to-Image and Editing: An 8.1% improvement compared to naive on-policy distillation setups.
- Fidelity Retention: The text-to-image anchor generation quality remains within 0.1% of dedicated single-task models.
Comparative Analysis
| Training Methodology | T2I FID (Lower is Better) | Local Editing (CLIP Score) | Global Editing (LPIPS) | Overall Quality Score |
|---|---|---|---|---|
| Single-Task Teacher | 16.10 | 82.4% | 0.125 | High (Inference-Heavy) |
| Naive Student (Average Gradients) | 19.85 | 72.1% | 0.180 | Mediocre (Capability Interference) |
| Off-Policy Distillation | 18.42 | 74.2% | 0.155 | Standard (Covariate Shift) |
| DanceOPD (On-Policy) | 16.82 | 89.4% | 0.118 | Excellent (Distilled) |
Analysis of Image Fidelity
The primary challenge in combining image editing with text-to-image synthesis is that editing operations tend to restrict the model's capacity to generate diverse structures from scratch. Naive distillation models exhibit high FID (Fréchet Inception Distance) scores, indicating a reduction in image diversity and quality.
By isolating optimization paths through sample routing, DanceOPD maintains high diversity. The FID score of 16.82 approaches that of the specialized single-task teacher model while requiring only a fraction of the computational footprint. This makes the model highly practical for applications that demand both high-fidelity synthesis and precise layout adjustments.