DINO Perceptual Loss

A modern alternative to LPIPS for training image autoencoders

Philippe Hansen-Estruch · Vivek Ramanujan

TL;DR: We replace LPIPS (VGG-based perceptual loss) with DINOv2/v3 features for training image autoencoders. DINO loss achieves 7× better rFID and 6× better rFDD than pixel-only training, and performs comparably to LPIPS—despite having no human-aligned supervision. Unlike StyleGAN-T (DINO as discriminator) or RAE (frozen DINO encoder), we use DINO purely as a loss function: simpler, no adversarial training, no architectural constraints.

Autoencoders and Latent Diffusion

Modern image generation pipelines like Stable Diffusion and FLUX operate in a compressed latent space rather than directly on pixels. An autoencoder first compresses images into a lower-dimensional representation, and a diffusion model then learns to generate in this latent space. This approach, introduced by Rombach et al., makes high-resolution generation computationally tractable.

The autoencoder's job is to compress images while preserving enough information for high-quality reconstruction. The compression ratio is typically expressed as f×c, where f is the spatial downsampling factor and c is the number of latent channels.

This creates a fundamental tension between reconstruction quality (rFID) and generation quality (gFID):

The optimal compression balances this trade-off: aggressive enough that diffusion models can learn efficiently, but not so aggressive that reconstruction quality bottlenecks generation. This is why perceptual losses matter—they help preserve the right information under compression.

The Perception-Distortion Trade-off

The standard auto-encoder training approach combines several terms:

L = Lpixel + α · Lperceptual + β · LGAN

Each term creates a different trade-off, as explored in ViTok (with ViTok-v2 coming soon):

The core trade-off: Pixel losses achieve the best PSNR/SSIM but the worst perceptual quality (FID/FDD). Adding perceptual or adversarial losses improves perceptual metrics at the cost of distortion metrics—hence we seek more Pareto-optimal solutions.

Why Replace LPIPS with DINO?

LPIPS uses VGG-16, a 2014 classification network trained on 1.2M ImageNet images. DINOv2 and DINOv3 are Meta's self-supervised vision models, offering several advantages:


Experimental Setup

Evaluation Metrics

Computing FDD

FDD works exactly like FID, but uses DINO CLS tokens instead of Inception features:

  1. Extract DINO CLS token features from real images → (N, 768)
  2. Extract DINO CLS token features from reconstructed/generated images → (M, 768)
  3. Compute Fréchet distance between the two Gaussian distributions
import numpy as np
from scipy import linalg
from dino_perceptual import DINOModel

def compute_fdd(real_features, fake_features):
    mu1, sigma1 = real_features.mean(0), np.cov(real_features, rowvar=False)
    mu2, sigma2 = fake_features.mean(0), np.cov(fake_features, rowvar=False)
    diff = mu1 - mu2
    covmean, _ = linalg.sqrtm(sigma1 @ sigma2, disp=False)
    if np.iscomplexobj(covmean):
        covmean = covmean.real
    return diff @ diff + np.trace(sigma1 + sigma2 - 2 * covmean)

# Usage
extractor = DINOModel(model_size="B").cuda().eval()
real_feats, _ = extractor(real_images)
fake_feats, _ = extractor(generated_images)
fdd = compute_fdd(real_feats.cpu().numpy(), fake_feats.cpu().numpy())

The figure below shows how to interpret FDD scores:

FDD score interpretation: same distribution ~0, good reconstruction 1-5, poor reconstruction >5

Interpretation: FDD <1 indicates excellent reconstruction quality (nearly identical distributions), 1-5 is good, and >5 suggests significant perceptual differences. Unlike pixel metrics, FDD captures semantic similarity—two images can have low FDD despite pixel-level differences if they share the same high-level structure.

Dataset

ImageNet 256×256 (center crop) validation set with 50K images. All baseline results (SD-VAE, Qwen VAE, FLUX.1) are self-reproduced on the same benchmark for fair comparison.

Model

We train a Vision Transformer autoencoder following the ViTok architecture: a shallow ViT encoder compresses images into latent tokens, and a deeper ViT decoder reconstructs the output. We use f16 spatial compression (256 tokens for 256×256 images) with 64 latent channels.

DINO Loss Details

For the perceptual loss, we use a frozen DINOv3-B (ViT-Base) model. We extract features from intermediate transformer layers, L2-normalize per token, and compute MSE between input and reconstruction features. We did not ablate other DINO model sizes (S, L, G, H)—larger models may yield further improvements.

The full loss for the final model combines: Charbonnier + SSIM (γ=0.1) + DINO (α=250). No adversarial training is used.

DINO Loss Behavior

A key property of any perceptual loss is that it should increase monotonically with perceptual degradation. We verified this by measuring DINO loss under three common distortions:

DINO loss vs blur, noise, and JPEG compression

The plots show DINO loss (DINOv2-B) for a sample image degraded with increasing levels of blur (σ=0→8), noise (σ=0→100), and JPEG compression (quality 100→5). In all cases, the loss increases monotonically—confirming that DINO features capture perceptual degradation in a well-behaved manner suitable for gradient-based optimization.

Loss Scaling Guide

The DINO loss magnitude is typically 10⁻⁴ to 10⁻² for natural images. Scale it to balance with your pixel loss:

Pixel Loss Type Recommended DINO Weight (α)
L1 / L2 250 - 500
Charbonnier 250 - 1000
+ SSIM loss 250 (SSIM provides structure)

Rule of thumb: Start with α=250, increase to α=1000 for better perceptual quality at the cost of ~1 dB PSNR.


Results

Loss Ablation

Effect of adding different perceptual losses to the base Charbonnier pixel loss:

Loss Configuration rFID ↓ rFDD ↓ PSNR ↑ SSIM ↑
Pixel-only (Charb) 5.13 10.96 34.81 0.929
+ SSIM (γ=0.1) 4.87 ↓5% 10.42 ↓5% 34.72 0.931
+ LPIPS (α=0.1) 0.72 ↓86% 2.93 ↓73% 34.19 ↓0.6dB 0.923
+ DINO (α=250) 0.51 ↓90% 1.45 ↓87% 33.93 ↓0.9dB 0.919
+ DINO (α=1000) 0.30 ↓94% 1.12 ↓90% 33.64 ↓1.2dB 0.914
+ LPIPS + DINO (α=0.1, 250) 0.38 ↓93% 1.35 ↓88% 33.89 ↓0.9dB 0.918
Key finding: DINO achieves 17× better rFID (0.30 vs 5.13) and 10× better rFDD (1.12 vs 10.96) compared to pixel-only training, at a cost of ~1 dB PSNR. Interestingly, DINO performs comparably to LPIPS as a perceptual loss despite having no human-aligned supervision—DINO's self-supervised features appear to capture similar perceptual structure. Combining LPIPS with DINO does not improve over DINO alone, suggesting the two losses capture overlapping information.

Based on these ablations, we use Charbonnier + SSIM + DINO as our final loss configuration and train models with 16, 32, and 64 latent channels.

Perception-Distortion Visualization

The scatter plots below visualize the trade-off between perceptual metrics (rFID, rFDD) and distortion metrics (PSNR, SSIM). The green line shows a potential Pareto frontier given the limited data points.

Scatter plots showing rFID vs SSIM, rFID vs PSNR, rFDD vs PSNR, and rFDD vs SSIM

With more models and configurations, the true frontier would likely shift. These plots illustrate the general trade-off pattern rather than definitive optimal points.

Comparison with State-of-the-Art

Using Charbonnier + SSIM + DINO loss, we train ViT autoencoders at three compression levels. Reconstruction quality on ImageNet 256×256:

Model Compression Ratio rFID ↓ rFDD ↓ PSNR ↑ SSIM ↑
SD-VAE f8×4ch 48 0.73 6.14 25.70 0.702
SDXL-VAE f8×4ch 48 0.68 26.04 0.834
Qwen VAE f8×16ch 12 1.32 7.36 30.27 0.860
FLUX.1 f8×16ch 12 0.15 2.29 31.10 0.887
FLUX.2 f8×16ch 12 0.27 31.46 0.904
ViTok-v2 16×16 f16×16ch 48 1.52 3.66 28.46 0.793
ViTok-v2 16×32 f16×32ch 24 1.26 2.94 31.23 0.867
ViTok-v2 16×64 f16×64ch 12 0.74 2.49 34.16 0.924

Bold = best, underlined = second best. Ratio = pixels per latent channel (3×f²/c). ViTok-v2 uses f16 compression (256 tokens) vs f8 methods (1024 tokens), enabling 4× faster diffusion training.

No GAN required: ViTok-v2 achieves state-of-the-art results using only pixel + DINO losses. No adversarial training needed—simpler pipeline, more stable training.

Related Work

The idea of using DINO features for image synthesis has been explored in several concurrent and prior works:

Our approach differs in that we use DINO purely as a perceptual loss—comparing features between input and reconstruction—rather than as a discriminator, latent alignment target, or encoder replacement. This is simpler (no adversarial training, no frozen encoder constraints) while achieving comparable perceptual quality improvements.

Conclusion

DINO perceptual loss is a simple drop-in replacement for LPIPS that leverages modern self-supervised features. By using DINOv2/v3 instead of VGG, we achieve 2× better perceptual metrics while eliminating the need for adversarial training.

Code: github.com/Na-VAE/dino_perceptual

Citation

If you find this code helpful, please cite:

@software{dino_perceptual,
  title={DINO Perceptual Loss},
  author={Hansen-Estruch, Philippe and Chen, Jiahui and Ramanujan, Vivek and Zohar, Orr and Ping, Yan and Sinha, Animesh and Georgopoulos, Markos and Schoenfeld, Edgar and Hou, Ji and Juefei-Xu, Felix and Vishwanath, Sriram and Thabet, Ali},
  year={2025},
  url={https://github.com/Na-VAE/dino_perceptual}
}

@article{vitok_v2,
  title={ViTok-v2: Scaling Visual Tokenizers},
  author={Hansen-Estruch, Philippe and others},
  year={2025},
  note={Coming soon! Please check back for the official citation.}
}

References

  1. Rombach et al. "High-Resolution Image Synthesis with Latent Diffusion Models." CVPR 2022. arXiv
  2. Zhang et al. "The Unreasonable Effectiveness of Deep Features as a Perceptual Metric." CVPR 2018. arXiv
  3. Oquab et al. "DINOv2: Learning Robust Visual Features without Supervision." TMLR 2024. arXiv
  4. Hansen-Estruch et al. "Learnings from Scaling Visual Tokenizers for Reconstruction and Generation." ICML 2025. arXiv
  5. Hansen-Estruch et al. "ViTok-v2: Scaling Visual Tokenizers." 2025. (Coming soon!)
  6. Sauer et al. "StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis." ICML 2023. arXiv
  7. Sauer et al. "Adversarial Diffusion Distillation." ECCV 2024. arXiv
  8. Yao et al. "Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models." CVPR 2025. arXiv
  9. Zheng et al. "Diffusion Transformers with Representation Autoencoders." 2024. arXiv