DINO Perceptual Loss

A modern alternative to LPIPS for training image autoencoders

Philippe Hansen-Estruch · Vivek Ramanujan

TL;DR: We replace LPIPS (VGG-based perceptual loss) with DINOv2/v3 features for training image autoencoders. DINO loss achieves 7× better rFID and 6× better rFDD than pixel-only training, and performs comparably to LPIPS—despite having no human-aligned supervision. Unlike StyleGAN-T (DINO as discriminator) or RAE (frozen DINO encoder), we use DINO purely as a loss function: simpler, no adversarial training, no architectural constraints.

Autoencoders and Latent Diffusion

Modern image generation pipelines like Stable Diffusion and FLUX operate in a compressed latent space rather than directly on pixels. An autoencoder first compresses images into a lower-dimensional representation, and a diffusion model then learns to generate in this latent space. This approach, introduced by Rombach et al., makes high-resolution generation computationally tractable.

The autoencoder's job is to compress images while preserving enough information for high-quality reconstruction. The compression ratio is typically expressed as f×c, where f is the spatial downsampling factor and c is the number of latent channels.

This creates a fundamental tension between reconstruction quality (rFID) and generation quality (gFID):

Less compression (e.g., f8×16ch) → better rFID since more information is preserved, but the latent space is high-dimensional and harder for diffusion models to learn, leading to worse gFID.
More compression (e.g., f16×16ch) → worse rFID since information is lost, but the latent space is more compact and structured, making it easier for diffusion models to learn effective representations, improving gFID.

The optimal compression balances this trade-off: aggressive enough that diffusion models can learn efficiently, but not so aggressive that reconstruction quality bottlenecks generation. This is why perceptual losses matter—they help preserve the right information under compression.

The Perception-Distortion Trade-off

The standard auto-encoder training approach combines several terms:

L = L_pixel + α · L_perceptual + β · L_GAN

Each term creates a different trade-off, as explored in ViTok (with ViTok-v2 coming soon):

Pixel losses (L1/L2) optimize distortion metrics (PSNR, SSIM) but produce blurry outputs by themselves. The model learns to output the average of plausible reconstructions, which minimizes pixel error but looks unrealistic.
Perceptual losses (LPIPS) compare images in feature space using VGG-16 features combined with linear layers trained on human perceptual judgments. This reduces blurriness and preserves high-level structure but relies on decade-old VGG features.
Adversarial losses (GAN) improve realism by training a discriminator, but introduce training instability and mode collapse risks. With Vision Transformers, GANs often require multi-stage training pipelines for stable convergence.

The core trade-off: Pixel losses achieve the best PSNR/SSIM but the worst perceptual quality (FID/FDD). Adding perceptual or adversarial losses improves perceptual metrics at the cost of distortion metrics—hence we seek more Pareto-optimal solutions.

Why Replace LPIPS with DINO?

LPIPS uses VGG-16, a 2014 classification network trained on 1.2M ImageNet images. DINOv2 and DINOv3 are Meta's self-supervised vision models, offering several advantages:

Modern architecture: Vision Transformers with attention (vs. 2014 CNNs)
Scale: Trained on 1.7B images (1400× more than VGG)
Self-supervised: Learns visual similarity, not classification
No GAN needed: Strong enough to eliminate adversarial training entirely

Experimental Setup

Evaluation Metrics

PSNR / SSIM: Distortion metrics measuring pixel-level fidelity. Higher is better, but correlates poorly with human perception.
rFID: Fréchet Inception Distance on reconstructions. Measures distributional similarity using Inception-v3 features. Lower is better.
rFDD: Fréchet DINO Distance—like FID but uses DINOv2 features. We find rFDD correlates better with perceptual quality than rFID, since DINO captures semantic structure rather than texture statistics. Lower is better.

Computing FDD

FDD works exactly like FID, but uses DINO CLS tokens instead of Inception features:

Extract DINO CLS token features from real images → (N, 768)
Extract DINO CLS token features from reconstructed/generated images → (M, 768)
Compute Fréchet distance between the two Gaussian distributions

import numpy as np
from scipy import linalg
from dino_perceptual import DINOModel

def compute_fdd(real_features, fake_features):
    mu1, sigma1 = real_features.mean(0), np.cov(real_features, rowvar=False)
    mu2, sigma2 = fake_features.mean(0), np.cov(fake_features, rowvar=False)
    diff = mu1 - mu2
    covmean, _ = linalg.sqrtm(sigma1 @ sigma2, disp=False)
    if np.iscomplexobj(covmean):
        covmean = covmean.real
    return diff @ diff + np.trace(sigma1 + sigma2 - 2 * covmean)

# Usage
extractor = DINOModel(model_size="B").cuda().eval()
real_feats, _ = extractor(real_images)
fake_feats, _ = extractor(generated_images)
fdd = compute_fdd(real_feats.cpu().numpy(), fake_feats.cpu().numpy())

The figure below shows how to interpret FDD scores:

FDD score interpretation: same distribution ~0, good reconstruction 1-5, poor reconstruction >5

Interpretation: FDD <1 indicates excellent reconstruction quality (nearly identical distributions), 1-5 is good, and >5 suggests significant perceptual differences. Unlike pixel metrics, FDD captures semantic similarity—two images can have low FDD despite pixel-level differences if they share the same high-level structure.

Dataset

ImageNet 256×256 (center crop) validation set with 50K images. All baseline results (SD-VAE, Qwen VAE, FLUX.1) are self-reproduced on the same benchmark for fair comparison.

Model

We train a Vision Transformer autoencoder following the ViTok architecture: a shallow ViT encoder compresses images into latent tokens, and a deeper ViT decoder reconstructs the output. We use f16 spatial compression (256 tokens for 256×256 images) with 64 latent channels.

DINO Loss Details

For the perceptual loss, we use a frozen DINOv3-B (ViT-Base) model. We extract features from intermediate transformer layers, L2-normalize per token, and compute MSE between input and reconstruction features. We did not ablate other DINO model sizes (S, L, G, H)—larger models may yield further improvements.

The full loss for the final model combines: Charbonnier + SSIM (γ=0.1) + DINO (α=250). No adversarial training is used.

DINO Loss Behavior

A key property of any perceptual loss is that it should increase monotonically with perceptual degradation. We verified this by measuring DINO loss under three common distortions:

DINO loss vs blur, noise, and JPEG compression

The plots show DINO loss (DINOv2-B) for a sample image degraded with increasing levels of blur (σ=0→8), noise (σ=0→100), and JPEG compression (quality 100→5). In all cases, the loss increases monotonically—confirming that DINO features capture perceptual degradation in a well-behaved manner suitable for gradient-based optimization.

Loss Scaling Guide

The DINO loss magnitude is typically 10⁻⁴ to 10⁻² for natural images. Scale it to balance with your pixel loss:

Pixel Loss Type	Recommended DINO Weight (α)
L1 / L2	250 - 500
Charbonnier	250 - 1000
+ SSIM loss	250 (SSIM provides structure)

Rule of thumb: Start with α=250, increase to α=1000 for better perceptual quality at the cost of ~1 dB PSNR.

Results

Loss Ablation

Effect of adding different perceptual losses to the base Charbonnier pixel loss:

Loss Configuration	rFID ↓	rFDD ↓	PSNR ↑	SSIM ↑
Pixel-only (Charb)	5.13	10.96	34.81	0.929
+ SSIM (γ=0.1)	4.87 ↓5%	10.42 ↓5%	34.72	0.931
+ LPIPS (α=0.1)	0.72 ↓86%	2.93 ↓73%	34.19 ↓0.6dB	0.923
+ DINO (α=250)	0.51 ↓90%	1.45 ↓87%	33.93 ↓0.9dB	0.919
+ DINO (α=1000)	0.30 ↓94%	1.12 ↓90%	33.64 ↓1.2dB	0.914
+ LPIPS + DINO (α=0.1, 250)	0.38 ↓93%	1.35 ↓88%	33.89 ↓0.9dB	0.918

Key finding: DINO achieves 17× better rFID (0.30 vs 5.13) and 10× better rFDD (1.12 vs 10.96) compared to pixel-only training, at a cost of ~1 dB PSNR. Interestingly, DINO performs comparably to LPIPS as a perceptual loss despite having no human-aligned supervision—DINO's self-supervised features appear to capture similar perceptual structure. Combining LPIPS with DINO does not improve over DINO alone, suggesting the two losses capture overlapping information.

Based on these ablations, we use Charbonnier + SSIM + DINO as our final loss configuration and train models with 16, 32, and 64 latent channels.

Perception-Distortion Visualization

The scatter plots below visualize the trade-off between perceptual metrics (rFID, rFDD) and distortion metrics (PSNR, SSIM). The green line shows a potential Pareto frontier given the limited data points.

Scatter plots showing rFID vs SSIM, rFID vs PSNR, rFDD vs PSNR, and rFDD vs SSIM

With more models and configurations, the true frontier would likely shift. These plots illustrate the general trade-off pattern rather than definitive optimal points.

Comparison with State-of-the-Art

Using Charbonnier + SSIM + DINO loss, we train ViT autoencoders at three compression levels. Reconstruction quality on ImageNet 256×256:

Model	Compression	Ratio	rFID ↓	rFDD ↓	PSNR ↑	SSIM ↑
SD-VAE	f8×4ch	48	0.73	6.14	25.70	0.702
SDXL-VAE	f8×4ch	48	0.68	—	26.04	0.834
Qwen VAE	f8×16ch	12	1.32	7.36	30.27	0.860
FLUX.1	f8×16ch	12	0.15	2.29	31.10	0.887
FLUX.2	f8×16ch	12	0.27	—	31.46	0.904
ViTok-v2 16×16	f16×16ch	48	1.52	3.66	28.46	0.793
ViTok-v2 16×32	f16×32ch	24	1.26	2.94	31.23	0.867
ViTok-v2 16×64	f16×64ch	12	0.74	2.49	34.16	0.924

Bold = best, underlined = second best. Ratio = pixels per latent channel (3×f²/c). ViTok-v2 uses f16 compression (256 tokens) vs f8 methods (1024 tokens), enabling 4× faster diffusion training.

No GAN required: ViTok-v2 achieves state-of-the-art results using only pixel + DINO losses. No adversarial training needed—simpler pipeline, more stable training.

Related Work

The idea of using DINO features for image synthesis has been explored in several concurrent and prior works:

StyleGAN-T (Sauer et al., ICML 2023) pioneered using a frozen DINO ViT-S as the discriminator backbone for text-to-image GANs. Multiple discriminator heads process intermediate DINO tokens. This demonstrated that DINO features are effective for adversarial training.
ADD (Sauer et al., ECCV 2024) uses a frozen DINOv2 discriminator for adversarial diffusion distillation, enabling 1-4 step sampling from diffusion models while maintaining quality.
VA-VAE (Yao et al., CVPR 2025) aligns VAE latents with DINOv2 features via a similarity loss, improving generation quality and achieving 21× faster convergence.
Representation Autoencoders (Zheng et al., 2024) replace VAE encoders entirely with frozen pretrained encoders (DINO, SigLIP, MAE) and train only the decoder.

Our approach differs in that we use DINO purely as a perceptual loss—comparing features between input and reconstruction—rather than as a discriminator, latent alignment target, or encoder replacement. This is simpler (no adversarial training, no frozen encoder constraints) while achieving comparable perceptual quality improvements.

Conclusion

DINO perceptual loss is a simple drop-in replacement for LPIPS that leverages modern self-supervised features. By using DINOv2/v3 instead of VGG, we achieve 2× better perceptual metrics while eliminating the need for adversarial training.

Code: github.com/Na-VAE/dino_perceptual

Citation

If you find this code helpful, please cite:

@software{dino_perceptual,
  title={DINO Perceptual Loss},
  author={Hansen-Estruch, Philippe and Chen, Jiahui and Ramanujan, Vivek and Zohar, Orr and Ping, Yan and Sinha, Animesh and Georgopoulos, Markos and Schoenfeld, Edgar and Hou, Ji and Juefei-Xu, Felix and Vishwanath, Sriram and Thabet, Ali},
  year={2025},
  url={https://github.com/Na-VAE/dino_perceptual}
}

@article{vitok_v2,
  title={ViTok-v2: Scaling Visual Tokenizers},
  author={Hansen-Estruch, Philippe and others},
  year={2025},
  note={Coming soon! Please check back for the official citation.}
}

References

Rombach et al. "High-Resolution Image Synthesis with Latent Diffusion Models." CVPR 2022. arXiv
Zhang et al. "The Unreasonable Effectiveness of Deep Features as a Perceptual Metric." CVPR 2018. arXiv
Oquab et al. "DINOv2: Learning Robust Visual Features without Supervision." TMLR 2024. arXiv
Hansen-Estruch et al. "Learnings from Scaling Visual Tokenizers for Reconstruction and Generation." ICML 2025. arXiv
Hansen-Estruch et al. "ViTok-v2: Scaling Visual Tokenizers." 2025. (Coming soon!)
Sauer et al. "StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale Text-to-Image Synthesis." ICML 2023. arXiv
Sauer et al. "Adversarial Diffusion Distillation." ECCV 2024. arXiv
Yao et al. "Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models." CVPR 2025. arXiv
Zheng et al. "Diffusion Transformers with Representation Autoencoders." 2024. arXiv