A modern alternative to LPIPS for training image autoencoders
Modern image generation pipelines like Stable Diffusion and FLUX operate in a compressed latent space rather than directly on pixels. An autoencoder first compresses images into a lower-dimensional representation, and a diffusion model then learns to generate in this latent space. This approach, introduced by Rombach et al., makes high-resolution generation computationally tractable.
The autoencoder's job is to compress images while preserving enough information for high-quality reconstruction. The compression ratio is typically expressed as f×c, where f is the spatial downsampling factor and c is the number of latent channels.
This creates a fundamental tension between reconstruction quality (rFID) and generation quality (gFID):
The optimal compression balances this trade-off: aggressive enough that diffusion models can learn efficiently, but not so aggressive that reconstruction quality bottlenecks generation. This is why perceptual losses matter—they help preserve the right information under compression.
The standard auto-encoder training approach combines several terms:
L = Lpixel + α · Lperceptual + β · LGAN
Each term creates a different trade-off, as explored in ViTok (with ViTok-v2 coming soon):
LPIPS uses VGG-16, a 2014 classification network trained on 1.2M ImageNet images. DINOv2 and DINOv3 are Meta's self-supervised vision models, offering several advantages:
FDD works exactly like FID, but uses DINO CLS tokens instead of Inception features:
import numpy as np
from scipy import linalg
from dino_perceptual import DINOModel
def compute_fdd(real_features, fake_features):
mu1, sigma1 = real_features.mean(0), np.cov(real_features, rowvar=False)
mu2, sigma2 = fake_features.mean(0), np.cov(fake_features, rowvar=False)
diff = mu1 - mu2
covmean, _ = linalg.sqrtm(sigma1 @ sigma2, disp=False)
if np.iscomplexobj(covmean):
covmean = covmean.real
return diff @ diff + np.trace(sigma1 + sigma2 - 2 * covmean)
# Usage
extractor = DINOModel(model_size="B").cuda().eval()
real_feats, _ = extractor(real_images)
fake_feats, _ = extractor(generated_images)
fdd = compute_fdd(real_feats.cpu().numpy(), fake_feats.cpu().numpy())
The figure below shows how to interpret FDD scores:
Interpretation: FDD <1 indicates excellent reconstruction quality (nearly identical distributions), 1-5 is good, and >5 suggests significant perceptual differences. Unlike pixel metrics, FDD captures semantic similarity—two images can have low FDD despite pixel-level differences if they share the same high-level structure.
ImageNet 256×256 (center crop) validation set with 50K images. All baseline results (SD-VAE, Qwen VAE, FLUX.1) are self-reproduced on the same benchmark for fair comparison.
We train a Vision Transformer autoencoder following the ViTok architecture: a shallow ViT encoder compresses images into latent tokens, and a deeper ViT decoder reconstructs the output. We use f16 spatial compression (256 tokens for 256×256 images) with 64 latent channels.
For the perceptual loss, we use a frozen DINOv3-B (ViT-Base) model. We extract features from intermediate transformer layers, L2-normalize per token, and compute MSE between input and reconstruction features. We did not ablate other DINO model sizes (S, L, G, H)—larger models may yield further improvements.
The full loss for the final model combines: Charbonnier + SSIM (γ=0.1) + DINO (α=250). No adversarial training is used.
A key property of any perceptual loss is that it should increase monotonically with perceptual degradation. We verified this by measuring DINO loss under three common distortions:
The plots show DINO loss (DINOv2-B) for a sample image degraded with increasing levels of blur (σ=0→8), noise (σ=0→100), and JPEG compression (quality 100→5). In all cases, the loss increases monotonically—confirming that DINO features capture perceptual degradation in a well-behaved manner suitable for gradient-based optimization.
The DINO loss magnitude is typically 10⁻⁴ to 10⁻² for natural images. Scale it to balance with your pixel loss:
| Pixel Loss Type | Recommended DINO Weight (α) |
|---|---|
| L1 / L2 | 250 - 500 |
| Charbonnier | 250 - 1000 |
| + SSIM loss | 250 (SSIM provides structure) |
Rule of thumb: Start with α=250, increase to α=1000 for better perceptual quality at the cost of ~1 dB PSNR.
Effect of adding different perceptual losses to the base Charbonnier pixel loss:
| Loss Configuration | rFID ↓ | rFDD ↓ | PSNR ↑ | SSIM ↑ |
|---|---|---|---|---|
| Pixel-only (Charb) | 5.13 | 10.96 | 34.81 | 0.929 |
| + SSIM (γ=0.1) | 4.87 ↓5% | 10.42 ↓5% | 34.72 | 0.931 |
| + LPIPS (α=0.1) | 0.72 ↓86% | 2.93 ↓73% | 34.19 ↓0.6dB | 0.923 |
| + DINO (α=250) | 0.51 ↓90% | 1.45 ↓87% | 33.93 ↓0.9dB | 0.919 |
| + DINO (α=1000) | 0.30 ↓94% | 1.12 ↓90% | 33.64 ↓1.2dB | 0.914 |
| + LPIPS + DINO (α=0.1, 250) | 0.38 ↓93% | 1.35 ↓88% | 33.89 ↓0.9dB | 0.918 |
Based on these ablations, we use Charbonnier + SSIM + DINO as our final loss configuration and train models with 16, 32, and 64 latent channels.
The scatter plots below visualize the trade-off between perceptual metrics (rFID, rFDD) and distortion metrics (PSNR, SSIM). The green line shows a potential Pareto frontier given the limited data points.
With more models and configurations, the true frontier would likely shift. These plots illustrate the general trade-off pattern rather than definitive optimal points.
Using Charbonnier + SSIM + DINO loss, we train ViT autoencoders at three compression levels. Reconstruction quality on ImageNet 256×256:
| Model | Compression | Ratio | rFID ↓ | rFDD ↓ | PSNR ↑ | SSIM ↑ |
|---|---|---|---|---|---|---|
| SD-VAE | f8×4ch | 48 | 0.73 | 6.14 | 25.70 | 0.702 |
| SDXL-VAE | f8×4ch | 48 | 0.68 | — | 26.04 | 0.834 |
| Qwen VAE | f8×16ch | 12 | 1.32 | 7.36 | 30.27 | 0.860 |
| FLUX.1 | f8×16ch | 12 | 0.15 | 2.29 | 31.10 | 0.887 |
| FLUX.2 | f8×16ch | 12 | 0.27 | — | 31.46 | 0.904 |
| ViTok-v2 16×16 | f16×16ch | 48 | 1.52 | 3.66 | 28.46 | 0.793 |
| ViTok-v2 16×32 | f16×32ch | 24 | 1.26 | 2.94 | 31.23 | 0.867 |
| ViTok-v2 16×64 | f16×64ch | 12 | 0.74 | 2.49 | 34.16 | 0.924 |
Bold = best, underlined = second best. Ratio = pixels per latent channel (3×f²/c). ViTok-v2 uses f16 compression (256 tokens) vs f8 methods (1024 tokens), enabling 4× faster diffusion training.
The idea of using DINO features for image synthesis has been explored in several concurrent and prior works:
Our approach differs in that we use DINO purely as a perceptual loss—comparing features between input and reconstruction—rather than as a discriminator, latent alignment target, or encoder replacement. This is simpler (no adversarial training, no frozen encoder constraints) while achieving comparable perceptual quality improvements.
DINO perceptual loss is a simple drop-in replacement for LPIPS that leverages modern self-supervised features. By using DINOv2/v3 instead of VGG, we achieve 2× better perceptual metrics while eliminating the need for adversarial training.
Code: github.com/Na-VAE/dino_perceptual
If you find this code helpful, please cite:
@software{dino_perceptual,
title={DINO Perceptual Loss},
author={Hansen-Estruch, Philippe and Chen, Jiahui and Ramanujan, Vivek and Zohar, Orr and Ping, Yan and Sinha, Animesh and Georgopoulos, Markos and Schoenfeld, Edgar and Hou, Ji and Juefei-Xu, Felix and Vishwanath, Sriram and Thabet, Ali},
year={2025},
url={https://github.com/Na-VAE/dino_perceptual}
}
@article{vitok_v2,
title={ViTok-v2: Scaling Visual Tokenizers},
author={Hansen-Estruch, Philippe and others},
year={2025},
note={Coming soon! Please check back for the official citation.}
}