Latent Diffusion and Flow Models
Diffusion models[1] and flow-based models[15] have emerged as the dominant paradigm for image generation, powering systems like Stable Diffusion[2], DALL·E 3[3], Flux[4], Imagen[19], and Sora[20]. Operating directly in pixel space is computationally prohibitive for high-resolution images, so central to these pipelines is the autoencoder, which converts the high-dimensional pixel space into a more compact, "diffusable" latent space. Each autoencoder can be characterized by its compression ratio—the ratio of input dimensions to latent dimensions. This is typically expressed as the number of latent tokens times the channels per token. For example, an 8× spatial downsampling with 16 channels yields 1024 tokens × 16 channels for a 256px image, giving a 12:1 compression ratio.
Training autoencoders is challenging, due to two main challenges:
(1) Simple L2 pixel loss produces blurry reconstructions because
it cannot capture perceptual similarity—two images can have low pixel error yet look very different to humans.
This has led to a proliferation of perceptual losses: VGG-based LPIPS[13],
adversarial GAN losses[21], and more recently DINO-based losses[11].
For a deeper discussion on perceptual loss design, see our analysis.
(2) A fundamental tension exists between reconstruction quality (rFID) and generation quality (gFID).
Lower compression ratios (more latent dimensions) enable better reconstruction but create more complex latent spaces
that are harder for diffusion models to learn. Higher compression simplifies the generative task but limits reconstruction fidelity.
Finding the right balance is key to practical latent diffusion systems.
CNN vs ViT Autoencoders
CNN-based VAEs have been the production standard since the original Stable Diffusion. SD-VAE[2] and its successor SDXL-VAE use convolutional encoders and decoders with 8× spatial downsampling. Flux VAE[4] extends this to 16 channels for richer latent representations. Their key advantage is translation invariance—CNNs generalize robustly across resolutions and aspect ratios, even when trained only at 256px. However, CNN architectures are limited in compression: most use 8× spatial reduction, yielding 1024 tokens for a 256px image.
Hybrid CNN-ViT VAEs push beyond 8× spatial reduction. DC-AE[6] achieves up to 128× compression using residual autoencoding with EfficientViT blocks in a primarily convolutional design. Cosmos Tokenizer[28] combines 3D convolutions with spatio-temporal attention for unified image/video tokenization. OmniTokenizer[27] uses transformer-based spatial-temporal decoupling.
ViT-based VAEs have emerged more recently. TiTok[5] compresses images to just 32 tokens using a 1D latent representation. GigaTok[7] scales pure ViT tokenizers to 3 billion parameters. AToken[29] introduces a unified tokenizer achieving both high-fidelity reconstruction and semantic understanding across images, videos, and 3D assets using 4D RoPE and adversarial-free training. Concurrent work RAE[12] explores aligning pretrained representation encoders (SigLIP2, DINOv2, MAE) with learned decoders, showing frozen encoders can serve as strong diffusion latent spaces—though without exploring native resolution support in detail.
The challenge with ViT-VAEs: they typically require GAN losses for perceptual quality, introducing training instability. More critically, vanilla ViTs struggle with resolution generalization—we discuss this limitation and our solution in the following sections.
ViTok-v1 Findings
Our prior work, ViTok-v1[8], introduced a simple continuous ViT-VAE where each patch token maps directly to a latent vector. Key findings:
- Asymmetric scaling: Decoder capacity drives reconstruction quality; encoders can remain relatively lightweight without performance loss
- Reconstruction–generation trade-off: More latent channels improve reconstruction monotonically, but generation quality (gFID) is parabolic—too many channels create distributions that diffusion models struggle to learn
- Token efficiency: ViT-VAEs can achieve comparable reconstruction quality to CNN-VAEs with far fewer tokens (e.g., 256 vs 1024), enabling faster diffusion training
Limitations: ViTok-v1 exhibited the resolution generalization problem common to ViT architectures. Unlike CNNs which generalize naturally due to translation equivariance, vanilla ViTs fail catastrophically at higher resolutions—models trained at 256px produce severe grid artifacts when evaluated at 512px or with non-square aspect ratios. The learned positional embeddings simply do not extrapolate to unseen positions, causing the model to hallucinate patch boundaries and produce blocky artifacts. Additionally, ViTok-v1's largest model (302M parameters) left open questions about billion-scale behavior, and like other ViT-VAEs, it relied on GAN losses for competitive perceptual metrics, introducing training instability.
ViTok-v2: Our Approach
We build upon ViTok-v1 with two key improvements:
- NaFlex Resolution Flexibility: enabling resolution generalization via NaFlex. We integrate the NaFlex data pipeline[9,10], which resizes images preserving aspect ratio, then pads to patch boundaries rather than distorting via crop or stretch. Combined with 2D RoPE positional embeddings[14], models trained at 256px generalize to 512px, 1024px, and beyond without fine-tuning.
- Decoder Scaling: stable training with modern perceptual losses. We scale decoders to 4.5B parameters and replace unstable GAN objectives with DINOv3[11] perceptual loss. This achieves competitive reconstruction quality with stable, single-stage training.
Model Architecture
Our models use an asymmetric encoder-decoder design with relatively lightweight encoders and deep decoders. Model names refer to the decoder size:
| Model Name | Encoder | Decoder | Total |
|---|---|---|---|
| 350M | 51M | 303M | ~354M |
| 5B | 463M | 4.5B | ~5B |
Compression Configurations
| Config | Spatial Factor | Channels | Compression Ratio | Tokens (256px) | Tokens (512px) |
|---|---|---|---|---|---|
| Baselines | |||||
| SD VAE (f8x4) | 8x | 4 | 48:1 | 1024 | 4096 |
| Flux VAE (f8x16) | 8x | 16 | 12:1 | 1024 | 4096 |
| ViTok-v2 (Ours) | |||||
| f16x64 (our best model) |
16x | 64 | 12:1 | 256 | 1024 |
| f16x32 | 16x | 32 | 24:1 | 256 | 1024 |
| f16x16 | 16x | 16 | 48:1 | 256 | 1024 |
| f32x64 | 32x | 64 | 48:1 | 64 | 256 |
| f32x128 | 32x | 128 | 24:1 | 64 | 256 |
Compression Ratio = (H × W × 3) / (H/f × W/f × C). Fewer tokens = cheaper DiT inference and faster training (usually at the cost of diffusability).
NaFlex: Resolution and Aspect-Ratio Flexibility
A key limitation of ViT-based autoencoders is their poor generalization to resolutions and aspect ratios outside their training distribution. We integrate the NaFlex data pipeline[9,10]:
- Preserve native aspect ratio: Images are resized so the longest side fits within a budget (e.g., 256px), keeping the original aspect ratio intact.
- Gray-pad to patch boundary: Pad with gray (neutral value) to make dimensions divisible by patch size, rather than distorting via resize.
- Patchify with spatial coordinates: Each patch gets explicit (row, col) coordinates stored in the batch metadata.
- 2D RoPE positional encoding[14]: Rotary embeddings encode patch positions as a continuous function of coordinates—enabling generalization to unseen positions.
- Attention masking: Padded regions are masked out during attention so the model ignores padding patches.
❗ Key benefits: Models trained at 256px generalize to 512px, 1024px, and beyond without fine-tuning. Small high-resolution finetunes further improve quality. The example image above is 1024p × 1536p, demonstrating that even the finetuned model generalizes well to unseen resolutions and aspect ratios.
Decoder Scaling
To investigate the scaling question: How does decoder capacity affect reconstruction at different compression levels?
we trained a suite of ViTok models using only L1 pixel loss—no
perceptual losses, no adversarial training. This setup isolates the effect of scaling, allowing us to study pure scaling behavior.
Recent work like GigaTok[7] scales ViT tokenizers to 3 billion parameters. We push further, training decoders from 88M (B-scale) to 4.5B parameters (T-scale) across compression ratios from 12:1 to 48:1.
❗ Key findings: Scaling the decoder consistently improves all metrics, but the benefits are most pronounced at aggressive compression. At 12:1 compression, the gap between L (302M decoder) and T (4.5B decoder) is modest. At 48:1, the gap widens dramatically: rFID improves from 12.2 to 2.3, and rFDD improves from 5.2 to 2.0. This makes intuitive sense: larger patches require more capacity to decode.
However, even our largest T-scale model achieves rFID of only ~5 and rFDD of ~5 with L1 loss alone—far from competitive with GAN-trained baselines. This motivates our use of DINOv3 perceptual loss, which dramatically improves perceptual metrics without adversarial instability (see our detailed analysis).
COCO Reconstruction Evaluation
We evaluate reconstruction on MS-COCO validation against reproduced baselines (FLUX.2 VAE, Qwen VAE[18], SD VAE). Toggle between resolutions. Best and second best values highlighted.
Metrics: Enc/Dec = encoder/decoder parameters; rFID/rFDD = reconstruction FID/FDD (lower is better); PSNR/SSIM = pixel-level metrics (higher is better); ms/img = latency per image (lower is better), A100-80GB, batch 500, compiled.
| Model | Enc | Dec | Compression | Tokens | rFID ↓ | rFDD ↓ | PSNR ↑ | SSIM ↑ | ms/img ↓ |
|---|---|---|---|---|---|---|---|---|---|
| 12:1 Compression | |||||||||
| FLUX.2 VAE | 34M | 49M | f8x16 | 1024 | 1.12 | 1.46 | 31.48 | 0.900 | 2.14 |
| Qwen VAE | 106M | 393M | f8x16 | 1024 | 1.71 | 3.79 | 29.12 | 0.849 | 75.96 |
| 5B-f16x64 | 463M | 4.5B | f16x64 | 256 | 2.98 | 4.28 | 34.05 | 0.930 | 3.59 |
| 350M-f16x64 | 51M | 303M | f16x64 | 256 | 3.73 | 5.62 | 32.83 | 0.918 | 0.54 |
| 5B-f32x256 | 463M | 4.5B | f32x256 | 64 | 6.10 | 7.50 | 31.47 | 0.899 | 0.91 |
| 24:1 Compression | |||||||||
| SDXL VAE | 34M | 49M | f8x4 | 1024 | 4.16 | 9.59 | 25.78 | 0.740 | 4.22 |
| 5B-f16x32 | 463M | 4.5B | f16x32 | 256 | 4.72 | 5.37 | 31.08 | 0.878 | 3.59 |
| 350M-f16x32 | 51M | 303M | f16x32 | 256 | 6.60 | 8.35 | 30.41 | 0.866 | 0.54 |
| 48:1 Compression | |||||||||
| SD VAE | 34M | 49M | f8x4 | 1024 | 4.38 | 12.10 | 25.42 | 0.715 | 4.24 |
| 5B-f16x16 | 463M | 4.5B | f16x16 | 256 | 6.66 | 7.15 | 28.26 | 0.807 | 3.62 |
| 5B-f32x128 | 463M | 4.5B | f32x128 | 64 | 8.93 | 10.08 | 29.01 | 0.838 | 0.89 |
| 350M-f16x16 | 51M | 303M | f16x16 | 256 | 10.06 | 11.79 | 27.92 | 0.795 | 0.54 |
| 5B-f32x64 | 463M | 4.5B | f32x64 | 64 | 11.07 | 15.96 | 26.51 | 0.754 | 0.90 |
| 96:1 Compression | |||||||||
| DC-AE-f32 | 19M | 72M | f32x32 | 64 | 5.11 | 16.63 | 23.13 | 0.625 | 5.72 |
| DC-AE-f64 | 43M | 310M | f64x128 | 16 | 5.95 | 16.04 | 23.31 | 0.632 | 4.55 |
ViTok f16 models use 4× fewer tokens than f8 baselines (256 vs 1024 at 256p), enabling faster DiT training. Latency measured with torch.compile on H100, adm_center crop, 5000 samples (batch 500 @ 256p, batch 125 @ 512p).
DIV8K High-Resolution Evaluation
We evaluate on DIV8K at high resolutions with native aspect ratios. At 1024p and 2048p, our models run at comparable speed to baselines while achieving state-of-the-art reconstruction quality. At 4096p and above, baseline VAEs either run out of memory or are extremely slow (~20-170 seconds/image), while ViTok models run 15-30× faster without OOM issues.
| Model | Enc | Dec | Compression | rFID ↓ | rFDD ↓ | PSNR ↑ | SSIM ↑ | ms/img ↓ |
|---|---|---|---|---|---|---|---|---|
| 12:1 Compression | ||||||||
| FLUX.2 VAE | 34M | 49M | f8x16 | 0.90 | 0.44 | 31.44 | 0.908 | 107.6 |
| Qwen VAE | 106M | 393M | f8x16 | 1.50 | 1.28 | 28.84 | 0.845 | 237.3 |
| 5B-f16x64 | 463M | 4.5B | f16x64 | 0.35 | 0.89 | 33.99 | 0.932 | 207.4 |
| 350M-f16x64 | 51M | 303M | f16x64 | 0.44 | 1.30 | 32.78 | 0.918 | 11.98 |
| 5B-f32x256 | 463M | 4.5B | f32x256 | 0.52 | 1.67 | 31.42 | 0.899 | 15.54 |
| 24:1 Compression | ||||||||
| SDXL VAE | 34M | 49M | f8x8 | 4.79 | 3.53 | 25.96 | 0.731 | 128.2 |
| 5B-f16x32 | 463M | 4.5B | f16x32 | 1.21 | 2.16 | 30.98 | 0.874 | 70.81 |
| 350M-f16x32 | 51M | 303M | f16x32 | 1.63 | 3.62 | 30.32 | 0.861 | 12.01 |
| 5B-f32x128 | 463M | 4.5B | f32x128 | 1.68 | 3.13 | 29.05 | 0.834 | 15.23 |
| 48:1 Compression | ||||||||
| SD VAE | 34M | 49M | f8x4 | 5.59 | 4.39 | 25.58 | 0.707 | 110.1 |
| 5B-f16x16 | 463M | 4.5B | f16x16 | 3.05 | 3.17 | 28.45 | 0.802 | 71.41 |
| 5B-f32x64 | 463M | 4.5B | f32x64 | 4.69 | 6.14 | 26.96 | 0.754 | 15.21 |
| 350M-f16x16 | 51M | 303M | f16x16 | 4.83 | 6.96 | 28.11 | 0.788 | 11.88 |
| 96:1 Compression | ||||||||
| DC-AE-f64 | 43M | 310M | f64x128 | 6.52 | 4.33 | 24.09 | 0.648 | 422.1 |
| DC-AE-f32 | 19M | 72M | f32x32 | 7.08 | 4.88 | 23.71 | 0.636 | 334.5 |
DIV8K evaluation with native aspect ratio crops. Latency measured on H100 (batch size 2 for 4096p, 1 for 8192p). At 4096p+, baseline VAEs are too slow or OOM, while ViTok models process images 15-30× faster without memory issues.
Visual Comparison Tool
Compare reconstructions across models with our interactive viewer. Supports magnification, error heatmaps, and side-by-side comparison.
ViTok 5B-f16x64
Flux VAE
DiT Training Efficiency
To study the reconstruction–generation trade-off, we trained Diffusion Transformers (DiT)[17] at two scales (450M DiT-L and 1.2B DiT-G) using flow matching[15] on ImageNet-22K for 300 epochs (batch size 4096, learning rate 1e-3, cosine schedule). All models use ViTok f16 compression (256 tokens per 256px image) with different channel configurations (16ch, 32ch, 64ch). We compare the performance of each DiT size with both VAE decoder sizes (5B Td4-T vs 302M Ld4-L) to isolate the effect of reconstruction quality on generation.
Reconstruction vs Generation Metrics
Key Takeaways: Channel count and DiT size interact—smaller DiT models (450M) work better with lower channel counts (16ch, 32ch), while larger DiT models (1.2B) can leverage higher channel counts (64ch) given sufficient training. There is a clear FID vs FDD trade-off: 64ch achieves the best gFID at convergence, but 16ch maintains better gFDD throughout training, suggesting lower-dimensional latents produce more diverse generations. Decoder capacity matters significantly for generation—the 5B VAE decoder outperforms the 302M decoder by 1–2 gFID points at convergence, with the gap widening as training progresses. Our 256-token f16 configuration matches the quality of 1024-token f8 baselines at 4× lower training compute.
512p f32 compression: We also explored 32× spatial downsampling for higher-resolution generation. This reduces token count to just 256 tokens for 512p images (vs 1024 tokens for f16 at 512p), enabling efficient high-resolution generation while maintaining quality. The 512p f32 results (green diamonds) show 128ch achieves better rFID (0.32 vs 1.1) but 64ch achieves better gFDD (9.4 vs 15.6)—the classic reconstruction-generation trade-off.
Limitations
- Texture quality: Flux models still produce better quality textures, likely due to their use of generative/adversarial losses during training
- High compression challenges: Extreme compression configurations (e.g., f32x256) failed to train stable DiT models in our experiments
- Decoder inference cost: 4.5B decoder is slower than smaller VAEs (acceptable since it only runs once per generated image)
- Video compression: Not yet supported—extending to video auto-encoding is future work
- CFG saturation artifacts: ViTok latents can produce oversaturated samples when using standard classifier-free guidance (CFG). This manifests as overly contrasty, saturated colors (see example below). The issue can be mitigated using CFG interval[31]—applying guidance only between 10-90% of the sampling process—combined with rescaled CFG (scaling CFG output by the ratio of conditional to unconditional standard deviations). Radial CFG may also provide better results and is worth exploring.
References
- Ho, Jain, Abbeel. Denoising Diffusion Probabilistic Models. NeurIPS 2020. arXiv:2006.11239
- Rombach, Blattmann, Lorenz, Esser, Ommer. High-Resolution Image Synthesis with Latent Diffusion Models. CVPR 2022. arXiv:2112.10752
- Betker et al. Improving Image Generation with Better Captions. OpenAI 2023. Paper
- Black Forest Labs. FLUX.1. 2024. Website
- Yu et al. An Image is Worth 32 Tokens for Reconstruction and Generation. NeurIPS 2024. arXiv:2406.07550
- Chen et al. Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models. 2024. arXiv:2410.10733
- Xiong et al. GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters. 2025. arXiv:2504.02803
- Hansen-Estruch et al. Learnings from Scaling Visual Tokenizers for Reconstruction and Generation. 2025. arXiv:2501.09755
- Dehghani et al. Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution. NeurIPS 2023. arXiv:2307.06304
- Tschannen et al. SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding. 2025. arXiv:2502.14786
- Simeoni et al. DINOv3. 2025. arXiv:2508.10104
- Yao, Wang. Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models. 2025. arXiv:2501.01423
- Zhang et al. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. CVPR 2018. arXiv:1801.03924
- Su et al. RoFormer: Enhanced Transformer with Rotary Position Embedding. 2021. arXiv:2104.09864
- Lipman et al. Flow Matching for Generative Modeling. ICLR 2023. arXiv:2210.02747
- Beltagy, Peters, Cohan. Longformer: The Long-Document Transformer. 2020. arXiv:2004.05150
- Peebles, Xie. Scalable Diffusion Models with Transformers. ICCV 2023. arXiv:2212.09748
- Wu et al. Qwen-Image Technical Report. 2025. arXiv:2508.02324
- Saharia et al. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. NeurIPS 2022. arXiv:2205.11487
- OpenAI. Sora: Creating video from text. 2024. Website
- Esser et al. Taming Transformers for High-Resolution Image Synthesis. CVPR 2021. arXiv:2012.09841
- van den Oord et al. Neural Discrete Representation Learning. NeurIPS 2017. arXiv:1711.00937
- Yu et al. Vector-quantized Image Modeling with Improved VQGAN. ICLR 2022. arXiv:2110.04627
- Chang et al. MaskGIT: Masked Generative Image Transformer. CVPR 2022. arXiv:2202.04200
- Yu et al. Language Model Beats Diffusion – Tokenizer is Key to Visual Generation. ICLR 2024. arXiv:2310.05737
- Luo et al. Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation. 2024. arXiv:2409.04410
- Wang et al. OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation. 2024. arXiv:2406.09399
- Agarwal et al. Cosmos Tokenizer: A suite of image and video neural tokenizers. NVIDIA 2024. GitHub
- Lu, Song et al. AToken: A Unified Tokenizer for Vision. 2025. arXiv:2509.14476
- Kolesnikov et al. UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes. NeurIPS 2022. arXiv:2205.10337
- Kynkäänniemi et al. Applying Guidance in a Limited Interval Improves Sample and Distribution Quality in Diffusion Models. NeurIPS 2024. arXiv:2404.07724
Authors & Citation
Authors: Philippe Hansen-Estruch, Jiahui Chen, Vivek Ramanujan, Orr Zohar, Yan Ping, Animesh Sinha, Markos Georgopoulos, Edgar Schoenfeld, Ji Hou, Felix Juefei-Xu, Sriram Vishwanath, Ali Thabet
If you find this work useful, please cite:
@article{hansenestruch2025vitokv2,
title={ViTok-v2: Scaling Vision Transformer Tokenizers to 4.5 Billion Parameters},
author={Hansen-Estruch, Philippe and Chen, Jiahui and Ramanujan, Vivek and Zohar, Orr and Ping, Yan and Sinha, Animesh and Georgopoulos, Markos and Schoenfeld, Edgar and Hou, Ji and Juefei-Xu, Felix and Vishwanath, Sriram and Thabet, Ali},
journal={arXiv preprint},
year={2025}
}