ViTok-v2

Fully Native Resolution Auto-Encoder Scaled to 4.5 Billion Parameters

We improve upon ViTok-v1 by integrating NaFlex data pipeline and scaling the decoder, achieving competitive performance with current leading autoencoders across reconstruction and generation metrics.

Latent Diffusion and Flow Models

Diffusion models[1] and flow-based models[15] have emerged as the dominant paradigm for image generation, powering systems like Stable Diffusion[2], DALL·E 3[3], Flux[4], Imagen[19], and Sora[20]. Operating directly in pixel space is computationally prohibitive for high-resolution images, so central to these pipelines is the autoencoder, which converts the high-dimensional pixel space into a more compact, "diffusable" latent space. Each autoencoder can be characterized by its compression ratio—the ratio of input dimensions to latent dimensions. This is typically expressed as the number of latent tokens times the channels per token. For example, an 8× spatial downsampling with 16 channels yields 1024 tokens × 16 channels for a 256px image, giving a 12:1 compression ratio.

Training autoencoders is challenging, due to two main challenges:
(1) Simple L2 pixel loss produces blurry reconstructions because it cannot capture perceptual similarity—two images can have low pixel error yet look very different to humans. This has led to a proliferation of perceptual losses: VGG-based LPIPS[13], adversarial GAN losses[21], and more recently DINO-based losses[11]. For a deeper discussion on perceptual loss design, see our analysis.
(2) A fundamental tension exists between reconstruction quality (rFID) and generation quality (gFID). Lower compression ratios (more latent dimensions) enable better reconstruction but create more complex latent spaces that are harder for diffusion models to learn. Higher compression simplifies the generative task but limits reconstruction fidelity. Finding the right balance is key to practical latent diffusion systems.

CNN vs ViT Autoencoders

CNN-based VAEs have been the production standard since the original Stable Diffusion. SD-VAE[2] and its successor SDXL-VAE use convolutional encoders and decoders with 8× spatial downsampling. Flux VAE[4] extends this to 16 channels for richer latent representations. Their key advantage is translation invariance—CNNs generalize robustly across resolutions and aspect ratios, even when trained only at 256px. However, CNN architectures are limited in compression: most use 8× spatial reduction, yielding 1024 tokens for a 256px image.

Hybrid CNN-ViT VAEs push beyond 8× spatial reduction. DC-AE[6] achieves up to 128× compression using residual autoencoding with EfficientViT blocks in a primarily convolutional design. Cosmos Tokenizer[28] combines 3D convolutions with spatio-temporal attention for unified image/video tokenization. OmniTokenizer[27] uses transformer-based spatial-temporal decoupling.

ViT-based VAEs have emerged more recently. TiTok[5] compresses images to just 32 tokens using a 1D latent representation. GigaTok[7] scales pure ViT tokenizers to 3 billion parameters. AToken[29] introduces a unified tokenizer achieving both high-fidelity reconstruction and semantic understanding across images, videos, and 3D assets using 4D RoPE and adversarial-free training. Concurrent work RAE[12] explores aligning pretrained representation encoders (SigLIP2, DINOv2, MAE) with learned decoders, showing frozen encoders can serve as strong diffusion latent spaces—though without exploring native resolution support in detail.

The challenge with ViT-VAEs: they typically require GAN losses for perceptual quality, introducing training instability. More critically, vanilla ViTs struggle with resolution generalization—we discuss this limitation and our solution in the following sections.

ViTok-v1 Findings

Our prior work, ViTok-v1[8], introduced a simple continuous ViT-VAE where each patch token maps directly to a latent vector. Key findings:

  • Asymmetric scaling: Decoder capacity drives reconstruction quality; encoders can remain relatively lightweight without performance loss
  • Reconstruction–generation trade-off: More latent channels improve reconstruction monotonically, but generation quality (gFID) is parabolic—too many channels create distributions that diffusion models struggle to learn
  • Token efficiency: ViT-VAEs can achieve comparable reconstruction quality to CNN-VAEs with far fewer tokens (e.g., 256 vs 1024), enabling faster diffusion training

Limitations: ViTok-v1 exhibited the resolution generalization problem common to ViT architectures. Unlike CNNs which generalize naturally due to translation equivariance, vanilla ViTs fail catastrophically at higher resolutions—models trained at 256px produce severe grid artifacts when evaluated at 512px or with non-square aspect ratios. The learned positional embeddings simply do not extrapolate to unseen positions, causing the model to hallucinate patch boundaries and produce blocky artifacts. Additionally, ViTok-v1's largest model (302M parameters) left open questions about billion-scale behavior, and like other ViT-VAEs, it relied on GAN losses for competitive perceptual metrics, introducing training instability.

ViTok-v2: Our Approach

We build upon ViTok-v1 with two key improvements:

  1. NaFlex Resolution Flexibility: enabling resolution generalization via NaFlex. We integrate the NaFlex data pipeline[9,10], which resizes images preserving aspect ratio, then pads to patch boundaries rather than distorting via crop or stretch. Combined with 2D RoPE positional embeddings[14], models trained at 256px generalize to 512px, 1024px, and beyond without fine-tuning.
  2. Decoder Scaling: stable training with modern perceptual losses. We scale decoders to 4.5B parameters and replace unstable GAN objectives with DINOv3[11] perceptual loss. This achieves competitive reconstruction quality with stable, single-stage training.

Model Architecture

Our models use an asymmetric encoder-decoder design with relatively lightweight encoders and deep decoders. Model names refer to the decoder size:

Model Name Encoder Decoder Total
350M 51M 303M ~354M
5B 463M 4.5B ~5B
ViTok-v2 method overview: NaFlex pipeline with gray padding, 2D RoPE embeddings, asymmetric encoder-decoder, LayerNorm latent normalization, and GAN-free DINOv3 perceptual loss
ViTok-v2 method overview. Left: Input images are padded (gray) to the nearest patch multiple, preserving native resolution and aspect ratio. Middle: 2D RoPE embeddings encode patch positions. The encoder projects to c latent channels, normalized to N(0,1) via LayerNorm. The decoder (4.5B params) can use sliding window attention for memory efficiency. Right: Random crops are used to compute DINOv3 perceptual loss and SSIM, while Charbonnier loss is computed over the full image. Our approach is entirely GAN-free for stable training.

Compression Configurations

Config Spatial Factor Channels Compression Ratio Tokens (256px) Tokens (512px)
Baselines
SD VAE (f8x4) 8x 4 48:1 1024 4096
Flux VAE (f8x16) 8x 16 12:1 1024 4096
ViTok-v2 (Ours)
f16x64
(our best model)
16x 64 12:1 256 1024
f16x32 16x 32 24:1 256 1024
f16x16 16x 16 48:1 256 1024
f32x64 32x 64 48:1 64 256
f32x128 32x 128 24:1 64 256

Compression Ratio = (H × W × 3) / (H/f × W/f × C). Fewer tokens = cheaper DiT inference and faster training (usually at the cost of diffusability).

NaFlex: Resolution and Aspect-Ratio Flexibility

A key limitation of ViT-based autoencoders is their poor generalization to resolutions and aspect ratios outside their training distribution. We integrate the NaFlex data pipeline[9,10]:

  1. Preserve native aspect ratio: Images are resized so the longest side fits within a budget (e.g., 256px), keeping the original aspect ratio intact.
  2. Gray-pad to patch boundary: Pad with gray (neutral value) to make dimensions divisible by patch size, rather than distorting via resize.
  3. Patchify with spatial coordinates: Each patch gets explicit (row, col) coordinates stored in the batch metadata.
  4. 2D RoPE positional encoding[14]: Rotary embeddings encode patch positions as a continuous function of coordinates—enabling generalization to unseen positions.
  5. Attention masking: Padded regions are masked out during attention so the model ignores padding patches.
Patch boundary artifacts comparison: (a) ground truth, (b) fixed 256p square training, (c) 256 token NaFlex, (d) 1024 token NaFlex
Patch boundary artifacts under different training regimes. We compare three training configurations: fixed 256p square crops, 256-token NaFlex (variable aspect ratio), and 1024-token NaFlex (higher resolution with variable aspect ratio). (a) Ground truth image at native aspect ratio. (b) Fixed 256p Square training produces visible grid artifacts at patch boundaries when evaluated on non-square aspect ratios. (c) 256 Token NaFlex eliminates these artifacts by exposing the model to diverse aspect ratios during training. (d) 1024 Token NaFlex further improves quality with higher-resolution training. Insets show zoomed regions highlighting the patch boundary effects.

Key benefits: Models trained at 256px generalize to 512px, 1024px, and beyond without fine-tuning. Small high-resolution finetunes further improve quality. The example image above is 1024p × 1536p, demonstrating that even the finetuned model generalizes well to unseen resolutions and aspect ratios.

Decoder Scaling

To investigate the scaling question: How does decoder capacity affect reconstruction at different compression levels?
we trained a suite of ViTok models using only L1 pixel loss—no perceptual losses, no adversarial training. This setup isolates the effect of scaling, allowing us to study pure scaling behavior.

Recent work like GigaTok[7] scales ViT tokenizers to 3 billion parameters. We push further, training decoders from 88M (B-scale) to 4.5B parameters (T-scale) across compression ratios from 12:1 to 48:1.

Decoder scaling ablation showing PSNR, SSIM, rFID, rFDD vs compression ratio for different model scales
Decoder scaling ablation (L1 loss only). Each solid line represents a different decoder scale: gray = Bd4-B (88M), purple = Ld4-L (302M), blue = Gd4-G (1.1B), red = Td4-T (4.5B). Dotted lines show the effect of doubling encoder depth. (a, b) PSNR and SSIM improve consistently with scale. (c, d) rFID and rFDD also improve, but remain high even at T-scale—L1 loss alone is insufficient for perceptual quality.

Key findings: Scaling the decoder consistently improves all metrics, but the benefits are most pronounced at aggressive compression. At 12:1 compression, the gap between L (302M decoder) and T (4.5B decoder) is modest. At 48:1, the gap widens dramatically: rFID improves from 12.2 to 2.3, and rFDD improves from 5.2 to 2.0. This makes intuitive sense: larger patches require more capacity to decode.

However, even our largest T-scale model achieves rFID of only ~5 and rFDD of ~5 with L1 loss alone—far from competitive with GAN-trained baselines. This motivates our use of DINOv3 perceptual loss, which dramatically improves perceptual metrics without adversarial instability (see our detailed analysis).

COCO Reconstruction Evaluation

We evaluate reconstruction on MS-COCO validation against reproduced baselines (FLUX.2 VAE, Qwen VAE[18], SD VAE). Toggle between resolutions. Best and second best values highlighted.

Metrics: Enc/Dec = encoder/decoder parameters; rFID/rFDD = reconstruction FID/FDD (lower is better); PSNR/SSIM = pixel-level metrics (higher is better); ms/img = latency per image (lower is better), A100-80GB, batch 500, compiled.

Model Enc Dec Compression Tokens rFID ↓ rFDD ↓ PSNR ↑ SSIM ↑ ms/img ↓
12:1 Compression
FLUX.2 VAE 34M 49M f8x16 1024 1.12 1.46 31.48 0.900 2.14
Qwen VAE 106M 393M f8x16 1024 1.71 3.79 29.12 0.849 75.96
5B-f16x64 463M 4.5B f16x64 256 2.98 4.28 34.05 0.930 3.59
350M-f16x64 51M 303M f16x64 256 3.73 5.62 32.83 0.918 0.54
5B-f32x256 463M 4.5B f32x256 64 6.10 7.50 31.47 0.899 0.91
24:1 Compression
SDXL VAE 34M 49M f8x4 1024 4.16 9.59 25.78 0.740 4.22
5B-f16x32 463M 4.5B f16x32 256 4.72 5.37 31.08 0.878 3.59
350M-f16x32 51M 303M f16x32 256 6.60 8.35 30.41 0.866 0.54
48:1 Compression
SD VAE 34M 49M f8x4 1024 4.38 12.10 25.42 0.715 4.24
5B-f16x16 463M 4.5B f16x16 256 6.66 7.15 28.26 0.807 3.62
5B-f32x128 463M 4.5B f32x128 64 8.93 10.08 29.01 0.838 0.89
350M-f16x16 51M 303M f16x16 256 10.06 11.79 27.92 0.795 0.54
5B-f32x64 463M 4.5B f32x64 64 11.07 15.96 26.51 0.754 0.90
96:1 Compression
DC-AE-f32 19M 72M f32x32 64 5.11 16.63 23.13 0.625 5.72
DC-AE-f64 43M 310M f64x128 16 5.95 16.04 23.31 0.632 4.55

ViTok f16 models use 4× fewer tokens than f8 baselines (256 vs 1024 at 256p), enabling faster DiT training. Latency measured with torch.compile on H100, adm_center crop, 5000 samples (batch 500 @ 256p, batch 125 @ 512p).

DIV8K High-Resolution Evaluation

We evaluate on DIV8K at high resolutions with native aspect ratios. At 1024p and 2048p, our models run at comparable speed to baselines while achieving state-of-the-art reconstruction quality. At 4096p and above, baseline VAEs either run out of memory or are extremely slow (~20-170 seconds/image), while ViTok models run 15-30× faster without OOM issues.

Model Enc Dec Compression rFID ↓ rFDD ↓ PSNR ↑ SSIM ↑ ms/img ↓
12:1 Compression
FLUX.2 VAE 34M 49M f8x16 0.90 0.44 31.44 0.908 107.6
Qwen VAE 106M 393M f8x16 1.50 1.28 28.84 0.845 237.3
5B-f16x64 463M 4.5B f16x64 0.35 0.89 33.99 0.932 207.4
350M-f16x64 51M 303M f16x64 0.44 1.30 32.78 0.918 11.98
5B-f32x256 463M 4.5B f32x256 0.52 1.67 31.42 0.899 15.54
24:1 Compression
SDXL VAE 34M 49M f8x8 4.79 3.53 25.96 0.731 128.2
5B-f16x32 463M 4.5B f16x32 1.21 2.16 30.98 0.874 70.81
350M-f16x32 51M 303M f16x32 1.63 3.62 30.32 0.861 12.01
5B-f32x128 463M 4.5B f32x128 1.68 3.13 29.05 0.834 15.23
48:1 Compression
SD VAE 34M 49M f8x4 5.59 4.39 25.58 0.707 110.1
5B-f16x16 463M 4.5B f16x16 3.05 3.17 28.45 0.802 71.41
5B-f32x64 463M 4.5B f32x64 4.69 6.14 26.96 0.754 15.21
350M-f16x16 51M 303M f16x16 4.83 6.96 28.11 0.788 11.88
96:1 Compression
DC-AE-f64 43M 310M f64x128 6.52 4.33 24.09 0.648 422.1
DC-AE-f32 19M 72M f32x32 7.08 4.88 23.71 0.636 334.5

DIV8K evaluation with native aspect ratio crops. Latency measured on H100 (batch size 2 for 4096p, 1 for 8192p). At 4096p+, baseline VAEs are too slow or OOM, while ViTok models process images 15-30× faster without memory issues.

Visual Comparison Tool

Compare reconstructions across models with our interactive viewer. Supports magnification, error heatmaps, and side-by-side comparison.

ViTok 5B reconstruction ViTok 5B-f16x64
Flux VAE reconstruction Flux VAE

DiT Training Efficiency

To study the reconstruction–generation trade-off, we trained Diffusion Transformers (DiT)[17] at two scales (450M DiT-L and 1.2B DiT-G) using flow matching[15] on ImageNet-22K for 300 epochs (batch size 4096, learning rate 1e-3, cosine schedule). All models use ViTok f16 compression (256 tokens per 256px image) with different channel configurations (16ch, 32ch, 64ch). We compare the performance of each DiT size with both VAE decoder sizes (5B Td4-T vs 302M Ld4-L) to isolate the effect of reconstruction quality on generation.

Reconstruction vs Generation Metrics

Scatter plot showing rFID vs gFID and rFDD vs gFDD correlation across VAE sizes, DiT sizes, and compression settings
Reconstruction vs generation quality. (a) rFID vs gFID. (b) rFDD vs gFDD. Blue: 5B VAE (256p f16). Red: 302M VAE (256p f16). Green: 5B VAE (512p f32). Circles: 1.2B DiT. Squares: 450M DiT. Diamonds: 512p f32 experiments. Labels indicate channel count.

Key Takeaways: Channel count and DiT size interact—smaller DiT models (450M) work better with lower channel counts (16ch, 32ch), while larger DiT models (1.2B) can leverage higher channel counts (64ch) given sufficient training. There is a clear FID vs FDD trade-off: 64ch achieves the best gFID at convergence, but 16ch maintains better gFDD throughout training, suggesting lower-dimensional latents produce more diverse generations. Decoder capacity matters significantly for generation—the 5B VAE decoder outperforms the 302M decoder by 1–2 gFID points at convergence, with the gap widening as training progresses. Our 256-token f16 configuration matches the quality of 1024-token f8 baselines at 4× lower training compute.

512p f32 compression: We also explored 32× spatial downsampling for higher-resolution generation. This reduces token count to just 256 tokens for 512p images (vs 1024 tokens for f16 at 512p), enabling efficient high-resolution generation while maintaining quality. The 512p f32 results (green diamonds) show 128ch achieves better rFID (0.32 vs 1.1) but 64ch achieves better gFDD (9.4 vs 15.6)—the classic reconstruction-generation trade-off.

Limitations

  • Texture quality: Flux models still produce better quality textures, likely due to their use of generative/adversarial losses during training
  • High compression challenges: Extreme compression configurations (e.g., f32x256) failed to train stable DiT models in our experiments
  • Decoder inference cost: 4.5B decoder is slower than smaller VAEs (acceptable since it only runs once per generated image)
  • Video compression: Not yet supported—extending to video auto-encoding is future work
  • CFG saturation artifacts: ViTok latents can produce oversaturated samples when using standard classifier-free guidance (CFG). This manifests as overly contrasty, saturated colors (see example below). The issue can be mitigated using CFG interval[31]—applying guidance only between 10-90% of the sampling process—combined with rescaled CFG (scaling CFG output by the ratio of conditional to unconditional standard deviations). Radial CFG may also provide better results and is worth exploring.
Example of CFG saturation artifacts in ViTok-decoded images showing oversaturated colors
CFG saturation artifacts. Standard CFG with ViTok latents can produce oversaturated, overly contrasty images. CFG interval[31] (applying guidance only at 10-90% of sampling) and rescaled CFG help restore natural color distributions.

References

  1. Ho, Jain, Abbeel. Denoising Diffusion Probabilistic Models. NeurIPS 2020. arXiv:2006.11239
  2. Rombach, Blattmann, Lorenz, Esser, Ommer. High-Resolution Image Synthesis with Latent Diffusion Models. CVPR 2022. arXiv:2112.10752
  3. Betker et al. Improving Image Generation with Better Captions. OpenAI 2023. Paper
  4. Black Forest Labs. FLUX.1. 2024. Website
  5. Yu et al. An Image is Worth 32 Tokens for Reconstruction and Generation. NeurIPS 2024. arXiv:2406.07550
  6. Chen et al. Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models. 2024. arXiv:2410.10733
  7. Xiong et al. GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters. 2025. arXiv:2504.02803
  8. Hansen-Estruch et al. Learnings from Scaling Visual Tokenizers for Reconstruction and Generation. 2025. arXiv:2501.09755
  9. Dehghani et al. Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution. NeurIPS 2023. arXiv:2307.06304
  10. Tschannen et al. SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding. 2025. arXiv:2502.14786
  11. Simeoni et al. DINOv3. 2025. arXiv:2508.10104
  12. Yao, Wang. Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models. 2025. arXiv:2501.01423
  13. Zhang et al. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. CVPR 2018. arXiv:1801.03924
  14. Su et al. RoFormer: Enhanced Transformer with Rotary Position Embedding. 2021. arXiv:2104.09864
  15. Lipman et al. Flow Matching for Generative Modeling. ICLR 2023. arXiv:2210.02747
  16. Beltagy, Peters, Cohan. Longformer: The Long-Document Transformer. 2020. arXiv:2004.05150
  17. Peebles, Xie. Scalable Diffusion Models with Transformers. ICCV 2023. arXiv:2212.09748
  18. Wu et al. Qwen-Image Technical Report. 2025. arXiv:2508.02324
  19. Saharia et al. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. NeurIPS 2022. arXiv:2205.11487
  20. OpenAI. Sora: Creating video from text. 2024. Website
  21. Esser et al. Taming Transformers for High-Resolution Image Synthesis. CVPR 2021. arXiv:2012.09841
  22. van den Oord et al. Neural Discrete Representation Learning. NeurIPS 2017. arXiv:1711.00937
  23. Yu et al. Vector-quantized Image Modeling with Improved VQGAN. ICLR 2022. arXiv:2110.04627
  24. Chang et al. MaskGIT: Masked Generative Image Transformer. CVPR 2022. arXiv:2202.04200
  25. Yu et al. Language Model Beats Diffusion – Tokenizer is Key to Visual Generation. ICLR 2024. arXiv:2310.05737
  26. Luo et al. Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation. 2024. arXiv:2409.04410
  27. Wang et al. OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation. 2024. arXiv:2406.09399
  28. Agarwal et al. Cosmos Tokenizer: A suite of image and video neural tokenizers. NVIDIA 2024. GitHub
  29. Lu, Song et al. AToken: A Unified Tokenizer for Vision. 2025. arXiv:2509.14476
  30. Kolesnikov et al. UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes. NeurIPS 2022. arXiv:2205.10337
  31. Kynkäänniemi et al. Applying Guidance in a Limited Interval Improves Sample and Distribution Quality in Diffusion Models. NeurIPS 2024. arXiv:2404.07724

Authors & Citation

Authors: Philippe Hansen-Estruch, Jiahui Chen, Vivek Ramanujan, Orr Zohar, Yan Ping, Animesh Sinha, Markos Georgopoulos, Edgar Schoenfeld, Ji Hou, Felix Juefei-Xu, Sriram Vishwanath, Ali Thabet

If you find this work useful, please cite:

@article{hansenestruch2025vitokv2,
  title={ViTok-v2: Scaling Vision Transformer Tokenizers to 4.5 Billion Parameters},
  author={Hansen-Estruch, Philippe and Chen, Jiahui and Ramanujan, Vivek and Zohar, Orr and Ping, Yan and Sinha, Animesh and Georgopoulos, Markos and Schoenfeld, Edgar and Hou, Ji and Juefei-Xu, Felix and Vishwanath, Sriram and Thabet, Ali},
  journal={arXiv preprint},
  year={2025}
}