ViTok-v2: Fully Native Resolution Auto-Encoder Scaled to 4.5 Billion Parameters

Latent Diffusion and Flow Models

Diffusion models[1] and flow-based models[15] have emerged as the dominant paradigm for image generation, powering systems like Stable Diffusion[2], DALL·E 3[3], Flux[4], Imagen[19], and Sora[20]. Operating directly in pixel space is computationally prohibitive for high-resolution images, so central to these pipelines is the autoencoder, which converts the high-dimensional pixel space into a more compact, "diffusable" latent space. Each autoencoder can be characterized by its compression ratio—the ratio of input dimensions to latent dimensions. This is typically expressed as the number of latent tokens times the channels per token. For example, an 8× spatial downsampling with 16 channels yields 1024 tokens × 16 channels for a 256px image, giving a 12:1 compression ratio.

Training autoencoders is challenging, due to two main challenges:
(1) Simple L2 pixel loss produces blurry reconstructions because it cannot capture perceptual similarity—two images can have low pixel error yet look very different to humans. This has led to a proliferation of perceptual losses: VGG-based LPIPS[13], adversarial GAN losses[21], and more recently DINO-based losses[11]. For a deeper discussion on perceptual loss design, see our analysis.
(2) A fundamental tension exists between reconstruction quality (rFID) and generation quality (gFID). Lower compression ratios (more latent dimensions) enable better reconstruction but create more complex latent spaces that are harder for diffusion models to learn. Higher compression simplifies the generative task but limits reconstruction fidelity. Finding the right balance is key to practical latent diffusion systems.

CNN vs ViT Autoencoders

CNN-based VAEs have been the production standard since the original Stable Diffusion. SD-VAE[2] and its successor SDXL-VAE use convolutional encoders and decoders with 8× spatial downsampling. Flux VAE[4] extends this to 16 channels for richer latent representations. Their key advantage is translation invariance—CNNs generalize robustly across resolutions and aspect ratios, even when trained only at 256px. However, CNN architectures are limited in compression: most use 8× spatial reduction, yielding 1024 tokens for a 256px image.

Hybrid CNN-ViT VAEs push beyond 8× spatial reduction. DC-AE[6] achieves up to 128× compression using residual autoencoding with EfficientViT blocks in a primarily convolutional design. Cosmos Tokenizer[28] combines 3D convolutions with spatio-temporal attention for unified image/video tokenization. OmniTokenizer[27] uses transformer-based spatial-temporal decoupling.

ViT-based VAEs have emerged more recently. TiTok[5] compresses images to just 32 tokens using a 1D latent representation. GigaTok[7] scales pure ViT tokenizers to 3 billion parameters. AToken[29] introduces a unified tokenizer achieving both high-fidelity reconstruction and semantic understanding across images, videos, and 3D assets using 4D RoPE and adversarial-free training. Concurrent work RAE[12] explores aligning pretrained representation encoders (SigLIP2, DINOv2, MAE) with learned decoders, showing frozen encoders can serve as strong diffusion latent spaces—though without exploring native resolution support in detail.

The challenge with ViT-VAEs: they typically require GAN losses for perceptual quality, introducing training instability. More critically, vanilla ViTs struggle with resolution generalization—we discuss this limitation and our solution in the following sections.

ViTok-v1 Findings

Our prior work, ViTok-v1[8], introduced a simple continuous ViT-VAE where each patch token maps directly to a latent vector. Key findings:

Asymmetric scaling: Decoder capacity drives reconstruction quality; encoders can remain relatively lightweight without performance loss
Reconstruction–generation trade-off: More latent channels improve reconstruction monotonically, but generation quality (gFID) is parabolic—too many channels create distributions that diffusion models struggle to learn
Token efficiency: ViT-VAEs can achieve comparable reconstruction quality to CNN-VAEs with far fewer tokens (e.g., 256 vs 1024), enabling faster diffusion training

Limitations: ViTok-v1 exhibited the resolution generalization problem common to ViT architectures. Unlike CNNs which generalize naturally due to translation equivariance, vanilla ViTs fail catastrophically at higher resolutions—models trained at 256px produce severe grid artifacts when evaluated at 512px or with non-square aspect ratios. The learned positional embeddings simply do not extrapolate to unseen positions, causing the model to hallucinate patch boundaries and produce blocky artifacts. Additionally, ViTok-v1's largest model (302M parameters) left open questions about billion-scale behavior, and like other ViT-VAEs, it relied on GAN losses for competitive perceptual metrics, introducing training instability.

ViTok-v2: Our Approach

We build upon ViTok-v1 with two key improvements:

NaFlex Resolution Flexibility: enabling resolution generalization via NaFlex. We integrate the NaFlex data pipeline[9,10], which resizes images preserving aspect ratio, then pads to patch boundaries rather than distorting via crop or stretch. Combined with 2D RoPE positional embeddings[14], models trained at 256px generalize to 512px, 1024px, and beyond without fine-tuning.
Decoder Scaling: stable training with modern perceptual losses. We scale decoders to 4.5B parameters and replace unstable GAN objectives with DINOv3[11] perceptual loss. This achieves competitive reconstruction quality with stable, single-stage training.

Model Architecture

Our models use an asymmetric encoder-decoder design with relatively lightweight encoders and deep decoders. Model names refer to the decoder size:

Model Name	Encoder	Decoder	Total
350M	51M	303M	~354M
5B	463M	4.5B	~5B

ViTok-v2 method overview: NaFlex pipeline with gray padding, 2D RoPE embeddings, asymmetric encoder-decoder, LayerNorm latent normalization, and GAN-free DINOv3 perceptual loss — **ViTok-v2 method overview.** *Left:* Input images are padded (gray) to the nearest patch multiple, preserving native resolution and aspect ratio. *Middle:* 2D RoPE embeddings encode patch positions. The encoder projects to c latent channels, normalized to N(0,1) via LayerNorm. The decoder (4.5B params) can use sliding window attention for memory efficiency. *Right:* Random crops are used to compute DINOv3 perceptual loss and SSIM, while Charbonnier loss is computed over the full image. Our approach is entirely GAN-free for stable training.

Compression Configurations

Config	Spatial Factor	Channels	Compression Ratio	Tokens (256px)	Tokens (512px)
Baselines
SD VAE (f8x4)	8x	4	48:1	1024	4096
Flux VAE (f8x16)	8x	16	12:1	1024	4096
ViTok-v2 (Ours)
f16x64 (our best model)	16x	64	12:1	256	1024
f16x32	16x	32	24:1	256	1024
f16x16	16x	16	48:1	256	1024
f32x64	32x	64	48:1	64	256
f32x128	32x	128	24:1	64	256

Compression Ratio = (H × W × 3) / (H/f × W/f × C). Fewer tokens = cheaper DiT inference and faster training (usually at the cost of diffusability).

NaFlex: Resolution and Aspect-Ratio Flexibility

A key limitation of ViT-based autoencoders is their poor generalization to resolutions and aspect ratios outside their training distribution. We integrate the NaFlex data pipeline[9,10]:

Preserve native aspect ratio: Images are resized so the longest side fits within a budget (e.g., 256px), keeping the original aspect ratio intact.
Gray-pad to patch boundary: Pad with gray (neutral value) to make dimensions divisible by patch size, rather than distorting via resize.
Patchify with spatial coordinates: Each patch gets explicit (row, col) coordinates stored in the batch metadata.
2D RoPE positional encoding[14]: Rotary embeddings encode patch positions as a continuous function of coordinates—enabling generalization to unseen positions.
Attention masking: Padded regions are masked out during attention so the model ignores padding patches.

Patch boundary artifacts comparison: (a) ground truth, (b) fixed 256p square training, (c) 256 token NaFlex, (d) 1024 token NaFlex — **Patch boundary artifacts under different training regimes.** We compare three training configurations: fixed 256p square crops, 256-token NaFlex (variable aspect ratio), and 1024-token NaFlex (higher resolution with variable aspect ratio). (a) **Ground truth** image at native aspect ratio. (b) **Fixed 256p Square** training produces visible grid artifacts at patch boundaries when evaluated on non-square aspect ratios. (c) **256 Token NaFlex** eliminates these artifacts by exposing the model to diverse aspect ratios during training. (d) **1024 Token NaFlex** further improves quality with higher-resolution training. Insets show zoomed regions highlighting the patch boundary effects.

❗ Key benefits: Models trained at 256px generalize to 512px, 1024px, and beyond without fine-tuning. Small high-resolution finetunes further improve quality. The example image above is 1024p × 1536p, demonstrating that even the finetuned model generalizes well to unseen resolutions and aspect ratios.

Decoder Scaling

To investigate the scaling question: How does decoder capacity affect reconstruction at different compression levels?
we trained a suite of ViTok models using only L1 pixel loss—no perceptual losses, no adversarial training. This setup isolates the effect of scaling, allowing us to study pure scaling behavior.

Recent work like GigaTok[7] scales ViT tokenizers to 3 billion parameters. We push further, training decoders from 88M (B-scale) to 4.5B parameters (T-scale) across compression ratios from 12:1 to 48:1.

Decoder scaling ablation showing PSNR, SSIM, rFID, rFDD vs compression ratio for different model scales — **Decoder scaling ablation (L1 loss only).** Each solid line represents a different decoder scale: gray = Bd4-B (88M), purple = Ld4-L (302M), blue = Gd4-G (1.1B), red = Td4-T (4.5B). Dotted lines show the effect of doubling encoder depth. (a, b) PSNR and SSIM improve consistently with scale. (c, d) rFID and rFDD also improve, but remain high even at T-scale—L1 loss alone is insufficient for perceptual quality.

❗ Key findings: Scaling the decoder consistently improves all metrics, but the benefits are most pronounced at aggressive compression. At 12:1 compression, the gap between L (302M decoder) and T (4.5B decoder) is modest. At 48:1, the gap widens dramatically: rFID improves from 12.2 to 2.3, and rFDD improves from 5.2 to 2.0. This makes intuitive sense: larger patches require more capacity to decode.

However, even our largest T-scale model achieves rFID of only ~5 and rFDD of ~5 with L1 loss alone—far from competitive with GAN-trained baselines. This motivates our use of DINOv3 perceptual loss, which dramatically improves perceptual metrics without adversarial instability (see our detailed analysis).

COCO Reconstruction Evaluation

We evaluate reconstruction on MS-COCO validation against reproduced baselines (FLUX.2 VAE, Qwen VAE[18], SD VAE). Toggle between resolutions. Best and second best values highlighted.

Metrics: Enc/Dec = encoder/decoder parameters; rFID/rFDD = reconstruction FID/FDD (lower is better); PSNR/SSIM = pixel-level metrics (higher is better); ms/img = latency per image (lower is better), A100-80GB, batch 500, compiled.

Model	Enc	Dec	Compression	Tokens	rFID ↓	rFDD ↓	PSNR ↑	SSIM ↑	ms/img ↓
12:1 Compression
FLUX.2 VAE	34M	49M	f8x16	1024	1.12	1.46	31.48	0.900	2.14
Qwen VAE	106M	393M	f8x16	1024	1.71	3.79	29.12	0.849	75.96
5B-f16x64	463M	4.5B	f16x64	256	2.98	4.28	34.05	0.930	3.59
350M-f16x64	51M	303M	f16x64	256	3.73	5.62	32.83	0.918	0.54
5B-f32x256	463M	4.5B	f32x256	64	6.10	7.50	31.47	0.899	0.91
24:1 Compression
SDXL VAE	34M	49M	f8x4	1024	4.16	9.59	25.78	0.740	4.22
5B-f16x32	463M	4.5B	f16x32	256	4.72	5.37	31.08	0.878	3.59
350M-f16x32	51M	303M	f16x32	256	6.60	8.35	30.41	0.866	0.54
48:1 Compression
SD VAE	34M	49M	f8x4	1024	4.38	12.10	25.42	0.715	4.24
5B-f16x16	463M	4.5B	f16x16	256	6.66	7.15	28.26	0.807	3.62
5B-f32x128	463M	4.5B	f32x128	64	8.93	10.08	29.01	0.838	0.89
350M-f16x16	51M	303M	f16x16	256	10.06	11.79	27.92	0.795	0.54
5B-f32x64	463M	4.5B	f32x64	64	11.07	15.96	26.51	0.754	0.90
96:1 Compression
DC-AE-f32	19M	72M	f32x32	64	5.11	16.63	23.13	0.625	5.72
DC-AE-f64	43M	310M	f64x128	16	5.95	16.04	23.31	0.632	4.55

Model	Enc	Dec	Compression	Tokens	rFID ↓	rFDD ↓	PSNR ↑	SSIM ↑	ms/img ↓
12:1 Compression
FLUX.2 VAE	34M	49M	f8x16	4096	0.45	0.45	33.21	0.914	17.77
Qwen VAE	106M	393M	f8x16	4096	0.79	2.24	30.77	0.863	52.68
5B-f16x64	463M	4.5B	f16x64	1024	0.41	1.02	35.91	0.938	70
350M-f16x64	51M	303M	f16x64	1024	0.48	1.35	34.64	0.927	39.04
5B-f32x256	463M	4.5B	f32x256	256	0.60	1.68	33.41	0.912	52.07
24:1 Compression
SDXL VAE	34M	49M	f8x4	4096	2.20	4.77	27.49	0.762	16.21
5B-f16x32	463M	4.5B	f16x32	1024	1.02	1.44	32.86	0.888	64.10
350M-f16x32	51M	303M	f16x32	1024	1.49	3.14	32.15	0.877	30.37
5B-f32x128	463M	4.5B	f32x128	256	1.96	2.42	30.75	0.853	61.76
48:1 Compression
SD VAE	34M	49M	f8x4	4096	2.28	4.79	27.11	0.742	16.50
5B-f16x16	463M	4.5B	f16x16	1024	2.27	2.24	30.07	0.824	62.78
350M-f16x16	51M	303M	f16x16	1024	4.20	7.00	29.77	0.815	42.11
5B-f32x64	463M	4.5B	f32x64	256	4.38	4.13	28.25	0.778	53.37
96:1 Compression
DC-AE-f64	43M	310M	f64x128	64	2.67	3.30	25.37	0.679	17.91
DC-AE-f32	19M	72M	f32x32	256	2.88	5.67	25.04	0.668	18.29

ViTok f16 models use 4× fewer tokens than f8 baselines (256 vs 1024 at 256p), enabling faster DiT training. Latency measured with torch.compile on H100, adm_center crop, 5000 samples (batch 500 @ 256p, batch 125 @ 512p).

DIV8K High-Resolution Evaluation

We evaluate on DIV8K at high resolutions with native aspect ratios. At 1024p and 2048p, our models run at comparable speed to baselines while achieving state-of-the-art reconstruction quality. At 4096p and above, baseline VAEs either run out of memory or are extremely slow (~20-170 seconds/image), while ViTok models run 15-30× faster without OOM issues.

Model	Enc	Dec	Compression	rFID ↓	rFDD ↓	PSNR ↑	SSIM ↑	ms/img ↓
12:1 Compression
FLUX.2 VAE	34M	49M	f8x16	0.90	0.44	31.44	0.908	107.6
Qwen VAE	106M	393M	f8x16	1.50	1.28	28.84	0.845	237.3
5B-f16x64	463M	4.5B	f16x64	0.35	0.89	33.99	0.932	207.4
350M-f16x64	51M	303M	f16x64	0.44	1.30	32.78	0.918	11.98
5B-f32x256	463M	4.5B	f32x256	0.52	1.67	31.42	0.899	15.54
24:1 Compression
SDXL VAE	34M	49M	f8x8	4.79	3.53	25.96	0.731	128.2
5B-f16x32	463M	4.5B	f16x32	1.21	2.16	30.98	0.874	70.81
350M-f16x32	51M	303M	f16x32	1.63	3.62	30.32	0.861	12.01
5B-f32x128	463M	4.5B	f32x128	1.68	3.13	29.05	0.834	15.23
48:1 Compression
SD VAE	34M	49M	f8x4	5.59	4.39	25.58	0.707	110.1
5B-f16x16	463M	4.5B	f16x16	3.05	3.17	28.45	0.802	71.41
5B-f32x64	463M	4.5B	f32x64	4.69	6.14	26.96	0.754	15.21
350M-f16x16	51M	303M	f16x16	4.83	6.96	28.11	0.788	11.88
96:1 Compression
DC-AE-f64	43M	310M	f64x128	6.52	4.33	24.09	0.648	422.1
DC-AE-f32	19M	72M	f32x32	7.08	4.88	23.71	0.636	334.5

Model	Enc	Dec	Compression	rFID ↓	rFDD ↓	PSNR ↑	SSIM ↑	ms/img ↓
12:1 Compression
FLUX.2 VAE	34M	49M	f8x16	0.48	0.16	32.57	0.918	~284
Qwen VAE	106M	393M	f8x16	0.69	0.31	29.98	0.864	790.3
5B-f16x64	463M	4.5B	f16x64	0.06	0.11	34.93	0.938	293.7
350M-f16x64	51M	303M	f16x64	0.08	0.13	33.76	0.926	49.4
5B-f32x256	463M	4.5B	f32x256	0.07	0.15	32.66	0.912	70.3
24:1 Compression
SDXL VAE	34M	49M	f8x8	N/A – too slow at 2048p
5B-f16x32	463M	4.5B	f16x32	0.19	0.47	31.99	0.886	291.5
350M-f16x32	51M	303M	f16x32	0.27	0.70	31.29	0.874	49.7
5B-f32x128	463M	4.5B	f32x128	0.73	0.70	29.99	0.853	70.2
48:1 Compression
SD VAE	34M	49M	f8x4	1.92	1.91	26.57	0.738	289.8
5B-f16x16	463M	4.5B	f16x16	0.69	1.11	29.39	0.818	290.3
5B-f32x64	463M	4.5B	f32x64	0.67	1.87	28.11	0.782	70.5
350M-f16x16	51M	303M	f16x16	0.96	2.17	28.99	0.805	49.3
96:1 Compression
DC-AE-f64	-	-	f64x128	N/A – too slow at 2048p
DC-AE-f32	-	-	f32x32	N/A – too slow at 2048p

Model	ms/img	Note
FLUX.2 VAE	~20,500	Too slow for full eval
SD VAE	~20,500	Too slow for full eval
Qwen VAE	OOM	Requires >80GB VRAM
DC-AE-f32	OOM	Requires >80GB VRAM
DC-AE-f64	OOM	Requires >80GB VRAM
350M-f16x64	490
350M-f16x32	700
350M-f16x16	333
5B-f16x64	1,205
5B-f16x32	1,195
5B-f16x16	1,215
5B-f32x256	431
5B-f32x128	442
5B-f32x64	450

Model	ms/img	Note
FLUX.2 VAE	~167,000	Too slow for full eval
SD VAE	~168,000	Too slow for full eval
Qwen VAE	OOM	Requires >80GB VRAM
DC-AE-f32	OOM	Requires >80GB VRAM
DC-AE-f64	OOM	Requires >80GB VRAM
350M-f16x64	952
350M-f16x32	885
350M-f16x16	935
5B-f16x64	5,420
5B-f16x32	5,180
5B-f16x16	5,730
5B-f32x256	1,260
5B-f32x128	1,250
5B-f32x64	1,290

DIV8K evaluation with native aspect ratio crops. Latency measured on H100 (batch size 2 for 4096p, 1 for 8192p). At 4096p+, baseline VAEs are too slow or OOM, while ViTok models process images 15-30× faster without memory issues.

DiT Training Efficiency

To study the reconstruction–generation trade-off, we trained Diffusion Transformers (DiT)[17] at two scales (450M DiT-L and 1.2B DiT-G) using flow matching[15] on ImageNet-22K for 300 epochs (batch size 4096, learning rate 1e-3, cosine schedule). All models use ViTok f16 compression (256 tokens per 256px image) with different channel configurations (16ch, 32ch, 64ch). We compare the performance of each DiT size with both VAE decoder sizes (5B Td4-T vs 302M Ld4-L) to isolate the effect of reconstruction quality on generation.

Reconstruction vs Generation Metrics

Scatter plot showing rFID vs gFID and rFDD vs gFDD correlation across VAE sizes, DiT sizes, and compression settings — **Reconstruction vs generation quality.** (a) rFID vs gFID. (b) rFDD vs gFDD. **Blue:** 5B VAE (256p f16). **Red:** 302M VAE (256p f16). **Green:** 5B VAE (512p f32). Circles: 1.2B DiT. Squares: 450M DiT. Diamonds: 512p f32 experiments. Labels indicate channel count.

Key Takeaways: Channel count and DiT size interact—smaller DiT models (450M) work better with lower channel counts (16ch, 32ch), while larger DiT models (1.2B) can leverage higher channel counts (64ch) given sufficient training. There is a clear FID vs FDD trade-off: 64ch achieves the best gFID at convergence, but 16ch maintains better gFDD throughout training, suggesting lower-dimensional latents produce more diverse generations. Decoder capacity matters significantly for generation—the 5B VAE decoder outperforms the 302M decoder by 1–2 gFID points at convergence, with the gap widening as training progresses. Our 256-token f16 configuration matches the quality of 1024-token f8 baselines at 4× lower training compute.

512p f32 compression: We also explored 32× spatial downsampling for higher-resolution generation. This reduces token count to just 256 tokens for 512p images (vs 1024 tokens for f16 at 512p), enabling efficient high-resolution generation while maintaining quality. The 512p f32 results (green diamonds) show 128ch achieves better rFID (0.32 vs 1.1) but 64ch achieves better gFDD (9.4 vs 15.6)—the classic reconstruction-generation trade-off.

Limitations

Texture quality: Flux models still produce better quality textures, likely due to their use of generative/adversarial losses during training
High compression challenges: Extreme compression configurations (e.g., f32x256) failed to train stable DiT models in our experiments
Decoder inference cost: 4.5B decoder is slower than smaller VAEs (acceptable since it only runs once per generated image)
Video compression: Not yet supported—extending to video auto-encoding is future work
CFG saturation artifacts: ViTok latents can produce oversaturated samples when using standard classifier-free guidance (CFG). This manifests as overly contrasty, saturated colors (see example below). The issue can be mitigated using CFG interval[31]—applying guidance only between 10-90% of the sampling process—combined with rescaled CFG (scaling CFG output by the ratio of conditional to unconditional standard deviations). Radial CFG may also provide better results and is worth exploring.

Example of CFG saturation artifacts in ViTok-decoded images showing oversaturated colors — **CFG saturation artifacts.** Standard CFG with ViTok latents can produce oversaturated, overly contrasty images. CFG interval[31] (applying guidance only at 10-90% of sampling) and rescaled CFG help restore natural color distributions.

References

Ho, Jain, Abbeel. Denoising Diffusion Probabilistic Models. NeurIPS 2020. arXiv:2006.11239
Rombach, Blattmann, Lorenz, Esser, Ommer. High-Resolution Image Synthesis with Latent Diffusion Models. CVPR 2022. arXiv:2112.10752
Betker et al. Improving Image Generation with Better Captions. OpenAI 2023. Paper
Black Forest Labs. FLUX.1. 2024. Website
Yu et al. An Image is Worth 32 Tokens for Reconstruction and Generation. NeurIPS 2024. arXiv:2406.07550
Chen et al. Deep Compression Autoencoder for Efficient High-Resolution Diffusion Models. 2024. arXiv:2410.10733
Xiong et al. GigaTok: Scaling Visual Tokenizers to 3 Billion Parameters. 2025. arXiv:2504.02803
Hansen-Estruch et al. Learnings from Scaling Visual Tokenizers for Reconstruction and Generation. 2025. arXiv:2501.09755
Dehghani et al. Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution. NeurIPS 2023. arXiv:2307.06304
Tschannen et al. SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding. 2025. arXiv:2502.14786
Simeoni et al. DINOv3. 2025. arXiv:2508.10104
Yao, Wang. Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models. 2025. arXiv:2501.01423
Zhang et al. The Unreasonable Effectiveness of Deep Features as a Perceptual Metric. CVPR 2018. arXiv:1801.03924
Su et al. RoFormer: Enhanced Transformer with Rotary Position Embedding. 2021. arXiv:2104.09864
Lipman et al. Flow Matching for Generative Modeling. ICLR 2023. arXiv:2210.02747
Beltagy, Peters, Cohan. Longformer: The Long-Document Transformer. 2020. arXiv:2004.05150
Peebles, Xie. Scalable Diffusion Models with Transformers. ICCV 2023. arXiv:2212.09748
Wu et al. Qwen-Image Technical Report. 2025. arXiv:2508.02324
Saharia et al. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. NeurIPS 2022. arXiv:2205.11487
OpenAI. Sora: Creating video from text. 2024. Website
Esser et al. Taming Transformers for High-Resolution Image Synthesis. CVPR 2021. arXiv:2012.09841
van den Oord et al. Neural Discrete Representation Learning. NeurIPS 2017. arXiv:1711.00937
Yu et al. Vector-quantized Image Modeling with Improved VQGAN. ICLR 2022. arXiv:2110.04627
Chang et al. MaskGIT: Masked Generative Image Transformer. CVPR 2022. arXiv:2202.04200
Yu et al. Language Model Beats Diffusion – Tokenizer is Key to Visual Generation. ICLR 2024. arXiv:2310.05737
Luo et al. Open-MAGVIT2: An Open-Source Project Toward Democratizing Auto-regressive Visual Generation. 2024. arXiv:2409.04410
Wang et al. OmniTokenizer: A Joint Image-Video Tokenizer for Visual Generation. 2024. arXiv:2406.09399
Agarwal et al. Cosmos Tokenizer: A suite of image and video neural tokenizers. NVIDIA 2024. GitHub
Lu, Song et al. AToken: A Unified Tokenizer for Vision. 2025. arXiv:2509.14476
Kolesnikov et al. UViM: A Unified Modeling Approach for Vision with Learned Guiding Codes. NeurIPS 2022. arXiv:2205.10337
Kynkäänniemi et al. Applying Guidance in a Limited Interval Improves Sample and Distribution Quality in Diffusion Models. NeurIPS 2024. arXiv:2404.07724

ViTok-v2