PiD β Pixel Diffusion Decoder
Yifan Lu,
Qi Wu,
Jay Zhangjie Wu,
Zian Wang,
Huan Ling,
Sanja Fidler,
Xuanchi Ren
PiD reformulates the latent-to-pixel decoder as a conditional pixel-space diffusion model, unifying decoding and upsampling into a single generative module. It denoises directly in high-resolution pixel space and produces a super-resolved image in one pass. This repository hosts the released decoder checkpoints, plus the encoder/decoder ("VAE") weights they depend on.
All PiD_* checkpoints in this repo are 4-step distilled. The non-PiD_*
entries (ae.safetensors, flux2_ae.safetensors, sdxl_vae.safetensors, QwenImage_VAE_2d.pth, sd3_vae/, rae/,
scale_rae/) are the corresponding encoder/decoder VAE weights that PiD
plugs into β they're not PiD checkpoints themselves.
License/Terms of Use
This model is released under the NSCLv1 License. The work and any derivative works may only be used for non-commercial (research or evaluation) purposes.
Deployment Geography:
Global
PiD checkpoints
Two variants are released for each diffusers-style backbone:
2kβ trained at 2048px, used as a 4Γ decoder (512 LDM β 2048 px), or as an 8Γ decoder for the Scale-RAE backbone (256 β 2048).2kto4kβ trained with multi-resolution data bucketing 2048β4096 and an SD3-style dynamic shift; designed for 1024 LDM β 4K (4096 px) decoding.
Both checkpoint variants support multiple aspect ratios.
| Path | Latent space | SR factor | Variant |
|---|---|---|---|
checkpoints/PiD_res2k_sr4x_official_flux_distill_4step |
Flux1-dev | 4Γ | 2k |
checkpoints/PiD_res2k_sr4x_official_flux2_distill_4step |
Flux2-dev | 4Γ | 2k |
checkpoints/PiD_res2k_sr4x_official_sd3_distill_4step |
SD3 medium | 4Γ | 2k |
checkpoints/PiD_res2k_sr4x_official_dinov2_distill_4step |
DINOv2-B | 4Γ | 2k |
checkpoints/PiD_res2k_sr8x_official_siglip_distill_4step |
SigLIP-2 | 8Γ | 2k |
checkpoints/PiD_res2kto4k_sr4x_official_flux_distill_4step |
Flux1-dev | 4Γ | 2kto4k |
checkpoints/PiD_res2kto4k_sr4x_official_flux2_distill_4step_2606 |
Flux2-dev | 4Γ | 2kto4k |
checkpoints/PiD_res2kto4k_sr4x_official_sd3_distill_4step |
SD3 medium | 4Γ | 2kto4k |
checkpoints/PiD_res2kto4k_sr4x_official_sdxl_distill_4step |
SDXL | 4Γ | 2kto4k |
checkpoints/PiD_res2kto4k_sr4x_official_qwenimage_distill_4step |
Qwen-Image | 4Γ | 2kto4k |
Each directory contains a single file, model_ema_bf16.pth, which is the EMA
weights cast to bfloat16 β the format the inference scripts load by default.
β οΈ Flux2-dev
2kto4kβ use the new_2606checkpoint. The previousPiD_res2kto4k_sr4x_official_flux2_distill_4step(without the_2606suffix) suffered from a color-drifting issue. The newPiD_res2kto4k_sr4x_official_flux2_distill_4step_2606fixes it β please use it and do not use the old one. See the comparison for details.
Latent space β compatible LDMs
A PiD decoder is tied to a latent space, not to a single generative model. Any
LDM that produces latents in that space can reuse the same checkpoint. The
--backbone aliases below pick the right LDM pipeline; they all decode through
the latent space's checkpoint above.
| Latent space | VAE / vision encoder weights | compatible --backbone |
Corresponding LDM Links |
|---|---|---|---|
| Flux1-dev | checkpoints/ae.safetensors |
flux, zimage, zimage-turbo |
FLUX.1-dev, Z-Image, Z-Image-Turbo |
| Flux2-dev | checkpoints/flux2_ae.safetensors |
flux2, flux2-klein-4b, flux2-klein-9b |
FLUX.2-dev, FLUX.2-klein-4B, FLUX.2-klein-9B |
| SD3 medium | checkpoints/sd3_vae/ |
sd3 |
SD3-medium |
| SDXL | checkpoints/sdxl_vae.safetensors |
sdxl |
SDXL-base-1.0 |
| Qwen-Image | checkpoints/QwenImage_VAE_2d.pth |
qwenimage, qwenimage-2512 |
Qwen-Image, Qwen-Image-2512 |
| DINOv2-B | checkpoints/rae/ |
dinov2 |
RAE (class-conditional; DINOv2-B) |
| SigLIP-2 | checkpoints/scale_rae/ |
siglip |
Scale-RAE (text-conditional; nyu-visionx/Scale-RAE-Qwen1.5B_DiT2.4B) |
For example, Z-Image and Z-Image-Turbo share Flux1-dev's VAE, so they reuse the
flux checkpoints (both 2k and 2kto4k) β no separate zimage checkpoint is
shipped. Likewise qwenimage-2512 reuses the qwenimage decoder (same VAE,
different transformer).
Usage
The decoder checkpoints are loaded by the inference scripts in the PiD
codebase. The exact (backbone, ckpt_type) β path mapping is the single source
of truth in
pid/_src/inference/checkpoint_registry.py β clone the
repo, point it at this snapshot, and the demos pick the right file
automatically:
# Pull just the checkpoints/ tree into the repo root (skips this README and
# the teaser figure so they don't clobber the files in the source repo).
hf download nvidia/PiD --local-dir . --include "checkpoints/*"
# Then run any of the demos, e.g.:
PYTHONPATH=. python -m pid._src.inference.from_ldm --backbone flux \
--prompt "A photorealistic half-body portrait of a brown tabby cat with bold stripes sitting attentively on a rustic wooden kitchen table, soft morning light streaming sideways through a large window, fine fur detail and stripe patterns sharply visible, intense amber-green eyes in razor-sharp focus, warm farmhouse kitchen softly out of focus, cinematic shallow depth of field, ultra-detailed fur texture, photorealistic" \
--ldm_inference_steps 28 --save_xt_steps 24 \
--output_dir ./results/official_demo/flux \
--pid_inference_steps 4
Pick the 2kto4k variant via --pid_ckpt_type 2kto4k when decoding at 4K.
Citation
@article{lu2026pid,
title={PiD: Fast and High-Resolution Latent Decoding with Pixel Diffusion},
author={Lu, Yifan and Wu, Qi and Wu, Jay Zhangjie and Wang, Zian and Ling, Huan and Fidler, Sanja and Ren, Xuanchi},
journal={arXiv preprint arXiv:2605.23902},
year={2026}
}
- Downloads last month
- 778
Model tree for nvidia/PiD
Base model
Tongyi-MAI/Z-Image