data-archetype/capacitor_decoder
Capacitor decoder: a faster, lighter FLUX.2-compatible latent decoder built on the SemDisDiffAE architecture.
Decode Speed
RTX 5090
| Resolution | Speedup vs FLUX.2 | Peak VRAM Reduction | capacitor_decoder (ms/image) | FLUX.2 VAE (ms/image) | capacitor_decoder peak VRAM | FLUX.2 peak VRAM |
|---|---|---|---|---|---|---|
512x512 |
3.41x |
61.8% |
7.34 |
25.03 |
351.2 MiB |
920.5 MiB |
1024x1024 |
10.80x |
81.4% |
11.60 |
125.35 |
520.2 MiB |
2795.2 MiB |
2048x2048 |
10.95x |
88.4% |
55.81 |
611.34 |
1197.8 MiB |
10291.8 MiB |
These measurements are decode-only, were run on an NVIDIA GeForce RTX 5090 in bfloat16, and time sequential batch-1 decode over the same cached latent set for both decoders.
2k PSNR Benchmark
| Model | Mean PSNR (dB) | Std (dB) | Median (dB) | Min (dB) | P5 (dB) | P95 (dB) | Max (dB) |
|---|---|---|---|---|---|---|---|
| FLUX.2 VAE | 36.28 | 4.53 | 36.07 | 22.73 | 28.89 | 43.63 | 47.38 |
| capacitor_decoder | 36.34 | 4.50 | 36.29 | 23.28 | 29.06 | 43.66 | 47.43 |
| Delta vs FLUX.2 | Mean (dB) | Std (dB) | Median (dB) | Min (dB) | P5 (dB) | P95 (dB) | Max (dB) |
|---|---|---|---|---|---|---|---|
| capacitor_decoder - FLUX.2 | 0.055 | 0.531 | 0.062 | -1.968 | -0.811 | 0.886 | 2.807 |
Evaluated on 2000 validation images: roughly 2/3
photographs and 1/3 book covers. Each image is encoded once with FLUX.2 and
reused for both decoders.
Usage
import torch
from diffusers.models import AutoencoderKLFlux2
from capacitor_decoder import CapacitorDecoder, CapacitorDecoderInferenceConfig
def flux2_patchify_and_whiten(
latents: torch.Tensor,
vae: AutoencoderKLFlux2,
) -> torch.Tensor:
b, c, h, w = latents.shape
if h % 2 != 0 or w % 2 != 0:
raise ValueError(f"Expected even FLUX.2 latent grid, got H={h}, W={w}")
z = latents.reshape(b, c, h // 2, 2, w // 2, 2)
z = z.permute(0, 1, 3, 5, 2, 4).reshape(b, c * 4, h // 2, w // 2)
mean = vae.bn.running_mean.view(1, -1, 1, 1).to(device=z.device, dtype=torch.float32)
var = vae.bn.running_var.view(1, -1, 1, 1).to(device=z.device, dtype=torch.float32)
std = torch.sqrt(var + float(vae.config.batch_norm_eps))
return (z.to(torch.float32) - mean) / std
device = "cuda"
flux2 = AutoencoderKLFlux2.from_pretrained(
"BiliSakura/VAEs",
subfolder="FLUX2-VAE",
torch_dtype=torch.bfloat16,
).to(device)
decoder = CapacitorDecoder.from_pretrained(
"data-archetype/capacitor_decoder",
device=device,
dtype=torch.bfloat16,
)
image = ... # [1, 3, H, W] in [-1, 1], with H and W divisible by 16
with torch.inference_mode():
posterior = flux2.encode(image.to(device=device, dtype=torch.bfloat16))
latent_mean = posterior.latent_dist.mean
# Default path: whiten in float32, then cast back to model dtype before decode.
latents = flux2_patchify_and_whiten(latent_mean, flux2).to(dtype=torch.bfloat16)
recon = decoder.decode(
latents,
height=int(image.shape[-2]),
width=int(image.shape[-1]),
inference_config=CapacitorDecoderInferenceConfig(num_steps=1),
)
Whitening and dewhitening are optional, but they must stay consistent. The
default above matches the usual FLUX.2 pipeline behavior. If your upstream path
already gives you raw patchified decoder-space latents instead, skip whitening
upstream and call decode(..., latents_are_flux2_whitened=False).
Details
- Default input contract: FLUX.2 patchified latents with FLUX.2 BN whitening still applied.
- Default decoder behavior: unwhiten with saved FLUX.2 BN running stats, then decode.
- Optional raw-latent mode: disable whitening upstream and call
decode(..., latents_are_flux2_whitened=False). - Reused decoder architecture: SemDisDiffAE
- Technical report
- SemDisDiffAE technical report
Citation
@misc{capacitor_decoder,
title = {Capacitor Decoder: A Faster, Lighter FLUX.2-Compatible Latent Decoder},
author = {data-archetype},
email = {data-archetype@proton.me},
year = {2026},
month = apr,
url = {https://huggingface.co/data-archetype/capacitor_decoder},
}
- Downloads last month
- 50
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support