data-archetype/capacitor_decoder

Capacitor decoder: a faster, lighter FLUX.2-compatible latent decoder built on the SemDisDiffAE architecture.

Decode Speed

RTX 5090

Resolution Speedup vs FLUX.2 Peak VRAM Reduction capacitor_decoder (ms/image) FLUX.2 VAE (ms/image) capacitor_decoder peak VRAM FLUX.2 peak VRAM
512x512 3.41x 61.8% 7.34 25.03 351.2 MiB 920.5 MiB
1024x1024 10.80x 81.4% 11.60 125.35 520.2 MiB 2795.2 MiB
2048x2048 10.95x 88.4% 55.81 611.34 1197.8 MiB 10291.8 MiB

These measurements are decode-only, were run on an NVIDIA GeForce RTX 5090 in bfloat16, and time sequential batch-1 decode over the same cached latent set for both decoders.

2k PSNR Benchmark

Model Mean PSNR (dB) Std (dB) Median (dB) Min (dB) P5 (dB) P95 (dB) Max (dB)
FLUX.2 VAE 36.28 4.53 36.07 22.73 28.89 43.63 47.38
capacitor_decoder 36.34 4.50 36.29 23.28 29.06 43.66 47.43
Delta vs FLUX.2 Mean (dB) Std (dB) Median (dB) Min (dB) P5 (dB) P95 (dB) Max (dB)
capacitor_decoder - FLUX.2 0.055 0.531 0.062 -1.968 -0.811 0.886 2.807

Evaluated on 2000 validation images: roughly 2/3 photographs and 1/3 book covers. Each image is encoded once with FLUX.2 and reused for both decoders.

Results viewer

Usage

import torch
from diffusers.models import AutoencoderKLFlux2

from capacitor_decoder import CapacitorDecoder, CapacitorDecoderInferenceConfig


def flux2_patchify_and_whiten(
    latents: torch.Tensor,
    vae: AutoencoderKLFlux2,
) -> torch.Tensor:
    b, c, h, w = latents.shape
    if h % 2 != 0 or w % 2 != 0:
        raise ValueError(f"Expected even FLUX.2 latent grid, got H={h}, W={w}")
    z = latents.reshape(b, c, h // 2, 2, w // 2, 2)
    z = z.permute(0, 1, 3, 5, 2, 4).reshape(b, c * 4, h // 2, w // 2)
    mean = vae.bn.running_mean.view(1, -1, 1, 1).to(device=z.device, dtype=torch.float32)
    var = vae.bn.running_var.view(1, -1, 1, 1).to(device=z.device, dtype=torch.float32)
    std = torch.sqrt(var + float(vae.config.batch_norm_eps))
    return (z.to(torch.float32) - mean) / std


device = "cuda"
flux2 = AutoencoderKLFlux2.from_pretrained(
    "BiliSakura/VAEs",
    subfolder="FLUX2-VAE",
    torch_dtype=torch.bfloat16,
).to(device)
decoder = CapacitorDecoder.from_pretrained(
    "data-archetype/capacitor_decoder",
    device=device,
    dtype=torch.bfloat16,
)

image = ...  # [1, 3, H, W] in [-1, 1], with H and W divisible by 16

with torch.inference_mode():
    posterior = flux2.encode(image.to(device=device, dtype=torch.bfloat16))
    latent_mean = posterior.latent_dist.mean

    # Default path: whiten in float32, then cast back to model dtype before decode.
    latents = flux2_patchify_and_whiten(latent_mean, flux2).to(dtype=torch.bfloat16)
    recon = decoder.decode(
        latents,
        height=int(image.shape[-2]),
        width=int(image.shape[-1]),
        inference_config=CapacitorDecoderInferenceConfig(num_steps=1),
    )

Whitening and dewhitening are optional, but they must stay consistent. The default above matches the usual FLUX.2 pipeline behavior. If your upstream path already gives you raw patchified decoder-space latents instead, skip whitening upstream and call decode(..., latents_are_flux2_whitened=False).

Details

  • Default input contract: FLUX.2 patchified latents with FLUX.2 BN whitening still applied.
  • Default decoder behavior: unwhiten with saved FLUX.2 BN running stats, then decode.
  • Optional raw-latent mode: disable whitening upstream and call decode(..., latents_are_flux2_whitened=False).
  • Reused decoder architecture: SemDisDiffAE
  • Technical report
  • SemDisDiffAE technical report

Citation

@misc{capacitor_decoder,
  title   = {Capacitor Decoder: A Faster, Lighter FLUX.2-Compatible Latent Decoder},
  author  = {data-archetype},
  email   = {data-archetype@proton.me},
  year    = {2026},
  month   = apr,
  url     = {https://huggingface.co/data-archetype/capacitor_decoder},
}
Downloads last month
50
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support