data-archetype/capacitor_decoder

Capacitor decoder: a faster, lighter FLUX.2-compatible latent decoder built on the SemDisDiffAE architecture.

Decode Speed

RTX 5090

Resolution	Speedup vs FLUX.2	Peak VRAM Reduction	capacitor_decoder (ms/image)	FLUX.2 VAE (ms/image)	capacitor_decoder peak VRAM	FLUX.2 peak VRAM
`512x512`	`3.41x`	`61.8%`	`7.34`	`25.03`	`351.2 MiB`	`920.5 MiB`
`1024x1024`	`10.80x`	`81.4%`	`11.60`	`125.35`	`520.2 MiB`	`2795.2 MiB`
`2048x2048`	`10.95x`	`88.4%`	`55.81`	`611.34`	`1197.8 MiB`	`10291.8 MiB`

These measurements are decode-only, were run on an NVIDIA GeForce RTX 5090 in bfloat16, and time sequential batch-1 decode over the same cached latent set for both decoders.

2k PSNR Benchmark

Model	Mean PSNR (dB)	Std (dB)	Median (dB)	Min (dB)	P5 (dB)	P95 (dB)	Max (dB)
FLUX.2 VAE	36.28	4.53	36.07	22.73	28.89	43.63	47.38
capacitor_decoder	36.34	4.50	36.29	23.28	29.06	43.66	47.43

Delta vs FLUX.2	Mean (dB)	Std (dB)	Median (dB)	Min (dB)	P5 (dB)	P95 (dB)	Max (dB)
capacitor_decoder - FLUX.2	0.055	0.531	0.062	-1.968	-0.811	0.886	2.807

Evaluated on 2000 validation images: roughly 2/3 photographs and 1/3 book covers. Each image is encoded once with FLUX.2 and reused for both decoders.

Results viewer

Usage

import torch
from diffusers.models import AutoencoderKLFlux2

from capacitor_decoder import CapacitorDecoder, CapacitorDecoderInferenceConfig


def flux2_patchify_and_whiten(
    latents: torch.Tensor,
    vae: AutoencoderKLFlux2,
) -> torch.Tensor:
    b, c, h, w = latents.shape
    if h % 2 != 0 or w % 2 != 0:
        raise ValueError(f"Expected even FLUX.2 latent grid, got H={h}, W={w}")
    z = latents.reshape(b, c, h // 2, 2, w // 2, 2)
    z = z.permute(0, 1, 3, 5, 2, 4).reshape(b, c * 4, h // 2, w // 2)
    mean = vae.bn.running_mean.view(1, -1, 1, 1).to(device=z.device, dtype=torch.float32)
    var = vae.bn.running_var.view(1, -1, 1, 1).to(device=z.device, dtype=torch.float32)
    std = torch.sqrt(var + float(vae.config.batch_norm_eps))
    return (z.to(torch.float32) - mean) / std


device = "cuda"
flux2 = AutoencoderKLFlux2.from_pretrained(
    "BiliSakura/VAEs",
    subfolder="FLUX2-VAE",
    torch_dtype=torch.bfloat16,
).to(device)
decoder = CapacitorDecoder.from_pretrained(
    "data-archetype/capacitor_decoder",
    device=device,
    dtype=torch.bfloat16,
)

image = ...  # [1, 3, H, W] in [-1, 1], with H and W divisible by 16

with torch.inference_mode():
    posterior = flux2.encode(image.to(device=device, dtype=torch.bfloat16))
    latent_mean = posterior.latent_dist.mean

    # Default path: whiten in float32, then cast back to model dtype before decode.
    latents = flux2_patchify_and_whiten(latent_mean, flux2).to(dtype=torch.bfloat16)
    recon = decoder.decode(
        latents,
        height=int(image.shape[-2]),
        width=int(image.shape[-1]),
        inference_config=CapacitorDecoderInferenceConfig(num_steps=1),
    )

Whitening and dewhitening are optional, but they must stay consistent. The default above matches the usual FLUX.2 pipeline behavior. If your upstream path already gives you raw patchified decoder-space latents instead, skip whitening upstream and call decode(..., latents_are_flux2_whitened=False).

Details

Default input contract: FLUX.2 patchified latents with FLUX.2 BN whitening still applied.
Default decoder behavior: unwhiten with saved FLUX.2 BN running stats, then decode.
Optional raw-latent mode: disable whitening upstream and call decode(..., latents_are_flux2_whitened=False).
Reused decoder architecture: SemDisDiffAE
Technical report
SemDisDiffAE technical report

Citation

@misc{capacitor_decoder,
  title   = {Capacitor Decoder: A Faster, Lighter FLUX.2-Compatible Latent Decoder},
  author  = {data-archetype},
  email   = {data-archetype@proton.me},
  year    = {2026},
  month   = apr,
  url     = {https://huggingface.co/data-archetype/capacitor_decoder},
}

Downloads last month: 50

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support