Diffusers

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

Ideogram4Transformer2DModel

A transformer for image-like data from Ideogram 4.

Ideogram4Transformer2DModel

class diffusers.Ideogram4Transformer2DModel

< source >

( in_channels: int = 128 num_layers: int = 34 attention_head_dim: int = 256 num_attention_heads: int = 18 intermediate_size: int = 12288 adaln_dim: int = 512 llm_features_dim: int = 53248 rope_theta: int = 5000000 mrope_section: tuple = (24, 20, 20) norm_eps: float = 1e-05 )

Parameters

in_channels (int, defaults to 128) — Latent channel count after patchification (ae_channels * patch_size ** 2).
num_layers (int, defaults to 34) — Number of transformer blocks.
attention_head_dim (int, defaults to 256) — Dimension of each attention head; the total hidden size is attention_head_dim * num_attention_heads.
num_attention_heads (int, defaults to 18) — Number of attention heads.
intermediate_size (int, defaults to 12288) — Feed-forward hidden size used by the SwiGLU MLP inside each block.
adaln_dim (int, defaults to 512) — Dimensionality of the conditioning vector consumed by the AdaLN modulations.
llm_features_dim (int, defaults to 53248) — Dimensionality of the per-token text features fed into the model (typically a concatenation of hidden states from several layers of the text encoder).
rope_theta (int, defaults to 5_000_000) — Base used by the multi-axis rotary position embedding.
mrope_section (tuple[int, int, int], defaults to (24, 20, 20)) — Number of frequencies allocated to each of the (t, h, w) axes of MRoPE.
norm_eps (float, defaults to 1e-5) — Epsilon used by the RMSNorm modules inside the transformer blocks.

The flow-matching transformer backbone used by the Ideogram 4 pipeline.

The transformer operates on a single packed sequence containing both text-conditioning tokens (produced by a multimodal text encoder) and the patchified image latents. Per-token indicators distinguish the two roles, and a block-diagonal attention mask derived from segment_ids restricts each sample to attend only to itself within a packed batch.

forward

< source >

( hidden_states: Tensor timestep: Tensor encoder_hidden_states: Tensor position_ids: Tensor segment_ids: Tensor indicator: Tensor attention_kwargs: dict | None = None return_dict: bool = True )

Parameters

hidden_states (torch.Tensor of shape (batch_size, sequence_length, in_channels)) — Packed sequence of patchified noisy image tokens. Non-image positions are masked out internally.
timestep (torch.Tensor of shape (batch_size,) or (batch_size, sequence_length)) — Flow-matching time in [0, 1] (0 is pure noise, 1 is clean data).
encoder_hidden_states (torch.Tensor of shape (batch_size, sequence_length, llm_features_dim)) — Per-token text conditioning features. Non-text positions are masked out internally.
position_ids (torch.Tensor of shape (batch_size, sequence_length, 3)) — (t, h, w) coordinates consumed by the multi-axis RoPE.
segment_ids (torch.Tensor of shape (batch_size, sequence_length)) — Per-token sample id within a packed batch. Positions sharing a segment_id attend to each other.
indicator (torch.Tensor of shape (batch_size, sequence_length)) — Per-token role: LLM_TOKEN_INDICATOR (text) or OUTPUT_IMAGE_INDICATOR (image).
attention_kwargs (dict, optional) — A kwargs dictionary passed along to the attention processor. A "scale" entry scales the LoRA weights (when the PEFT backend is active).
return_dict (bool, optional, defaults to True) — Whether to return a Transformer2DModelOutput instead of a plain tuple.

Predict the flow-matching velocity for the image-token positions of the packed sequence.

Update on GitHub

←HunyuanVideoTransformer3DModel JoyImageEditTransformer3DModel→