# AnyFlowFARTransformer3DModel

The causal (FAR) 3D Transformer used by [`AnyFlowFARPipeline`](../pipelines/anyflow#anyflowfarpipeline) —
the FAR variant of [AnyFlow](https://huggingface.co/papers/2605.13724) (Yuchao Gu, Guian Fang et al., NUS
ShowLab × NVIDIA). It extends the v0.35.1 Wan2.1 backbone with three additions:

1. **FAR causal block-mask** via `torch.nn.attention.flex_attention`, supporting frame-level autoregressive
   generation as introduced in [FAR (Gu et al., 2025)](https://arxiv.org/abs/2503.19325).
2. **Compressed-frame patch embedding** (`far_patch_embedding`) for context (already-generated) frames,
   warm-started from the full-resolution `patch_embedding` at construction time via trilinear interpolation.
3. **Dual-timestep flow-map embedding** (same as
   [`AnyFlowTransformer3DModel`](anyflow_transformer3d)) — every forward call conditions on both the source
   timestep ``t`` and the target timestep ``r``.

The chunk schedule (`chunk_partition`) is **not** baked into the model config. It is a per-call argument to
`forward`, so the same checkpoint handles different `num_frames` configurations without retraining.

```python
from diffusers import AnyFlowFARTransformer3DModel

# Causal AnyFlow checkpoint (FAR):
transformer = AnyFlowFARTransformer3DModel.from_pretrained(
    "nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers", subfolder="transformer"
)
```

## AnyFlowFARTransformer3DModel[[diffusers.AnyFlowFARTransformer3DModel]]

#### diffusers.AnyFlowFARTransformer3DModel[[diffusers.AnyFlowFARTransformer3DModel]]

[Source](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/transformers/transformer_anyflow_far.py#L777)

Causal (FAR) 3D Transformer for AnyFlow flow-map sampling with frame-level autoregressive generation.

Extends the v0.35.1 Wan2.1 backbone with:

- **FAR causal block-mask** via `torch.nn.attention.flex_attention`, supporting frame-level autoregressive
  generation (FAR; [Gu et al., 2025](https://arxiv.org/abs/2503.19325)).
- **Compressed-frame patch embedding** `far_patch_embedding` for context (already-generated) frames, initialized
  from `patch_embedding` via trilinear interpolation so a freshly constructed model is already at a reasonable
  starting point even before LoRA fine-tuning.
- **Dual-timestep flow-map embedding** for any-step sampling (same as `AnyFlowTransformer3DModel`).

Use `AnyFlowTransformer3DModel` instead for plain bidirectional T2V — that variant skips the FAR causal masking
and `far_patch_embedding` and is ~5–10% smaller.

`chunk_partition` is **not** a model config field — it is a per-call argument passed to `forward`.
Different inference setups (varying `num_frames` or full-vs-compressed schedules) therefore do not require
separate checkpoints.

forwarddiffusers.AnyFlowFARTransformer3DModel.forwardhttps://github.com/huggingface/diffusers/blob/main/src/diffusers/models/transformers/transformer_anyflow_far.py#L913[{"name": "hidden_states", "val": ": Tensor"}, {"name": "timestep", "val": ": Tensor"}, {"name": "r_timestep", "val": ": Tensor"}, {"name": "encoder_hidden_states", "val": ": Tensor"}, {"name": "chunk_partition", "val": ": typing.List[int]"}, {"name": "encoder_hidden_states_image", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "clean_hidden_states", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "clean_timestep", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "kv_cache", "val": ": typing.Optional[typing.List[typing.Dict[str, torch.Tensor]]] = None"}, {"name": "kv_cache_flag", "val": ": typing.Optional[typing.Dict[str, typing.Any]] = None"}, {"name": "attention_kwargs", "val": ": typing.Optional[typing.Dict[str, typing.Any]] = None"}, {"name": "return_dict", "val": ": bool = True"}]- **hidden_states** (*torch.Tensor*) --
  Latent input of shape `(B, F, C, H, W)`.
- **timestep** (*torch.Tensor*) --
  Source (noisier) flow-map timestep *t*.
- **r_timestep** (*torch.Tensor*) --
  Target (cleaner) flow-map timestep *r*.
- **encoder_hidden_states** (*torch.Tensor*) --
  UMT5 text embeddings.
- **chunk_partition** (*List[int]*) --
  Per-chunk frame counts; total must match the number of latent frames in `hidden_states`.
- **encoder_hidden_states_image** (*torch.Tensor*, *optional*) --
  I2V image embedding; concatenated before text tokens when provided.
- **clean_hidden_states** (*torch.Tensor*, *optional*) --
  Clean (noise-free) conditioning frames used by the training rollout.
- **clean_timestep** (*torch.Tensor*, *optional*) --
  Timesteps for the clean conditioning frames in the training rollout.
- **kv_cache** (*List[Dict[str, torch.Tensor]]*, *optional*) --
  Per-block KV cache for autoregressive inference. *None* selects the training path.
- **kv_cache_flag** (*Dict[str, Any]*, *optional*) --
  KV-cache metadata (e.g. `is_cache_step` flag and token counts).
- **attention_kwargs** (*dict*, *optional*) --
  Forwarded to the attention processors.
- **return_dict** (*bool*, *optional*, defaults to *True*) --
  If *False*, returns positional tuples instead of an output dataclass.0

FAR causal forward pass. Dispatches to one of three internal paths:

- `kv_cache is None` → causal training rollout (returns `Transformer2DModelOutput`).
- `kv_cache is not None` and `kv_cache_flag["is_cache_step"]` → cache-prefill (returns
  `AnyFlowFARTransformerOutput` with `sample=None`).
- Otherwise → autoregressive inference step (returns `AnyFlowFARTransformerOutput`).

**Parameters:**

patch_size (*Tuple[int]*, defaults to *(1, 2, 2)*) : 3D patch dimensions for full-resolution chunks.

compressed_patch_size (*Tuple[int]*, defaults to *(1, 4, 4)*) : Larger patch dimensions for the FAR-compressed (context) chunks.

full_chunk_limit (*int*, defaults to *3*) : Maximum number of full-resolution chunks before earlier chunks are demoted to compressed FAR context. The released checkpoints use `3`.

num_attention_heads (*int*, defaults to *40*) : Number of attention heads.

attention_head_dim (*int*, defaults to *128*) : The number of channels in each head.

in_channels (*int*, defaults to *16*) : The number of channels in the input latent.

out_channels (*int*, defaults to *16*) : The number of channels in the output latent.

text_dim (*int*, defaults to *4096*) : Input dimension for text embeddings (UMT5).

freq_dim (*int*, defaults to *256*) : Dimension for sinusoidal time embeddings.

ffn_dim (*int*, defaults to *13824*) : Intermediate dimension in feed-forward network.

num_layers (*int*, defaults to *40*) : Number of transformer blocks.

cross_attn_norm (*bool*, defaults to *True*) : Enable cross-attention normalization.

eps (*float*, defaults to *1e-6*) : Epsilon for normalization layers.

image_dim (*Optional[int]*, *optional*, defaults to *None*) : Image embedding dimension for I2V conditioning.

rope_max_seq_len (*int*, defaults to *1024*) : Maximum sequence length used to precompute rotary position frequencies.

gate_value (*float*, defaults to *0.25*) : Mixing gate between source-timestep and delta-timestep embeddings.

deltatime_type (*str*, defaults to *'r'*) : Either `"r"` (delta is the target timestep) or `"t-r"` (delta is the absolute interval).

## AnyFlowFARTransformerOutput[[diffusers.models.transformers.transformer_anyflow_far.AnyFlowFARTransformerOutput]]

#### diffusers.models.transformers.transformer_anyflow_far.AnyFlowFARTransformerOutput[[diffusers.models.transformers.transformer_anyflow_far.AnyFlowFARTransformerOutput]]

[Source](https://github.com/huggingface/diffusers/blob/main/src/diffusers/models/transformers/transformer_anyflow_far.py#L56)

Output dataclass for `AnyFlowFARTransformer3DModel`'s causal forward paths.

**Parameters:**

sample (*torch.Tensor* or *None*) : Predicted denoising target for the autoregressive chunk. `None` for the cache-prefill path, which only writes the KV cache and produces no usable sample.

kv_cache (*list[dict[str, torch.Tensor]]*, *optional*) : Per-block KV cache state used by subsequent autoregressive steps.