# AnyFlow

[AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation](https://huggingface.co/papers/2605.13724) by Yuchao Gu, Guian Fang and collaborators at [NUS ShowLab](https://sites.google.com/view/showlab) in collaboration with NVIDIA.

*Few-step video generation has been significantly advanced by consistency models. However, their performance often degrades in any-step video diffusion models due to the fixed-point formulation. To address this limitation, we present AnyFlow, the first any-step video diffusion distillation framework built on flow maps. Instead of learning only the mapping z_t → z_0, AnyFlow learns transitions z_t → z_r over arbitrary time intervals, enabling a single model to adapt to different inference budgets. We design an improved forward flow map training recipe that fine-tunes pretrained video diffusion models into flow map models, and introduce Flow Map Backward Simulation to enable on-policy distillation for flow map models. Extensive experiments across both bidirectional and causal architectures, at scales ranging from 1.3B to 14B, on text-to-video and image-to-video tasks demonstrate that AnyFlow outperforms consistency-based baselines while preserving high fidelity and flexible sampling under varying step budgets.*

The original training code is at [`NVlabs/AnyFlow`](https://github.com/NVlabs/AnyFlow). The project page is at [nvlabs.github.io/AnyFlow](https://nvlabs.github.io/AnyFlow).

The following AnyFlow checkpoints are supported:

| Checkpoint | Backbone | Description |
|------------|----------|-------------|
| [`nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers`](https://huggingface.co/nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers) | Wan2.1 1.3B | Bidirectional T2V, lightweight |
| [`nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers`](https://huggingface.co/nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers) | Wan2.1 14B | Bidirectional T2V, full quality |
| [`nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers`](https://huggingface.co/nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers) | FAR + Wan2.1 1.3B | Causal T2V / I2V / V2V |
| [`nvidia/AnyFlow-FAR-Wan2.1-14B-Diffusers`](https://huggingface.co/nvidia/AnyFlow-FAR-Wan2.1-14B-Diffusers) | FAR + Wan2.1 14B | Causal T2V / I2V / V2V |

All four are grouped under the [`nvidia/anyflow`](https://huggingface.co/collections/nvidia/anyflow) Hugging Face collection.

> [!TIP]
> Choose `AnyFlowPipeline` for traditional bidirectional text-to-video generation. Choose `AnyFlowFARPipeline` for streaming I2V, video continuation (V2V), or any setup that benefits from frame-by-frame autoregressive sampling.

> [!TIP]
> AnyFlow supports any-step sampling: a single distilled checkpoint can be evaluated at 1, 2, 4, 8, 16... NFE without retraining. Quality scales monotonically with steps in our benchmarks.

### Optimizing Memory and Inference Speed

```py
import torch
from diffusers import AnyFlowPipeline
from diffusers.hooks import apply_group_offloading

pipe = AnyFlowPipeline.from_pretrained(
    "nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers", torch_dtype=torch.bfloat16
)
apply_group_offloading(pipe.transformer, onload_device="cuda", offload_type="leaf_level")
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()
```

```py
import torch
from diffusers import AnyFlowPipeline

pipe = AnyFlowPipeline.from_pretrained(
    "nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers", torch_dtype=torch.bfloat16
).to("cuda")
pipe.transformer = torch.compile(pipe.transformer, mode="max-autotune-no-cudagraphs")
```

### Generation with AnyFlow (Bidirectional T2V)

```py
import torch
from diffusers import AnyFlowPipeline
from diffusers.utils import export_to_video

pipe = AnyFlowPipeline.from_pretrained(
    "nvidia/AnyFlow-Wan2.1-T2V-1.3B-Diffusers", torch_dtype=torch.bfloat16
).to("cuda")

prompt = "A red panda eating bamboo in a forest, cinematic lighting"
video = pipe(prompt, num_inference_steps=4, num_frames=33).frames[0]
export_to_video(video, "out.mp4", fps=16)
```

### Generation with AnyFlow (FAR Causal)

The causal pipeline selects between T2V / I2V / V2V via the ``video`` (or ``video_latents``) argument:
omit both for plain text-to-video, or pass ``video=<tensor>`` of shape ``(B, T, C, H, W)`` in ``[0, 1]``
with ``T = 4n + 1`` to condition on existing frames. Use a single conditioning frame for I2V and a longer
clip for V2V continuation. If you already have pre-encoded latents in the model layout, pass them via
``video_latents=<tensor>`` to skip VAE encoding. ``video`` and ``video_latents`` are mutually exclusive.

> [!IMPORTANT]
> `AnyFlowFARPipeline.default_chunk_partition = [1, 3, 3, 3, 3, 3, 3, 2]` (sum 21) is matched to the
> released checkpoints' canonical 81 raw frames (21 latent frames at the VAE temporal stride of 4). When
> you change `num_frames`, you must also pass a matching `chunk_partition` summing to
> `(num_frames - 1) // 4 + 1`, otherwise the pipeline raises an `AssertionError`.

```py
import torch
from diffusers import AnyFlowFARPipeline
from diffusers.utils import export_to_video

pipe = AnyFlowFARPipeline.from_pretrained(
    "nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers", torch_dtype=torch.bfloat16
).to("cuda")

video = pipe(
    prompt="A cat surfing a wave, sunset",
    num_inference_steps=4,
    num_frames=81,
).frames[0]
export_to_video(video, "out.mp4", fps=16)
```

```py
import numpy as np
import torch
from diffusers import AnyFlowFARPipeline
from diffusers.utils import export_to_video, load_image

pipe = AnyFlowFARPipeline.from_pretrained(
    "nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers", torch_dtype=torch.bfloat16
).to("cuda")

# Wrap the conditioning image as a one-frame video tensor: (1, 1, 3, H, W) in [0, 1].
first_frame = load_image("path/to/first_frame.png").resize((832, 480))
arr = np.asarray(first_frame).astype("float32") / 255.0  # (480, 832, 3)
context_tensor = torch.from_numpy(arr).permute(2, 0, 1).unsqueeze(0).unsqueeze(1).to("cuda")

video = pipe(
    prompt="a cat walks across a sunlit lawn",
    video=context_tensor,
    num_inference_steps=4,
    num_frames=81,
).frames[0]
export_to_video(video, "out.mp4", fps=16)
```

```py
import numpy as np
import torch
from diffusers import AnyFlowFARPipeline
from diffusers.utils import export_to_video, load_video

pipe = AnyFlowFARPipeline.from_pretrained(
    "nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers", torch_dtype=torch.bfloat16
).to("cuda")

# Context clip — 9 raw frames map to 3 latent frames (9 = 4·2 + 1, 3 = 2 + 1).
context_frames = load_video("path/to/context.mp4")[:9]
arr = np.stack([np.asarray(f.resize((832, 480))) for f in context_frames]).astype("float32") / 255.0
# np.stack gives (T, H, W, C) = (9, 480, 832, 3) → permute to (T, C, H, W) then add batch.
context_tensor = torch.from_numpy(arr).permute(0, 3, 1, 2).unsqueeze(0).to("cuda")  # (1, 9, 3, 480, 832)

video = pipe(
    prompt="continue the story",
    video=context_tensor,
    num_inference_steps=4,
    num_frames=81,
    # Override chunk_partition so the first chunk covers exactly the 3 latent context frames.
    chunk_partition=[3, 3, 3, 3, 3, 3, 3],
).frames[0]
export_to_video(video, "out.mp4", fps=16)
```

## Notes

- Classifier-free guidance is fused into the released checkpoints, so inference does not run a second guided forward pass. Keep the default `guidance_scale=1.0` unless your own checkpoint requires otherwise.
- `FlowMapEulerDiscreteScheduler` is general-purpose. You can attach it to any flow-map-distilled checkpoint via `from_pretrained(..., scheduler=FlowMapEulerDiscreteScheduler.from_config(...))`.
- `AnyFlowPipeline` uses [`AnyFlowTransformer3DModel`](../models/anyflow_transformer3d) (bidirectional). `AnyFlowFARPipeline` uses [`AnyFlowFARTransformer3DModel`](../models/anyflow_far_transformer3d), which adds a compressed-frame patch embedding and the FAR causal block-mask.
- LoRA loading is supported via `WanLoraLoaderMixin`, the same mixin used by the upstream Wan pipelines.
- For training recipes (forward flow-map training and on-policy distillation), refer to the original AnyFlow training framework at [`NVlabs/AnyFlow`](https://github.com/NVlabs/AnyFlow); training is out of scope for diffusers.

## AnyFlowPipeline[[diffusers.AnyFlowPipeline]]

#### diffusers.AnyFlowPipeline[[diffusers.AnyFlowPipeline]]

[Source](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/anyflow/pipeline_anyflow.py#L80)

Bidirectional text-to-video generation pipeline for AnyFlow flow-map-distilled checkpoints, introduced in
[AnyFlow](https://huggingface.co/papers/2605.13724) by Yuchao Gu, Guian Fang et al.

AnyFlow learns arbitrary-interval transitions \\(z_t \to z_r\\) rather than the fixed \\(z_t \to z_0\\) mapping
of consistency models, so a single distilled checkpoint can be evaluated at 1, 2, 4, 8, 16... NFE without
retraining. This pipeline operates over the full video tensor in one bidirectional pass; for frame-level
autoregressive (causal) generation use `AnyFlowFARPipeline`.

Sampling is plain Euler in mean-velocity form (`z_r = z_t - (t - r) * u`) with no re-noising. The released NVIDIA
checkpoints fold classifier-free guidance into the model weights, so the default `guidance_scale=1.0` is the
recommended setting.

This model inherits from [*DiffusionPipeline*]. Check the superclass documentation for the generic methods
implemented for all pipelines (downloading, saving, running on a particular device, etc.).

__call__diffusers.AnyFlowPipeline.__call__https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/anyflow/pipeline_anyflow.py#L379[{"name": "prompt", "val": ": typing.Union[str, typing.List[str]] = None"}, {"name": "video", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "video_latents", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "negative_prompt", "val": ": typing.Union[str, typing.List[str]] = None"}, {"name": "height", "val": ": int = 480"}, {"name": "width", "val": ": int = 832"}, {"name": "num_frames", "val": ": int = 81"}, {"name": "num_inference_steps", "val": ": int = 50"}, {"name": "sigmas", "val": ": typing.Optional[typing.List[float]] = None"}, {"name": "timesteps", "val": ": typing.Optional[typing.List[float]] = None"}, {"name": "guidance_scale", "val": ": float = 1.0"}, {"name": "num_videos_per_prompt", "val": ": typing.Optional[int] = 1"}, {"name": "generator", "val": ": typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None"}, {"name": "latents", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "negative_prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "output_type", "val": ": typing.Optional[str] = 'np'"}, {"name": "return_dict", "val": ": bool = True"}, {"name": "attention_kwargs", "val": ": typing.Optional[typing.Dict[str, typing.Any]] = None"}, {"name": "callback_on_step_end", "val": ": typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None"}, {"name": "callback_on_step_end_tensor_inputs", "val": ": typing.List[str] = ['latents']"}, {"name": "max_sequence_length", "val": ": int = 512"}, {"name": "use_mean_velocity", "val": ": bool = True"}]- **prompt** (`str` or `List[str]`, *optional*) --
  The prompt or prompts to guide the video generation. If not defined, pass `prompt_embeds` instead.
- **video** (`torch.Tensor`, *optional*) --
  Pre-VAE conditioning frames of shape `(B, T, C, H, W)` in `[0, 1]`. When provided, the pipeline
  VAE-encodes them and keeps the corresponding latent prefix fixed during sampling. Mutually exclusive
  with `video_latents`.
- **video_latents** (`torch.Tensor`, *optional*) --
  Pre-encoded VAE latents in the AnyFlow layout `(B, T_latent, C, H_latent, W_latent)`. Skips VAE
  encoding on the pipeline side. Mutually exclusive with `video`.
- **negative_prompt** (`str` or `List[str]`, *optional*) --
  The prompt or prompts to avoid during video generation. Ignored when not using guidance
  (`guidance_scale 0`~AnyFlowPipelineOutput` or `tuple`If `return_dict` is `True`, `AnyFlowPipelineOutput` is returned, otherwise a `tuple` whose first
element is the generated video.

The call function to the pipeline for generation.

Examples:
```python
>>> import torch
>>> from diffusers import AnyFlowPipeline
>>> from diffusers.utils import export_to_video

>>> pipe = AnyFlowPipeline.from_pretrained(
...     "nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers", torch_dtype=torch.bfloat16
... ).to("cuda")

>>> prompt = "A red panda eating bamboo in a forest, cinematic lighting"
>>> video = pipe(prompt, num_inference_steps=4, num_frames=33).frames[0]
>>> export_to_video(video, "anyflow_t2v.mp4", fps=16)
```

**Parameters:**

tokenizer ([*AutoTokenizer*]) : Tokenizer from [google/umt5-xxl](https://huggingface.co/google/umt5-xxl).

text_encoder ([*UMT5EncoderModel*]) : [google/umt5-xxl](https://huggingface.co/google/umt5-xxl) text encoder.

transformer ([*AnyFlowTransformer3DModel*]) : Bidirectional flow-map 3D Transformer.

vae ([*AutoencoderKLWan*]) : VAE that encodes/decodes videos to and from latent representations.

scheduler ([*FlowMapEulerDiscreteScheduler*]) : Flow-map sampler. The pipeline drives `scheduler.step(..., timestep, sample, r_timestep)` per inference step.

**Returns:**

``~AnyFlowPipelineOutput` or `tuple``

If `return_dict` is `True`, `AnyFlowPipelineOutput` is returned, otherwise a `tuple` whose first
element is the generated video.
#### encode_prompt[[diffusers.AnyFlowPipeline.encode_prompt]]

[Source](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/anyflow/pipeline_anyflow.py#L179)

Encodes the prompt into text encoder hidden states.

**Parameters:**

prompt (`str` or `list[str]`, *optional*) : prompt to be encoded

negative_prompt (`str` or `list[str]`, *optional*) : The prompt or prompts not to guide the image generation. If not defined, one has to pass `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is less than `1`).

do_classifier_free_guidance (`bool`, *optional*, defaults to `True`) : Whether to use classifier free guidance or not.

num_videos_per_prompt (`int`, *optional*, defaults to 1) : Number of videos that should be generated per prompt. torch device to place the resulting embeddings on

prompt_embeds (`torch.Tensor`, *optional*) : Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not provided, text embeddings will be generated from `prompt` input argument.

negative_prompt_embeds (`torch.Tensor`, *optional*) : Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input argument.

device : (`torch.device`, *optional*): torch device

dtype : (`torch.dtype`, *optional*): torch dtype
#### encode_video[[diffusers.AnyFlowPipeline.encode_video]]

[Source](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/anyflow/pipeline_anyflow.py#L359)

Encode a pixel-space video into AnyFlow's latent layout.

Mirrors the single-helper convention of other diffusers pipelines (cf.
`WanImageToVideoPipeline.encode_image`): wraps preprocessing, VAE encoding, and latent normalization into one
call. Output layout is `(B, T_latent, C, H, W)`, which is what the AnyFlow transformer expects for
conditioning frames.

## AnyFlowFARPipeline[[diffusers.AnyFlowFARPipeline]]

#### diffusers.AnyFlowFARPipeline[[diffusers.AnyFlowFARPipeline]]

[Source](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/anyflow/pipeline_anyflow_far.py#L92)

Causal (FAR-based) text-to-video / image-to-video / video-to-video pipeline for AnyFlow checkpoints, introduced in
[AnyFlow](https://huggingface.co/papers/2605.13724) by Yuchao Gu, Guian Fang et al.

The pipeline drives a frame-level autoregressive sampling loop over chunks: each chunk is denoised with flow-map
steps while attending only to past chunks via block-sparse causal attention, and intermediate KV cache is reused
across chunks.

The task mode (T2V / I2V / V2V) is selected by which conditioning argument is passed to `__call__`:

- both `video=None` and `video_latents=None` — pure text-to-video.
- `video=&amp;lt;tensor of shape (B, T, C, H, W) in [0, 1] with T = 4n + 1>` — pre-VAE conditioning frames; the pipeline
  VAE-encodes them. Pass a single-frame video for I2V or a multi-frame clip for V2V.
- `video_latents=&amp;lt;latent tensor of shape (B, T_latent, C, H_latent, W_latent)>` — already-encoded latents in the
  FAR layout (skips the VAE encode step).

The FAR backbone is the causal Wan2.1 variant introduced by FAR (Gu et al., 2025; arXiv:2503.19325). Inference is
plain Euler in mean-velocity form per chunk with no re-noising. Joint T2V / I2V / V2V is supported by a single
distilled model.

This model inherits from [*DiffusionPipeline*]. Check the superclass documentation for the generic methods
implemented for all pipelines (downloading, saving, running on a particular device, etc.).

__call__diffusers.AnyFlowFARPipeline.__call__https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/anyflow/pipeline_anyflow_far.py#L444[{"name": "prompt", "val": ": typing.Union[str, typing.List[str]] = None"}, {"name": "video", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "video_latents", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "negative_prompt", "val": ": typing.Union[str, typing.List[str]] = None"}, {"name": "height", "val": ": int = 480"}, {"name": "width", "val": ": int = 832"}, {"name": "num_frames", "val": ": int = 81"}, {"name": "num_inference_steps", "val": ": int = 50"}, {"name": "sigmas", "val": ": typing.Optional[typing.List[float]] = None"}, {"name": "timesteps", "val": ": typing.Optional[typing.List[float]] = None"}, {"name": "guidance_scale", "val": ": float = 1.0"}, {"name": "num_videos_per_prompt", "val": ": typing.Optional[int] = 1"}, {"name": "generator", "val": ": typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None"}, {"name": "latents", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "negative_prompt_embeds", "val": ": typing.Optional[torch.Tensor] = None"}, {"name": "output_type", "val": ": typing.Optional[str] = 'np'"}, {"name": "return_dict", "val": ": bool = True"}, {"name": "attention_kwargs", "val": ": typing.Optional[typing.Dict[str, typing.Any]] = None"}, {"name": "callback_on_step_end", "val": ": typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None"}, {"name": "callback_on_step_end_tensor_inputs", "val": ": typing.List[str] = ['latents']"}, {"name": "max_sequence_length", "val": ": int = 512"}, {"name": "use_mean_velocity", "val": ": bool = True"}, {"name": "use_kv_cache", "val": ": bool = True"}, {"name": "chunk_partition", "val": ": typing.Optional[typing.List[int]] = None"}]- **prompt** (`str` or `List[str]`, *optional*) --
  The prompt or prompts to guide the video generation. If not defined, pass `prompt_embeds` instead.
- **video** (`torch.Tensor`, *optional*) --
  Pre-VAE conditioning frames of shape `(B, T, C, H, W)` in `[0, 1]` (`T = 4n + 1`). When provided, the
  pipeline VAE-encodes them and keeps the corresponding latent prefix fixed during sampling. Mutually
  exclusive with `video_latents`.
- **video_latents** (`torch.Tensor`, *optional*) --
  Pre-encoded VAE latents in the FAR layout `(B, T_latent, C, H_latent, W_latent)`. Skips VAE encoding on
  the pipeline side. Mutually exclusive with `video`.
- **negative_prompt** (`str` or `List[str]`, *optional*) --
  The prompt or prompts to avoid during video generation. Ignored when not using guidance
  (`guidance_scale 0`~AnyFlowPipelineOutput` or `tuple`If `return_dict` is `True`, an `AnyFlowPipelineOutput` is returned, otherwise a `tuple` whose first
element is the generated video.

The call function to the pipeline for generation.

Examples:
```python
>>> import numpy as np
>>> import torch
>>> from diffusers import AnyFlowFARPipeline
>>> from diffusers.utils import export_to_video, load_image

>>> pipe = AnyFlowFARPipeline.from_pretrained(
...     "nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers", torch_dtype=torch.bfloat16
... ).to("cuda")

>>> # Single-frame I2V: wrap the conditioning image as a (1, 1, 3, H, W) tensor in [0, 1].
>>> first_frame = load_image("path/to/first_frame.png").resize((832, 480))
>>> arr = np.asarray(first_frame).astype("float32") / 255.0
>>> context = torch.from_numpy(arr).permute(2, 0, 1).unsqueeze(0).unsqueeze(1).to("cuda")

>>> video = pipe(
...     prompt="a cat walks across a sunlit lawn",
...     video=context,
...     num_inference_steps=4,
...     num_frames=81,
... ).frames[0]
>>> export_to_video(video, "anyflow_far.mp4", fps=16)
```

**Parameters:**

tokenizer ([*AutoTokenizer*]) : Tokenizer from [google/umt5-xxl](https://huggingface.co/google/umt5-xxl).

text_encoder ([*UMT5EncoderModel*]) : [google/umt5-xxl](https://huggingface.co/google/umt5-xxl) text encoder.

transformer ([*AnyFlowFARTransformer3DModel*]) : FAR causal flow-map 3D Transformer.

vae ([*AutoencoderKLWan*]) : VAE that encodes/decodes videos to and from latent representations.

scheduler ([*FlowMapEulerDiscreteScheduler*]) : Flow-map sampler.

**Returns:**

``~AnyFlowPipelineOutput` or `tuple``

If `return_dict` is `True`, an `AnyFlowPipelineOutput` is returned, otherwise a `tuple` whose first
element is the generated video.
#### encode_prompt[[diffusers.AnyFlowFARPipeline.encode_prompt]]

[Source](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/anyflow/pipeline_anyflow_far.py#L202)

Encodes the prompt into text encoder hidden states.

**Parameters:**

prompt (`str` or `list[str]`, *optional*) : prompt to be encoded

negative_prompt (`str` or `list[str]`, *optional*) : The prompt or prompts not to guide the image generation. If not defined, one has to pass `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is less than `1`).

do_classifier_free_guidance (`bool`, *optional*, defaults to `True`) : Whether to use classifier free guidance or not.

num_videos_per_prompt (`int`, *optional*, defaults to 1) : Number of videos that should be generated per prompt. torch device to place the resulting embeddings on

prompt_embeds (`torch.Tensor`, *optional*) : Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not provided, text embeddings will be generated from `prompt` input argument.

negative_prompt_embeds (`torch.Tensor`, *optional*) : Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input argument.

device : (`torch.device`, *optional*): torch device

dtype : (`torch.dtype`, *optional*): torch dtype
#### encode_video[[diffusers.AnyFlowFARPipeline.encode_video]]

[Source](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/anyflow/pipeline_anyflow_far.py#L385)

Encode a pixel-space video into AnyFlow's latent layout.

Mirrors the single-helper convention of other diffusers pipelines (cf.
`WanImageToVideoPipeline.encode_image`): wraps preprocessing, VAE encoding, and latent normalization into one
call. Output layout is `(B, T_latent, C, H, W)`, which is what the AnyFlow transformer expects for
conditioning frames.

## AnyFlowPipelineOutput[[diffusers.pipelines.anyflow.pipeline_output.AnyFlowPipelineOutput]]

#### diffusers.pipelines.anyflow.pipeline_output.AnyFlowPipelineOutput[[diffusers.pipelines.anyflow.pipeline_output.AnyFlowPipelineOutput]]

[Source](https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/anyflow/pipeline_output.py#L23)

Output class for AnyFlow pipelines.

**Parameters:**

frames (`torch.Tensor`, `np.ndarray`, or list[list[PIL.Image.Image]]) : list of video outputs - It can be a nested list of length `batch_size,` with each sub-list containing denoised PIL image sequences of length `num_frames.` It can also be a NumPy array or Torch tensor of shape `(batch_size, num_frames, channels, height, width)`.

