Diffusers documentation
OvisImageTransformer2DModel
Get started
Pipelines
Adapters
Inference
Inference optimization
Modular Diffusers
Training
Quantization
Model accelerators and hardware
Resources
API
Main Classes
Modular
Loaders
Models
OverviewAutoModel
ControlNets
Transformers
AceStepTransformer1DModelAllegroTransformer3DModelAnyFlowFARTransformer3DModelAnyFlowTransformer3DModelAuraFlowTransformer2DModelBriaFiboTransformer2DModelBriaTransformer2DModelChromaTransformer2DModelChronoEditTransformer3DModelCogVideoXTransformer3DModelCogView3PlusTransformer2DModelCogView4Transformer2DModelConsisIDTransformer3DModelCosmos3OmniTransformerCosmosTransformer3DModelDiTTransformer2DModelEasyAnimateTransformer3DModelErnieImageTransformer2DModelFlux2Transformer2DModelFluxTransformer2DModelGlmImageTransformer2DModelHeliosTransformer3DModelHiDreamImageTransformer2DModelHunyuanDiT2DModelHunyuanImageTransformer2DModelHunyuanVideo15Transformer3DModelHunyuanVideoTransformer3DModelIdeogram4Transformer2DModelJoyImageEditTransformer3DModelKrea2Transformer2DModelLatteTransformer3DModelLongCatImageTransformer2DModelLTX2VideoTransformer3DModelLTXVideoTransformer3DModelLumina2Transformer2DModelLuminaNextDiT2DModelMochiTransformer3DModelMotifVideoTransformer3DModelOmniGenTransformer2DModelOvisImageTransformer2DModelPixArtTransformer2DModelPriorTransformerQwenImageTransformer2DModelSanaTransformer2DModelSanaVideoTransformer3DModelSD3Transformer2DModelSkyReelsV2Transformer3DModelStableAudioDiTModelTransformer2DModelTransformerTemporalModelWanAnimateTransformer3DModelWanTransformer3DModelZImageTransformer2DModel
UNets
VAEs
Pipelines
Schedulers
Internal classes
OvisImageTransformer2DModel
The model can be loaded with the following code snippet.
from diffusers import OvisImageTransformer2DModel
transformer = OvisImageTransformer2DModel.from_pretrained("AIDC-AI/Ovis-Image-7B", subfolder="transformer", torch_dtype=torch.bfloat16)OvisImageTransformer2DModel
class diffusers.OvisImageTransformer2DModel
< source >( patch_size: int = 1 in_channels: int = 64 out_channels: int | None = 64 num_layers: int = 6 num_single_layers: int = 27 attention_head_dim: int = 128 num_attention_heads: int = 24 joint_attention_dim: int = 2048 axes_dims_rope: tuple = (16, 56, 56) )
Parameters
- patch_size (
int, defaults to1) — Patch size to turn the input data into small patches. - in_channels (
int, defaults to64) — The number of channels in the input. - out_channels (
int, optional, defaults toNone) — The number of channels in the output. If not specified, it defaults toin_channels. - num_layers (
int, defaults to6) — The number of layers of dual stream DiT blocks to use. - num_single_layers (
int, defaults to27) — The number of layers of single stream DiT blocks to use. - attention_head_dim (
int, defaults to128) — The number of dimensions to use for each attention head. - num_attention_heads (
int, defaults to24) — The number of attention heads to use. - joint_attention_dim (
int, defaults to2048) — The number of dimensions to use for the joint attention (embedding/channel dimension ofencoder_hidden_states). - axes_dims_rope (
tuple[int], defaults to(16, 56, 56)) — The dimensions to use for the rotary positional embeddings.
The Transformer model introduced in Ovis-Image.
Reference: https://github.com/AIDC-AI/Ovis-Image
forward
< source >( hidden_states: Tensor encoder_hidden_states: Tensor = None timestep: LongTensor = None img_ids: Tensor = None txt_ids: Tensor = None joint_attention_kwargs: dict[str, typing.Any] | None = None return_dict: bool = True )
Parameters
- hidden_states (
torch.Tensorof shape(batch_size, image_sequence_length, in_channels)) — Inputhidden_states. - encoder_hidden_states (
torch.Tensorof shape(batch_size, text_sequence_length, joint_attention_dim)) — Conditional embeddings (embeddings computed from the input conditions such as prompts) to use. - timestep (
torch.LongTensor) — Used to indicate denoising step. - img_ids — (
torch.Tensor): The position ids for image tokens. - txt_ids (
torch.Tensor) — The position ids for text tokens. - joint_attention_kwargs (
dict, optional) — A kwargs dictionary that if specified is passed along to theAttentionProcessoras defined underself.processorin diffusers.models.attention_processor. - return_dict (
bool, optional, defaults toTrue) — Whether or not to return a~models.transformer_2d.Transformer2DModelOutputinstead of a plain tuple.
The OvisImageTransformer2DModel forward method.