arxiv:2604.23586

Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling

Published on Apr 26

· Submitted by

Authors:

Abstract

Talker-T2AV presents an autoregressive diffusion framework for talking head synthesis that separates high-level cross-modal reasoning from low-level modality-specific refinement, improving lip-sync accuracy and cross-modal consistency.

AI-generated summary

Joint audio-video generation models have shown that unified generation yields stronger cross-modal coherence than cascaded approaches. However, existing models couple modalities throughout denoising via pervasive attention, treating high-level semantics and low-level details in a fully entangled manner. This is suboptimal for talking head synthesis: while audio and facial motion are semantically correlated, their low-level realizations (acoustic signals and visual textures) follow distinct rendering processes. Enforcing joint modeling across all levels causes unnecessary entanglement and reduces efficiency. We propose Talker-T2AV, an autoregressive diffusion framework where high-level cross-modal modeling occurs in a shared backbone, while low-level refinement uses modality-specific decoders. A shared autoregressive language model jointly reasons over audio and video in a unified patch-level token space. Two lightweight diffusion transformer heads decode the hidden states into frame-level audio and video latents. Experiments on talking portrait benchmarks show Talker-T2AV outperforms dual-branch baselines in lip-sync accuracy, video quality, and audio quality, achieving stronger cross-modal consistency than cascaded pipelines.

View arXiv page View PDF Project page Add to collection

Community

danielhzlin

Paper submitter about 4 hours ago

Talker-T2AV improves talking head synthesis by decoupling high-level audio-video reasoning from low-level modality-specific generation. Instead of coupling audio and video throughout denoising, it uses a shared autoregressive backbone for semantic cross-modal modeling and lightweight diffusion heads for audio/video refinement, achieving better lip-sync, visual quality, and audio quality than dual-branch and cascaded baselines.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2604.23586

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.23586 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.23586 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.23586 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.