# AudioLDM

AudioLDM was proposed in [AudioLDM: Text-to-Audio Generation with Latent Diffusion Models](https://huggingface.co/papers/2301.12503) by Haohe Liu et al. Inspired by [Stable Diffusion](https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/overview), AudioLDM
is a text-to-audio _latent diffusion model (LDM)_ that learns continuous audio representations from [CLAP](https://huggingface.co/docs/transformers/main/model_doc/clap)
latents. AudioLDM takes a text prompt as input and predicts the corresponding audio. It can generate text-conditional
sound effects, human speech and music.

The abstract from the paper is:

*Text-to-audio (TTA) system has recently gained attention for its ability to synthesize general audio based on text descriptions. However, previous studies in TTA have limited generation quality with high computational costs. In this study, we propose AudioLDM, a TTA system that is built on a latent space to learn the continuous audio representations from contrastive language-audio pretraining (CLAP) latents. The pretrained CLAP models enable us to train LDMs with audio embedding while providing text embedding as a condition during sampling. By learning the latent representations of audio signals and their compositions without modeling the cross-modal relationship, AudioLDM is advantageous in both generation quality and computational efficiency. Trained on AudioCaps with a single GPU, AudioLDM achieves state-of-the-art TTA performance measured by both objective and subjective metrics (e.g., frechet distance). Moreover, AudioLDM is the first TTA system that enables various text-guided audio manipulations (e.g., style transfer) in a zero-shot fashion. Our implementation and demos are available at [this https URL](https://audioldm.github.io/).*

The original codebase can be found at [haoheliu/AudioLDM](https://github.com/haoheliu/AudioLDM).

## Tips

When constructing a prompt, keep in mind:

* Descriptive prompt inputs work best; you can use adjectives to describe the sound (for example, "high quality" or "clear") and make the prompt context specific (for example, "water stream in a forest" instead of "stream").
* It's best to use general terms like "cat" or "dog" instead of specific names or abstract objects the model may not be familiar with.

During inference:

* The _quality_ of the predicted audio sample can be controlled by the `num_inference_steps` argument; higher steps give higher quality audio at the expense of slower inference.
* The _length_ of the predicted audio sample can be controlled by varying the `audio_length_in_s` argument.

> [!TIP]
> Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.

## AudioLDMPipeline[[diffusers.AudioLDMPipeline]]
#### diffusers.AudioLDMPipeline[[diffusers.AudioLDMPipeline]]

[Source](https://github.com/huggingface/diffusers/blob/v0.37.0/src/diffusers/pipelines/audioldm/pipeline_audioldm.py#L60)

Pipeline for text-to-audio generation using AudioLDM.

This model inherits from [DiffusionPipeline](/docs/diffusers/v0.37.0/en/api/pipelines/overview#diffusers.DiffusionPipeline). Check the superclass documentation for the generic methods
implemented for all pipelines (downloading, saving, running on a particular device, etc.).

__call__diffusers.AudioLDMPipeline.__call__https://github.com/huggingface/diffusers/blob/v0.37.0/src/diffusers/pipelines/audioldm/pipeline_audioldm.py#L360[{"name": "prompt", "val": ": str | list[str] = None"}, {"name": "audio_length_in_s", "val": ": float | None = None"}, {"name": "num_inference_steps", "val": ": int = 10"}, {"name": "guidance_scale", "val": ": float = 2.5"}, {"name": "negative_prompt", "val": ": str | list[str] | None = None"}, {"name": "num_waveforms_per_prompt", "val": ": int | None = 1"}, {"name": "eta", "val": ": float = 0.0"}, {"name": "generator", "val": ": torch._C.Generator | list[torch._C.Generator] | None = None"}, {"name": "latents", "val": ": torch.Tensor | None = None"}, {"name": "prompt_embeds", "val": ": torch.Tensor | None = None"}, {"name": "negative_prompt_embeds", "val": ": torch.Tensor | None = None"}, {"name": "return_dict", "val": ": bool = True"}, {"name": "callback", "val": ": typing.Optional[typing.Callable[[int, int, torch.Tensor], NoneType]] = None"}, {"name": "callback_steps", "val": ": int | None = 1"}, {"name": "cross_attention_kwargs", "val": ": dict[str, typing.Any] | None = None"}, {"name": "output_type", "val": ": str | None = 'np'"}]- **prompt** (`str` or `list[str]`, *optional*) --
  The prompt or prompts to guide audio generation. If not defined, you need to pass `prompt_embeds`.
- **audio_length_in_s** (`int`, *optional*, defaults to 5.12) --
  The length of the generated audio sample in seconds.
- **num_inference_steps** (`int`, *optional*, defaults to 10) --
  The number of denoising steps. More denoising steps usually lead to a higher quality audio at the
  expense of slower inference.
- **guidance_scale** (`float`, *optional*, defaults to 2.5) --
  A higher guidance scale value encourages the model to generate audio that is closely linked to the text
  `prompt` at the expense of lower sound quality. Guidance scale is enabled when `guidance_scale > 1`.
- **negative_prompt** (`str` or `list[str]`, *optional*) --
  The prompt or prompts to guide what to not include in audio generation. If not defined, you need to
  pass `negative_prompt_embeds` instead. Ignored when not using guidance (`guidance_scale 0[AudioPipelineOutput](/docs/diffusers/v0.37.0/en/api/pipelines/audioldm#diffusers.AudioPipelineOutput) or `tuple`If `return_dict` is `True`, [AudioPipelineOutput](/docs/diffusers/v0.37.0/en/api/pipelines/audioldm#diffusers.AudioPipelineOutput) is returned, otherwise a `tuple` is
returned where the first element is a list with the generated audio.

The call function to the pipeline for generation.

Examples:
```py
>>> from diffusers import AudioLDMPipeline
>>> import torch
>>> import scipy

>>> repo_id = "cvssp/audioldm-s-full-v2"
>>> pipe = AudioLDMPipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
>>> pipe = pipe.to("cuda")

>>> prompt = "Techno music with a strong, upbeat tempo and high melodic riffs"
>>> audio = pipe(prompt, num_inference_steps=10, audio_length_in_s=5.0).audios[0]

>>> # save the audio sample as a .wav file
>>> scipy.io.wavfile.write("techno.wav", rate=16000, data=audio)
```

**Parameters:**

vae ([AutoencoderKL](/docs/diffusers/v0.37.0/en/api/models/autoencoderkl#diffusers.AutoencoderKL)) : Variational Auto-Encoder (VAE) model to encode and decode images to and from latent representations.

text_encoder ([ClapTextModelWithProjection](https://huggingface.co/docs/transformers/v5.3.0/en/model_doc/clap#transformers.ClapTextModelWithProjection)) : Frozen text-encoder (`ClapTextModelWithProjection`, specifically the [laion/clap-htsat-unfused](https://huggingface.co/laion/clap-htsat-unfused) variant.

tokenizer (`PreTrainedTokenizer`) : A [RobertaTokenizer](https://huggingface.co/docs/transformers/v5.3.0/en/model_doc/mvp#transformers.RobertaTokenizer) to tokenize text.

unet ([UNet2DConditionModel](/docs/diffusers/v0.37.0/en/api/models/unet2d-cond#diffusers.UNet2DConditionModel)) : A `UNet2DConditionModel` to denoise the encoded audio latents.

scheduler ([SchedulerMixin](/docs/diffusers/v0.37.0/en/api/schedulers/overview#diffusers.SchedulerMixin)) : A scheduler to be used in combination with `unet` to denoise the encoded audio latents. Can be one of [DDIMScheduler](/docs/diffusers/v0.37.0/en/api/schedulers/ddim#diffusers.DDIMScheduler), [LMSDiscreteScheduler](/docs/diffusers/v0.37.0/en/api/schedulers/lms_discrete#diffusers.LMSDiscreteScheduler), or [PNDMScheduler](/docs/diffusers/v0.37.0/en/api/schedulers/pndm#diffusers.PNDMScheduler).

vocoder ([SpeechT5HifiGan](https://huggingface.co/docs/transformers/v5.3.0/en/model_doc/speecht5#transformers.SpeechT5HifiGan)) : Vocoder of class `SpeechT5HifiGan`.

**Returns:**

`[AudioPipelineOutput](/docs/diffusers/v0.37.0/en/api/pipelines/audioldm#diffusers.AudioPipelineOutput) or `tuple``

If `return_dict` is `True`, [AudioPipelineOutput](/docs/diffusers/v0.37.0/en/api/pipelines/audioldm#diffusers.AudioPipelineOutput) is returned, otherwise a `tuple` is
returned where the first element is a list with the generated audio.

## AudioPipelineOutput[[diffusers.AudioPipelineOutput]]
#### diffusers.AudioPipelineOutput[[diffusers.AudioPipelineOutput]]

[Source](https://github.com/huggingface/diffusers/blob/v0.37.0/src/diffusers/pipelines/pipeline_utils.py#L135)

Output class for audio pipelines.

**Parameters:**

audios (`np.ndarray`) : List of denoised audio samples of a NumPy array of shape `(batch_size, num_channels, sample_rate)`.

