Woosh — Sony AI Sound-Effect Foundation Model (Mirror)

This repository is a community mirror of the open weights released by Sony Research for Woosh — a foundation model for sound-effect generation supporting text-to-audio (T2A) and video-to-audio (V2A) synthesis.

All files here are a one-to-one copy of Sony's v1.0.0 GitHub release, repackaged into a single browseable HF repo for convenience.

License — CC-BY-NC 4.0 (Non-Commercial)

All open weights in this repository are released by Sony Research under the CC-BY-NC 4.0 license. Generated outputs inherit the non-commercial restriction. You may not use model outputs in commercial products, paid releases, or client work. The upstream project's source code is released separately under MIT / Apache-2.0.

If you need to attribute: Sony Research — Woosh (arXiv / paper, GitHub: SonyResearch/Woosh).

Model Suite

Woosh is a multi-model suite. All components are required together — the generative backbones depend on the shared AE, CLAP, and text conditioners at inference time.

Shared infrastructure

Folder Role File(s)
checkpoints/Woosh-AE/ Audio encoder / decoder producing high-quality latents weights.safetensors, config.yaml
checkpoints/Woosh-CLAP/ Multimodal text-audio alignment model (audio + text encoders) weights_audio.safetensors, weights_text.safetensors, config.yaml
checkpoints/TextConditionerA/ Text conditioner for the T2A path (pairs with Flow / DFlow) weights.safetensors, config.yaml
checkpoints/TextConditionerV/ Text conditioner for the V2A path (pairs with VFlow / DVFlow) weights.safetensors, config.yaml

Generative backbones

Folder Task Notes
checkpoints/Woosh-Flow/ Text → Audio Full-quality T2A latent diffusion
checkpoints/Woosh-DFlow/ Text → Audio Distilled T2A — fewer steps, faster inference
checkpoints/Woosh-VFlow-8s/ Video → Audio V2A latent diffusion — fixed 8-second output
checkpoints/Woosh-DVFlow-8s/ Video → Audio Distilled V2A — fewer steps, fixed 8-second output

Every weight file ships as safetensors. No .pt / .ckpt / .bin in this mirror.

Layout

checkpoints/
├── Woosh-AE/
│   ├── weights.safetensors
│   └── config.yaml
├── Woosh-CLAP/
│   ├── weights_audio.safetensors
│   ├── weights_text.safetensors
│   └── config.yaml
├── TextConditionerA/
│   ├── weights.safetensors
│   └── config.yaml
├── TextConditionerV/
│   ├── weights.safetensors
│   └── config.yaml
├── Woosh-Flow/
│   ├── weights.safetensors
│   └── config.yaml
├── Woosh-DFlow/
│   ├── weights.safetensors
│   └── config.yaml
├── Woosh-VFlow-8s/
│   ├── weights.safetensors
│   └── config.yaml
└── Woosh-DVFlow-8s/
    ├── weights.safetensors
    └── config.yaml

Directory names match Sony's release zip layout exactly so the upstream inference code finds its configs without modification.

Usage

This mirror is intended to be consumed by Sony's upstream woosh package. Clone and install the upstream repo, then point it at a local copy of this mirror's checkpoints/ directory.

# Clone upstream
git clone https://github.com/SonyResearch/Woosh.git
cd Woosh

# Sony's suggested env setup (uses uv)
uv sync
uv pip install -e .

# Pull weights from this mirror
hf download AEmotionStudio/woosh-models --local-dir ./

Acknowledgements

All credit for the Woosh models belongs to Sony Research. This mirror exists solely to make the CC-BY-NC open weights easier to fetch and integrate. Please cite the upstream project and respect the non-commercial license.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support