XLSR-SLS

EER% 0.23 on ASVspoof2019_LA EER% 7.39 on ASVspoof2021_LA EER% 3.93 on ASVspoof2021_DF EER% 7.46 on InTheWild EER% 9.81 on CD-ADD EER% 24.19 on SONAR EER% 1.86 on LibriSeVoc EER% 12.81 on CFAD EER% 10.53 on CVoiceFake_small EER% 18.76 on ASVspoof5 1-SRR% 19.68 on LRLspoof EER% 14.31 on ADD22_eval_31 EER% 27.07 on DeepVoice EER% 31.2 on ArAD EER% 12.1 on DECRO EER% 8.7 on J-SPAW_LA EER% 25.47 on ODSS EER% 1.59 on HABLA EER% 7.48 on DFADD EER% 5.24 on PyAra EER% 15.27 on XMAD EER% 19.37 on ADD2023_track12_test_r1 EER% 0.61 on EmoFake_test 1-SRR% 62.12 on EmoSpoofTTS arena tier arena rank

A wav2vec 2.0 (XLS-R 300M) + SLS audio-deepfake-detection model, from "Audio Deepfake Detection with Self-Supervised XLS-R and SLS Classifier" (Zhang, Wen & Hu, ACM MM 2024). A self-supervised XLS-R front-end is paired with the SLS (Sensitive Layer Selection) classifier, which treats the 24 XLS-R transformer layers as a feature pyramid and learns to weight them. The model takes a raw speech waveform and returns a score where higher = more bona fide.

The exact wrapper used to produce the Arena scores is in xlsr_sls.py; the network definition is in _net.py.

Architecture

  1. wav2vec 2.0 XLS-R (300M) front-end β€” a self-supervised transformer (fairseq Wav2Vec2Model) producing 1024-d frame features from all 24 transformer layers.
  2. SLS (Sensitive Layer Selection) back-end β€” every layer's hidden state is average-pooled to a 1024-d descriptor and gated by a per-layer sigmoid attention (fc0 β†’ sigmoid); the gates re-weight the full per-layer feature stack, which is summed across layers. The fused feature passes through BatchNorm + SELU + 3Γ—3 max-pool, is flattened, and goes through a two-layer MLP (fc1: 22847β†’1024, fc3: 1024β†’2).
  3. The 2-class log-softmax output is read at index 1 = bona fide.

How it was trained

  • Data: ASVspoof 2019 Logical Access (LA).
  • Input length: raw audio at 16 kHz cropped/padded to 64,600 samples (~4.04 s). The window length is fixed β€” fc1 expects a 22,847-d flatten, so the 64,600-sample window is mandatory at inference.
  • Output: 2-class log-softmax; the bona-fide log-prob (index 1) is the score.

See the source repository for the full training and evaluation code.

Benchmark result (Speech Anti-Spoofing Arena)

Evaluated through the reproducible Speech Anti-Spoofing Arena. Scores were computed with a deterministic first-64,600-sample window (no random crop), so the numbers are exactly reproducible from the pinned score file. Arena standing: πŸ₯‡ gold tier, rank #1 of 10.

Dataset Split EER % Trials Skipped W2V2-AASIST† Notes
ASVspoof2019_LA test 0.23 71,237 0 0.22 in-domain (training data)
ASVspoof2021_LA test 7.39 181,566 0 8.11 cross-dataset generalization
ASVspoof2021_DF test 3.93 611,829 0 8.32 cross-dataset generalization
InTheWild test 7.46 31,779 0 11.22 out-of-domain (real-world deepfakes)
CD-ADD test 9.81 20,786 0 38.57 out-of-domain (modern neural-TTS)
SONAR test 24.19 3,948 0 β€” out-of-domain (multi-generator deepfakes)
LibriSeVoc test 1.86 18,487 0 β€” out-of-domain (vocoder artifacts)
CFAD test 12.81 62,999 0 β€” out-of-domain (Chinese audio deepfakes)
CVoiceFake_small test 10.53 138,136 0 β€” out-of-domain (multilingual TTS/vocoder)
ASVspoof5 test 18.76 680,774 0 β€” out-of-domain (ASVspoof5 eval)
ADD22_eval_31 test 14.31 112,861 0 β€” out-of-domain (ADD 2022 Mandarin Track-3 fake-game)

† Same benchmark, the other XLS-R-based system (XLS-R 300M + AASIST). XLSR-SLS's multi-layer SLS fusion wins on every out-of-domain set β€” most strikingly on ASVspoof2021_DF (3.93 vs 8.32) and CD-ADD (9.81 vs 38.57) β€” and is on par in-domain. The benchmark's ASVspoof2021 LA/DF use curated trial sets, so absolute EER differs from the paper's official-keys numbers (1.92 % DF, 7.46 % InTheWild β€” the latter matched here exactly); the relative ordering is the meaningful comparison.

Usage

The checkpoint is a state_dict for the Model network defined in _net.py. Constructing the network requires the base XLS-R 300M checkpoint xlsr2_300m.pt (only used to build the wav2vec 2.0 architecture; every weight is then overwritten by MMpaper_model.pth):

wget https://dl.fbaipublicfiles.com/fairseq/wav2vec/xlsr2_300m.pt

The input must be exactly 64,600 samples at 16 kHz mono β€” window the waveform with pad_fixed (first 64,600 samples, tile-repeat if shorter).

import numpy as np
from xlsr_sls import XLSRSLS   # _net.py + xlsr_sls.py are in this repo

m = XLSRSLS()
m.load()                                          # loads MMpaper_model.pth (+ xlsr2_300m.pt)
audio = np.random.randn(48000).astype(np.float32) # float32 mono 16 kHz
print(m.score_batch([audio], [16000])[0])         # higher = more bona fide
m.unload()

Internally the wrapper windows the input, runs the network, and returns output[:, 1] (class 1 = bona fide; source main.py: batch_score = batch_out[:, 1]). xlsr_sls.py is the exact speech_spoof_bench model that produced the Arena scores.txt.

Citation

@inproceedings{zhang2024audio,
  title={Audio Deepfake Detection with Self-Supervised XLS-R and SLS Classifier},
  author={Zhang, Qishan and Wen, Shuangbing and Hu, Tao},
  booktitle={Proceedings of the 32nd ACM International Conference on Multimedia},
  pages={6765--6773},
  year={2024},
  doi={10.1145/3664647.3681345}
}

License

MIT β€” see the source repository.

Maintainer

Maintained by Kirill Borodin (SpeechAntiSpoofingBenchmarks).

Downloads last month
68
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support