taejinp commited on
Commit
27ebb70
·
1 Parent(s): 57d487f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +427 -6
README.md CHANGED
@@ -1,6 +1,427 @@
1
- ---
2
- license: other
3
- license_name: nvidia-open-model-license
4
- license_link: >-
5
- https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-open-model-license/
6
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-4.0
3
+ library_name: nemo
4
+ datasets:
5
+ - fisher_english
6
+ - NIST_SRE_2004-2010
7
+ - librispeech
8
+ - ami_meeting_corpus
9
+ - voxconverse_v0.3
10
+ - icsi
11
+ - aishell4
12
+ - dihard_challenge-3-dev
13
+ - NIST_SRE_2000-Disc8_split1
14
+ - Alimeeting-train
15
+ - DiPCo
16
+ thumbnail: null
17
+ tags:
18
+ - speaker-diarization
19
+ - speaker-recognition
20
+ - speech
21
+ - audio
22
+ - Transformer
23
+ - FastConformer
24
+ - Conformer
25
+ - NEST
26
+ - pytorch
27
+ - NeMo
28
+ widget:
29
+ - example_title: Librispeech sample 1
30
+ src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
31
+ - example_title: Librispeech sample 2
32
+ src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
33
+ model-index:
34
+ - name: diar_streaming_sortformer_4spk-v2
35
+ results:
36
+ - task:
37
+ name: Speaker Diarization
38
+ type: speaker-diarization-with-post-processing
39
+ dataset:
40
+ name: DIHARD III Eval (1-4 spk)
41
+ type: dihard3-eval-1to4spks
42
+ config: with_overlap_collar_0.0s
43
+ input_buffer_lenght: 1.04s
44
+ split: eval-1to4spks
45
+ metrics:
46
+ - name: Test DER
47
+ type: der
48
+ value: 13.24
49
+ - task:
50
+ name: Speaker Diarization
51
+ type: speaker-diarization-with-post-processing
52
+ dataset:
53
+ name: DIHARD III Eval (5-9 spk)
54
+ type: dihard3-eval-5to9spks
55
+ config: with_overlap_collar_0.0s
56
+ input_buffer_lenght: 1.04s
57
+ split: eval-5to9spks
58
+ metrics:
59
+ - name: Test DER
60
+ type: der
61
+ value: 42.56
62
+ - task:
63
+ name: Speaker Diarization
64
+ type: speaker-diarization-with-post-processing
65
+ dataset:
66
+ name: DIHARD III Eval (full)
67
+ type: dihard3-eval
68
+ config: with_overlap_collar_0.0s
69
+ input_buffer_lenght: 1.04s
70
+ split: eval
71
+ metrics:
72
+ - name: Test DER
73
+ type: der
74
+ value: 18.91
75
+ - task:
76
+ name: Speaker Diarization
77
+ type: speaker-diarization-with-post-processing
78
+ dataset:
79
+ name: CALLHOME (NIST-SRE-2000 Disc8) part2 (2 spk)
80
+ type: CALLHOME-part2-2spk
81
+ config: with_overlap_collar_0.25s
82
+ input_buffer_lenght: 1.04s
83
+ split: part2-2spk
84
+ metrics:
85
+ - name: Test DER
86
+ type: der
87
+ value: 6.57
88
+ - task:
89
+ name: Speaker Diarization
90
+ type: speaker-diarization-with-post-processing
91
+ dataset:
92
+ name: CALLHOME (NIST-SRE-2000 Disc8) part2 (3 spk)
93
+ type: CALLHOME-part2-3spk
94
+ config: with_overlap_collar_0.25s
95
+ input_buffer_lenght: 1.04s
96
+ split: part2-3spk
97
+ metrics:
98
+ - name: Test DER
99
+ type: der
100
+ value: 10.05
101
+ - task:
102
+ name: Speaker Diarization
103
+ type: speaker-diarization-with-post-processing
104
+ dataset:
105
+ name: CALLHOME (NIST-SRE-2000 Disc8) part2 (4 spk)
106
+ type: CALLHOME-part2-4spk
107
+ config: with_overlap_collar_0.25s
108
+ input_buffer_lenght: 1.04s
109
+ split: part2-4spk
110
+ metrics:
111
+ - name: Test DER
112
+ type: der
113
+ value: 12.44
114
+ - task:
115
+ name: Speaker Diarization
116
+ type: speaker-diarization-with-post-processing
117
+ dataset:
118
+ name: CALLHOME (NIST-SRE-2000 Disc8) part2 (5 spk)
119
+ type: CALLHOME-part2-5spk
120
+ config: with_overlap_collar_0.25s
121
+ input_buffer_lenght: 1.04s
122
+ split: part2-5spk
123
+ metrics:
124
+ - name: Test DER
125
+ type: der
126
+ value: 21.68
127
+ - task:
128
+ name: Speaker Diarization
129
+ type: speaker-diarization-with-post-processing
130
+ dataset:
131
+ name: CALLHOME (NIST-SRE-2000 Disc8) part2 (6 spk)
132
+ type: CALLHOME-part2-6spk
133
+ config: with_overlap_collar_0.25s
134
+ input_buffer_lenght: 1.04s
135
+ split: part2-6spk
136
+ metrics:
137
+ - name: Test DER
138
+ type: der
139
+ value: 28.74
140
+ - task:
141
+ name: Speaker Diarization
142
+ type: speaker-diarization-with-post-processing
143
+ dataset:
144
+ name: CALLHOME (NIST-SRE-2000 Disc8) part2 (full)
145
+ type: CALLHOME-part2
146
+ config: with_overlap_collar_0.25s
147
+ input_buffer_lenght: 1.04s
148
+ split: part2
149
+ metrics:
150
+ - name: Test DER
151
+ type: der
152
+ value: 10.70
153
+ - task:
154
+ name: Speaker Diarization
155
+ type: speaker-diarization-with-post-processing
156
+ dataset:
157
+ name: call_home_american_english_speech
158
+ type: CHAES_2spk_109sessions
159
+ config: with_overlap_collar_0.25s
160
+ input_buffer_lenght: 1.04s
161
+ split: ch109
162
+ metrics:
163
+ - name: Test DER
164
+ type: der
165
+ value: 4.88
166
+ metrics:
167
+ - der
168
+ pipeline_tag: audio-classification
169
+ ---
170
+
171
+
172
+ # Streaming Sortformer Diarizer 4spk v2.1
173
+
174
+ <style>
175
+ img {
176
+ display: inline;
177
+ }
178
+ </style>
179
+
180
+ [![Model architecture](https://img.shields.io/badge/Model_Arch-FastConformer--Transformer-lightgrey#model-badge)](#model-architecture)
181
+ | [![Model size](https://img.shields.io/badge/Params-117M-lightgrey#model-badge)](#model-architecture)
182
+ <!-- | [![Language](https://img.shields.io/badge/Language-multilingual-lightgrey#model-badge)](#datasets) -->
183
+
184
+ This model is a streaming version of Sortformer diarizer. [Sortformer](https://arxiv.org/abs/2409.06656)[1] is a novel end-to-end neural model for speaker diarization, trained with unconventional objectives compared to existing end-to-end diarization models.
185
+
186
+ <div align="center">
187
+ <img src="figures/sortformer_intro.png" width="750" />
188
+ </div>
189
+
190
+ [Streaming Sortformer](https://arxiv.org/abs/2507.18446)[2] employs an Arrival-Order Speaker Cache (AOSC) to store frame-level acoustic embeddings of previously observed speakers.
191
+ <div align="center">
192
+ <img src="figures/aosc_3spk_example.gif" width="1400" />
193
+ </div>
194
+ <div align="center">
195
+ <img src="figures/aosc_4spk_example.gif" width="1400" />
196
+ </div>
197
+
198
+ Sortformer resolves permutation problem in diarization following the arrival-time order of the speech segments from each speaker.
199
+
200
+ ## Model Architecture
201
+
202
+ Streaming sortformer employs pre-encode layer in the Fast-Conformer to generate speaker-cache. At each step, speaker cache is filtered to only retain the high-quality speaker cache vectors.
203
+
204
+ <div align="center">
205
+ <img src="figures/streaming_steps.png" width="1400" />
206
+ </div>
207
+
208
+
209
+ Aside from speaker-cache management part, streaming Sortformer follows the architecture of the offline version of Sortformer. Sortformer consists of an L-size (17 layers) [NeMo Encoder for
210
+ Speech Tasks (NEST)](https://arxiv.org/abs/2408.13106)[3] which is based on [Fast-Conformer](https://arxiv.org/abs/2305.05084)[4] encoder. Following that, an 18-layer Transformer[5] encoder with hidden size of 192,
211
+ and two feedforward layers with 4 sigmoid outputs for each frame input at the top layer. More information can be found in the [Streaming Sortformer paper](https://arxiv.org/abs/2507.18446)[2].
212
+
213
+ <div align="center">
214
+ <img src="figures/sortformer-v1-model.png" width="450" />
215
+ </div>
216
+
217
+
218
+
219
+
220
+ ## NVIDIA NeMo
221
+
222
+ To train, fine-tune or perform diarization with Sortformer, you will need to install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo)[6]. We recommend you install it after you've installed Cython and latest PyTorch version.
223
+
224
+ ```
225
+ apt-get update && apt-get install -y libsndfile1 ffmpeg
226
+ pip install Cython packaging
227
+ pip install git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[asr]
228
+ ```
229
+
230
+ ## How to Use this Model
231
+
232
+ The model is available for use in the NeMo Framework[6], and can be used as a pre-trained checkpoint for inference or for fine-tuning on another dataset.
233
+
234
+ ### Loading the Model
235
+
236
+ ```python3
237
+ from nemo.collections.asr.models import SortformerEncLabelModel
238
+
239
+ # load model from Hugging Face model card directly (You need a Hugging Face token)
240
+ diar_model = SortformerEncLabelModel.from_pretrained("nvidia/diar_streaming_sortformer_4spk-v2")
241
+
242
+ # If you have a downloaded model in "/path/to/diar_streaming_sortformer_4spk-v2.nemo", load model from a downloaded file
243
+ diar_model = SortformerEncLabelModel.restore_from(restore_path="/path/to/diar_streaming_sortformer_4spk-v2.nemo", map_location='cuda', strict=False)
244
+
245
+ # switch to inference mode
246
+ diar_model.eval()
247
+ ```
248
+
249
+ ### Input Format
250
+ Input to Sortformer can be an individual audio file:
251
+ ```python3
252
+ audio_input="/path/to/multispeaker_audio1.wav"
253
+ ```
254
+ or a list of paths to audio files:
255
+ ```python3
256
+ audio_input=["/path/to/multispeaker_audio1.wav", "/path/to/multispeaker_audio2.wav"]
257
+ ```
258
+ or a jsonl manifest file:
259
+ ```python3
260
+ audio_input="/path/to/multispeaker_manifest.json"
261
+ ```
262
+ where each line is a dictionary containing the following fields:
263
+ ```yaml
264
+ # Example of a line in `multispeaker_manifest.json`
265
+ {
266
+ "audio_filepath": "/path/to/multispeaker_audio1.wav", # path to the input audio file
267
+ "offset": 0, # offset (start) time of the input audio
268
+ "duration": 600, # duration of the audio, can be set to `null` if using NeMo main branch
269
+ }
270
+ {
271
+ "audio_filepath": "/path/to/multispeaker_audio2.wav",
272
+ "offset": 900,
273
+ "duration": 580,
274
+ }
275
+ ```
276
+
277
+ ### Setting up Streaming Configuration
278
+
279
+ Streaming configuration is defined by the following parameters, all measured in **80ms frames**:
280
+ * **CHUNK_SIZE**: The number of frames in a processing chunk.
281
+ * **RIGHT_CONTEXT**: The number of future frames attached after the chunk.
282
+ * **FIFO_SIZE**: The number of previous frames attached before the chunk, from the FIFO queue.
283
+ * **UPDATE_PERIOD**: The number of frames extracted from the FIFO queue to update the speaker cache.
284
+ * **SPEAKER_CACHE_SIZE**: The total number of frames in the speaker cache.
285
+
286
+ Here are recommended configurations for different scenarios:
287
+ | **Configuration** | **Latency** | **RTF** | **CHUNK_SIZE** | **RIGHT_CONTEXT** | **FIFO_SIZE** | **UPDATE_PERIOD** | **SPEAKER_CACHE_SIZE** |
288
+ | :---------------- | :---------- | :------ | :------------- | :---------------- | :------------ | :---------------- | :--------------------- |
289
+ | very high latency | 30.4s | 0.002 | 340 | 40 | 40 | 300 | 188 |
290
+ | high latency | 10.0s | 0.005 | 124 | 1 | 124 | 124 | 188 |
291
+ | low latency | 1.04s | 0.093 | 6 | 7 | 188 | 144 | 188 |
292
+ | ultra low latency | 0.32s | 0.180 | 3 | 1 | 188 | 144 | 188 |
293
+
294
+ For clarity on the metrics used in the table:
295
+ * **Latency**: Refers to **Input Buffer Latency**, calculated as **CHUNK_SIZE** + **RIGHT_CONTEXT**. This value does not include computational processing time.
296
+ * **Real-Time Factor (RTF)**: Characterizes processing speed, calculated as the time taken to process an audio file divided by its duration. RTF values are measured with a batch size of 1 on an NVIDIA RTX 6000 Ada Generation GPU.
297
+
298
+ To set streaming configuration, use:
299
+ ```python3
300
+ diar_model.sortformer_modules.chunk_len = CHUNK_SIZE
301
+ diar_model.sortformer_modules.chunk_right_context = RIGHT_CONTEXT
302
+ diar_model.sortformer_modules.fifo_len = FIFO_SIZE
303
+ diar_model.sortformer_modules.spkcache_update_period = UPDATE_PERIOD
304
+ diar_model.sortformer_modules.spkcache_len = SPEAKER_CACHE_SIZE
305
+ diar_model.sortformer_modules._check_streaming_parameters()
306
+ ```
307
+
308
+ ### Getting Diarization Results
309
+ To perform speaker diarization and get a list of speaker-marked speech segments in the format 'begin_seconds, end_seconds, speaker_index', simply use:
310
+ ```python3
311
+ predicted_segments = diar_model.diarize(audio=audio_input, batch_size=1)
312
+ ```
313
+ To obtain tensors of speaker activity probabilities, use:
314
+ ```python3
315
+ predicted_segments, predicted_probs = diar_model.diarize(audio=audio_input, batch_size=1, include_tensor_outputs=True)
316
+ ```
317
+
318
+
319
+ ### Input
320
+
321
+ This model accepts single-channel (mono) audio sampled at 16,000 Hz.
322
+ - The actual input tensor is a Ns x 1 matrix for each audio clip, where Ns is the number of samples in the time-series signal.
323
+ - For instance, a 10-second audio clip sampled at 16,000 Hz (mono-channel WAV file) will form a 160,000 x 1 matrix.
324
+
325
+ ### Output
326
+
327
+ The output of the model is an T x S matrix, where:
328
+ - S is the maximum number of speakers (in this model, S = 4).
329
+ - T is the total number of frames, including zero-padding. Each frame corresponds to a segment of 0.08 seconds of audio.
330
+ Each element of the T x S matrix represents the speaker activity probability in the [0, 1] range. For example, a matrix element a(150, 2) = 0.95 indicates a 95% probability of activity for the second speaker during the time range [12.00, 12.08] seconds.
331
+
332
+
333
+ ## Train and evaluate Sortformer diarizer using NeMo
334
+ ### Training
335
+
336
+ Sortformer diarizer models are trained on 8 nodes of 8×NVIDIA Tesla V100 GPUs. We use 90 second long training samples and batch size of 4.
337
+ The model can be trained using this [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/diarization/neural_diarizer/sortformer_diar_train.py) and [base config](https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/diarization/conf/neural_diarizer/sortformer_diarizer_hybrid_loss_4spk-v1.yaml).
338
+
339
+ ### Inference
340
+
341
+ Sortformer diarizer models can be performed with post-processing algorithms using inference [example script](https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_tasks/diarization/neural_diarizer/e2e_diarize_speech.py). If you provide the post-processing YAML configs in [`post_processing` folder](https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/post_processing) to reproduce the optimized post-processing algorithm for each development dataset.
342
+
343
+ ### Technical Limitations
344
+
345
+ - The model operates in a streaming mode (online mode).
346
+ - It can detect a maximum of 4 speakers; performance degrades on recordings with 5 and more speakers.
347
+ - While the model is designed for long-form audio and can handle recordings that are several hours long, performance may degrade on very long recordings.
348
+ - The model was trained on publicly available speech datasets, primarily in English. As a result:
349
+ * Performance may degrade on non-English speech.
350
+ * Performance may also degrade on out-of-domain data, such as recordings in noisy conditions.
351
+
352
+ ## Datasets
353
+
354
+ Sortformer was trained on a combination of 2445 hours of real conversations and 5150 hours or simulated audio mixtures generated by [NeMo speech data simulator](https://arxiv.org/abs/2310.12371)[7].
355
+ All the datasets listed above are based on the same labeling method via [RTTM](https://web.archive.org/web/20100606092041if_/http://www.itl.nist.gov/iad/mig/tests/rt/2009/docs/rt09-meeting-eval-plan-v2.pdf) format. A subset of RTTM files used for model training are processed for the speaker diarization model training purposes.
356
+ Data collection methods vary across individual datasets. For example, the above datasets include phone calls, interviews, web videos, and audiobook recordings. Please refer to the [Linguistic Data Consortium (LDC) website](https://www.ldc.upenn.edu/) or dataset webpage for detailed data collection methods.
357
+
358
+
359
+ ### Training Datasets (Real conversations)
360
+ - Fisher English (LDC)
361
+ - AMI Meeting Corpus
362
+ - VoxConverse-v0.3
363
+ - ICSI
364
+ - AISHELL-4
365
+ - Third DIHARD Challenge Development (LDC)
366
+ - 2000 NIST Speaker Recognition Evaluation, split1 (LDC)
367
+ - DiPCo
368
+ - AliMeeting
369
+
370
+ ### Training Datasets (Used to simulate audio mixtures)
371
+ - 2004-2010 NIST Speaker Recognition Evaluation (LDC)
372
+ - Librispeech
373
+
374
+ ## Performance
375
+
376
+
377
+ ### Evaluation data specifications
378
+
379
+ | **Dataset** | **Number of speakers** | **Number of Sessions** |
380
+ |----------------------------|------------------------|------------------------|
381
+ | **DIHARD III Eval <=4spk** | 1-4 | 219 |
382
+ | **DIHARD III Eval >=5spk** | 5-9 | 40 |
383
+ | **DIHARD III Eval full** | 1-9 | 259 |
384
+ | **CALLHOME-part2 2spk** | 2 | 148 |
385
+ | **CALLHOME-part2 3spk** | 3 | 74 |
386
+ | **CALLHOME-part2 4spk** | 4 | 20 |
387
+ | **CALLHOME-part2 5spk** | 5 | 5 |
388
+ | **CALLHOME-part2 6spk** | 6 | 3 |
389
+ | **CALLHOME-part2 full** | 2-6 | 250 |
390
+ | **CH109** | 2 | 109 |
391
+
392
+
393
+ ### Diarization Error Rate (DER)
394
+
395
+ * All evaluations include overlapping speech.
396
+ * Collar tolerance is 0s for DIHARD III Eval, and 0.25s for CALLHOME-part2 and CH109.
397
+ * Post-Processing (PP) is optimized on two different held-out dataset splits.
398
+ - [DIHARD III Dev Optimized Post-Processing](https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/post_processing/diar_streaming_sortformer_4spk-v2_dihard3-dev.yaml) for DIHARD III Eval
399
+ - [CALLHOME-part1 Optimized Post-Processing](https://github.com/NVIDIA/NeMo/tree/main/examples/speaker_tasks/diarization/conf/post_processing/diar_streaming_sortformer_4spk-v2_callhome-part1.yaml) for CALLHOME-part2 and CH109
400
+
401
+ # Dataset Evaluation Results
402
+
403
+ | Model | CHAES | AMI-Eval | AMI-Eval | DipCo | Mixer6 |
404
+ |---------------------------------------|:------:|:--------:|:--------:|:-----:|:------:|
405
+ | **Configurations** | CH-109 | IHM | SDM | CH1 | CH4 |
406
+ | **diar_sortformer_streaming-v2.nemo** | 15.81% | 21.26% | 0.00% | 0.00% | 0.00% |
407
+
408
+
409
+
410
+ ## References
411
+ [1][Speaker Targeting via Self-Speaker Adaptation for Multi-talker ASR](https://arxiv.org/pdf/2506.22646)
412
+
413
+ [2] [Sortformer: Seamless Integration of Speaker Diarization and ASR by Bridging Timestamps and Tokens](https://arxiv.org/abs/2409.06656)
414
+
415
+ [3] [Streaming Sortformer: Speaker Cache-Based Online Speaker Diarization with Arrival-Time Ordering](https://arxiv.org/abs/2507.18446)
416
+
417
+ [4] [NEST: Self-supervised Fast Conformer as All-purpose Seasoning to Speech Processing Tasks](https://arxiv.org/abs/2408.13106)
418
+
419
+ [5] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084)
420
+
421
+ [6] [NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo)
422
+
423
+ [7] [NeMo speech data simulator](https://arxiv.org/abs/2310.12371)
424
+
425
+ ## Licence
426
+
427
+ License to use this model is covered by the [CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/legalcode). By downloading the public and release version of the model, you accept the terms and conditions of the CC-BY-4.0 license.