S4D: Speaker Diarization Toolkit in Python

S4D: Speaker Diarization Toolkit in Python

1French National Audiovisual Institute (INA), Paris, France
2Computer Science Laboratory of Le Mans University (LIUM - EA 4023), Le Mans, France

SIDEKIT for dialization(S4D)

In this paper, we present S4D, a new open-source Python toolkit dedicated to speaker diarization. S4D provides various state-of- the-art components and the possibility to easily develop end-to- end diarization prototype systems. S4D offers a large panel of clustering, segmentation, scoring and visualization algorithms. S4D has been thought to be easily understood, installed, mod- ified and used in order to allow fast transfers of diarization technologies to industry and facilitate development of new ap- proaches. Examples, benchmarks on standard tasks and tutori- als are provided in this paper. S4D is an extension of the open- source toolkit for speaker recognition: SIDEKIT.

  1. Introduction
    The diarization task is a necessary pre-processing step for speaker identification [1] or speech transcription [2] when there is more than one speaker in an audio/video recording. For each speaker in a recording, it consists of detecting the time areas where he or she speaks. Each time area, corresponding to a segment, is annotated with an abstract label representing the speaker. Thus, the diarization task allows to determine who spoke when. This domain is still an active research area since there are many unsolved problems such as detection of over- lapped speech [3] or labeling of speech overlapping with music.
    For the diarization task, few toolkits are available. Most of them are dedicated to research. Quick transfers of new tech- nologies to industry require tools which are close to industrial standards. So as to reach this purpose, a diarization toolkit should comply with some requirements:beeasytounderstand,modify,installanduse;
    • enable end-to-end diarization system development;
    • offervariousstate-of-the-artalgorithms;
    • manage standard data formats to allow compatibility with other tools.
    To address the lacks of existing toolkits, we developed S4D, a new toolkit for diarization fulfilling the mentioned requirements and facilitating the development of new approaches.
    In this paper, we first present the context in which S4D has been developed. We give then a detailed description of S4D contents before providing a guide to develop a broadcast news diarization system. Finally, we explain how to deploy S4D be- fore offering a few perspectives.
    本文首先介绍了S4D的发展背景。然后对S4D的内容进行了详细的描述,为广播新闻二值化系统的开发提供了指导。最后,在提供一些观点之前,我们将解释如何部署S4D be。

  2. Context
    This section presents the context in which S4D has been devel- oped, other existing tools and the link with SIDEKIT [4].
    2.1. Comparisonandcompatibilitieswithexistingtools
    Few tools are freely available for speaker diarization. S4D has been designed to overcome limitations of those tools.
    LIUMSpkDiarization [5, 6] is a toolkit for diarization writ- ten in Java. It includes most state-of-the-art methods in the diarization field. This toolkit was developed by the Computer Science Laboratory of Le Mans Univer- sity (LIUM) for French ESTER2 evaluation campaign [7], where it obtained the best results for the task of di- arization of broadcast news in 2008. This toolkit has two main drawbacks: it is no longer being updated and it can only be executed via command lines thanks to a jar file.
    Pyannote.metrics [8] is a toolkit for reproducible evaluation, diagnostic and error analysis of diarization systems. It is a regularly updated project with a wide selection of metrics. S4D includes certain metrics from this toolkit to offer greater ease of use.
    Pyannote.audio [9] is a toolkit for diarization. It only pro- poses state-of-the-art methods developed by using the oriented object paradigm in which it is easy to extend. Moreover, it requires a considerable learning time.

2.2. SIDEKIT and S4D
SIDEKIT is an open source package for speaker and language recognition developed by Anthony Larcher, Kong Aik Lee and Sylvain Meignier [4] which provides an end-to-end tool- chain including various state-of-the-art algorithms. SIDEKIT for Diarization (S4D) is an open source package extension of SIDEKIT dedicated to diarization. The aim of S4D is to provide an educational and efficient toolkit for diarization encompass- ing the whole chain of treatment that goes from the audio data to the analysis of the system performance. Furthermore, both SIDEKIT and S4D have completely been written in Python and tested on several platforms under Python 3 for both Linux and MacOS.
SIDEKIT是由Anthony Larcher、Kong Aik Lee和Sylvain Meignier[4]开发的一个用于说话人和语言识别的开源软件包,它提供了一个端到端的工具链,包括各种最先进的算法。SIDEKIT for dialization(S4D)是SIDEKIT的一个开源包扩展,专门用于二聚。S4D的目的是提供一个教育和有效的工具集,包括从音频数据到系统性能分析的整个处理链。此外,SIDEKIT和S4D都完全是用Python编写的,并在Python 3下的几个平台上对Linux和MacOS进行了测试。

  1. What is in S4D?

This section describes several uses currently offered by S4D.

3.1. Segmentation

The segmentation detects the instantaneous change points cor- responding to segment boundaries. The proposed algorithm is based on the detection of local maxima. It detects the change points through a Gaussian Divergence (GD) [10], computed us- ing Gaussians. The left and right Gaussians are estimated over a window sliding along the whole signal. A change point, i.e. a segment boundary, is present in the middle of the window when the Gaussian divergence score reaches a local maximum.
After a GD segmentation, a second pass over the signal fuses consecutive segments of the same speaker from the start to the end of the recording. The employed measure for the fus- ing is the ∆BIC [11] based on Bayesian Information Criterion. Alternatively, it is possible to use the BIC Square Root distance for the value of the penalty factor in the ∆BIC, as defined in [12].

3.2. Clustering

In order to group clusters, S4D offers a certain number of meth- ods.

3.2.1. HACBIC

The algorithm is based upon a Hierarchical Agglomerative Clustering (HAC). Each cluster is modeled by a Gaussian. The ∆BIC measure [11] is employed to select the candidate clus- ters to be grouped as well as to stop the merging process. The two closest clusters i and j are merged at each iteration until ∆BICi,j > 0.

3.2.2. HACCLR

The HAC CLR merges a set of clusters thanks to a HAC algo- rithm. The CLR (Cross Likelihood Ratio) score [13] is used as the dissimilarity measure as well as the stop criterion. This score requires the Universal Background Model (UBM) for the computation and to eventually adjust used data models with the MAP algorithm [14]. The lowest CLR score allows to select the two clusters to merge at each iteration. The merging pro- cess stops when the score exceeds a threshold set a priori.
HAC CLR通过HAC算法合并了一组集群。CLR(Cross-Likelihood Ratio,交叉似然比)得分[13]被用作相异性度量和停止准则。该分数要求通用背景模型(UBM)进行计算,并最终使用MAP算法调整使用的数据模型[14]。最低的CLR得分允许选择在每次迭代时合并的两个集群。当分数超过预先设定的阈值时,合并过程停止。

3.2.3. ILPIV

The Integer Linear Programming I-Vector (ILP IV) clustering [15] extracts an i-vector for each cluster and computes the dis- tances among all of them (PLDA [16], cosine [17] or Maha- lanobis [18]). ILP clustering was inspired by the k-medoids algorithm which choose k observations as class centers. For the ILP IV, this number k is determined automatically. We look for K centers which cover all the i-vectors such as each one is as- signed to only one center and has a distance of less than δ from its center. This problem is solved using the GNU Linear Pro- gramming Kit (GLPK) package which is intended for solving large-scale Linear Programming (LP).
整数线性规划I-向量(ILP-IV)聚类[15]为每个聚类提取一个I-向量,并计算它们之间的差异(PLDA[16]、cosine[17]或Maha-lanobis[18])。ILP聚类的灵感来源于选择k个观测值作为类中心的k-medods算法。对于ILP IV,这个数字k是自动确定的。我们寻找覆盖所有i-向量的K个中心,例如每个i-向量仅与一个中心有符号,并且与中心的距离小于δ。这个问题是用GNU线性编程工具包(GLPK)来解决的,该工具包是用来解决大规模线性规划(LP)的。
So as to save execution time, a search of connected compo- nents (CC) can be done [19]. The distances below δ represent connected components with clusters as nodes and distances as edges. The ILP IV clustering is then applied for each connected component which is not in a form of a star graph. A star is just one or several nodes only connected to a same node.

3.2.4. HACIV

This clustering process is based upon a HAC algorithm. Each cluster is modeled by an i-vector and the distances among all of them are computed thanks to the PLDA, cosine or Mahanalo- bis score. This distance is the measure employed to select the clusters to be grouped as well as to stop the clustering process.

  1. Discussion
    We have introduced S4D, a new open-source toolkit for the di- arization task. It is a comprehensive toolkit offering an end- to-end tool-chain with various ready-to-use state-of-the-art al- gorithms. S4D allows to easily develop systems for broad- cast news but also for other tasks (meeting, telephone conversa- tions). It is very useful to create offline diarization system but is not adapted yet for online diarization system or treatments in stream. The resulting diarization system is nonetheless time ef- ficient, as it processes the total 40 hours of our test corpus in 70 minutes (see Table 2), which corresponds to less than 3% of the total audio duration. This toolkit is maintained for an indefinite period. It will implement new methods and metrics according to speaker diarization advances. In the near future, Artificial Neu- ral Network (ANN) [32] and Binary Key (BK) [33] methods for segmentation and clustering will be implemented.
  • 1
  • 1
    觉得还不错? 一键收藏
  • 0


  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助




当前余额3.43前往充值 >
领取后你会自动成为博主和红包主的粉丝 规则
钱包余额 0


