常见声音的时频统计特征的Python编程实现

摘要

常规的声音识别方案中最关键的两个环节:特征构建和分类模型选择。这次我们学习传统的声音信号的时频域统计特征及其Python代码实现。

特征描述及Python代码

编程环境:Python3.6/3.5、 librosa 0.7
以下程序大部分利用公式重写代码验证过,若有问题,请提出,谢谢!!!

一、Spectral centroid and spectral spread

In digital signal processing, the spectral centroid (SC) and the spectral spread (SS) are measures for characterising the distribution of the frequency components of a signal. The spectral centroid is defined as the ”center of mass” of the spectrum and is computed as follows:
S C = ∑ i = 1 L F i F s L F ∣ X ( i ) ∣ ∑ i = 1 L F ∣ X ( i ) ∣ , SC = \frac{\sum^{L_F}_{i=1}i\frac{F_s}{L_F}\lvert X(i)\rvert}{\sum^{L_F}_{i=1}\lvert X(i)\rvert}, SC=i=1LFX(i)i=1LFiLFFsX(i),
while the spectral spread is computed as the dispersion of the frequency components of the signal around the centroid:
S S = ∑ i = 1 L F [ i F s L F − S C ] 2 ∣ X ( i ) ∣ ∑ i = 1 L F ∣ X ( i ) ∣ SS = \sqrt{\frac{\sum^{L_F}_{i=1}\lbrack i\frac{F_s}{L_F}-SC\rbrack ^2 \lvert X(i)\rvert}{\sum^{L_F}_{i=1}\lvert X(i)\rvert}} SS=i=1LFX(i)i=1LF[iLFFsSC]2X(i)
where L F L_F LF and ∣ X ( k ) ∣ \lvert X(k) \rvert X(k) are the length and the module of the F F T FFT FFT of the imput signal x ( n ) x(n) x(n), respectively.

# 代码如下,已经过验证
import librosa
import numpy as np

path = 'scream.wav'
y,sr = librosa.load(path,sr=44100)
frame = librosa.util.frame(y,frame_length=1024,hop_length=512)
S = np.abs(librosa.stft(y,n_fft=1024,hop_length=512,center=False))
sc = librosa.feature.spectral_centroid(y,sr=sr,n_fft=1024,hop_length=512,center=False)
#spectral spread
ss = np.zeros((1,frame.shape[1]))
for i in range(frame.shape[1]):
    ss[:,i] = np.sqrt((sum((fre-sc[:,i])**2*S[:,i]))/sum(S[:,i]))

二、Spectral rolloff

The spectral rolloff is a measure of the skewness of the spectrum and is defined as the frequency fro at which the P % P\% P% of the spectral components of the signal is at lower frequency. In our case, we consider P = 90 P = 90 P=90 and determine the value f r o f_{ro} fro from the following relation:
∑ i = 1 f r o ∣ X ( i ) ∣ = P 100 ∑ i = 1 F m a x ∣ X ( i ) ∣ \sum^{f_{ro}}_{i=1}\lvert X(i) \rvert = \frac{P}{100}\sum^{F_{max}}_{i=1}\lvert X(i) \rvert i=1froX(i)=100Pi=1FmaxX(i)

# 利用librosa来提取特征,程序未验证
import librosa
path = 'scream.wav'
y,sr = librosa.load(path,sr=44100)
rolloff = librosa.feature.spectral_rolloff(y=y,sr=sr,n_fft=1024,hop_length=512,center=False,roll_percent=0.9)

三、Spectral flu

The spectral flux (SF) indicates how quickly the spectral information of a signal is changing and it is computed by considering the squared-difference between the spectra of two consecutive audio frames, as reported in the following equation:
S F = ∑ i = 1 L F [ X n ( i ) − X n − 1 ( i ) ] 2 SF = \sum^{L_F}_{i=1}\lbrack X_n(i) - X_{n-1}(i) \rbrack^2 SF=i=1LF[Xn(i)Xn1(i)]2

import librosa
import numpy as np
path = 'scream.wav'
y,sr = librosa.load(path,sr=44100)
sf = np.zeros((1,S.shape[1]-1))
for i in range(S.shape[1]-1):
    sf[:,i] = sum((S[:,i+1]-S[:,i])**2)

四、Energy ratios in sub-bands

The energy ratios in sub-bands (ERSB) give a rough approximation of the energy distribution of the spectrum. We divided the spectrum of the signal into four sub-bands,which are reported as follow, and for each sub-band we computed the ratio between the energy contained in that subband and the overall energy of the audio frame.
E R S B n = ∑ i = k n 1 k n 2 ∣ X ( i ) ∣ 2 ∑ i = 1 F m a x ∣ X ( i ) ∣ 2 , ERSB_n = \frac{\sum^{k_{n2}}_{i=k_{n1}} \lvert X(i)\rvert ^2}{\sum^{F_{max}}_{i=1} \lvert X(i)\rvert ^2}, ERSBn=i=1FmaxX(i)2i=kn1kn2X(i)2,
where
在这里插入图片描述

import librosa
import numpy as np
path = 'scream.wav'
y,sr = librosa.load(path,sr=44100)
ersb = np.zeros((4,S.shape[1]))
sub_band = [14,39,102,512]
for i in range(S.shape[1]):
    ersb[0,i] = sum(S[:sub_band[0],i])/sum(S[:,i])
    ersb[1,i] = sum(S[sub_band[0]:sub_band[1],i])/sum(S[:,i])
    ersb[2,i] = sum(S[sub_band[1]:sub_band[2],i])/sum(S[:,i])
    ersb[3,i] = sum(S[sub_band[2]:sub_band[3],i])/sum(S[:,i])

五、Volume and energy

We calculate the volume feature (V) as the root mean square (RMS) of the amplitude value of the samples in an audio frame:
V = 1 L ∑ i = 1 L x ( i ) 2 V = \sqrt {\frac{1}{L}\sum^{L}_{i=1}x(i)^2} V=L1i=1Lx(i)2
while the energy (E) is the squared-sum of the amplitude value of the audio samples:
E = ∑ i = 1 L x ( i ) 2 E = \sum^{L}_{i=1}x(i)^2 E=i=1Lx(i)2

import librosa
import numpy as np
path = 'scream.wav'
y,sr = librosa.load(path,sr=44100)
frame = librosa.util.frame(y,frame_length=1024,hop_length=512)
# Volumns and energy
V = librosa.feature.rms(y,frame_length=1024,hop_length=512,center=False)
Energy = np.sum(frame**2,axis=0)

六、Zero crossing rate

The zero crossing rate (ZCR) is the rate of the sign-changes along a frame and is especially used to characterise percussive sounds and environmental noise. For a frame x ( i ) x(i) x(i) of L L L samples, the ZCR is computed as follows:
Z C R = 1 2 L ∑ i = 1 L ∣ s g n ( x ( i + 1 ) ) − s g n ( x ( i ) ) ∣ ZCR = \frac{1}{2L}\sum^L_{i=1}\lvert sgn(x(i+1)) - sgn(x(i))\rvert ZCR=2L1i=1Lsgn(x(i+1))sgn(x(i))

import librosa
path = 'scream.wav'
y,sr = librosa.load(path,sr=44100)
zcr = librosa.feature.zero_crossing_rate(y,frame_length=1024,hop_length=512,center=False)

总结

以免忘记,在此记录下特征提取方法,其他特征将会继续更新,若想看其他特征欢迎在评论区提出,共同学习。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值