- 参考pyAudioAnalysis、openSmile以及语音信号处理实验教程(MATLAB源代码)
- Introduction to Audio Analysis–A Matlab Approach
- 完整测试文件
- 注意,以下代码不在genFeatures.py内的,可在pyAudioAnalysis.audioFeatureExtraction文件内观察得到
1.过零率
zero crossing rate
每帧信号内,信号过零点的次数,体现的是频率特性。
Z n = 1 2 ∑ m = 0 N − 1 ∣ s g n [ x n ( m ) ] − s g n [ x n ( m − 1 ) ] ∣ Z_n = \frac{1}{2}\sum_{m=0}^{N-1}|sgn[x_n(m)]-sgn[x_n(m-1)]| Zn=21m=0∑N−1∣sgn[xn(m)]−sgn[xn(m−1)]∣
import numpy as np
def stZCR(frame):
# computing zero crossing rate
count = len(frame)
count_z = np.sum(np.abs(np.diff(np.sign(frame)))) / 2
return (np.float64(count_z) / np.float64(count - 1))
2.能量
energy
短时能量,即每帧信号的平方和,体现的是信号能量的强弱
E n = ∑ m = 0 N − 1 x n 2 ( m ) E_n = \sum_{m=0}^{N-1}x_n^2(m) En=m=0∑N−1xn2(m)
import numpy as np
def stEnergy(frame):
return (np.sum(frame ** 2) / np.float64(len(frame)))
2.1 振幅扰动度-分贝形式
shimmer in DB
S h i m m e r ( d B ) = 1 N − 1 ∑ i = 1 N − − 1 ∣ 20 l o g ( A i + 1 / A i ) ∣ Shimmer(dB) = {\frac{1}{N-1}}\sum_{i=1}^{N--1}|20log(A_{i+1}/A_i)| Shimmer(dB)=N−11i=1∑N−−1∣20log(Ai+1/Ai)∣
import numpy as np
def stShimmerDB(frame):
'''
amplitude shimmer 振幅扰动度
expressed as variability of the peak-to-peak amplitude in decibels 分贝
[3]
'''
count = len(frame)
sigma = 0
for i in range(count):
if i == count - 1:
break
sigma += np.abs(20 * (np.log10(np.abs(frame[i + 1] / (frame[i] + eps)))))
return np.float64(sigma) / np.float64(count - 1)
2.2 振幅扰动度-百分数形式
S h i m m e r ( r e l a t i v e ) = 1 N − 1 ∑ N − 1 i = 1 ∣ A i − A i + 1 ∣ 1 N ∑ i = 1 N A i Shimmer(relative) = \frac{\frac{1}{N-1}\sum_{N-1}^{i=1}|A_i-A_{i+1}|}{\frac{1}{N}\sum_{i=1}^{N}A_i} Shimmer(relative)=N1∑i=1NAiN−11∑N−1i=1∣Ai−Ai+1∣
def stShimmerRelative(frame):
'''
shimmer relative is defined as average absolute difference between the amplitude
of consecutive periods divided by the average amplitude, expressed as percentage
[3]
'''
count = len(frame)
sigma_diff = 0
sigma_sum = 0
for i in range(count):
if i < count - 1:
sigma_diff += np.abs(np.abs(frame[i]) - np.abs(frame[i + 1]))
sigma_sum += np.abs(frame[i])
return np.float64(sigma_diff / (count - 1)) / np.float64(sigma_sum / count + eps)
3. 声强/响度
intensity / loudness
- intensity: mean of squared input values multiplied by a Hamming window
声强和响度是对应的概念,参考openSmile程序
i n t e n s i t y = ∑ m = 0 N − 1 h a m W i n [ m ] ∗ x n 2 ( m ) ∑ m = 0 N − 1 h a m W i n [ m ] intensity = \frac{\sum_{m=0}^{N-1}hamWin[m]*x_n^2(m)}{\sum_{m=0}^{N-1}hamWin[m]} intensity=∑m=0N−1hamWin[m]∑m=0N−1hamWin[m]∗xn2(m)
l o u d n e s s = ( i n t e n s i t y I 0 0.3 ) loudness =( {\frac{intensity}{I0}}^{0.3}) loudness=(I0intensity0.3) I 0 = 1 × 1 0 − 12 I0=1\times10^{-12} I0=1×10−12
###################
##
## from opensimle
##
#####################
def stIntensity(frame):
'''
cannot understand what differ from energy
'''
fn = len(frame)
hamWin = np.hamming(fn)
winSum = np.sum(hamWin)
if winSum <= 0.0:
winSum = 1.0
I0 = 0.000001
Im = 0
for i in range(fn):
Im = hamWin[i] * frame[i] ** 2
intensity = Im/winSum
loudness = (Im / I0) ** .3
return intensity, loudness
4. 基频
计算基频的方法包括倒谱法、短时自相关法和线性预测法。本文采用短时自相关法
1)基音检测预处理:语音端点检测
由于语音的头部和尾部不具有周期性,因此为了提高基音检测的准确性,在基音检测时采用了端点检测。使用谱熵法进行端点检测。
语音信号时域波形为 x ( i ) x(i) x(i),加窗分帧后第 n n n帧语音信号为 x n ( m ) x_n(m) xn(m),其FFT表示为 X n ( k ) X_n(k) Xn(k), k k k表示第 k k k条谱线。该语音帧在频域中的短时能量 E n E_n En为
E n = ∑ k = 0 N / 2 X n ( k ) X n ∗ ( k ) E_n = \sum_{k=0}^{N/2}X_n(k)X_n^*(k) En=k=0∑N/2Xn(k)Xn∗(k)
N N N为FFT长度,只取正频率部分
某一谱线 k k k的能量谱为 Y n ( k ) = X n ( k ) X n ( k ) ∗ Y_n(k) = X_n(k)X_n(k)^* Yn(k)=Xn(k)Xn(k)∗,则每个频率分量的归一化谱概率密度函数定义为 p n ( k ) = Y n ( k ) E n = Y n ( k ) ∑ l = 0 N / 2 Y n ( l ) p_n(k)=\frac{Y_n(k)}{E_n}=\frac{Y_n(k)}{\sum_{l=