我有三个观察要分享.
首先,经过一些游戏后,我得出结论,起始检测算法似乎可能被设计为自动重新调整其自身的操作,以便在任何给定时刻考虑局部背景噪声.这可能是有序的,因此它可以检测弱音部分的起始时间,其可能性与强度部分相同.这有一个令人遗憾的结果,即算法倾向于触发来自廉价麦克风的背景噪音 – 开始检测算法老实认为它只是在听低音音乐.
第二个观察结果是,在您的记录示例(大约前0.1秒)中,大约第一个~2200个样本有点不稳定,因为在短暂的初始间隔期间噪声确实几乎为零.尝试在起点处放大波形,你会明白我的意思.不幸的是,吉他演奏的开始在噪音开始之后如此迅速(大约在样本3000附近),算法无法独立地解决这两个问题 – 相反,它只是简单地将两者合并为一个开始事件,开始时间约为0.1秒早.因此,我大致删除了前2240个样本,以便对文件进行"标准化”(我不认为这是作弊;如果您之前只是记录了第二次左右的初始静音,那么边缘效应可能会消失采摘第一个字符串,就像通常那样).
我的第三个观察是基于频率的滤波仅在噪声和音乐实际上在某些不同的频带中有效.在这种情况下可能是这样,但我认为你还没有证明这一点.因此,我选择尝试不同的方法而不是基于频率的过滤:阈值处理.我使用录音的最后3秒,没有吉他演奏,以估计整个录音中的典型背景噪音水平,以RMS能量为单位,然后我使用该中值设定最小能量阈值被计算安全地位于中位数之上.仅在RMS能量高于阈值时发生的检测器返回的起始事件被接受为"有效”.
示例脚本如下所示:
import librosa
import numpy as np
import matplotlib.pyplot as plt
# I played around with this but ultimately kept the default value
hoplen=512
y, sr = librosa.core.load("./Vocaroo_s07Dx8dWGAR0.mp3")
# Note that the first ~2240 samples (0.1 seconds) are anomalously low noise,
# so cut out this section from processing
start = 2240
y = y[start:]
idx = np.arange(len(y))
# Calcualte the onset frames in the usual way
onset_frames = librosa.onset.onset_detect(y=y, sr=sr, hop_length=hoplen)
onstm = librosa.frames_to_time(onset_frames, sr=sr, hop_length=hoplen)
# Calculate RMS energy per frame. I shortened the frame length from the
# default value in order to avoid ending up with too much smoothing
rmse = librosa.feature.rmse(y=y, frame_length=512, hop_length=hoplen)[0,]
envtm = librosa.frames_to_time(np.arange(len(rmse)), sr=sr, hop_length=hoplen)
# Use final 3 seconds of recording in order to estimate median noise level
# and typical variation
noiseidx = [envtm > envtm[-1] - 3.0]
noisemedian = np.percentile(rmse[noiseidx], 50)
sigma = np.percentile(rmse[noiseidx], 84.1) - noisemedian
# Set the minimum RMS energy threshold that is needed in order to declare
# an "onset" event to be equal to 5 sigma above the median
threshold = noisemedian + 5*sigma
threshidx = [rmse > threshold]
# Choose the corrected onset times as only those which meet the RMS energy
# minimum threshold requirement
correctedonstm = onstm[[tm in envtm[threshidx] for tm in onstm]]
# Print both in units of actual time (seconds) and sample ID number
print(correctedonstm+start/sr)
print(correctedonstm*sr+start)
fg = plt.figure(figsize=[12, 8])
# Print the waveform together with onset times superimposed in red
ax1 = fg.add_subplot(2,1,1)
ax1.plot(idx+start, y)
for ii in correctedonstm*sr+start:
ax1.axvline(ii, color='r')
ax1.set_ylabel('Amplitude', fontsize=16)
# Print the RMSE together with onset times superimposed in red
ax2 = fg.add_subplot(2,1,2, sharex=ax1)
ax2.plot(envtm*sr+start, rmse)
for ii in correctedonstm*sr+start:
ax2.axvline(ii, color='r')
# Plot threshold value superimposed as a black dotted line
ax2.axhline(threshold, linestyle=':', color='k')
ax2.set_ylabel("RMSE", fontsize=16)
ax2.set_xlabel("Sample Number", fontsize=16)
fg.show()
打印输出如下:
In [1]: %run rosatest
[ 0.17124717 1.88952381 3.74712018 5.62793651]
[ 3776. 41664. 82624. 124096.]
它产生的情节如下所示: