VSB_1st_place_solution 代码解析

不正经的码狗

于 2023-05-22 11:37:10 发布

阅读量239

点赞数

分类专栏： vsb 文章标签： python

本文链接：https://blog.csdn.net/weixin_43448473/article/details/130802268

版权

vsb 专栏收录该内容

1 篇文章 0 订阅

订阅专栏

Simple LightGBM model
Standard 5-fold CV
Uses just 9 features
Features all calculated using the peaks in the signals
Run times:
- Generating features: ~15 minutes
- Training: ~2 minutes
- Prediction: ~10 seconds
这是对使用LightGBM（梯度提升框架）的简单机器学习模型的描述。该模型仅使用 9 个特征创建，所有这些特征都是使用信号中的峰值计算的。该模型使用标准的 5 折交叉验证进行训练，特征生成运行时间为 ~15 分钟，训练运行时间为 ~2 分钟，预测运行时间为 ~10 秒。总体而言，此模型相对快速和高效，使其成为需要快速预测和处理时间的任务的不错选择。但是，请记住，模型的有效性将取决于所用特征的质量以及它们捕获数据中重要模式的程度。

Set RECALCULATE_PEAK_FEATURES to True to recalculate the per peak features

RECALCULATE_PEAK_FEATURES = True

The PREPROCESSED_DIR is used to load the previous commit’s per peak features if

RECALCULATE_PEAK_FEATURES is set to False. Note, if the kernel is forked this

variable needs to be updated to point to the forked kernel’s directory

PREPROCESSED_DIR = '../input/vsb-1st-place-solution'

这是 Python 脚本中的一个全局变量，其中包含存储先前计算的每个峰值特征的目录的路径。目录的名称表明它与 Kaggle 上托管的 VSB 电力线故障检测竞赛的解决方案相关。

在这样的比赛中，多个团队提交他们的解决方案，参与者通常会在比赛结束后发布他们的代码和数据。此变量可能正在获胜解决方案代码的公开可用版本中使用。通过指向已预先计算功能的目录，此变量可在运行代码时节省时间和计算工作量。这将很有帮助，特别是如果特征生成步骤昂贵或耗时。
**
换成电脑文件目录后怎么写**
如果要将PREPROCESSED_DIR变量设置为计算机上的目录路径，需要将其替换为代表该目录的字符串，在双引号内括起来。
例如：PREPROCESSED_DIR = ‘C:\Users\YourUserName\Documents\preprocessed_data’
这里，C:\Users\YourUserName\Documents\preprocessed_data应该替换为实际存储先前计算的峰值特征的目录路径。记得要用双反斜杠来转义路径中出现的反斜杠。
*
换成服务器的目录应该怎么写**
如果要将PREPROCESSED_DIR变量设置为服务器上的目录路径，需要将其替换为代表该目录的字符串，在双引号内括起来。

例如：PREPROCESSED_DIR = ‘/home/username/preprocessed_data’

这里，/home/username/preprocessed_data应该替换为实际存储先前计算的峰值特征的目录路径。请注意，Linux和Unix系统使用正斜杠(/)作为路径分隔符，而不是Windows系统中的反斜杠()。

功能：

        @numba.jit(nopython=True)
def flatiron(x, alpha=100., beta=1):
    """
    Flatten signal
    
    Creator: Michael Kazachok
    Source: https://www.kaggle.com/miklgr500/flatiron
    """
    new_x = np.zeros_like(x)
    zero = x[0]
    for i in range(1, len(x)):
        zero = zero*(alpha-beta)/alpha + beta*x[i]/alpha
        new_x[i] = x[i] - zero
    return new_x

       这是一个使用Numba库实现的函数，用于将信号平坦化处理。该函数将输入信号数组x作为参数，并使用alpha和beta参数来调整信号的平突（flattening）程度。

具体地说，该算法首先初始化一个新的零数组new_x，然后根据信号的递归公式对输入序列进行处理。该公式按照以下步骤进行：

将第一个数据点设置为初始化值;
对于接下来的每个数据点，将其减去前一个采样点和当前期望峰值之间的距离，并将此结果存储在输出序列中;
更新期望峰值。
这个函数创建者是Michael Kazachok，并且可以在kaggle.com中看到他的创作。

@numba.jit(nopython=True)

def drop_missing(intersect,sample):
“”"
Find intersection of sorted numpy arrays

Since intersect1d sort arrays each time, it's effectively inefficient.
Here you have to sweep intersection and each sample together to build
the new intersection, which can be done in linear time, maintaining order. 

Source: https://stackoverflow.com/questions/46572308/intersection-of-sorted-numpy-arrays
Creator: B. M.
"""
i=j=k=0
new_intersect=np.empty_like(intersect)
while i< intersect.size and j < sample.size:
    if intersect[i]==sample[j]: # the 99% case
        new_intersect[k]=intersect[i]
        k+=1
        i+=1
        j+=1
    elif intersect[i]<sample[j]:
        i+=1
    else : 
        j+=1
return new_intersect[:k]

这是一个使用Numba库实现的函数，用于找出两个排序numpy数组之间的交集。该函数将两个被进行比较的输入数组intersect和sample作为参数，并输出它们之间的交集。

该函数通过建立一个空的new_intersect数组并同时遍历两个输入数组来寻找它们的交集。在进行遍历时，可以按顺序处理数组中的元素，因此该函数的操作复杂度大约为线性级别。在具体算法实现方面，i和j变量分别记录两个输入数组当前处理到的位置，k变量用于迭代新new_intersect数组。

值得注意的是，该函数是由B. M.创建的，你可以在 https://stackoverflow.com/questions/46572308/intersection-of-sorted-numpy-arrays 找到该方法的源代码。

@numba.jit(nopython=True)
def _local_maxima_1d_window_single_pass(x, w):

midpoints = np.empty(x.shape[0] // 2, dtype=np.intp)
left_edges = np.empty(x.shape[0] // 2, dtype=np.intp)
right_edges = np.empty(x.shape[0] // 2, dtype=np.intp)
m = 0  # Pointer to the end of valid area in allocated arrays

i = 1  # Pointer to current sample, first one can't be maxima
i_max = x.shape[0] - 1  # Last sample can't be maxima
while i < i_max:
    # Test if previous sample is smaller
    if x[i - 1] < x[i]:
        i_ahead = i + 1  # Index to look ahead of current sample

        # Find next sample that is unequal to x[i]
        while i_ahead < i_max and x[i_ahead] == x[i]:
            i_ahead += 1
                
        i_right = i_ahead - 1
        
        f = False
        i_window_end = i_right + w
        while i_ahead < i_max and i_ahead < i_window_end:
            if x[i_ahead] > x[i]:
                f = True
                break
            i_ahead += 1
            
        # Maxima is found if next unequal sample is smaller than x[i]
        if x[i_ahead] < x[i]:
            left_edges[m] = i
            right_edges[m] = i_right
            midpoints[m] = (left_edges[m] + right_edges[m]) // 2
            m += 1
            
        # Skip samples that can't be maximum
        i = i_ahead - 1
    i += 1

# Keep only valid part of array memory.
midpoints = midpoints[:m]
left_edges = left_edges[:m]
right_edges = right_edges[:m]

return midpoints, left_edges, right_edges

@numba.jit(nopython=True)
def local_maxima_1d_window(x, w=1):
“”"
Find local maxima in a 1D array.
This function finds all local maxima in a 1D array and returns the indices
for their midpoints (rounded down for even plateau sizes).
It is a modified version of scipy.signal._peak_finding_utils._local_maxima_1d
to include the use of a window to define how many points on each side to use in
the test for a point being a local maxima.
Parameters
----------
x : ndarray
The array to search for local maxima.
w : np.int
How many points on each side to use for the comparison to be True
Returns
-------
midpoints : ndarray
Indices of midpoints of local maxima in x.
Notes
-----
- Compared to argrelmax this function is significantly faster and can
detect maxima that are more than one sample wide. However this comes at
the cost of being only applicable to 1D arrays.
“”"

fm, fl, fr = _local_maxima_1d_window_single_pass(x, w)
bm, bl, br = _local_maxima_1d_window_single_pass(x[::-1], w)
bm = np.abs(bm - x.shape[0] + 1)[::-1]
bl = np.abs(bl - x.shape[0] + 1)[::-1]
br = np.abs(br - x.shape[0] + 1)[::-1]

m = drop_missing(fm, bm)

return m

```这是一个使用Numba库实现的函数，用于在1D数组中查找本地峰值（local maxima）。该函数将输入数据序列x和窗口大小w作为参数，并返回所有本地最大值的中点索引。

该函数基于_scipy.signal._peak_finding_utils._local_maxima_1d修改而来。它利用窗口大小处理给定序列，并在每个窗口内寻找局部极大值。具体地说，对于每个采样点，该函数将检查其左右w个采样点（总共2*w+1个采样点），并确定在这些采样点范围内是否存在局部极大值。如果整个采样窗口内不存在更大的值，则当前采样点被标记为局部极大值。

该算法分为两个部分：先前向扫描找到楼顶，然后反向像后移动继续找到那些从后往前的峰值，接着将两个方向的峰值求交集，返回合并结果。

该算法也与原始 SciPy 库提供的 argrelmax 函数相比较，该local_maxima_1d_window函数比后者具有更高的查找速度和更广泛的峰值特性检测。由于其实现方式只适用于一维数组，因此可以使用更快的算法来找到所有突起。

@numba.jit(nopython=True)
def plateau_detection(grad, threshold, plateau_length=5):
“”“Detect the point when the gradient has reach a plateau”“”

count = 0
loc = 0
for i in range(grad.shape[0]):
    if grad[i] > threshold:
        count += 1
    
    if count == plateau_length:
        loc = i - plateau_length
        break
        
return loc

这是一个使用Numba库实现的函数，用于检测信号梯度是否超出给定阈值，并返回其对应位置。该函数将输入梯度序列`grad`，阈值`threshold`和平均长度`plateau_length`作为参数。

该函数采用均衡器的概念，即在某些点之后必须遵循信号梯度与阈值的差异规则。如果连续的`plateau_length`个梯度样本大于给定的阈值，则认为已经到达了一个平台状态（plateau）。如果存在平台，该函数返回最后一个变化之前的样本点的位置索引。这意味着如果梯度突然降低到阈值以下，则 `loc` 参数将始终为上一个采样点的索引。从概念上讲，它可以被认为是从之前不足以触发平台状态的点开始计数。

#@numba.jit(nopython=True)
def get_peaks(
x,
window=25,
visualise=False,
visualise_color=None,
):
“”"
Find the peaks in a signal trace.
Parameters
----------
x : ndarray
The array to search.
window : np.int
How many points on each side to use for the local maxima test
Returns
-------
peaks_x : ndarray
Indices of midpoints of peaks in x.
peaks_y : ndarray
Absolute heights of peaks in x.
x_hp : ndarray
An absolute flattened version of x.
“”"

x_hp = flatiron(x, 100, 1)

x_dn = np.abs(x_hp)

peaks = local_maxima_1d_window(x_dn, window)

heights = x_dn[peaks]

ii = np.argsort(heights)[::-1]

peaks = peaks[ii]
heights = heights[ii]

ky = heights
kx = np.arange(1, heights.shape[0]+1)

conv_length = 9

grad = np.diff(ky, 1)/np.diff(kx, 1)
grad = np.convolve(grad, np.ones(conv_length)/conv_length)#, mode='valid')
grad = grad[conv_length-1:-conv_length+1]

knee_x = plateau_detection(grad, -0.01, plateau_length=1000)
knee_x -= conv_length//2

if visualise:
    plt.plot(grad, color=visualise_color)
    plt.axvline(knee_x, ls="--", color=visualise_color)

peaks_x = peaks[:knee_x]
peaks_y = heights[:knee_x]

ii = np.argsort(peaks_x)
peaks_x = peaks_x[ii]
peaks_y = peaks_y[ii]
    
return peaks_x, peaks_y, x_hp

这是一个用于在信号轨迹中查找峰值的函数，它还包含一些可视化功能。该函数将输入数据序列`x`，窗口大小`window`以及用于可视化和调试的其他参数作为参数，并返回所有本地最大值的中点索引、相对高度和绝对高度。

该函数使用一些预处理您的数据，例如将平坦图形带宽压缩以便更好地处理。接着使用局部极大值算法在 1D 数据轨迹中检测峰值。然后，通过小转角划分法来查找阶梯点，以便确定有多少个峰值落在合适的区间。

在算法实现方面，该算法首先使基线平坦化。其次，它通过对该序列进行快速傅里叶变换 (FFT)，使用一个典型的 HC-ABS 算法将振幅谱上升到常见背景噪声水平之上，接着将其目标重新自我蹂躏成原始数据的峰值轮廓。然后，该函数计算当系统处于低能模式（即站在平台上）时，其总输出接近零，而在有效状态下（即越来越丰富的活动状态），则会产生反馈强度比闭环的值高一个数量级。

由于该函数中包含一些可视化/调试处理，与原始峰值算法相比，该函数的效率略低。

@numba.jit(nopython=True)
def clip(v, l, u):
“”“Numba helper function to clip a value”“”

if v < l:
    v = l
elif v > u:
    v = u
    
return v

peak_features_names = [
‘ratio_next’,
‘ratio_prev’,
‘small_dist_to_min’,
‘sawtooth_rmse’,
]

num_peak_features = len(peak_features_names)

@numba.jit(nopython=True)
def create_sawtooth_template(sawtooth_length, pre_length, post_length):
“”“Generate sawtooth template”“”

l = pre_length+post_length+1

st = np.zeros(l)
for i in range(sawtooth_length+1):
    
    j = pre_length+i
    if j < l:
        st[j] = 1 - ((2./sawtooth_length) * i)
    
return st

@numba.jit(nopython=True)
def calculate_peak_features(px, x_hp0, ws=5, wl=25):
“”"
Calculate features for peaks.
Parameters
----------
px : ndarray
Indices of peaks.
x_hp0 : ndarray
The array to search.
ws : np.int
How many points on each side to use for small window features
wl : np.int
How many points on each side to use for large window features
Returns
-------
features : ndarray
Features calculate for each peak in x_hp0.
“”"

features = np.ones((px.shape[0], num_peak_features), dtype=np.float64) * np.nan

for i in range(px.shape[0]):
    
    feature_number = 0
    
    x = px[i]
    x_next = x+1
    x_prev = x-1
    
    h0 = x_hp0[x]

    ws_s = clip(x-ws, 0, 800000-1)
    ws_e = clip(x+ws, 0, 800000-1)
    wl_s = clip(x-wl, 0, 800000-1)
    wl_e = clip(x+wl, 0, 800000-1)
    
    ws_pre = x - ws_s
    ws_post = ws_e - x
    
    wl_pre = x - wl_s
    wl_post = wl_e - x
    
    if x_next < 800000:
        h0_next = x_hp0[x_next]
        features[i, feature_number] = np.abs(h0_next)/np.abs(h0)
    feature_number += 1
        
    if x_prev >= 0:
        h0_prev = x_hp0[x_prev]
        features[i, feature_number] = np.abs(h0_prev)/np.abs(h0)
    feature_number += 1
        
    x_hp_ws0 = x_hp0[ws_s:ws_e+1]
    x_hp_wl0 = x_hp0[wl_s:wl_e+1]
    x_hp_wl0_norm = (x_hp_wl0/np.abs(h0))
    x_hp_ws0_norm = (x_hp_ws0/np.abs(h0))
    x_hp_abs_wl0 = np.abs(x_hp_wl0)
    wl_max_0 = np.max(x_hp_abs_wl0)
    
    ws_opp_peak_i = np.argmin(x_hp_ws0*np.sign(h0))
    
    features[i, feature_number] = ws_opp_peak_i - ws
    feature_number += 1
    
    x_hp_wl0_norm_sign = x_hp_wl0_norm * np.sign(h0)
    
    sawtooth_length = 3
    st = create_sawtooth_template(sawtooth_length, wl_pre, wl_post)
    assert np.argmax(st) == np.argmax(x_hp_wl0_norm_sign)
    assert st.shape[0] == x_hp_wl0_norm_sign.shape[0]
    features[i, feature_number] = np.mean(np.power(x_hp_wl0_norm_sign - st, 2))
    feature_number += 1
    
    if i == 0:
        assert feature_number == num_peak_features
    
return features

这个函数计算给定峰值位置的一些特征。该函数将位置序列 `px`，包含要查找特征的原始数据 `x_hp0` 以及要用于某些窗口聚合方法的参数作为输入，并返回一个数组，用于存储每个峰值的特征。

在特征提取方面，该函数首先努力通过使用平均数和标准差来调整所有峰值的不同高度。其次，它通过上下文和当前窗口大小过滤了每个峰值周围的邻居。然后，它尝试解决闭环和时间同步方案的微调问题，并处理由于由未知冲击和初始暂初始移动产生的随机变化等引入的额外标记或噪声。

该函数计算每个峰值本身的比率，称为 `ratio_next` 和 `ratio_prev`。还计算与峰值本身相关的其他特征，例如以下图表中的小距离到最小值的距离（`small_dist_to_min`），该距离位于 `rw` 窗口以内。许多算法和模型都依赖于这些特征，以识别或区分不同的类型的事件或发现“规律性”。

def process_signal(
data,
window=25,
):
“”"
Process a signal trace to find the peaks and calculate features for each peak.
Parameters
----------
data : ndarray
The array to search.
window : np.int
How many points on each side to use for the local maxima test
Returns
-------
px0 : ndarray
Indices for each peak in data.
height0 : ndarray
Absolute heaight for each peak in data.
f0 : ndarray
Features calculate for each peak in data.
“”"

px0, height0, x_hp0 = get_peaks(
    data.astype(np.float),
    window=window, 
)
        
f0 = calculate_peak_features(px0, x_hp0)

return px0, height0, f0

def process_measurement_peaks(data, signal_ids):
“”"
Process three signal traces in measurment to find the peaks
and calculate features for each peak.
Parameters
----------
data : ndarray
Signal traces.
signal_ids : ndarray
Signal IDs for each of the signal traces in measurment
Returns
-------
res : ndarray
Data for each peak in the three traces in data.
sigid_res : ndarray
Signal ID for each row in res.
“”"
res = []
sigid_res = []

assert data.shape[1] % 3 == 0
N = data.shape[1]//3

for i in range(N):
    
    sigids = signal_ids[i*3:(i+1)*3]
    x = data[:, i*3:(i+1)*3].astype(np.float)
    
    px0, height0, f0 = process_signal(
        x[:, 0]
    )
    
    px1, height1, f1 = process_signal(
        x[:, 1]
    )
    
    px2, height2, f2 = process_signal(
        x[:, 2]
    )
    
    if px0.shape[0] != 0:
        res.append(np.hstack([
            px0[:, np.newaxis], 
            height0[:, np.newaxis],
            f0,
        ]))
        
        sigid_res.append(np.ones(px0.shape[0], dtype=np.int) * sigids[0])
    
    if px1.shape[0] != 0:
        res.append(np.hstack([
            px1[:, np.newaxis], 
            height1[:, np.newaxis],
            f1,
        ]))

        sigid_res.append(np.ones(px1.shape[0], dtype=np.int) * sigids[1])
    
    if px2.shape[0] != 0:
        res.append(np.hstack([
            px2[:, np.newaxis], 
            height2[:, np.newaxis],
            f2,
        ]))

        sigid_res.append(np.ones(px2.shape[0], dtype=np.int) * sigids[2])
        
return res, sigid_res

def process_measurement(data_df, meta_df, fft_data):
“”"
Process three signal traces in measurment to find the peaks
and calculate features for each peak.
Parameters
----------
data_df : pandas.DataFrame
Signal traces.
meta_df : pandas.DataFrame
Meta data for measurement
fft_data : ndarray
50Hz fourier coefficient for three traces
Returns
-------
peaks : pandas.DataFrame
Data for each peak in the three traces in data.
“”"
peaks, sigids = process_measurement_peaks(
data_df.values, # [:, :1003],
meta_df[‘signal_id’].values, # [:1003]
)

peaks = np.concatenate(peaks)

peaks = pd.DataFrame(
    peaks,
    columns=['px', 'height'] + peak_features_names
)
peaks['signal_id'] = np.concatenate(sigids)

# Calculate the phase resolved location of each peak

phase_50hz = np.angle(fft_data, deg=False) # fft_data[:, 1]

phase_50hz = pd.DataFrame(
    phase_50hz,
    columns=['phase_50hz']
)
phase_50hz['signal_id'] = meta_df['signal_id'].values

peaks = pd.merge(peaks, phase_50hz, on='signal_id', how='left')

dt = (20e-3/(800000))
f1 = 50
w1 = 2*np.pi*f1
peaks['phase_aligned_x'] = (np.degrees(
    (w1*peaks['px'].values*dt) + peaks['phase_50hz'].values
) + 90) % 360

# Calculate the phase resolved quarter for each peak
peaks['Q'] = pd.cut(peaks['phase_aligned_x'], [0, 90, 180, 270, 360], labels=[0, 1, 2, 3])

return peaks

@numba.jit(nopython=True, parallel=True)
def calculate_50hz_fourier_coefficient(data):
“”“Calculate the 50Hz Fourier coefficient of a signal.
Assumes the signal is 800000 data points long and covering 20ms.
“””

n = 800000
assert data.shape[0] == n

omegas = np.exp(-2j * np.pi * np.arange(n) / n).reshape(n, 1)
m_ = omegas ** np.arange(1, 2)

m = m_.flatten()

res = np.zeros(data.shape[1], dtype=m.dtype)
for i in numba.prange(data.shape[1]):
    res[i] = m.dot(data[:, i].astype(m.dtype))
        
return res

这段代码实现了为测量中的信号跟踪中的三个信号之一找到峰值。该函数使用numpy和numba库，以加快数组操作的计算速度。它调用`get_peaks` （从原始数据提取峰值）和`calculate_peak_features`（为每个峰计算特征），并将结果存储在一个元祖中。

它被调用处理每个测量样本中的所有信号以查找离散特点的峰值，并根据每个峰值位置、高度等计算峰值特征。该函数返回数据帧 `peaks`，其中包含峰值的所有信息和特征，以及每个峰值的50 Hz傅里叶系数的相位分辨标识。最后，将对峰值按其在50Hz傅立叶方案中的相位与相关的象限进行分类。

Each signal trace is preprocessed to identify the peaks and calculate features.

The steps performed to do this are:

Flatten the trace using the flatiron function
Identify the local maxima peaks in each trace using the local_maxima_1d_window function
Filter the peaks identified in step 2 to separate the signal from the noise
Calculate features for each peak identified in step 3 using calculate_peak_features
Step 2. To identify the local maxima the function local_maxima_1d_window is used. This function takes a window length argument, which is the number of points on each side to use for the comparison. An example of the behaviour of this function can be seen below:

这段代码描述了对每个信号进行预处理以识别峰值和计算特征的步骤。

其步骤如下：

1. 使用 flatiron 函数将信号转换为一维
2. 使用 local_maxima_1d_window 函数查找每个信号中的局部最大值峰值
3. 过滤在第2步中识别出来的峰值，以区分信号和噪声
4. 对在第3步中识别的每个峰值使用 calculate_peak_features 计算特征

第2步。local_maxima_1d_window函数用于查找局部极大值。该函数需要一个窗口长度参数，该参数用于指定比较使用的左右两侧点数。以下是该函数的示例行为：`

a = np.array([0, 0, 0, 1, 1, 1, 2, 2, 2, 1, 0, 0, 3, 0, 0, 0, 0, 0, 1, 0, 2, 0, 3, 0, 2, 0, 1, 0])

p1 = local_maxima_1d_window(a, w=1)
p3 = local_maxima_1d_window(a, w=3)
p4 = local_maxima_1d_window(a, w=4)

plt.plot(a, marker=‘o’)
plt.scatter(p1, a[p1]+0.2, color=‘red’, label=‘1’)
plt.scatter(p3, a[p3]+0.4, color=‘orange’, marker=‘x’, label=‘3’)
plt.scatter(p4, a[p4]+0.6, color=‘grey’, marker=‘^’, label=‘4’)
plt.legend()
plt.show()


```此示例展示了如何使用 `local_maxima_1d_window` 函数查找一维信号中的局部峰值。在这个示例中，我们使用以下数组：

a = np.array([0, 0, 0, 1, 1, 1, 2, 2, 2, 1, 0, 0, 3, 0, 0, 0, 0, 0, 1, 0, 2, 0, 3, 0, 2, 0, 1, 0])

我们调用三次local_maxima_1d_window函数：w分别为1、3和4。然后在plot中绘制原始数据和找到峰值。在此图中

- 橙色x表示窗口为3的局部极大值
- 红色圆点表示窗口大小为1的局部最大值
- 灰色向上三角形表示窗口大小为4的局部极大值

![](null)

```c
Step 3. Once all the peaks in a trace have been identified, the peaks caused by the noise in the signal need to be removed. This is performed in the get_peaks function. When the peaks are ordered by height, knee detection is performed to identify the point when the height of the peaks stops changing due to the noise floor being reached. The steps are:

1. Order the peaks by their height
2. Calculate the gradient between each consecutive pair of peaks
3. Smooth the gradients using a convolution operation
4. Find the noise floor using the plateau_detection function

The location of the threshold identified for a number of signals can be seen below:

第3步。一旦确定了信号中的所有峰值，就需要消除信号中的噪声引起的峰值。这是在 get_peaks 函数中执行的。当按高度排序峰值时，使用膝盖检测来确定由于已达到噪声平台而峰值高度停止变化的点。步骤如下：

按峰值高度对峰值进行排序
计算每个相邻峰值之间的梯度
使用卷积操作平滑梯度
使用 plateau_detection 函数找到噪音阈值

在以下示例中可以看到为多个信号标识的噪声阈值位置：

fig, ax = plt.subplots(1, 1, figsize=(10, 6))

sigids = [2323, 10, 4200, 4225]
colours = ['blue', 'red', 'orange', 'grey']
for i, sigid in enumerate(sigids):
    d = pq.read_pandas(
        data_dir + '/train.parquet',
        columns=[str(sigid)]
    ).to_pandas().values[:, 0].astype(np.float)
    get_peaks(d, visualise=True, visualise_color=colours[i])

plt.xlim([0, 4000])
plt.axhline(-0.01, color='black', ls='--')
plt.yscale("symlog")
plt.xscale("symlog")

plt.xlabel('Sorted peak index')
plt.ylabel('Gradient')
plt.suptitle('Example of peak filtering')

plt.show()

此示例展示了根据峰值梯度阈值过滤多个信号中的峰值。在这个示例中，我们从训练集示例中选取四个样本，在其上运行 get_peaks 函数，并使用不同颜色可视化其峰值梯度曲线。

每个图形显示从左侧到右侧排序的峰值（x轴），并显示针对每个峰值在平滑梯度曲线上计算出的峰值梯度（y轴）。平滑过程使得局部波动被平滑到一个更平滑的曲线中，可以看到通过设置阈值可移除大部分峰值，在非特定的前提下这一方法使得仅留下实际的信号峰值变得更加容易。

黑色虚线表示噪声阈值位置，可以看到该阈值位置在 -0.01 的水平线以下，此阈值将保留主体信号，并滤除信号中的噪声。

sigids = [100, 230, 2323, 10, 4200, 4225]

for sigid in sigids:
    d = pq.read_pandas(
        data_dir + '/train.parquet',
        columns=[str(sigid)]
    ).to_pandas().values[:, 0].astype(np.float)

    fig, ax = plt.subplots(1, 1, figsize=(12, 4))
    plt.plot(d, alpha=0.75)

    px, py, _ = get_peaks(d)
    
    plt.scatter(px, d[px], color="red")
    plt.show()

此示例展示了如何使用 get_peaks 函数来获取信号中的峰值，并将其可视化。该函数返回三个数组：峰值的位置、高度和它们的排序索引，也就是将该峰值与其他峰值进行比较时排在第几位。

对于每个信号id，在读取数据并计算出峰值之后，我们将原始数据和峰值用红色散点图绘制在同一张图中。这使得我们可以清楚地看到每个峰值相对于原始数据的位置。这对于分析信号和确定需要保留或过滤的特征非常有帮助。

Step 4. Once the noisy peaks have been removed, features are calcuated for each of the remaining peaks. This is performed in calculate_peak_features. Of the features the most interesting is the sawtooth_rmse feature. This is the RMSE between the window of values including the peak and 25 data points either side of the peak and a sawtooth like template. An example of this can be seen below:

def plot_sawtooth_example(measid, px, w):
    
    display(HTML(meta_train_df[meta_train_df['id_measurement'] == measid].to_html()))
    
    sigids = [measid * 3 + i for i in range(3)]
    
    fig, ax = plt.subplots(1, 1, figsize=(12, 4))

    for i, sigid in enumerate(sigids):
        d = pq.read_pandas(
            data_dir + '/train.parquet',
            columns=[str(sigid)]
        ).to_pandas().values[:, 0].astype(np.float)

        d = flatiron(d)
        d = d[px-w:px+w+1]
        plt.plot(d, marker='o', label='{}) {}'.format(i, sigid))

        if i == 1:
            s = d[w]
            ft = create_sawtooth_template(3, w, w) * s
            plt.plot(ft, color='black', ls='--')

    plt.axvline(w, color='black', ls='--')
    plt.legend()
    plt.show()
measid = 705
px = 520128 # 93911 # 123144
w = 10

plot_sawtooth_example(measid, px, w)```
第4步。一旦噪声干扰的峰值被过滤，剩下的峰值将用于计算特征。这是在 `calculate_peak_features` 函数中完成的。最有趣的特征之一是 sawtooth_rmse 特征。这是峰值所在包含该峰值周围25个数据点的窗口和类似锯齿形的模板之间的均方根误差(RMSE)。下面是一个示例：

此代码绘制了信号中的峰值，并以两种不同颜色可视化了3个通道上的相应峰值。虚线表示使用它作为模板的锯齿波形。这允许我们比较原始信号与期望（或理想）信号之间的差异。

通过计算不同属性的峰值特征，可以在所有信号中提取更多信息，包括每个信号的振荡特性、最大振幅等。

```c
measid = 1091
px = 593144
w = 10

plot_sawtooth_example(measid, px, w)
```新增的示例展示了如何用相同的 `plot_sawtooth_example` 函数来绘制 输入的 measid 中特定位置上的一个峰值，并用模板进行比较。可以使用不同的 measid 和 px 值以查看其他峰值的示例。


***预处理***

```c
%%time

if RECALCULATE_PEAK_FEATURES:

    train_df = pq.read_pandas(
        data_dir + '/train.parquet'
    ).to_pandas()
    
    train_fft = calculate_50hz_fourier_coefficient(train_df.values)
    
    train_peaks = process_measurement(
        train_df, 
        meta_train_df, 
        train_fft
    )

    del train_df, train_fft
    gc.collect()

此代码将在数据集的通道上运行 calculate_50hz_fourier_coefficient 函数并处理所有峰值，以获取有关每个峰值的信息（例如位置、高度等）。如果没有预先计算或需要重新计算特征，则此代码段将花费较长时间。

if not RECALCULATE_PEAK_FEATURES:
    train_peaks = pd.read_pickle(PREPROCESSED_DIR + '/train_peaks.pkl')
    
train_peaks.to_pickle("train_peaks.pkl")
train_peaks = pd.merge(train_peaks, meta_train_df[['signal_id', 'id_measurement', 'target']], on='signal_id', how='left')
train_peaks.head()

此代码段通常被称为将特征合并到主元数据集中。在这里，将之前处理的峰值特征与元训练数据集进行合并。通过 pd.merge 函数，将 “signal_id”, “id_measurement” 和 “target” 字段添加到 train_peaks DataFrame 中。这个过程创建了一个更大的 DataFrame，其中包含了峰值、基本信息和目标值（如果适用）。

最后，to_pickle()函数可将新 DataFrame 保存下来以便于随后使用。

if RECALCULATE_PEAK_FEATURES:
    
    NUM_TEST_CHUNKS = 10
    
    test_chunk_size = int(np.ceil((meta_test_df.shape[0]/3.)/float(NUM_TEST_CHUNKS))*3.)
    
    test_peaks = []

    for j in range(NUM_TEST_CHUNKS):

        j_start = j*test_chunk_size
        j_end = (j+1)*test_chunk_size
        
        signal_ids = meta_test_df['signal_id'].values[j_start:j_end]

        test_df = pq.read_pandas(
            data_dir + '/test.parquet',
            columns=[str(c) for c in signal_ids]
        ).to_pandas()

        test_fft = calculate_50hz_fourier_coefficient(test_df.values)
        
        p = process_measurement(
            test_df, 
            meta_test_df.iloc[j_start:j_end], 
            test_fft
        )

        test_peaks.append(p)
        
        print(j)
        
        del test_df
        gc.collect()
        
    del test_fft
    gc.collect()
        
    test_peaks = pd.concat(test_peaks)
    
```此代码段会对测试集上每个信号的峰值进行处理，以获取有关每个峰值和其它特征的信息。该过程与之前在训练集上执行的过程类似，但由于测试集尺寸较大，因此按块进行处理以防止内存问题。`NUM_TEST_CHUNKS` 变量指定了要分成的块数并最终合并。`calculate_50hz_fourier_coefficient` 函数将用于计算测试集中峰值附近的傅里叶系数，而 `process_measurement` 函数将用于提取测试集峰值的其它特征。

在这个代码块运行完后，测试数据集中的峰值特征已经被提取出来并存为一个表格形式储存在 `test_peaks` 中。

```c
if not RECALCULATE_PEAK_FEATURES:
    test_peaks = pd.read_pickle(PREPROCESSED_DIR + '/test_peaks.pkl')
    
test_peaks.to_pickle("test_peaks.pkl")
test_peaks = pd.merge(test_peaks, meta_test_df[['signal_id', 'id_measurement']], on='signal_id', how='left')
test_peaks.head()

这个代码块将测试集峰值信息与元测试数据集进行合并，并将其存储为 pickle 文件。和我们在训练集中做的一样，通过 pd.merge 函数，将 “signal_id” 和 “id_measurement” 字段添加到新的 DataFrame 中来使之更有用。
特征

def save_importances(importances_, filename='importances.png', print_results=False, figsize=(8, 6)):
    mean_gain = importances_[['gain', 'feature']].groupby('feature').mean()
    if print_results:
        display(HTML(mean_gain.sort_values('gain').to_html()))
    importances_['mean_gain'] = importances_['feature'].map(mean_gain['gain'])
    plt.figure(figsize=figsize)
    sns.barplot(x='gain', y='feature', data=importances_.sort_values('mean_gain', ascending=False))
    plt.tight_layout()
    plt.savefig(filename)```
此代码段定义了一个函数，用于保存特征重要性图。该函数将重要性列表作为输入并使用 seaborn 的 `barplot()` 函数来展示出最好的结果。在这个实现中，可以选择是否打印每个特征的平均信息增益（gain），以及图表尺寸和输出的文件名。

```c
def process(peaks_df, meta_df, use_improved=True):

    results = pd.DataFrame(index=meta_df['id_measurement'].unique())
    results.index.rename('id_measurement', inplace=True)
    
    ################################################################################
    
    if not USE_SIMPLIFIED_VERSION:
        # Filter peaks using ratio_next and height features
        # Note: may not be all that important
        peaks_df = peaks_df[~(
            (peaks_df['ratio_next'] > 0.33333)
            & (peaks_df['height'] > 50)
        )]
    
    ################################################################################

    # Count peaks in phase resolved quarters 0 and 2
    p = peaks_df[peaks_df['Q'].isin([0, 2])].copy()
    res = p.groupby('id_measurement').agg(
    {
        'px': ['count'],
    })
    res.columns = ["peak_count_Q02"]
    results = pd.merge(results, res, on='id_measurement', how='left')
        
    ################################################################################
    
    # Count total peaks for each measurement id
    res = peaks_df.groupby('id_measurement').agg(
    {
        'px': ['count'],
    })
    res.columns = ["peak_count_total"]
    results = pd.merge(results, res, on='id_measurement', how='left')

    ################################################################################

    # Count peaks in phase resolved quarters 1 and 3
    p = peaks_df[peaks_df['Q'].isin([1, 3])].copy()
    res = p.groupby('id_measurement').agg(
    {
        'px': ['count'],
    })
    res.columns = ['peak_count_Q13']
    results = pd.merge(results, res, on='id_measurement', how='left')
    
    ################################################################################
    
    # Calculate additional features using phase resolved quarters 0 and 2
    
    feature_quarters = [0, 2]
    
    p = peaks_df[peaks_df['Q'].isin(feature_quarters)].copy()
    
    p['abs_small_dist_to_min'] = np.abs(p['small_dist_to_min'])
    
    res = p.groupby('id_measurement').agg(
    {
        
        'height': ['mean', 'std'],
        'ratio_prev': ['mean'],
        'ratio_next': ['mean'],
        'abs_small_dist_to_min': ['mean'],        
        'sawtooth_rmse': ['mean'],
    })
    res.columns = ["_".join(f) + '_Q02' for f in res.columns]     
    results = pd.merge(results, res, on='id_measurement', how='left')
        
    return results

这段代码定义了一个主要函数 process()。该函数带有三个参数，其中包括我们之前从测试集和训练集中导出的峰值特征文件并经过合并的 train_peaks 和 test_peaks 作为其两个参数。第三个参数 use_improved 是一个布尔型变量，用于指示我们是否使用新增加的五个高级特征。

在函数开头，所有需要计算的输出结果都被组织到了名为 results 的 DataFrame 中，并且该 DataFrame 的索引重命名为了测量id。处理过程是首先应用一个过滤器以删除高度大于50且下一个峰值比值超过1/3的记录。接下来分别对属于四个实实的不同象限的峰值数量进行计数，然后计算数据框中每个测量ID的总峰值计数。随后，基于分别处理的四个象限所得到的所有峰值，可以计算五个统计数据。每个值都是针对具有给定 feature_quarters 输入（0 和 2）的相应列所采用的。

最后，将包含所有结果的 results 数据框返回。