案例研究概述: (Overview of the case study:)

Step 1: Explanation of the problem which includes details about the source, problem statement, explanation of relevant terms, and how the problem can be viewed as a business problem.


step 2: Conversion of the business problem into a machine learning problem.Details of the existing performance metrics, previous solutions, approaches, and improvements.


step 3: Complete Exploratory data analysis.


step 4: Feature Engineering involving various techniques.


step 5: Trying out different machine learning models and selecting the best model and predicting on the test dataset.


step 6: Future works that can be done to improve the performance metric


step 7:References


1.业务/实际问题: (1.Business/Real-world Problem:)

1.1资料来源: (1.1 Source:)

This was posted as a kaggle challenge by the ENET Centre which researches and develops renewable energy resources with the goal of reducing or eliminating harmful environmental impacts.


Source: https://www.kaggle.com/c/vsb-power-line-fault-detection/overview

Data: Enet Centre, VSB — T.U. of Ostrava

1.2什么是局部放电? (1.2 What is Partial Discharge?)

Here we deal with medium voltage overhead powerlines which are spread over hundreds of miles making manual fault detection almost impossible


These lines on some occasions get damaged by either a tree branch or due to a flaw in the insulator. These damages lead to a power outage gradually over the passage of time. This phenomenon is called partial discharge.

这种现象称为局部放电。

Its textbook definition is an electrical discharge that does not bridge the electrodes between an insulation system completely.


1.3问题陈述 (1.3 Problem Statement)

The main objective of this case study is to detect these partial discharge patterns in signals acquired from lines with a new meter. Effective classifiers using this data will make it possible to continuously monitor power lines for faults.

本案例研究的主要目的是使用新的仪表检测从线路获得的信号中的这些局部放电模式。 使用此数据的有效分类器将使连续监视电源线是否有故障成为可能。

1.3。 现实世界/业务目标和约束。 (1.3. Real-world/Business objectives and constraints.)

1.Minimize binary-class error2.probability estimates3. There’s no time limitation as partial discharge faults do damage over time and not immediately so limit can be in hours4. Detecting the partial discharge early can be helpful financially

1.最小化二元类错误2.概率估计3。 没有时间限制,因为局部放电故障会随时间推移而不是立即损坏,因此限制可以在几小时之内4。 尽早发现局部放电对财务有帮助

2.机器学习问题 (2. Machine Learning Problem)

2.1。 资料总览 (2.1. Data Overview)

  • Source:https://www.kaggle.com/c/vsb-power-line-fault-detection/data


In total 4 files are given in which 2 correspond to train data and the rest correspond to test data1.A file containing signal data2.A file containing metadata


Each signal contains 800,000 points and in total data of 8712 signals were given for training and 20337 signals were given for testing in the form of parquet data


Metadata consists of the phase of the signal and the target label0-if partial discharge is not there1-if partial discharge is present


2.2。 将实际问题映射到ML问题 (2.2. Mapping the real-world problem to an ML problem)

2.2.1. Type of Machine Learning Problem

2.2.1。 机器学习问题的类型

There are 2 different classes of malware that we need to classify a given a data point => Binary class classification problem


2.2.2. Performance Metric

2.2.2。 绩效指标

Source: https://www.kaggle.com/c/vsb-power-line-fault-detection/overview/evaluation

Metrics:*Matthews correlation coefficient(MCC)*Confusion matrix


2.2.3. Machine Learning Objectives and Constraints

2.2.3。 机器学习目标和约束

Objective: Predict the probability of each data-point belonging to each of the 2 classes.




* Class probabilities are needed. * Penalize the errors in class probabilities => Metric is Matthews’s correlation coefficient.* Some Latency constraints.

*需要班级概率。 *惩罚类概率中的错误=>度量标准是Matthews的相关系数。*一些延迟约束。

2.3.1。 现有方法 (2.3.1. Existing approaches)

Most of the notebooks present in https://www.kaggle.com/c/vsb-power-line-fault-detection/notebooks used deep learning techniques.


In most of the approaches, each signal is divided into equal chunks of data of size 1000. So in total, there would be 800 chunks. Now from each chunk statistical features are extracted which would result in a 3-dimensional array. Now, most of the solutions used LSTMs as the data is sequential. Some solutions have used attention layers and some notebooks used transformers.

在大多数方法中,每个信号被分为大小为1000的相等数据块。因此,总共将有800个数据块。 现在,从每个块中提取统计特征,这将导致3维数组。 现在,大多数解决方案都使用LSTM,因为数据是连续的。 一些解决方案使用了注意层,一些笔记本使用了变压器。

Some approaches used signal denoising techniques like DWT(discrete wavelet transform) and some other solutions relied on finding peaks in the signal data and then models were built using deep learning techniques.


2.3.2。 改进之处 (2.3.2. Improvements)

As most of the solutions used deep learning techniques I’ve used machine learning models like boosting models. In terms of feature engineering I’ve used new techniques like power spectral density and Fourier transform and found the top features using peak detection. I’ve also used peak detection in the spectra of the signal which was already mentioned in a notebook. In the modeling part, I’ve four different machine learning models and other techniques like Randomsearchcv and stratifiedKfold cross-validation.

由于大多数解决方案都使用深度学习技术,因此我使用了机器学习模型,例如增强模型。 在特征工程方面,我使用了功率谱密度和傅立叶变换等新技术,并使用峰值检测发现了最重要的特征。 我还在笔记本中已经提到的信号频谱中使用了峰值检测。 在建模部分,我有四种不同的机器学习模型和其他技术,例如Randomsearchcv和StratifiedKfold交叉验证。

