数值概率算法复杂度_将失败概率转换为失败时间度量

weixin_26752759

于 2020-09-13 12:44:12 发布

阅读量596

点赞数

文章标签：算法 python java 机器学习深度学习

原文链接：https://medium.com/swlh/converting-a-probability-to-fail-into-a-time-to-failure-metric-9a0ddf6122c4

版权

数值概率算法复杂度

With predictive maintenance problems, there are two common metrics that represent the health of your asset.

对于预测性维护问题，有两个通用指标代表资产的运行状况。

The first is a probability to fail. That is, at a given moment in time, what is the probability that your machine will fail. Sometimes, this is represented by a health score. Typically, the health score is one minus the probability to fail times 100.

首先是失败的可能性。也就是说，在给定的时间，您的计算机将发生故障的概率是多少。有时，这可以通过健康评分来表示。通常，运行状况得分是1减去失败概率乘以100。

The second metric is the time until failure. That is, how many days, weeks, months, hours, minutes or seconds do you have until the asset in question stops working.

第二个指标是直到失败为止的时间。也就是说，您需要多少天，几周，几月，几小时，几分钟或几秒钟才能使相关资产停止工作。

There are many different ways to calculate these metrics. Probably the most common way to gauge the probability to fail is a machine learning algorithm like logistic regression, random forest or gradient boosted tree (Note, there are many more).

有许多不同的方法来计算这些指标。评估失败概率的最常见方法可能是机器学习算法，例如逻辑回归，随机森林或梯度增强树(请注意，还有更多)。

Time to failure models typically rely on some type of survival model. A survival model is a family of techniques based on measuring and predicting expected lifetimes, given certain attributes of an individual or population. For example, will a drug treatment increase or decrease the life of a cancer patient? Or, how much longer will a machine operate if we service it every three months instead of every six months?

失效时间模型通常依赖于某种类型的生存模型。生存模型是在给定个人或人群某些属性的基础上，基于测量和预测预期寿命的一系列技术。例如，药物治疗会增加或减少癌症患者的寿命吗？或者，如果我们每三个月而不是每六个月维护一次机器，机器将运行多长时间？

What if you have a probability to fail and want to convert it into a time to failure? Is this possible?

如果您有失败的可能性并希望将其转换为失败的时间怎么办？这可能吗？

Of course it is possible. In fact, there are many ways to do it. In this article, I focus on my favorite technique. Not saying it is the best, just my favorite.

当然可以。实际上，有很多方法可以做到这一点。在本文中，我将重点介绍我最喜欢的技术。没有说这是最好的，只是我的最爱。

You probably won’t see this in a textbook, but my approach has served me well over the years. My technique is pretty easy and guarantees that your time to failure and probability to fail metrics are perfectly in-synch.

您可能不会在教科书中看到这一点，但是多年来，我的方法为我提供了很好的帮助。我的技术非常简单，可以确保您的失败时间和失败概率指标完全同步。

I should also mention that, although this is an equipment failure problem, this technique has many other applications. For example, the probability to churn and expected customer lifetime. Anytime you need to convert a probability of death to a prediction of lifespan, this should work.

我还应该提到，尽管这是设备故障问题，但该技术还有许多其他应用。例如，客户流失的可能性和预期的客户寿命。每当您需要将死亡概率转换为寿命预测时，这都应该起作用。

1.0设置 (1.0 Getting Set-Up)

Install all of the relevant Python Libraries

安装所有相关的Python库

!pip install --upgrade numpy 
!pip install imblearn --upgrade
!pip install plotly --upgrade
!pip install chart-studio --upgrade

Import required libraries

导入所需的库

import chart_studio.plotly as py
import plotly.graph_objs as go
import numpy as np
import numpy.dual as dualimport types
import pandas as pd
import statsmodels.api as sm
import plotly as plotlydef __iter__(self): return 0

Import the data from GitHub

从GitHub导入数据

#Remove the data if you run this notebook more than once
!rm P_HAT_DATA.csv
#import from github
!wget https://raw.githubusercontent.com/shadgriffin/phat_to_ttf/master/P_HAT_DATA.csv# Convert csv to pandas dataframe
pd_data = pd.read_csv("P_HAT_DATA.csv", sep=",", header=0)

2.0数据曝光 (2.0 Data Exporation)

pd_data.head()

Our data set is pretty simple. Note that we have panel data. This means we have multiple entities (machines in this case) measured each day for a period of time.

我们的数据集非常简单。请注意，我们有面板数据。这意味着我们在一段时间内每天都要测量多个实体(在这种情况下为机器)。

Here is a description of the fields.

这是字段的描述。

ID — ID field that represents a specific machine.

ID —代表特定计算机的ID字段。

DATE — The date of the observation.

DATE —观测日期。

FAILURE_DATE — Day the machine in question failed.

FAILURE_DATE —有关机器发生故障的日期。

P_FAIL — The probability a machine will fail on a given day. The result of Gradient Boosted Tree Model.

P_FAIL —机器在给定日期发生故障的概率。梯度增强树模型的结果。

We also need to create another field that indicates the number of days between the day of record and the failure date of the machine. This is the actual time to failure from the historical record.

我们还需要创建另一个字段，该字段指示记录日期与机器故障日期之间的天数。这是从历史记录到发生故障的实际时间。

#convert dates to date format
pd_data['DATE'] = pd.to_datetime(pd_data['DATE'])
pd_data['FAILURE_DATE'] = pd.to_datetime(pd_data['FAILURE_DATE'])pd_data['C'] = pd_data['FAILURE_DATE'] - pd_data['DATE']
pd_data['TIME_TO_FAILURE'] = pd_data['C'] / np.timedelta64(1, 'D')
pd_data=pd_data[['ID','DATE','FAILURE_DATE','P_FAIL','TIME_TO_FAILURE']]
pd_data.head()

Examine the number of rows and columns. The data has 171,094 rows and 5 columns

检查行数和列数。数据有171,094行和5列

pd_data.shape

There are 421 machines in the data set

数据集中有421台机器

xxxx = pd.DataFrame(pd_data.groupby(['ID']).agg(['count']))
xxxx.shape

Check for dups. There are none

检查是否有公仔。没有了

df_failure_thingy=pd_data
df_failure_thingy=df_failure_thingy.drop_duplicates(subset=['ID','DATE'])
df_failure_thingy.shape

Look for null values in the fields — There are none

在字段中查找空值-没有

pd_data.isnull().sum(axis = 0)

3.0数据转换和特征工程 (3.0 Data transformations and Feature Engineering)

Our data is currently at a daily level. We really don’t need that level of detail for this exercise. In fact, if the data has any non-normal properties (it does and this is almost always the case) aggregating a bit usually yields better results.

我们的数据目前处于每日水平。我们确实不需要此练习的详细程度。实际上，如果数据具有任何非正常属性(确实如此，并且几乎总是这种情况)，则汇总一点通常会产生更好的结果。

Next, we will convert the daily data to monthly data.

接下来，我们将每日数据转换为每月数据。

#create fields that represent the month and year
pd_data['MONTH'] = pd.DatetimeIndex(pd_data['DATE']).month
pd_data['YEAR'] = pd.DatetimeIndex(pd_data['DATE']).yearpd_data=pd_data[['ID','MONTH','DATE','YEAR','FAILURE_DATE','P_FAIL','TIME_TO_FAILURE']]
pd_data=pd_data.copy()# aggregate by machine, month and year
YY=pd_data
tips_summed = pd.DataFrame(YY.groupby(['ID','YEAR','MONTH'])[['P_FAIL', 'TIME_TO_FAILURE']].mean())tips_summed=tips_summed.sort_values(by=['ID','YEAR','MONTH'], ascending=[True,True,True])
plotter=tips_summed
plotter.reset_index(level=0, inplace=True)
plotter.reset_index(level=0, inplace=True)
plotter.reset_index(level=0, inplace=True)pd_data=plotter
pd_data.shape

We now have 5,844 records after aggregating.

汇总后，我们现在有5,844条记录。

In case you haven’t guessed already, we are about to build a model that uses the probability to fail to predict the time to failure. Before we do that, however, let’s examine the relationship betwwen the two variables graphically.

如果您还没有猜到，我们将建立一个模型，该模型使用失败概率预测失败时间。但是，在执行此操作之前，让我们以图形方式检查两个变量之间的关系。

A scatter plot with 5,844 records will probably not be that meaningful, so first we will aggreate the data by using P_FAIL to create groupings. This should give us the essence of the relationship between P_FAIL and the TIME_TO_FAILURE without a bunch of clutter.

具有5,844条记录的散点图可能不会那么有意义，因此首先我们将使用P_FAIL创建分组来汇总数据。这应该使我们掌握P_FAIL和TIME_TO_FAILURE之间关系的实质，而不会造成混乱。

#Create percentiles based on P_FAIL
pd_data['P'] = pd.qcut(pd_data['P_FAIL'], 100, labels=np.arange(100, 0,-1))
pd_data.head()

#aggregate the variables by grouping
YY=pd_data
tips_summed = pd.DataFrame(YY.groupby(['P'])[['P_FAIL', 'TIME_TO_FAILURE']].mean())tips_summed=tips_summed.sort_values(by=['P'], ascending=[True])
plotter=tips_summed
plotter.head(10)

Create a plot with Plotly

使用Plotly创建图

x1 = plotter['P_FAIL']
y1 = plotter['TIME_TO_FAILURE']trace = go.Scatter(
    x = x1,
    y = y1,mode='markers',
name='Time to Failure')layout = go.Layout(
    title='Time to Failure and the Probability to Fail',
    xaxis=dict(
        title='Probability to Fail',
        titlefont=dict(
            family='Courier New, monospace',
            size=18,
            color='#7f7f7f'
        )
    ),
    yaxis=dict(
        title='Time to Failure',
        titlefont=dict(
            family='Courier New, monospace',
            size=18,
            color='#7f7f7f'
        )
    ),
    showlegend=True,
)
    
data=[trace]  
fig = go.Figure(data=data, layout=layout)#plot_url = py.plot(fig, filename='styling-names')
plotly.offline.iplot(fig, filename='lines')

The relationship between P_FAIL and TIME_TO_FAILURE looks fairly linear. Visually, there does appear to be inflection points around .13,.30, .50 and .65.

P_FAIL和TIME_TO_FAILURE之间的关系看起来很线性。在视觉上，确实在.13，.30，.50和.65附近出现拐点。

Here are a few transformations that may be useful when we build our model.

以下是我们构建模型时可能有用的一些转换。

pd_data['LN_P_FAIL'] = np.log((pd_data.P_FAIL*100+1))
pd_data['LN_TTF'] = np.log((pd_data.TIME_TO_FAILURE+1))
pd_data['SQ_P_FAIL'] = pd_data['P_FAIL']*pd_data['P_FAIL']

Also, a few dummy variables based on our chart above.

另外，根据我们上面的图表，一些虚拟变量。

pd_data['DV_LT_p13'] = np.where(((pd_data.P_FAIL <= 0.13)), 1, 0)
pd_data['DV_GT_p65'] = np.where(((pd_data.P_FAIL >= 0.65)), 1, 0)
pd_data['DV_GT_p30_LT_p50'] = np.where((((pd_data.P_FAIL <= 0.5) & (pd_data.P_FAIL>.3))), 1, 0)
pd_data['DV_GT_p13_LT_p30'] = np.where((((pd_data.P_FAIL <= 0.3) & (pd_data.P_FAIL>.13))), 1, 0)
pd_data['DV_GT_p65'] = np.where(((pd_data.P_FAIL >= 0.65)), 1, 0)

Create interactive or slope dummies.

创建交互式或坡度假人。

pd_data['DV_p13_LN_PFAIL'] = pd_data['DV_LT_p13']*pd_data['LN_P_FAIL']
pd_data['DV_p65_LN_PFAIL'] = pd_data['DV_GT_p65']*pd_data['LN_P_FAIL']
pd_data['DV_GT_p13_LT_p30_LN_PFAIL'] = pd_data['DV_GT_p13_LT_p30']*pd_data['LN_P_FAIL']
pd_data['DV_GT_p30_LT_p50_LN_PFAIL'] = pd_data['DV_GT_p30_LT_p50']*pd_data['LN_P_FAIL']

4.0创建测试和培训分组 (4.0 Create the Testing and Training Groupings)

Because we are dealing with a panel data set (cross-sectional time-series) it is better to not just take a random sample of all records. Doing so would put the records from one machine in both data sets. To avoid this, we’ll need to randomly select IDs and place all of the records for each machine in either the training or testing data set.

因为我们正在处理面板数据集(横截面时间序列)，所以最好不只是对所有记录进行随机抽样。这样做会将来自一台计算机的记录放入两个数据集中。为避免这种情况，我们需要随机选择ID，并将每台机器的所有记录放入培训或测试数据集中。

#Get a Unique List of All IDsaa=pd_datapd_id=aa.drop_duplicates(subset=’ID’)
pd_id=pd_id[[‘ID’]]

Create a new variable with a random number between 0 and 1

创建一个随机数介于0和1之间的新变量

np.random.seed(33)
pd_id['wookie'] = (np.random.randint(0, 10000, pd_id.shape[0]))/10000pd_id=pd_id[['ID', 'wookie']]

Give each record a 50% chance of being in the testing and a 50% chance of being in the training data set.

给每条记录50％的机会参加测试，50％的机会参加训练数据集。

pd_id['MODELING_GROUP'] = np.where(((pd_id.wookie <= 0.50)), 'TRAINING', 'TESTING')

This is how many machines fall in each group.

这是每个组中有多少台计算机。

tips_summed = pd_id.groupby(['MODELING_GROUP'])['wookie'].count()
tips_summed

Append the Group of each id to each individual record.

将每个ID的组追加到每个单独的记录。

pd_data=pd_data.sort_values(by=['ID'], ascending=[True])
pd_id=pd_id.sort_values(by=['ID'], ascending=[True])pd_data =pd_data.merge(pd_id, on=['ID'], how='inner')

This is how many records are in each group.

这是每个组中有多少条记录。

tips_summed = pd_data.groupby(['MODELING_GROUP'])['wookie'].count()
tips_summed

Create a separate data frame for the training data. We will use this data set to build the model.

为训练数据创建一个单独的数据框。我们将使用此数据集构建模型。

df_training=pd_data[pd_data['MODELING_GROUP'] == 'TRAINING']
df_training=df_training.drop(columns=['P'])

Create a separate data frame for the testing data set. We will use this to validate our modeling results.

为测试数据集创建一个单独的数据框。我们将使用它来验证我们的建模结果。

df_test=pd_data[pd_data['MODELING_GROUP'] != 'TRAINING']df_test=df_test.drop(columns=['P'])

5.0建立OLS回归模型。 (5.0 Build an OLS Regression Model.)

Select the variables that are theoretically relevant and statistically significant.

选择理论上相关且具有统计意义的变量。

independentx = df_training[['LN_P_FAIL','DV_LT_p13','DV_GT_p65','DV_p13_LN_PFAIL','DV_GT_p13_LT_p30_LN_PFAIL',
                            'DV_GT_p13_LT_p30','DV_GT_p30_LT_p50','DV_GT_p30_LT_p50_LN_PFAIL']]
independent = sm.add_constant(independentx, prepend=False)
dependent=df_training['LN_TTF']#build the model
mod = sm.OLS(dependent, independent)#create a predicted value
res = mod.fit()# Examine the anova table
print(res.summary())

Add a prediction of “Time to Failure” in the original data set.

在原始数据集中添加“失效时间”的预测。

df_ypred = pd.DataFrame(res.predict(independent))df_ypred.columns = ['P_LN_TTF']
df_ypred['P_TTF']=(np.exp(df_ypred['P_LN_TTF'])-1)df_training = pd.concat([df_training, df_ypred], axis=1)
plottery=df_training

Compare the predictions to actuals graphically.

以图形方式将预测与实际值进行比较。

x1 = plottery['P_FAIL']
y1 = plottery['P_TTF']x2 = plottery['P_FAIL']
y2 = plottery['TIME_TO_FAILURE']trace = go.Scatter(
    x = x1,
    y = y1,
    mode='markers',
    name='Predicted TTF')trace2 = go.Scatter(
    x = x2,
    y = y2,
    mode='markers',
    name='Actual TTF')layout = go.Layout(
    title='Time to Failure and the Probability to Fail',
    xaxis=dict(
        title='Probabilty to Fail',
        titlefont=dict(
            family='Courier New, monospace',
            size=18,
            color='#7f7f7f'
        )
    ),
    yaxis=dict(
        title='Time to Failure',
        titlefont=dict(
            family='Courier New, monospace',
            size=18,
            color='#7f7f7f'
        )
    ),
    showlegend=True,
)
    
data=[trace,trace2]  
fig = go.Figure(data=data, layout=layout)#plot_url = py.plot(fig, filename='styling-names')
plotly.offline.iplot(fig, filename='shapes-lines')

6.0将模型应用于测试数据。 (6.0 Apply the Model to the Testing Data.)

#define the variables
independentx = df_test[['LN_P_FAIL','DV_LT_p13','DV_GT_p65','DV_p13_LN_PFAIL','DV_GT_p13_LT_p30_LN_PFAIL',
                            'DV_GT_p13_LT_p30','DV_GT_p30_LT_p50','DV_GT_p30_LT_p50_LN_PFAIL']]
independent = sm.add_constant(independentx, prepend=False)#score the test set
df_ypred = pd.DataFrame(res.predict(independent))df_ypred.columns = ['P_LN_TTF']
df_ypred['P_TTF']=(np.exp(df_ypred['P_LN_TTF'])-1)#append y-hat
df_test= pd.concat([df_test, df_ypred], axis=1)plotterx=df_test

After applying the model to the testing data set, examine the actual and predicted graphically.

将模型应用于测试数据集后，以图形方式检查实际和预测。

x1 = plotterx['P_FAIL']
y1 = plotterx['P_TTF']x2 = plotterx['P_FAIL']
y2 = plotterx['TIME_TO_FAILURE']trace = go.Scatter(
    x = x1,
    y = y1,
    mode='markers',
    name='Predicted TTF')trace2 = go.Scatter(
    x = x2,
    y = y2,
    mode='markers',
    name='Actual TTF')layout = go.Layout(
    title='Time to Failure and the Probability to Fail',
    xaxis=dict(
        title='Probability to Fail',
        titlefont=dict(
            family='Courier New, monospace',
            size=18,
            color='#7f7f7f'
        )
    ),
    yaxis=dict(
        title='Time to Failure',
        titlefont=dict(
            family='Courier New, monospace',
            size=18,
            color='#7f7f7f'
        )
    ),
    showlegend=True,
)
    
data=[trace,trace2]  
fig = go.Figure(data=data, layout=layout)#plot_url = py.plot(fig, filename='styling-names')
plotly.offline.iplot(fig, filename='shapes-lines')

The plot for the testing data looks similar to the modeling data set, but this doesn’t give us much insight. Let’s look at the R-squared of the testing data and compare it to the training data. Remember that the pearson statistic between a predicted variable and an actual variable gives you R. Squaring R gives you R-Squared.

测试数据的图看起来与建模数据集相似，但这并没有给我们带来太多的见识。让我们看一下测试数据的R平方，并将其与训练数据进行比较。请记住，预测变量和实际变量之间的皮尔逊统计量为R。平方R为R平方。

df_corr=df_training[['P_LN_TTF','LN_TTF']]df_corr.corr(method='pearson')

Squaring R gives us a R-Squared of .224

R的平方是R的.224

Let’s do the same for the Testing Data Set.

让我们对测试数据集执行相同的操作。

df_corr=df_test[['P_LN_TTF','LN_TTF']]
df_corr.corr(method='pearson')

The Testing data has an R-Squared of .208. Very similar to the Training Data.

测试数据的R平方为.208。与培训数据非常相似。

7.0最后一步是将模型应用于原始每日数据。 (7.0 The final step is to apply the model to the original daily data.)

# Convert csv to pandas dataframe
pd_data = pd.read_csv(“P_HAT_DATA.csv”, sep=”,”, header=0)
pd_data=pd_data.copy()#convert dates to date format
pd_data['DATE'] = pd.to_datetime(pd_data['DATE'])
pd_data['FAILURE_DATE'] = pd.to_datetime(pd_data['FAILURE_DATE'])pd_data['C'] = pd_data['FAILURE_DATE'] - pd_data['DATE']
pd_data['TIME_TO_FAILURE'] = pd_data['C'] / np.timedelta64(1, 'D')
pd_data=pd_data[['ID','DATE','FAILURE_DATE','P_FAIL','TIME_TO_FAILURE']]pd_data['LN_P_FAIL'] = np.log((pd_data.P_FAIL*100+1))pd_data['DV_LT_p13'] = np.where(((pd_data.P_FAIL <= 0.13)), 1, 0)
pd_data['DV_GT_p65'] = np.where(((pd_data.P_FAIL >= 0.65)), 1, 0)
pd_data['DV_GT_p30_LT_p50'] = np.where((((pd_data.P_FAIL <= 0.5) & (pd_data.P_FAIL>.3))), 1, 0)
pd_data['DV_GT_p13_LT_p30'] = np.where((((pd_data.P_FAIL <= 0.3) & (pd_data.P_FAIL>.13))), 1, 0)
pd_data['DV_GT_p65'] = np.where(((pd_data.P_FAIL >= 0.65)), 1, 0)pd_data['DV_p13_LN_PFAIL'] = pd_data['DV_LT_p13']*pd_data['LN_P_FAIL']
pd_data['DV_p65_LN_PFAIL'] = pd_data['DV_GT_p65']*pd_data['LN_P_FAIL']
pd_data['DV_GT_p13_LT_p30_LN_PFAIL'] = pd_data['DV_GT_p13_LT_p30']*pd_data['LN_P_FAIL']
pd_data['DV_GT_p30_LT_p50_LN_PFAIL'] = pd_data['DV_GT_p30_LT_p50']*pd_data['LN_P_FAIL']independentx = pd_data[['LN_P_FAIL','DV_LT_p13','DV_GT_p65','DV_p13_LN_PFAIL','DV_GT_p13_LT_p30_LN_PFAIL',
                            'DV_GT_p13_LT_p30','DV_GT_p30_LT_p50','DV_GT_p30_LT_p50_LN_PFAIL']]independent = sm.add_constant(independentx, prepend=False)df_ypred = pd.DataFrame(res.predict(independent))df_ypred.columns = ['P_LN_TTF']
df_ypred['P_TTF']=(np.exp(df_ypred['P_LN_TTF'])-1)df_final= pd.concat([pd_data, df_ypred], axis=1)df_final=df_final[['ID','DATE','FAILURE_DATE','P_FAIL','TIME_TO_FAILURE','P_TTF']]df_final.head(150)

翻译自: https://medium.com/swlh/converting-a-probability-to-fail-into-a-time-to-failure-metric-9a0ddf6122c4

数值概率算法复杂度

weixin_26752759

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
数值概率算法复杂度_将失败概率转换为失败时间度量

数值概率算法复杂度将失败概率转换为失败时间度量 (Converting a Probability to Fail Into a Time to Failure Metric) 充分利用失败模型的可能性 (Getting the most out of your Probability to Fail Models)With predictive maintenance problems, ...
复制链接

扫一扫