https://www.kaggle.com/gaborfodor/bosch-production-line-performance/69-failure-rate
说明:这个数据分析的过程值得去学习,从数据中去分析特征取哪些值时能得到label=1的结论。或者哪些特征取哪些组合时,能得到response=1的结论。值得学习。
说明:数据分析也是有套路存在的,pandas是数据分析的利器。
下面,来慢慢分析,数据分析的套路:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import gc
sns.set_style('whitegrid')
假设1:比如我们想知道特征A,B,C的特征的取值组合,跟label有什么关系,其实就是groupby的过程。这里假设A,B,C是离散的,label就是response
STATIONS = ['S32', 'S33', 'S34']
train_date_part = pd.read_csv('../input/train_date.csv', nrows=10000)
date_cols = train_date_part.drop('Id', axis=1).count().reset_index().sort_values(by=0, ascending=False)
date_cols['station'] = date_cols['index'].apply(lambda s: s.split('_')[1])
date_cols = date_cols[date_cols['station'].isin(STATIONS)]
date_cols = date_cols.drop_duplicates('station', keep='first')['index'].tolist()
print(date_cols)
train_date = pd.read_csv('../input/train_date.csv', usecols=['Id'] + date_cols)
print(train_date.columns)
train_date.columns = ['Id'] + STATIONS
for station in STATIONS:
train_date[station] = 1 * (train_date[station] >= 0)
response = pd.read_csv('../input/train_numeric.csv', usecols=['Id', 'Response'])
print(response.shape)
train = response.merge(train_date, how='left', on='Id')
# print(train.count())
train.head(3)
train['cnt'] = 1
failure_rate = train.groupby(STATIONS).sum()[['Response', 'cnt']]
failure_rate['failure_rate'] = failure_rate['Response'] / failure_rate['cnt']
failure_rate = failure_rate[failure_rate['cnt'] > 1000] # remove
failure_rate.head(20)
说明:基本就是将数据预处理好,然后使用pandas的groupby就可以计算出不同特征的组合跟response的关系