数据分析
EDA的价值主要在于熟悉数据集,了解数据集,对数据集进行验证来确定所获得数据集可以用于接下来的机器学习或者深度学习使用。当了解了数据集之后我们下一步就是要去了解变量间的相互关系以及变量与预测值之间的存在关系。引导数据科学从业者进行数据处理以及特征工程的步骤,使数据集的结构和特征集让接下来的预测问题更加可靠。完成对于数据的探索性分析,并对于数据进行一些图表或者文字总结并打卡。
- 载入各种数据科学以及可视化库:
- 数据科学库 pandas、numpy、scipy;
- 可视化库 matplotlib、seabon;
- 载入数据:
- 载入训练集和测试集;
- 简略观察数据(head()+shape);
- 数据总览:
- 通过describe()来熟悉数据的相关统计量
- 通过info()来熟悉数据类型
- 判断数据缺失和异常
- 查看每列的存在nan情况
- 异常值检测
- 了解预测值的分布
- 总体分布概况
- 查看skewness and kurtosis
- 查看预测值的具体频数
代码
下载地址:天池-零基础入门数据挖掘-心跳信号分类预测-EDA分析全过程-代码.rar
载入各种数据科学与可视化库
In [1]:
#coding:utf-8
#导入warnings包,利用过滤器来实现忽略警告语句。
import warnings
warnings.filterwarnings('ignore')
import missingno as msno
import pandas as pd
from pandas import DataFrame
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
executed in 1.31s, finished 22:45:29 2021-03-19
Bad key "text.kerning_factor" on line 4 in
C:\Anaconda3-5.2.0-64\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test_patch.mplstyle.
You probably need to get an updated matplotlibrc file from
http://github.com/matplotlib/matplotlib/blob/master/matplotlibrc.template
or from the matplotlib source distribution
载入训练集和测试集
导入训练集train.csv
In [2]:
import pandas as pd
from pandas import DataFrame, Series
import matplotlib.pyplot as plt
Train_data = pd.read_csv('./datasets/train.csv')
executed in 2.05s, finished 22:45:32 2021-03-19
导入测试集testA.csv
In [3]:
import pandas as pd
from pandas import DataFrame, Series
import matplotlib.pyplot as plt
Test_data = pd.read_csv('./datasets/testA.csv')
executed in 445ms, finished 22:45:32 2021-03-19
所有特征集均脱敏处理(方便大家观看)
id - 心跳信号分配的唯一标识 heartbeat_signals - 心跳信号序列 label - 心跳信号类别(0、1、2、3) data.head().append(data.tail())——观察首尾数据
data.shape——观察数据集的行列信息
观察train首尾数据
Train_data.head().append(Train_data.tail())
观察train数据集的行列信息
In [4]:
Train_data.shape
executed in 6ms, finished 22:45:32 2021-03-19
Out[4]:
(100000, 3)
观察testA首尾数据
In [5]:
Test_data.head().append(Test_data.tail())
executed in 16ms, finished 22:45:32 2021-03-19
Out[5]:
id | heartbeat_signals | |
---|---|---|
0 | 100000 | 0.9915713654170097,1.0,0.6318163407681274,0.13... |
1 | 100001 | 0.6075533139615096,0.5417083883163654,0.340694... |
2 | 100002 | 0.9752726292239277,0.6710965234906665,0.686758... |
3 | 100003 | 0.9956348033996116,0.9170249621481004,0.521096... |
4 | 100004 | 1.0,0.8879490481178918,0.745564725322326,0.531... |
19995 | 119995 | 1.0,0.8330283177934747,0.6340472606311671,0.63... |
19996 | 119996 | 1.0,0.8259705825857048,0.4521053488322387,0.08... |
19997 | 119997 | 0.951744840752379,0.9162611283848351,0.6675251... |
19998 | 119998 | 0.9276692903808186,0.6771898159607004,0.242906... |
19999 | 119999 | 0.6653212231837624,0.527064114047737,0.5166625... |
观察testA数据集的行列信
In [6]:
Test_data.shape
executed in 5ms, finished 22:45:32 2021-03-19
Out[6]:
(20000, 2)
要养成看数据集的head()以及shape的习惯,这会让你每一步更放心,导致接下里的连串的错误, 如果对自己的pandas等操作不放心,建议执行一步看一下,这样会有效的方便你进行理解函数并进行操作
2.3.3 总览数据概况 describe种有每列的统计量,个数count、平均值mean、方差std、最小值min、中位数25% 50% 75% 、以及最大值 看这个信息主要是瞬间掌握数据的大概的范围以及每个值的异常值的判断,比如有的时候会发现999 9999 -1 等值这些其实都是nan的另外一种表达方式,有的时候需要注意下 info 通过info来了解数据每列的type,有助于了解是否存在除了nan以外的特殊符号异常 data.describe()——获取数据的相关统计量
data.info()——获取数据类型
获取train数据的相关统计量
In [7]:
Train_data.describe()
executed in 23ms, finished 22:45:32 2021-03-19
Out[7]:
id | label | |
---|---|---|
count | 100000.000000 | 100000.000000 |
mean | 49999.500000 | 0.856960 |
std | 28867.657797 | 1.217084 |
min | 0.000000 | 0.000000 |
25% | 24999.750000 | 0.000000 |
50% | 49999.500000 | 0.000000 |
75% | 74999.250000 | 2.000000 |
max | 99999.000000 | 3.000000 |
获取train数据类型
In [8]:
Train_data.info()
executed in 18ms, finished 22:45:32 2021-03-19
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 3 columns):
id 100000 non-null int64
heartbeat_signals 100000 non-null object
label 100000 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 2.3+ MB
获取testA数据的相关统计量
In [9]:
Test_data.describe()
executed in 12ms, finished 22:45:32 2021-03-19
Out[9]:
id | |
---|---|
count | 20000.000000 |
mean | 109999.500000 |
std | 5773.647028 |
min | 100000.000000 |
25% | 104999.750000 |
50% | 109999.500000 |
75% | 114999.250000 |
max | 119999.000000 |
获取testA数据类型
In [10]:
Test_data.info()
executed in 9ms, finished 22:45:32 2021-03-19
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 2 columns):
id 20000 non-null int64
heartbeat_signals 20000 non-null object
dtypes: int64(1), object(1)
memory usage: 312.6+ KB
判断数据缺失和异常 data.isnull().sum()——查看每列的存在nan情况
查看trian每列的存在nan情况
In [11]:
Train_data.isnull().sum()
executed in 17ms, finished 22:45:32 2021-03-19
Out[11]:
id 0
heartbeat_signals 0
label 0
dtype: int64
查看testA每列的存在nan情况
In [12]:
Test_data.isnull().sum()
executed in 9ms, finished 22:45:32 2021-03-19
Out[12]:
id 0
heartbeat_signals 0
dtype: int64
了解预测值的分布
In [13]:
Train_data['label']
executed in 7ms, finished 22:45:32 2021-03-19
Out[13]:
0 0.0
1 0.0
2 2.0
3 0.0
4 2.0
...
99995 0.0
99996 2.0
99997 3.0
99998 2.0
99999 0.0
Name: label, Length: 100000, dtype: float64
In [14]:
Train_data['label'].value_counts()
executed in 10ms, finished 22:45:32 2021-03-19
Out[14]:
0.0 64327
3.0 17912
2.0 14199
1.0 3562
Name: label, dtype: int64
In [15]:
## 1) 总体分布概况(无界约翰逊分布等)
import scipy.stats as st
y = Train_data['label']
plt.figure(1); plt.title('Default')
sns.distplot(y, rug=True, bins=20)
plt.figure(2); plt.title('Normal')
sns.distplot(y, kde=False, fit=st.norm)
plt.figure(3); plt.title('Log Normal')
sns.distplot(y, kde=False, fit=st.lognorm)
executed in 3.48s, finished 22:45:36 2021-03-19
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x1ec2a699160>
In [16]:
# 2)查看skewness and kurtosis
sns.distplot(Train_data['label']);
print("Skewness: %f" % Train_data['label'].skew())
print("Kurtosis: %f" % Train_data['label'].kurt())
executed in 338ms, finished 22:45:36 2021-03-19
Skewness: 0.871005
Kurtosis: -1.009573
In [17]:
Train_data.skew(), Train_data.kurt()
executed in 102ms, finished 22:45:36 2021-03-19
Out[17]:
(id 0.000000
label 0.871005
dtype: float64, id -1.200000
label -1.009573
dtype: float64)
In [18]:
sns.distplot(Train_data.kurt(),color='orange',axlabel ='Kurtness')
executed in 132ms, finished 22:45:36 2021-03-19
Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x1ec43389470>
In [19]:
## 3) 查看预测值的具体频数
plt.hist(Train_data['label'], orientation = 'vertical',histtype = 'bar', color ='red')
plt.show()
executed in 79ms, finished 22:45:36 2021-03-19
用pandas_profiling生成数据报告
In [21]:
import pandas_profiling
pfr = pandas_profiling.ProfileReport(Train_data)
pfr.to_file("./example.html")
executed in 18.2s, finished 22:46:22 2021-03-19
Summarize dataset: 100%
17/17 [00:17<00:00, 1.40it/s, Completed]
Generate report structure: 100%
1/1 [00:00<00:00, 1.34it/s]
Render HTML: 100%
1/1 [00:00<00:00, 4.33it/s]
Export report to file: 100%
1/1 [00:00<00:00, 22.90it/s]
总结 数据探索性分析是我们初步了解数据,熟悉数据为特征工程做准备的阶段,甚至很多时候EDA阶段提取出来的特征可以直接当作规则来用。可见EDA的重要性,这个阶段的主要工作还是借助于各个简单的统计量来对数据整体的了解,分析各个类型变量相互之间的关系,以及用合适的图形可视化出来直观观察。希望本节内容能给初学者带来帮助,更期待各位学习者对其中的不足提出建议。
提分点1
将模型预测出来的值进行向上,向下划分,例如将当出现大于0.9概率的类别时将其设置为1,其他类别设置为0
def max1min0(infilename="submit.csv", outfilename="submit_max1min0.csv"):
data = pd.read_csv(infilename)
print(data)
for index, row in data.iterrows():
row_max = max(list(row)[1:])
if row_max > 0.9:
for i in range(1, 5):
if row[i] > 0.9:
data.iloc[index, i] = 1
else:
data.iloc[index, i] = 0
print(data)
data.to_csv(outfilename, index=False)
提升了70分。
也可以将最大概率的类别直接设置为1,其余设置为0
def maxmax(infilename="submit.csv", outfilename="submit_maxmax.csv"):
data = pd.read_csv(infilename)
print(data)
for index, row in data.iterrows():
tmp = list(row)[1:]
row_max_idx = tmp.index(max(tmp)) + 1
for i in range(1, 5):
if i == row_max_idx:
data.iloc[index, i] = 1
else:
data.iloc[index, i] = 0
print(data)
data.to_csv(outfilename, index=False)
提分到410。
参考
https://tianchi.aliyun.com/competition/entrance/531883/introduction
https://github.com/datawhalechina/team-learning-data-mining/tree/master/HeartbeatClassification