Task 2 数据分析
赛题:心电图心跳信号多分类预测
学习目标
-
熟悉数据集,了解数据集,对数据集进行验证来确定所获得数据集可以用于接下来的机器学习或者深度学习使用
-
了解变量间的相互关系以及变量与预测值之间的存在关系
-
完成对于数据的探索性分析,并对于数据进行一些图表或者文字总结
-
掌握基础的查看数据基本信息的方法
2.1 导入各种数据科学以及可视化库、载入数据
import warnings
warnings.filterwarnings('ignore')
import missingno as msno
import pandas as pd
from pandas import DataFrame, Series
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
Train_data = pd.read_csv('./train.csv')
Test_data = pd.read_csv('./testA.csv')
2.2 熟悉数据的简单信息
Train_data.head().append(Train_data.tail()) #观察首尾数据
id | heartbeat_signals | label | |
---|---|---|---|
0 | 0 | 0.9912297987616655,0.9435330436439665,0.764677… | 0.0 |
1 | 1 | 0.9714822034884503,0.9289687459588268,0.572932… | 0.0 |
2 | 2 | 1.0,0.9591487564065292,0.7013782792997189,0.23… | 2.0 |
3 | 3 | 0.9757952826275774,0.9340884687738161,0.659636… | 0.0 |
4 | 4 | 0.0,0.055816398940721094,0.26129357194994196,0… | 2.0 |
99995 | 99995 | 1.0,0.677705342021188,0.22239242747868546,0.25… | 0.0 |
99996 | 99996 | 0.9268571578157265,0.9063471198026871,0.636993… | 2.0 |
99997 | 99997 | 0.9258351628306013,0.5873839035878395,0.633226… | 3.0 |
99998 | 99998 | 1.0,0.9947621698382489,0.8297017704865509,0.45… | 2.0 |
99999 | 99999 | 0.9259994004527861,0.916476635326053,0.4042900… | 0.0 |
Train_data.shape #观察行列信息
(100000, 3)
Train_data.describe() #观察相关统计量
id | label | |
---|---|---|
count | 100000.000000 | 100000.000000 |
mean | 49999.500000 | 0.856960 |
std | 28867.657797 | 1.217084 |
min | 0.000000 | 0.000000 |
25% | 24999.750000 | 0.000000 |
50% | 49999.500000 | 0.000000 |
75% | 74999.250000 | 2.000000 |
max | 99999.000000 | 3.000000 |
Train_data.info() #获取数据类型
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 20000 non-null int64
1 heartbeat_signals 20000 non-null object
dtypes: int64(1), object(1)
memory usage: 312.6+ KB
2.3 查看数据缺失
Train_data.isnull().sum() #统计数据中的空值数量
id 0
heartbeat_signals 0
label 0
dtype: int64
2.4 了解预测值的分布
Train_data['label'].value_counts() #查看预测值的分布
0.0 64327
3.0 17912
2.0 14199
1.0 3562
Name: label, dtype: int64
import scipy.stats as st #引入统计函数库
y = Train_data['label']
plt.figure(1); plt.title('Default')
sns.distplot(y, rug=True, bins=20)
plt.figure(2); plt.title('Normal')
sns.distplot(y, kde=False, fit=st.norm)
plt.figure(3); plt.title('Log Normal')
sns.distplot(y, kde=False, fit=st.lognorm)
[图略,在记事本写的markdown文件,还不懂得如何导入图片]
sns.distplot(Train_data['label']) #查看skewness and kurtosis
print("Skewness: %f" % Train_data['label'].skew())
print("Kurtosis: %f" % Train_data['label'].kurt())
(id 0.000000
label 0.871005
dtype: float64,
id -1.200000
label -1.009573
dtype: float64)
sns.distplot(Train_data.kurt(),color='orange',axlabel ='Kurtness')
[再次图略]
2.5 学习总结
该阶段的目标是了解手中的训练数据集的组成,检查是否存在训练模型不友好的数据,为接下来的数据预处理做好铺垫,确认好大致的操作思路。