本文以鸢尾花数据为例,总结数据分析一般过程,python数据分析库的部分用法,并完成鸢尾花分类模型构建
- 数据获取以及导入
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels
import os
import requests
import numpy as np
#request.get('URL')可以读取网站信息,返回respose对象,将其存入变量r中。
r = requests.get('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data')
#这里返回r,可以看出r是Response对象
r
<Response [200]>
#os.getcwd()可以返回当前编辑目录
path = os.getcwd()
path
'C:\\Users\\44587\\Python机器学习实战指南'
#用python的with open方法以write模式在当前path下创建iris.data并将存储于r中的数据写入
#response.text表示获取response中的文本信息
with open(path+'iris.data','w') as f:
f.write(r.text)
#数据写入后使用pandas的read_csv方法读取CSV文件,names参数可赋值一个list以更改列名
df = pd.read_csv(path + 'iris.data',names = ['sepal length','sepal width','petal length',
'petal width','Class'])
- 探索性数据分析
这部分的目的是对数据有一个总体的认知,并发现一些明显的信息,并且对数据进行清洗
#查看DataFrame信息,观察数据类型以及数据是否有缺失值等
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
sepal length 150 non-null float64
sepal width 150 non-null float64
petal length 150 non-null float64
petal width 150 non-null float64
Class 150 non-null object
dtypes: float64(4), object(1)
memory usage: 5.9+ KB
可以看出,数据是十分完整而整齐的,没有缺失值。
#查看数据表的统计信息
df.describe()
sepal length | sepal width | petal length | petal width | |
---|---|---|---|---|
count | 150.000000 | 150.000000 | 150.000000 | 150.000000 |
mean | 5.843333 | 3.054000 | 3.758667 | 1.198667 |
std | 0.828066 | 0.433594 | 1.764420 | 0.763161 |
min | 4.300000 | 2.000000 | 1.000000 | 0.100000 |
25% | 5.100000 | 2.800000 | 1.600000 | 0.300000 |
50% | 5.800000 | 3.000000 | 4.350000 | 1.300000 |
75% | 6.400000 | 3.300000 | 5.100000 | 1.800000 |
max | 7.900000 | 4.400000 | 6.900000 | 2.500000 |
#查看前5行
df.head()
sepal length | sepal width | petal length | petal width | Class | |
---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
1 | 4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
2 | 4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
3 | 4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
4 | 5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
#使用序数索引,方法为DataFrame.iloc[行索引,列索引]
df.iloc[:3,:4]
sepal length | sepal width | petal length | petal width | |
---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 |
1 | 4.9 | 3.0 | 1.4 | 0.2 |
2 | 4.7 | 3.2 | 1.3 | 0.2 |
#使用行列名索引
df.loc[:3,'sepal length']
0 5.1
1 4.9
2 4.7
3 4.6
Name: sepal length, dtype: float64
#查看类别,列.unique()可以返回列中所有不同数据,类似SQL中的unique
df.Class.unique()
array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)
#查看详细分组信息,可以看到类别前50行为Setosa类,中间50行为Versicolor,后50行为Virginica
df.groupby('Class').groups
{'Iris-setosa': Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49],
dtype='int64'),
'Iris-versicolor': Int64Index([50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66,
67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83,
84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99],
dtype='int64'),
'Iris-virginica': Int64Index([100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112,
113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125,
126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138,
139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149],
dtype='int64')}
df.count()
sepal length 150
sepal width 150
petal length 150
petal width 150
Class 150
dtype: int64
#取出特征组成列表
labels = list(df.columns[:4])
labels
['sepal length', 'sepal width', 'petal length', 'petal width']
#取出类别为Virginica的数据并重置index
df1 = df[df.Class == 'Iris-virginica']
df1.reset_index()
index | sepal length | sepal width | petal length | petal width | Class | |
---|---|---|---|---|---|---|
0 | 100 | 6.3 | 3.3 | 6.0 | 2.5 | Iris-virginica |
1 | 101 | 5.8 | 2.7 | 5.1 | 1.9 | Iris-virginica |
2 | 102 | 7.1 | 3.0 | 5.9 | 2.1 | Iris-virginica |
3 | 103 | 6.3 | 2.9 | 5.6 | 1.8 | Iris-virginica |
4 | 104 | 6.5 | 3.0 | 5.8 | 2.2 | Iris-virginica |
5 | 105 | 7.6 | 3.0 | 6.6 | 2.1 | Iris-virginica |
6 | 106 | 4.9 | 2.5 | 4.5 | 1.7 | Iris-virginica |
7 | 107 | 7.3 | 2.9 | 6.3 | 1.8 | Iris-virginica |
8 | 108 | 6.7 | 2.5 | 5.8 | 1.8 | Iris-virginica |
9 | 109 | 7.2 | 3.6 | 6.1 | 2.5 | Iris-virginica |
10 | 110 | 6.5 | 3.2 | 5.1 | 2.0 | Iris-virginica |
11 | 111 | 6.4 | 2.7 | 5.3 | 1.9 | Iris-virginica |
12 | 112 | 6.8 | 3.0 | 5.5 | 2.1 | Iris-virginica |
13 | 113 | 5.7 | 2.5 | 5.0 | 2.0 | Iris-virginica |
14 | 114 | 5.8 | 2.8 | 5.1 | 2.4 | Iris-virginica |
15 | 115 | 6.4 | 3.2 | 5.3 | 2.3 | Iris-virginica |
16 | 116 | 6.5 | 3.0 | 5.5 | 1.8 | Iris-virginica |
17 | 117 | 7.7 | 3.8 | 6.7 | 2.2 | Iris-virginica |
18 | 118 | 7.7 | 2.6 | 6.9 | 2.3 | Iris-virginica |
19 | 119 | 6.0 | 2.2 | 5.0 | 1.5 | Iris-virginica |
20 | 120 | 6.9 | 3.2 | 5.7 | 2.3 | Iris-virginica |
21 | 121 | 5.6 | 2.8 | 4.9 | 2.0 | Iris-virginica |
22 | 122 | 7.7 | 2.8 | 6.7 | 2.0 | Iris-virginica |
23 | 123 | 6.3 | 2.7 | 4.9 | 1.8 | Iris-virginica |
24 | 124 | 6.7 | 3.3 | 5.7 | 2.1 | Iris-virginica |
25 | 125 | 7.2 | 3.2 | 6.0 | 1.8 | Iris-virginica |
26 | 126 | 6.2 | 2.8 | 4.8 | 1.8 | Iris-virginica |
27 | 127 | 6.1 | 3.0 | 4.9 | 1.8 | Iris-virginica |
28 | 128 | 6.4 | 2.8 | 5.6 | 2.1 | Iris-virginica |
29 | 129 | 7.2 | 3.0 | 5.8 | 1.6 | Iris-virginica |
30 | 130 | 7.4 | 2.8 | 6.1 | 1.9 | Iris-virginica |
31 | 131 | 7.9 | 3.8 | 6.4 | 2.0 | Iris-virginica |
32 | 132 | 6.4 | 2.8 | 5.6 | 2.2 | Iris-virginica |
33 | 133 | 6.3 | 2.8 | 5.1 | 1.5 | Iris-virginica |
34 | 134 | 6.1 | 2.6 | 5.6 | 1.4 | Iris-virginica |
35 | 135 | 7.7 | 3.0 | 6.1 | 2.3 | Iris-virginica |
36 | 136 | 6.3 | 3.4 | 5.6 | 2.4 | Iris-virginica |
37 | 137 | 6.4 | 3.1 | 5.5 | 1.8 | Iris-virginica |
38 | 138 | 6.0 | 3.0 | 4.8 | 1.8 | Iris-virginica |
39 | 139 | 6.9 | 3.1 | 5.4 | 2.1 | Iris-virginica |
40 | 140 | 6.7 | 3.1 | 5.6 | 2.4 | Iris-virginica |
41 | 141 | 6.9 | 3.1 | 5.1 | 2.3 | Iris-virginica |
42 | 142 | 5.8 | 2.7 | 5.1 | 1.9 | Iris-virginica |
43 | 143 | 6.8 | 3.2 | 5.9 | 2.3 | Iris-virginica |
44 | 144 | 6.7 | 3.3 | 5.7 | 2.5 | Iris-virginica |
45 | 145 | 6.7 | 3.0 | 5.2 | 2.3 | Iris-virginica |
46 | 146 | 6.3 | 2.5 | 5.0 | 1.9 | Iris-virginica |
47 | 147 | 6.5 | 3.0 | 5.2 | 2.0 | Iris-virginica |
48 | 148 | 6.2 | 3.4 | 5.4 | 2.3 | Iris-virginica |
49 | 149 | 5.9 | 3.0 | 5.1 | 1.8 | Iris-virginica |
#df.corr()可以返回特征的线性相关系数
df=df.iloc[:,:5]
df.corr()
sepal length | sepal width | petal length | petal width | |
---|---|---|---|---|
sepal length | 1.000000 | -0.109369 | 0.871754 | 0.817954 |
sepal width | -0.109369 | 1.000000 | -0.420516 | -0.356544 |
petal length | 0.871754 | -0.420516 | 1.000000 | 0.962757 |
petal width | 0.817954 | -0.356544 | 0.962757 | 1.000000 |
#Seaborn库是基于matplotlib的高阶绘图库,可以简洁而优美的绘制图形
sns.pairplot(df,hue = 'Class')
<seaborn.axisgrid.PairGrid at 0x212708a40b8>
从图中可以观察到,petal length以及petal width两特征可以较好的对鸢尾花进行分类,后面的randomforest也证实了这一点,两特征分类贡献近乎各占四成。
观察图[1,0]可以发现,setosa的sepal length 与sepal width有一定的线性关系,后文对此进行线性回归分析。
#绘制小提琴图,展示各个特征分类数据分布
fig,ax = plt.subplots(2,2,figsize =(8,8))
sns.set(style='white',palette='muted')
sns.violinplot(x = df['Class'],y=df['sepal length'],ax =ax[0,0])
sns.violinplot(x = df['Class'],y=df['sepal width'],ax =ax[0,1])
sns.violinplot(x = df['Class'],y=df['petal length'],ax =ax[1,0])
sns.violinplot(x = df['Class'],y=df['petal width'],ax =ax[1,1])
plt.tight_layout
<function matplotlib.pyplot.tight_layout(pad=1.08, h_pad=None, w_pad=None, rect=None)>
#绘制直方图查看sepal width的分布
plt.style.use('ggplot')
fig,ax = plt.subplots(1,1,figsize=(4,4))
ax.hist(df['sepal width'],color = 'black')
ax.set_xlabel('sepal width')
plt.tight_layout()
- Setosa 的Sepal Width 与Sepal Length线性相关性分析
#绘制两特征散点图
fig,axes = plt.subplots(figsize = (7,7))
axes.scatter(df['sepal width'][df['Class'] == 'Iris-setosa'],df['sepal length'][df['Class'] == 'Iris-setosa'])
axes.set_xlabel('Sepal width')
axes.set_ylabel('Sepal length')
axes.set_title('Setosa Sepal Width vs. Sepal Length',y = 1.02)
Text(0.5, 1.02, 'Setosa Sepal Width vs. Sepal Length')
#构造线性模型分析
import statsmodels.api as sm
y = df['sepal length'][df['Class'] == 'Iris-setosa']
x = df['sepal width'][df['Class'] == 'Iris-setosa']
X = sm.add_constant(x)
result = sm.OLS(y,X).fit()
print(result.summary())
D:\Anaconda\lib\site-packages\numpy\core\fromnumeric.py:2389: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
return ptp(axis=axis, out=out, **kwargs)
OLS Regression Results
==============================================================================
Dep. Variable: sepal length R-squared: 0.558
Model: OLS Adj. R-squared: 0.548
Method: Least Squares F-statistic: 60.52
Date: Wed, 19 Jun 2019 Prob (F-statistic): 4.75e-10
Time: 09:43:22 Log-Likelihood: 2.0879
No. Observations: 50 AIC: -0.1759
Df Residuals: 48 BIC: 3.648
Df Model: 1
Covariance Type: nonrobust
===============================================================================
coef std err t P>|t| [0.025 0.975]
-------------------------------------------------------------------------------
const 2.6447 0.305 8.660 0.000 2.031 3.259
sepal width 0.6909 0.089 7.779 0.000 0.512 0.869
==============================================================================
Omnibus: 0.252 Durbin-Watson: 2.517
Prob(Omnibus): 0.882 Jarque-Bera (JB): 0.436
Skew: -0.110 Prob(JB): 0.804
Kurtosis: 2.599 Cond. No. 34.0
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
可以看出,回归方程:
sepal length = 0.6909*sepal width +2.6447
回归系数t检验p-val较大,结果显著,由于一元线性回归的局限性,R以及Adjusted R并不太大,模型拟合程度较低,但是模型F检验显著。
#在散点图中绘制回归线
plt.plot(x,result.fittedvalues,label = 'Regression Line')
plt.scatter(x,y,label = 'data point',color = 'red')
plt.xlabel('Sepel Width')
plt.ylabel('Sepel Length')
plt.title('Regression line')
plt.legend(loc = 'Best')
D:\Anaconda\lib\site-packages\ipykernel_launcher.py:6: MatplotlibDeprecationWarning: Unrecognized location 'Best'. Falling back on 'best'; valid locations are
best
upper right
upper left
lower left
lower right
right
center left
center right
lower center
upper center
center
This will raise an exception in 3.3.
<matplotlib.legend.Legend at 0x21274c1ac88>
- 使用随机森林构建分类模型
#导入相关包
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
#构建并训练分类器
X = df.iloc[:,:4]
y = df.iloc[:,4]
X_train,X_test,y_train,y_test = train_test_split(X,y,stratify = y)
clf = RandomForestClassifier(max_depth=5,n_estimators=10).fit(X_train,y_train)
clf.score(X_train,y_train),clf.score(X_test,y_test)
输出模型训练以及测试集评分
(0.9910714285714286, 0.9736842105263158)
clf.feature_importances_
输出特征重要性
array([0.10363298, 0.03755123, 0.37714949, 0.4816663 ])