数据分析入门——以鸢尾花分类为例

本文以鸢尾花数据为例,总结数据分析一般过程,python数据分析库的部分用法,并完成鸢尾花分类模型构建

  • 数据获取以及导入
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels
import os
import requests
import numpy as np
#request.get('URL')可以读取网站信息,返回respose对象,将其存入变量r中。
r = requests.get('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data')
#这里返回r,可以看出r是Response对象
r
<Response [200]>
#os.getcwd()可以返回当前编辑目录
path = os.getcwd()
path
'C:\\Users\\44587\\Python机器学习实战指南'
#用python的with open方法以write模式在当前path下创建iris.data并将存储于r中的数据写入
#response.text表示获取response中的文本信息
with open(path+'iris.data','w') as f:
    f.write(r.text)
#数据写入后使用pandas的read_csv方法读取CSV文件,names参数可赋值一个list以更改列名
df = pd.read_csv(path + 'iris.data',names = ['sepal length','sepal width','petal length',
                                            'petal width','Class'])
  • 探索性数据分析
    这部分的目的是对数据有一个总体的认知,并发现一些明显的信息,并且对数据进行清洗
#查看DataFrame信息,观察数据类型以及数据是否有缺失值等
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
sepal length    150 non-null float64
sepal width     150 non-null float64
petal length    150 non-null float64
petal width     150 non-null float64
Class           150 non-null object
dtypes: float64(4), object(1)
memory usage: 5.9+ KB

可以看出,数据是十分完整而整齐的,没有缺失值。

#查看数据表的统计信息
df.describe()
sepal lengthsepal widthpetal lengthpetal width
count150.000000150.000000150.000000150.000000
mean5.8433333.0540003.7586671.198667
std0.8280660.4335941.7644200.763161
min4.3000002.0000001.0000000.100000
25%5.1000002.8000001.6000000.300000
50%5.8000003.0000004.3500001.300000
75%6.4000003.3000005.1000001.800000
max7.9000004.4000006.9000002.500000
#查看前5行
df.head()
sepal lengthsepal widthpetal lengthpetal widthClass
05.13.51.40.2Iris-setosa
14.93.01.40.2Iris-setosa
24.73.21.30.2Iris-setosa
34.63.11.50.2Iris-setosa
45.03.61.40.2Iris-setosa
#使用序数索引,方法为DataFrame.iloc[行索引,列索引]
df.iloc[:3,:4]
sepal lengthsepal widthpetal lengthpetal width
05.13.51.40.2
14.93.01.40.2
24.73.21.30.2
#使用行列名索引
df.loc[:3,'sepal length']
0    5.1
1    4.9
2    4.7
3    4.6
Name: sepal length, dtype: float64
#查看类别,列.unique()可以返回列中所有不同数据,类似SQL中的unique
df.Class.unique()
array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)
#查看详细分组信息,可以看到类别前50行为Setosa类,中间50行为Versicolor,后50行为Virginica
df.groupby('Class').groups
{'Iris-setosa': Int64Index([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
             17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
             34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49],
            dtype='int64'),
 'Iris-versicolor': Int64Index([50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66,
             67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83,
             84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99],
            dtype='int64'),
 'Iris-virginica': Int64Index([100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112,
             113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125,
             126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138,
             139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149],
            dtype='int64')}
df.count()
sepal length    150
sepal width     150
petal length    150
petal width     150
Class           150
dtype: int64
#取出特征组成列表
labels = list(df.columns[:4])
labels
['sepal length', 'sepal width', 'petal length', 'petal width']
#取出类别为Virginica的数据并重置index
df1 = df[df.Class == 'Iris-virginica']
df1.reset_index()
indexsepal lengthsepal widthpetal lengthpetal widthClass
01006.33.36.02.5Iris-virginica
11015.82.75.11.9Iris-virginica
21027.13.05.92.1Iris-virginica
31036.32.95.61.8Iris-virginica
41046.53.05.82.2Iris-virginica
51057.63.06.62.1Iris-virginica
61064.92.54.51.7Iris-virginica
71077.32.96.31.8Iris-virginica
81086.72.55.81.8Iris-virginica
91097.23.66.12.5Iris-virginica
101106.53.25.12.0Iris-virginica
111116.42.75.31.9Iris-virginica
121126.83.05.52.1Iris-virginica
131135.72.55.02.0Iris-virginica
141145.82.85.12.4Iris-virginica
151156.43.25.32.3Iris-virginica
161166.53.05.51.8Iris-virginica
171177.73.86.72.2Iris-virginica
181187.72.66.92.3Iris-virginica
191196.02.25.01.5Iris-virginica
201206.93.25.72.3Iris-virginica
211215.62.84.92.0Iris-virginica
221227.72.86.72.0Iris-virginica
231236.32.74.91.8Iris-virginica
241246.73.35.72.1Iris-virginica
251257.23.26.01.8Iris-virginica
261266.22.84.81.8Iris-virginica
271276.13.04.91.8Iris-virginica
281286.42.85.62.1Iris-virginica
291297.23.05.81.6Iris-virginica
301307.42.86.11.9Iris-virginica
311317.93.86.42.0Iris-virginica
321326.42.85.62.2Iris-virginica
331336.32.85.11.5Iris-virginica
341346.12.65.61.4Iris-virginica
351357.73.06.12.3Iris-virginica
361366.33.45.62.4Iris-virginica
371376.43.15.51.8Iris-virginica
381386.03.04.81.8Iris-virginica
391396.93.15.42.1Iris-virginica
401406.73.15.62.4Iris-virginica
411416.93.15.12.3Iris-virginica
421425.82.75.11.9Iris-virginica
431436.83.25.92.3Iris-virginica
441446.73.35.72.5Iris-virginica
451456.73.05.22.3Iris-virginica
461466.32.55.01.9Iris-virginica
471476.53.05.22.0Iris-virginica
481486.23.45.42.3Iris-virginica
491495.93.05.11.8Iris-virginica
#df.corr()可以返回特征的线性相关系数
df=df.iloc[:,:5]
df.corr()
sepal lengthsepal widthpetal lengthpetal width
sepal length1.000000-0.1093690.8717540.817954
sepal width-0.1093691.000000-0.420516-0.356544
petal length0.871754-0.4205161.0000000.962757
petal width0.817954-0.3565440.9627571.000000
#Seaborn库是基于matplotlib的高阶绘图库,可以简洁而优美的绘制图形
sns.pairplot(df,hue = 'Class')
<seaborn.axisgrid.PairGrid at 0x212708a40b8>

png

从图中可以观察到,petal length以及petal width两特征可以较好的对鸢尾花进行分类,后面的randomforest也证实了这一点,两特征分类贡献近乎各占四成。

观察图[1,0]可以发现,setosa的sepal length 与sepal width有一定的线性关系,后文对此进行线性回归分析。

#绘制小提琴图,展示各个特征分类数据分布
fig,ax = plt.subplots(2,2,figsize =(8,8))
sns.set(style='white',palette='muted')
sns.violinplot(x = df['Class'],y=df['sepal length'],ax =ax[0,0])
sns.violinplot(x = df['Class'],y=df['sepal width'],ax =ax[0,1])
sns.violinplot(x = df['Class'],y=df['petal length'],ax =ax[1,0])
sns.violinplot(x = df['Class'],y=df['petal width'],ax =ax[1,1])
plt.tight_layout
<function matplotlib.pyplot.tight_layout(pad=1.08, h_pad=None, w_pad=None, rect=None)>

在这里插入图片描述

#绘制直方图查看sepal width的分布
plt.style.use('ggplot')
fig,ax = plt.subplots(1,1,figsize=(4,4))
ax.hist(df['sepal width'],color = 'black')
ax.set_xlabel('sepal width')
plt.tight_layout()

在这里插入图片描述

  • Setosa 的Sepal Width 与Sepal Length线性相关性分析
#绘制两特征散点图
fig,axes = plt.subplots(figsize = (7,7))
axes.scatter(df['sepal width'][df['Class'] == 'Iris-setosa'],df['sepal length'][df['Class'] == 'Iris-setosa'])
axes.set_xlabel('Sepal width')
axes.set_ylabel('Sepal length')
axes.set_title('Setosa Sepal Width vs. Sepal Length',y = 1.02)
Text(0.5, 1.02, 'Setosa Sepal Width vs. Sepal Length')

在这里插入图片描述

#构造线性模型分析
import statsmodels.api as sm
y = df['sepal length'][df['Class'] == 'Iris-setosa']
x = df['sepal width'][df['Class'] == 'Iris-setosa']
X = sm.add_constant(x)

result = sm.OLS(y,X).fit()
print(result.summary())
D:\Anaconda\lib\site-packages\numpy\core\fromnumeric.py:2389: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
  return ptp(axis=axis, out=out, **kwargs)


                            OLS Regression Results                            
==============================================================================
Dep. Variable:           sepal length   R-squared:                       0.558
Model:                            OLS   Adj. R-squared:                  0.548
Method:                 Least Squares   F-statistic:                     60.52
Date:                Wed, 19 Jun 2019   Prob (F-statistic):           4.75e-10
Time:                        09:43:22   Log-Likelihood:                 2.0879
No. Observations:                  50   AIC:                           -0.1759
Df Residuals:                      48   BIC:                             3.648
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
===============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
const           2.6447      0.305      8.660      0.000       2.031       3.259
sepal width     0.6909      0.089      7.779      0.000       0.512       0.869
==============================================================================
Omnibus:                        0.252   Durbin-Watson:                   2.517
Prob(Omnibus):                  0.882   Jarque-Bera (JB):                0.436
Skew:                          -0.110   Prob(JB):                        0.804
Kurtosis:                       2.599   Cond. No.                         34.0
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

可以看出,回归方程:
sepal length = 0.6909*sepal width +2.6447
回归系数t检验p-val较大,结果显著,由于一元线性回归的局限性,R以及Adjusted R并不太大,模型拟合程度较低,但是模型F检验显著。

#在散点图中绘制回归线
plt.plot(x,result.fittedvalues,label = 'Regression Line')
plt.scatter(x,y,label = 'data point',color = 'red')
plt.xlabel('Sepel Width')
plt.ylabel('Sepel Length')
plt.title('Regression line')
plt.legend(loc = 'Best')
D:\Anaconda\lib\site-packages\ipykernel_launcher.py:6: MatplotlibDeprecationWarning: Unrecognized location 'Best'. Falling back on 'best'; valid locations are
	best
	upper right
	upper left
	lower left
	lower right
	right
	center left
	center right
	lower center
	upper center
	center
This will raise an exception in 3.3.
  





<matplotlib.legend.Legend at 0x21274c1ac88>

在这里插入图片描述

  • 使用随机森林构建分类模型
#导入相关包
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
#构建并训练分类器
X = df.iloc[:,:4]
y = df.iloc[:,4]
X_train,X_test,y_train,y_test = train_test_split(X,y,stratify = y)
clf = RandomForestClassifier(max_depth=5,n_estimators=10).fit(X_train,y_train)
clf.score(X_train,y_train),clf.score(X_test,y_test)

输出模型训练以及测试集评分

(0.9910714285714286, 0.9736842105263158)
clf.feature_importances_

输出特征重要性

array([0.10363298, 0.03755123, 0.37714949, 0.4816663 ])
  • 7
    点赞
  • 35
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值