数据分析入门——以鸢尾花分类为例

最新推荐文章于 2022-10-14 23:01:24 发布

HollyKitty

最新推荐文章于 2022-10-14 23:01:24 发布

阅读量4.5k

点赞数 7

本文链接：https://blog.csdn.net/weixin_42369054/article/details/92829790

版权

本文以鸢尾花数据为例，总结数据分析一般过程，python数据分析库的部分用法，并完成鸢尾花分类模型构建

数据获取以及导入

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels
import os
import requests
import numpy as np

#request.get('URL')可以读取网站信息，返回respose对象,将其存入变量r中。
r = requests.get('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data')

#这里返回r，可以看出r是Response对象
r

<Response [200]>

#os.getcwd()可以返回当前编辑目录
path = os.getcwd()

path

'C:\\Users\\44587\\Python机器学习实战指南'

#用python的with open方法以write模式在当前path下创建iris.data并将存储于r中的数据写入
#response.text表示获取response中的文本信息
with open(path+'iris.data','w') as f:
    f.write(r.text)

#数据写入后使用pandas的read_csv方法读取CSV文件，names参数可赋值一个list以更改列名
df = pd.read_csv(path + 'iris.data',names = ['sepal length','sepal width','petal length',
                                            'petal width','Class'])

探索性数据分析
这部分的目的是对数据有一个总体的认知，并发现一些明显的信息，并且对数据进行清洗

#查看DataFrame信息，观察数据类型以及数据是否有缺失值等
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
sepal length    150 non-null float64
sepal width     150 non-null float64
petal length    150 non-null float64
petal width     150 non-null float64
Class           150 non-null object
dtypes: float64(4), object(1)
memory usage: 5.9+ KB

可以看出，数据是十分完整而整齐的，没有缺失值。

#查看数据表的统计信息
df.describe()

	sepal length	sepal width	petal length	petal width
count	150.000000	150.000000	150.000000	150.000000
mean	5.843333	3.054000	3.758667	1.198667
std	0.828066	0.433594	1.764420	0.763161
min	4.300000	2.000000	1.000000	0.100000
25%	5.100000	2.800000	1.600000	0.300000
50%	5.800000	3.000000	4.350000	1.300000
75%	6.400000	3.300000	5.100000	1.800000
max	7.900000	4.400000	6.900000	2.500000

#查看前5行
df.head()

	sepal length	sepal width	petal length	petal width	Class
0	5.1	3.5	1.4	0.2	Iris-setosa
1	4.9	3.0	1.4	0.2	Iris-setosa
2	4.7	3.2	1.3	0.2	Iris-setosa
3	4.6	3.1	1.5	0.2	Iris-setosa
4	5.0	3.6	1.4	0.2	Iris-setosa

#使用序数索引，方法为DataFrame.iloc[行索引，列索引]
df.iloc[:3,:4]

	sepal length	sepal width	petal length	petal width
0	5.1	3.5	1.4	0.2
1	4.9	3.0	1.4	0.2
2	4.7	3.2	1.3	0.2

#使用行列名索引
df.loc[:3,

最低0.47元/天解锁文章

HollyKitty

关注

7
点赞
踩
35

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫