本文以鸢尾花数据为例,总结数据分析一般过程,python数据分析库的部分用法,并完成鸢尾花分类模型构建
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels
import os
import requests
import numpy as np
r = requests.get('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data')
r
<Response [200]>
path = os.getcwd()
path
'C:\\Users\\44587\\Python机器学习实战指南'
with open(path+'iris.data','w') as f:
f.write(r.text)
df = pd.read_csv(path + 'iris.data',names = ['sepal length','sepal width','petal length',
'petal width','Class'])
- 探索性数据分析
这部分的目的是对数据有一个总体的认知,并发现一些明显的信息,并且对数据进行清洗
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 150 entries, 0 to 149
Data columns (total 5 columns):
sepal length 150 non-null float64
sepal width 150 non-null float64
petal length 150 non-null float64
petal width 150 non-null float64
Class 150 non-null object
dtypes: float64(4), object(1)
memory usage: 5.9+ KB
可以看出,数据是十分完整而整齐的,没有缺失值。
df.describe()
|
sepal length |
sepal width |
petal length |
petal width |
count |
150.000000 |
150.000000 |
150.000000 |
150.000000 |
mean |
5.843333 |
3.054000 |
3.758667 |
1.198667 |
std |
0.828066 |
0.433594 |
1.764420 |
0.763161 |
min |
4.300000 |
2.000000 |
1.000000 |
0.100000 |
25% |
5.100000 |
2.800000 |
1.600000 |
0.300000 |
50% |
5.800000 |
3.000000 |
4.350000 |
1.300000 |
75% |
6.400000 |
3.300000 |
5.100000 |
1.800000 |
max |
7.900000 |
4.400000 |
6.900000 |
2.500000 |
df.head()
|
sepal length |
sepal width |
petal length |
petal width |
Class |
0 |
5.1 |
3.5 |
1.4 |
0.2 |
Iris-setosa |
1 |
4.9 |
3.0 |
1.4 |
0.2 |
Iris-setosa |
2 |
4.7 |
3.2 |
1.3 |
0.2 |
Iris-setosa |
3 |
4.6 |
3.1 |
1.5 |
0.2 |
Iris-setosa |
4 |
5.0 |
3.6 |
1.4 |
0.2 |
Iris-setosa |
df.iloc[:3,:4]
|
sepal length |
sepal width |
petal length |
petal width |
0 |
5.1 |
3.5 |
1.4 |
0.2 |
1 |
4.9 |
3.0 |
1.4 |
0.2 |
2 |
4.7 |
3.2 |
1.3 |
0.2 |
df.loc[:3,