一、画箱型图,直观观察数据
import pandas as pd
import matplotlib.pyplot as plt
# 一行,八列
fig,axes = plt.subplots(1,8)
# box表示箱体,ax表示绘制子图,subplots判断图片中是否有子图,sym参数表示异常值标记的方式
df.plot(kind='box', ax=axes, subplots=True, title='Different boxplots', sym='r+')
# 调整子图之间的间距
fig.subplots_adjust(wspace=6,hspace=6)
二、画出散点图,直观观察极端值
df1 = df.copy()
## 得到df1的多少个列下标编号
df1["index"] = df1.index
# 画出每个特性的散点图,直观感受是否有离群点
# 此时会得到x坐标标签为index,y坐标标签同title为Length等等,kind表示方式为点,s代表点的大小
features = ["Length","Diameter","Height","Whole weight","Shucked weight","Viscera weight","Shell weight","Rings"]
for i in features:
s = df1.plot(title=i,
kind='scatter',
x="index",
y=i,
s=8,
figsize=(8, 6))
总结:DataFrame.plot()函数,会对DataFrame的内容进行针对性的绘制可视化图。
三、数据处理
为方便数据处理,决定将所有特征值在 [ 0,中位数 * 3.5 ] 之外的数字去除 (玄学调参,不讲道理)
# 如果df[条件],条件满足,则保留在df里
features = ["Length","Diameter","Height","Whole weight","Shucked weight","Viscera weight","Shell weight","Rings"]
for i in features:
df = df[df[i] > 0]
df = df[df[i] < df[i].median() * 3.5]