机器学习数据清理笔记

李沐的课件

网上把数据爬下来后,存在cvs的文件里。让后把它放进jupter

# !pip install seaborn pandas matplotlib numpy
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from IPython import display
display.set_matplotlib_formats('svg') 
# svg分辨率相对高一点,当发现python和jupter版本比较低的时候会用
# Alternative to set svg for newer versions
# import matplotlib_inline
# matplotlib_inline.backend_inline.set_matplotlib_formats('svg')
# 如果有warming,反注释掉

数据读取

data = pd.read_csv('house_sales.zip') #文本可以直接读取zip文件
data.shape #Output: (行,列)

输出前五行数据信息

data.head() 

删除掉缺数据的列(数据少于30%的列都删掉)

# We drop columns that at least 30% values are null to simplify our EDA.
null_sum = data.isnull().sum()
data.columns[null_sum < len(data) * 0.3]  # columns will keep
data.drop(columns=data.columns[null_sum > len(data) * 0.3], inplace=True) # columns will delete

找出一些数据类型不正确的列

data.dtypes #查看数据类型

数据转换

# Convert currency from string format such as $1,000,000 to float.
currency = ['Sold Price', 'Listed Price', 'Tax assessed value', 'Annual tax amount']
for c in currency:
    data[c] = data[c].replace(  #regex正则表达式
        r'[$,-]', '', regex=True).replace( #吧$符号和逗号全部删掉,横线表示没数据
        r'^\s*$', np.nan, regex=True).astype(float) #如果是空的话,直接转成np的not a number,最后astype转成float格式

#Also convert areas from string format such as 1000 sqft and 1 Acres to float as well.
areas = ['Total interior livable area', 'Lot size']
for c in areas:
    acres = data[c].str.contains('Acres') == True
    col = data[c].replace(r'\b sqft\b|\b Acres\b|\b,\b','',regex=True).astype(float)
    col[acres] *= 43560 #1acres=43560sqft
    data[c] = col

清除噪音

看下数据大概有多少噪音(噪音大概意思是数据集本身带有的一些对任务没有帮助的信息)

# Now we can check values of the numerical columns. You could see the min and max values for several columns do not make sense.
data.describe()

在这里插入图片描述

删除掉一些不正常的数据,比如,小于1平米或者大于10000平米的房子

# We filter out houses whose living areas are too small or too hard to simplify the visualization later.
abnormal = (data[areas[1]] < 10) | (data[areas[1]] > 1e4)
data = data[~abnormal] # ~等于非
sum(abnormal)
41000

看下预测的值的分布

# Let's check the histogram of the 'Sold Price', which is the target we want to predict.
ax = sns.histplot(np.log10(data['Sold Price']))
ax.set_xlim([3, 8])
ax.set_xticks(range(3, 9))
ax.set_xticklabels(['%.0e'%a for a in 10**ax.get_xticks()]);

在这里插入图片描述

看下数据的类型

# A house has different types. Here are the top types:
data['Type'].value_counts()[0:20]
SingleFamily            74318
Condo                   18749
MultiFamily              6586
VacantLand               6199
Townhouse                5846
Unknown                  5390
MobileManufactured       2588
Apartment                1416
Cooperative               161
Residential Lot            75
Single Family              69
Single Family Lot          56
Acreage                    48
2 Story                    39
3 Story                    25
Hi-Rise (9+), Luxury       21
RESIDENTIAL                19
Condominium                19
Duplex                     19
Mid-Rise (4-8)             17
Name: Type, dtype: int64

画个分布概率

# Price density for different house types.
types = data['Type'].isin(['SingleFamily', 'Condo', 'MultiFamily', 'Townhouse'])
sns.displot(pd.DataFrame({'Sold Price':np.log10(data[types]['Sold Price']),
                          'Type':data[types]['Type']}),
            x='Sold Price', hue='Type', kind='kde');

在这里插入图片描述
画个box plot

# Another important measurement is the sale price per living sqft. Let's check the differences between different house types.
data['Price per living sqft'] = data['Sold Price'] / data['Total interior livable area']
ax = sns.boxplot(x='Type', y='Price per living sqft', data=data[types], fliersize=0)
ax.set_ylim([0, 2000]);

在这里插入图片描述
低于特征对房子的影响,也可以考虑下

# We know the location affect the price. Let's check the price for the top 20 zip codes.
d = data[data['Zip'].isin(data['Zip'].value_counts()[:20].keys())]
ax = sns.boxplot(x='Zip', y='Price per living sqft', data=d, fliersize=0)
ax.set_ylim([0, 2000])
ax.set_xticklabels(ax.get_xticklabels(), rotation=90);

每个特征的关系

# Last, we visualize the correlation matrix of several columns.
_, ax = plt.subplots(figsize=(6,6))
columns = ['Sold Price', 'Listed Price', 'Annual tax amount', 'Price per living sqft', 'Elementary School Score', 'High School Score']
sns.heatmap(data[columns].corr(),annot=True,cmap='RdYlGn', ax=ax);

在这里插入图片描述

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值