数据清洗之异常值处理

最新推荐文章于 2024-07-09 15:14:17 发布

若尘

最新推荐文章于 2024-07-09 15:14:17 发布

阅读量8.8k

点赞数 8

分类专栏： Python数据清洗实战文章标签：数据预处理异常值处理数据清洗 pandas numpy

本文链接：https://blog.csdn.net/qq_29339467/article/details/105636205

版权

Python数据清洗实战专栏收录该内容

22 篇文章 13 订阅

订阅专栏

异常值处理

指那些偏离正常范围的值，不是错误值
异常值出现频率较低，但又会对实际项目分析造成偏差
异常值一般用过箱线图法(分位差法)或者分布图(标准差法)来判断
异常值检测可以使用均值的二倍标准差范围，也可以使用上下4分位数差方法
异常值往往采取盖帽法或者数据离散化

import pandas as pd
import numpy as np
import os

os.getcwd()

'D:\\Jupyter\\notebook\\Python数据清洗实战\\数据清洗之数据预处理'

os.chdir('D:\\Jupyter\\notebook\\Python数据清洗实战\\数据')

df = pd.read_csv('MotorcycleData.csv', encoding='gbk', na_values='Na')

def f(x):
    if '$' in str(x):
        x = str(x).strip('$')
        x = str(x).replace(',', '')
    else:
        x = str(x).replace(',', '')
    return float(x)

df['Price'] = df['Price'].apply(f)

df['Mileage'] = df['Mileage'].apply(f)

df.head(5)

	Condition	Condition_Desc	Price	Location	Model_Year	Mileage	Exterior_Color	Make	Warranty	Model	...	Vehicle_Title	OBO	Feedback_Perc	Watch_Count	N_Reviews	Seller_Status	Vehicle_Tile	Auction	Buy_Now	Bid_Count
0	Used	mint!!! very low miles	11412.0	McHenry, Illinois, United States	2013.0	16000.0	Black	Harley-Davidson	Unspecified	Touring	...	NaN	FALSE	8.1	NaN	2427	Private Seller	Clear	True	FALSE	28.0
1	Used	Perfect condition	17200.0	Fort Recovery, Ohio, United States	2016.0	60.0	Black	Harley-Davidson	Vehicle has an existing warranty	Touring	...	NaN	FALSE	100	17	657	Private Seller	Clear	True	TRUE	0.0
2	Used	NaN	3872.0	Chicago, Illinois, United States	1970.0	25763.0	Silver/Blue	BMW	Vehicle does NOT have an existing warranty	R-Series	...	NaN	FALSE	100	NaN	136	NaN	Clear	True	FALSE	26.0
3	Used	CLEAN TITLE READY TO RIDE HOME	6575.0	Green Bay, Wisconsin, United States	2009.0	33142.0	Red	Harley-Davidson	NaN	Touring	...	NaN	FALSE	100	NaN	2920	Dealer	Clear	True	FALSE	11.0
4	Used	NaN	10000.0	West Bend, Wisconsin, United States	2012.0	17800.0	Blue	Harley-Davidson	NO WARRANTY	Touring	...	NaN	FALSE	100	13	271	OWNER	Clear	True	TRUE	0.0

5 rows × 22 columns

# 对价格异常值处理
# 计算价格均值
x_bar = df['Price'].mean()

# 计算价格标准差
x_std = df['Price'].std()

# 异常值上限检测
any(df['Price'] > x_bar + 2 * x_std)

True

# 异常值下限检测
any(df['Price'] < x_bar - 2 * x_std)

False

# 描述性统计
df['Price'].describe()

count      7493.000000
mean       9968.811557
std        8497.326850
min           0.000000
25%        4158.000000
50%        7995.000000
75%       13000.000000
max      100000.000000
Name: Price, dtype: float64

# 25% 分位数
Q1 = df['Price'].quantile(q = 0.25)

# 75% 分位数
Q3 = df['Price'].quantile(q = 0.75)

# 分位差
IQR = Q3 - Q1

any(df['Price'] > Q3 + 1.5 * IQR)

True

any(df['Price'] < Q1 - 1.5 * IQR)

False

import matplotlib.pyplot as plt

%matplotlib inline

df['Price'].plot(kind='box')

<matplotlib.axes._subplots.AxesSubplot at 0x11ddad20ac8>

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-PioEYXZs-1587367435767)(output_21_1.png)]

# 设置绘图风格
plt.style.use('seaborn')
# 绘制直方图
df.Price.plot(kind='hist', bins=30, density=True)
# 绘制核密度图
df.Price.plot(kind='kde')
# 图形展现
plt.show()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-JWb6qAoD-1587367435770)(output_22_0.png)]

# 用99分位数和1分位数替换
# 计算P1和P99
P99 = df['Price'].quantile(q=0.99)
P1 = df['Price'].quantile(q=0.01)

P99

39995.32

df['Price_new'] = df['Price']

# 盖帽法
df.loc[df['Price'] > P99, 'Price_new'] = P99
df.loc[df['Price'] < P1, 'Price_new'] = P1

df[['Price', 'Price_new']].describe()

	Price	Price_new
count	7493.000000	7493.000000
mean	9968.811557	9821.220873
std	8497.326850	7737.092537
min	0.000000	100.000000
25%	4158.000000	4158.000000
50%	7995.000000	7995.000000
75%	13000.000000	13000.000000
max	100000.000000	39995.320000