主要介绍了Sklearn中常用的数据预处理方法。
数据预处理
1.导入用到的库
import numpy as np
import pandas as pd
from sklearn.preprocessing import Imputer
from sklearn.neighbors import LocalOutlierFactor
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import Binarizer
from sklearn.cluster import KMeans
2.读取数据集
- “teenager_sns”包含30000个样本的美国高中生社交网络信息数据集。每个样本包含40个变量,其中 gradyear, gender, age和friends四个变量代表高中生的毕业年份、性别、年龄和好友数等基本信息。 其余36个变量代表36个词语,代表高中生的5大兴趣。
- “accord_sedan_testing”是一个二手汽车数据集,包含二手汽车的价格、已行驶英里、上市年份、档次、引擎缸数、换挡方式等
teenager_sns = pd.read_csv("./input/teenager_sns.csv")
teenager_sns.head(5)
gradyear | gender | age | friends | basketball | football | soccer | softball | volleyball | swimming | ... | blonde | mall | shopping | clothes | hollister | abercrombie | die | death | drunk | drugs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2006 | M | 18.980 | 7 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
1 | 2006 | F | 18.801 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | 2006 | M | 18.335 | 69 | 0 | 1 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
3 | 2006 | F | 18.875 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
4 | 2006 | NaN | 18.995 | 10 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
5 rows × 40 columns
test = pd.read_csv('./input/accord_sedan_testing.csv')
# 将数据集中的已行驶英里数“mileage”与价格“price”汇总
data = test[['mileage','price']]
data.head()
mileage | price | |
---|---|---|
0 | 68265 | 12995 |
1 | 92778 | 9690 |
2 | 136000 | 8995 |
3 | 72765 | 11995 |
4 | 36448 | 17999 |
3.缺失值处理
缺失值(NaN)处理:from sklearn.preprocessing import Imputer
"""
使用sklearn中的Imputer方法,将数据集“teenager_sns”中“age”列利用均值“mean”进行填充
"""
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit(teenager_sns[["age"]])
teenager_sns["age_imputed"]=imp.transform(teenager_sns[["age"]])
# 显示年龄缺失的行,和插补缺失值之后的列"age_imputed"
teenager_sns[teenager_sns['age'].isnull()]
gradyear | gender | age | friends | basketball | football | soccer | softball | volleyball | swimming | ... | mall | shopping | clothes | hollister | abercrombie | die | death | drunk | drugs | age_imputed | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
5 | 2006 | F | NaN | 142 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 17.993949 |
13 | 2006 | NaN | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17.993949 |
15 | 2006 | NaN | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17.993949 |
16 | 2006 | NaN | NaN | 135 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17.993949 |
26 | 2006 | F | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17.993949 |
38 | 2006 | F | NaN | 17 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 17.993949 |
41 | 2006 | NaN | NaN | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 17.993949 |
49 | 2006 | M | NaN | 35 | 1 | 3 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 |