Categorical Data Handling
文件中的数据分为数值型和非数值型:
例如:
Value Data | Categorical Data |
---|---|
price | ocean_proximity |
2,000,000 | 近海 |
1,500,000 | 内陆 |
1. 编码Nominal Categorical Features
这种非数值类型的数据,通常用one-hot encoding 来编码,把字符转换成对应的密码。
# Import libraries
import numpy as np
from sklearn.preprocessing import LabelBinarizer, MultiLabelBinarizer #后来有用
# Create feature
feature = np.array([["Texas"],
["California"],
["Texas"],
["Delaware"],
["Texas"]])
# Create one-hot encoder
one_hot = LabelBinarizer()
# One-hot encode feature
one_hot.fit_transform(feature)
# View feature classes
one_hot.classes_
# Reverse one-hot encoding
one_hot.inverse_transform(one_hot.transform(feature))
feature = np.array(
[[“Texas”],
[“California”],
[“Texas”],
[“Delaware”],
[“Texas”]])
编码之后,变成了:
array(
[[0, 0, 1],
[1, 0, 0],
[0, 0, 1],
[0, 1, 0],
[0, 0, 1]])
那如果是multiple classes呢?
比如:
(‘Texas’,‘Florida’),("California’, ‘Aalabama’)…
#Create multiclass feature
multiclass_feature = [("Texas", "Florida"),
("California", "Alabama"),
("Texas", "Florida"),
("Delware", "Florida"),
("Texas", "Alabama")]
ne_hot_multiclass.fit_transform(multiclass_feature)
# View classes
one_hot_multiclass.classes_
编码结果:
array(
[[0, 0, 0, 1, 1],
[1, 1, 0, 0, 0],
[0, 0, 0, 1, 1],
[0, 0, 1, 1, 0],
[1, 0, 0, 0, 1]])
ps : dummying == one-hot encoding
编码Ordinal Categorical Features
例如: 高,中,低,有层级的数据
思路: 利用replace, 将层级的字符替换为有大小之分的数值。
# Load library
import pandas as pd
# Create features
dataframe = pd.DataFrame({"Score": ["Low", "Low", "Medium", "Medium", "High"]})
# Create mapper(字典类型)
scale_mapper = {"Low":1,
"Medium":2,
"High":3}
# Replace feature values with scale
dataframe["Score"].replace(scale_mapper)
ps: 重点在于 df.replace(mapper)而这个mapper是字典类型。
然后思考一个问题,如果低分和中分和高分之间的距离相等,那么如上分配数字(1,2,3)就很合理,但是如果现在有个中高分呢??是否可以给个数字2.5
参考:
scale_mapper = {"Low":1,
"Medium":2,
“Barely More Than Medium”: 2.1
" More Than Medium": 2.5,
"High":3}
dataframe["Score"].replace(scale_mapper)
编码Dictionaries of Features
编码字典类型的数据。
怎么办?
思路: 利用sklearn.feature_extraction 中的DictVectorizer库。
# Import library
from sklearn.feature_extraction import DictVectorizer
# Create dictionary
data_dict = [{"Red": 2, "Blue": 4},
{"Red": 4, "Blue": 3},
{"Red": 1, "Yellow": 2},
{"Red": 2, "Yellow": 2}]
# Create dictionary vectorizer
dictvectorizer = DictVectorizer(sparse=False)# default output sparse matrix that only stores elements with nonzero value.
# Convert dictionary to feature matrix
features = dictvectorizer.fit_transform(data_dict)
# View feature matrix
features
结果如下:
array(
[[ 4., 2., 0.],
[ 3., 4., 0.],
[ 0., 1., 2.],
[ 0., 2., 2.]])
# Get feature names
feature_names = dictvectorizer.get_feature_names()
# View feature names
feature_names
结果如下:
[‘Blue’, ‘Red’, ‘Yellow’]