python机器学习基础5之非数值数据处理（cook book）

最新推荐文章于 2022-05-06 18:04:43 发布

万物琴弦光锥之外

最新推荐文章于 2022-05-06 18:04:43 发布

阅读量1.2k

点赞数 1

分类专栏： python 机器学习文章标签：非数值数据 pandas python 机器学习

本文链接：https://blog.csdn.net/weixin_43702920/article/details/95551245

版权

python 同时被 2 个专栏收录

85 篇文章 1 订阅

订阅专栏

机器学习

60 篇文章 3 订阅

订阅专栏

Categorical Data Handling

文件中的数据分为数值型和非数值型：
例如：

Value Data	Categorical Data
price	ocean_proximity
2,000,000	近海
1,500,000	内陆

1. 编码Nominal Categorical Features

这种非数值类型的数据，通常用one-hot encoding 来编码，把字符转换成对应的密码。

# Import libraries
import numpy as np
from sklearn.preprocessing import LabelBinarizer, MultiLabelBinarizer  #后来有用

# Create feature
feature = np.array([["Texas"],
					["California"],
					["Texas"],
					["Delaware"],
					["Texas"]])

# Create one-hot encoder
one_hot = LabelBinarizer()

# One-hot encode feature
one_hot.fit_transform(feature)
	   
# View feature classes
one_hot.classes_	  

# Reverse one-hot encoding
one_hot.inverse_transform(one_hot.transform(feature))

feature = np.array(
[[“Texas”],
[“California”],
[“Texas”],
[“Delaware”],
[“Texas”]])
编码之后，变成了：
array(
[[0, 0, 1],
[1, 0, 0],
[0, 0, 1],
[0, 1, 0],
[0, 0, 1]])

那如果是multiple classes呢？
比如：
（‘Texas’，‘Florida’),("California’, ‘Aalabama’)…

 #Create multiclass feature
multiclass_feature = [("Texas", "Florida"),
					  ("California", "Alabama"),
					  ("Texas", "Florida"),
					  ("Delware", "Florida"),
					  ("Texas", "Alabama")]

ne_hot_multiclass.fit_transform(multiclass_feature)

# View classes
one_hot_multiclass.classes_

编码结果：
array(
[[0, 0, 0, 1, 1],
[1, 1, 0, 0, 0],
[0, 0, 0, 1, 1],
[0, 0, 1, 1, 0],
[1, 0, 0, 0, 1]])

ps : dummying == one-hot encoding

编码Ordinal Categorical Features

例如：高，中，低，有层级的数据
思路：利用replace, 将层级的字符替换为有大小之分的数值。

# Load library
import pandas as pd

# Create features
dataframe = pd.DataFrame({"Score": ["Low", "Low", "Medium", "Medium", "High"]})

# Create mapper(字典类型)
scale_mapper = {"Low":1,
				"Medium":2,
				"High":3}

# Replace feature values with scale
dataframe["Score"].replace(scale_mapper)

ps: 重点在于 df.replace(mapper)而这个mapper是字典类型。

然后思考一个问题，如果低分和中分和高分之间的距离相等，那么如上分配数字（1,2,3）就很合理，但是如果现在有个中高分呢？？是否可以给个数字2.5
参考：

scale_mapper = {"Low":1,
				"Medium":2,
				“Barely More Than Medium”: 2.1
				" More Than Medium": 2.5,
				"High":3}
				
dataframe["Score"].replace(scale_mapper)

编码Dictionaries of Features

编码字典类型的数据。

怎么办？
思路：利用sklearn.feature_extraction 中的DictVectorizer库。

# Import library
from sklearn.feature_extraction import DictVectorizer

# Create dictionary
data_dict = [{"Red": 2, "Blue": 4},
					{"Red": 4, "Blue": 3},
					{"Red": 1, "Yellow": 2},
					{"Red": 2, "Yellow": 2}]
					
# Create dictionary vectorizer
dictvectorizer = DictVectorizer(sparse=False)# default output sparse matrix that only stores elements with nonzero value.

# Convert dictionary to feature matrix
features = dictvectorizer.fit_transform(data_dict)

# View feature matrix
features

结果如下：
array(
[[ 4., 2., 0.],
[ 3., 4., 0.],
[ 0., 1., 2.],
[ 0., 2., 2.]])

# Get feature names
feature_names = dictvectorizer.get_feature_names()

# View feature names
feature_names

结果如下：
[‘Blue’, ‘Red’, ‘Yellow’]

万物琴弦光锥之外

关注

1
点赞
踩
7

收藏

觉得还不错? 一键收藏
打赏
0
评论
python机器学习基础5之非数值数据处理（cook book）

Categorical Data Handling文件中的数据分为数值型和非数值型：例如：Value DataCategorical Datapriceocean_proximity2,000,000近海1,500,000内陆1. 编码Nominal Categorical Features这种非数值类型的数据，通常用one-hot encodi...
复制链接

扫一扫