特征工程是指将原始数据集处理成适合机器学习数据集的过程.
没有特征工程, 就没有机器学习.
特征工程总结与示例
准备以下10条数据, 保存为test.txt
milage,Liters,Consumtime,target
40920,8.326976,0.953952,3
14488,7.153469,1.673904,2
26052,1.441871,0.805124,1
75136,13.147394,0.428964,1
38344,1.669788,0.134296,1
72993,10.141740,1.032955,1
35948,6.830792,1.213192,3
42666,13.276369,0.543880,3
67497,8.631577,0.749278,1
35483,12.273169,1.508053,3
分割数据集
将数据分割为训练数据和验证数据两部分
from sklearn.model_selection import train_test_split
import pandas as pd
data = pd.read_csv("test.txt")
x = data[["milage", "Liters", "Consumtime"]]
y = data[["target"]]
# 分割数据集会将10条数据分为
# 训练部分 x_train, y_train 8条和2条
# 结果部分 target 也分为x_test, y_test 8条和2条
x_train, y_train, x_test, y_test = train_test_split(x, y, train_size=0.8)
print(len(x_train), len(y_train), len(x_test), len(y_test)) # 8 2 8 2
归一化处理数据
将数据处理成同一个数量级
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
data = pd.read_csv("test.txt")
x = data[["milage", "Liters", "Consumtime"]]
y = data[["target"]]
mmscl = MinMaxScaler(feature_range=(0,1))
print("before", x[:1]) # 40920 8.326976 0.953952
x = mmscl.fit_transform(x)
print("after", x[:1]) # 0.43582641 0.5817826 0.53237967
标准化处理数据
处理数据使其符合标准正态分布,即均值为0,标准差为1
import pandas as pd
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
data = pd.read_csv("test.txt")
x = data[["milage", "Liters", "Consumtime"]]
y = data[["target"]]
ss = StandardScaler()
print( x[:1]) # 40920 8.326976 0.953952
x = ss.fit_transform(x)
print(x[:1]) # -0.20889502 0.00935662 0.10942786
文本特征提取–转为one-hot编码
将文本数据处理为利于机器学习的编码格式
from sklearn.feature_extraction.text import CountVectorizer
str = "machine learning is a kind of technology that make machine became smarter."
cv = CountVectorizer()
res = cv.fit_transform([str])
print(cv.get_feature_names())
print(res.toarray())
# ['became', 'is', 'kind', 'learning', 'machine', 'make', 'of', 'smarter','technology', 'that']
#[[1 1 1 1 2 1 1 1 1 1]]
one-hot 编码示例:
(0, 4) 2
(0, 3) 1
(0, 1) 1
(0, 2) 1
中文分词
从中文语句中
import jieba
str = "小明吃着方便面看电视"
print(jieba.lcut(str)) # ['小明', '吃', '着', '方便面', '看电视']
pandas合并数据表
准备两个简单数据表
pa.txt
id,shop_id,shop_name
1,1,aaa
2,2,bbb
pb.txt
shop_id,shop_type
1,tool
2,food
import pandas as pd
pa_data = pd.read_csv("pa.txt")
pb_data = pd.read_csv("pb.txt")
# 指定合并表时相互关联的键
pmerge = pd.merge(pa_data, pb_data, on=["shop_id", "shop_id"])
print(pmerge)
得到结果:
id shop_id shop_name shop_type
0 1 1 aaa tool
1 2 2 bbb food
pandas 表交叉信息提取
准备两个简单数据表
pa.txt
id,shop_id,shop_name
1,1,aaa
2,2,bbb
3,3,ccc
pb.txt
shop_id,shop_type
1,tool
2,food
3,food
import pandas as pd
from pandas.core.reshape.pivot import crosstab
pa_data = pd.read_csv("pa.txt")
pb_data = pd.read_csv("pb.txt")
pmerge = pd.merge(pa_data, pb_data, on=["shop_id", "shop_id"])
# 指定交叉提取的字段
cross_data = pd.crosstab(pmerge["shop_type"], pmerge["shop_name"])
print(cross_data)
得到结果:
shop_name aaa bbb ccc
shop_type
food 0 1 1
tool 1 0 0
PCA主成分分析实现特征降维
将具有诸多特征的原始数据视为多维数据, 通过投影等算法提取主要特征,从而实现特征降维
import pandas as pd
from sklearn.decomposition import PCA
pa_data = pd.read_csv("pa.txt")
pb_data = pd.read_csv("pb.txt")
pmerge = pd.merge(pa_data, pb_data, on=["id", "id"])
print(pmerge)
# n_components: 指定保留信息比例
pca = PCA(n_components=0.95)
pmerge = pca.fit_transform(pmerge)
print(pmerge)
得到结果
处理前:
id fa fb fc fd fe ha hb hc hd he
0 1 1 2 2 3 8 1 2 2 1 2
1 1 1 2 2 3 8 1 2 2 1 2
2 1 1 2 2 4 7 1 2 2 1 2
3 1 1 2 2 4 7 1 2 2 1 2
处理后:
[[ 0.70710678] [ 0.70710678] [-0.70710678] [-0.70710678]]
字典提取转换(特征向量化)
就是把一组数据中, 非数值类的数据转变为数字表示
准备数据如下(a.txt):
id,name,age,hobby
1,kity,12,football
2,lili,13,cooking
3,bob,12,cooking
import pandas as pd
from sklearn.feature_extraction import DictVectorizer
data = pd.read_csv("a.txt")
print(data)
dv = DictVectorizer()
data = dv.fit_transform(data.to_dict(orient="records"))
print(dv.get_feature_names())
print(data.toarray())
得到结果:
原始数据:
id name age hobby
0 1 kity 12 football
1 2 lili 13 cooking
2 3 bob 12 cooking
转换后数据为one-hot编码格式, 为了便于查看, 这里将one-hot转为列表
['age', 'hobby=cooking', 'hobby=football', 'id', 'name=bob', 'name=kity', 'name=lili']
[[12. 0. 1. 1. 0. 1. 0.]
[13. 1. 0. 2. 0. 0. 1.]
[12. 1. 0. 3. 1. 0. 0.]]