特征工程总结与示例

特征工程是指将原始数据集处理成适合机器学习数据集的过程.
没有特征工程, 就没有机器学习.

准备以下10条数据, 保存为test.txt

milage,Liters,Consumtime,target
40920,8.326976,0.953952,3
14488,7.153469,1.673904,2
26052,1.441871,0.805124,1
75136,13.147394,0.428964,1
38344,1.669788,0.134296,1
72993,10.141740,1.032955,1
35948,6.830792,1.213192,3
42666,13.276369,0.543880,3
67497,8.631577,0.749278,1
35483,12.273169,1.508053,3
分割数据集

将数据分割为训练数据和验证数据两部分

from sklearn.model_selection import train_test_split
import pandas as pd

data = pd.read_csv("test.txt")
x = data[["milage", "Liters", "Consumtime"]]
y = data[["target"]]

# 分割数据集会将10条数据分为 
# 训练部分 x_train, y_train 8条和2条 
# 结果部分 target 也分为x_test, y_test  8条和2条
x_train, y_train, x_test, y_test = train_test_split(x, y, train_size=0.8)
print(len(x_train), len(y_train), len(x_test), len(y_test)) # 8 2 8 2
归一化处理数据

将数据处理成同一个数量级

import pandas as pd
from sklearn.preprocessing import MinMaxScaler

data = pd.read_csv("test.txt")
x = data[["milage", "Liters", "Consumtime"]]
y = data[["target"]]

mmscl = MinMaxScaler(feature_range=(0,1))
print("before", x[:1]) # 40920  8.326976    0.953952
x = mmscl.fit_transform(x)
print("after", x[:1]) # 0.43582641 0.5817826  0.53237967
标准化处理数据

处理数据使其符合标准正态分布,即均值为0,标准差为1

import pandas as pd
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler

data = pd.read_csv("test.txt")
x = data[["milage", "Liters", "Consumtime"]]
y = data[["target"]]

ss = StandardScaler()
print( x[:1]) # 40920  8.326976    0.953952
x = ss.fit_transform(x)
print(x[:1]) # -0.20889502  0.00935662  0.10942786
文本特征提取–转为one-hot编码

将文本数据处理为利于机器学习的编码格式

from sklearn.feature_extraction.text import CountVectorizer

str = "machine learning is a kind of technology that make machine became smarter."

cv = CountVectorizer()
res = cv.fit_transform([str])

print(cv.get_feature_names())
print(res.toarray())
# ['became', 'is', 'kind', 'learning', 'machine', 'make', 'of', 'smarter','technology', 'that']
#[[1 1 1 1 2 1 1 1 1 1]]

one-hot 编码示例:

(0, 4)        2
(0, 3)        1
(0, 1)        1
(0, 2)        1
中文分词

从中文语句中

import jieba

str = "小明吃着方便面看电视"
print(jieba.lcut(str)) # ['小明', '吃', '着', '方便面', '看电视']
pandas合并数据表

准备两个简单数据表
pa.txt

id,shop_id,shop_name
1,1,aaa
2,2,bbb

pb.txt

shop_id,shop_type
1,tool
2,food
import pandas as pd

pa_data = pd.read_csv("pa.txt")
pb_data = pd.read_csv("pb.txt")
# 指定合并表时相互关联的键
pmerge = pd.merge(pa_data, pb_data, on=["shop_id", "shop_id"])
print(pmerge)

得到结果:

   id  shop_id shop_name shop_type
0   1        1       aaa      tool
1   2        2       bbb      food
pandas 表交叉信息提取

准备两个简单数据表
pa.txt

id,shop_id,shop_name
1,1,aaa
2,2,bbb
3,3,ccc

pb.txt

shop_id,shop_type
1,tool
2,food
3,food
import pandas as pd
from pandas.core.reshape.pivot import crosstab

pa_data = pd.read_csv("pa.txt")
pb_data = pd.read_csv("pb.txt")

pmerge = pd.merge(pa_data, pb_data, on=["shop_id", "shop_id"])
# 指定交叉提取的字段
cross_data = pd.crosstab(pmerge["shop_type"], pmerge["shop_name"])
print(cross_data)

得到结果:

shop_name  aaa  bbb  ccc
shop_type
food         0    1    1
tool         1    0    0
PCA主成分分析实现特征降维

将具有诸多特征的原始数据视为多维数据, 通过投影等算法提取主要特征,从而实现特征降维

import pandas as pd
from sklearn.decomposition import PCA

pa_data = pd.read_csv("pa.txt")
pb_data = pd.read_csv("pb.txt")

pmerge = pd.merge(pa_data, pb_data, on=["id", "id"])
print(pmerge)
# n_components: 指定保留信息比例
pca = PCA(n_components=0.95)
pmerge = pca.fit_transform(pmerge)
print(pmerge)

得到结果

处理前: 
   id  fa  fb  fc  fd  fe  ha  hb  hc  hd  he
0   1   1   2   2   3   8   1   2   2   1   2
1   1   1   2   2   3   8   1   2   2   1   2
2   1   1   2   2   4   7   1   2   2   1   2
3   1   1   2   2   4   7   1   2   2   1   2
处理后: 
[[ 0.70710678] [ 0.70710678] [-0.70710678] [-0.70710678]]
字典提取转换(特征向量化)

就是把一组数据中, 非数值类的数据转变为数字表示
准备数据如下(a.txt):

id,name,age,hobby
1,kity,12,football
2,lili,13,cooking
3,bob,12,cooking
import pandas as pd
from sklearn.feature_extraction import DictVectorizer

data = pd.read_csv("a.txt")
print(data)

dv = DictVectorizer()
data = dv.fit_transform(data.to_dict(orient="records"))
print(dv.get_feature_names())
print(data.toarray())

得到结果:

原始数据:
   id  name  age     hobby
0   1  kity   12  football
1   2  lili   13   cooking
2   3   bob   12   cooking
转换后数据为one-hot编码格式, 为了便于查看, 这里将one-hot转为列表
['age', 'hobby=cooking', 'hobby=football', 'id', 'name=bob', 'name=kity', 'name=lili']
[[12.  0.  1.  1.  0.  1.  0.]
 [13.  1.  0.  2.  0.  0.  1.]
 [12.  1.  0.  3.  1.  0.  0.]]
  • 1
    点赞
  • 3
    收藏
    觉得还不错? 一键收藏
  • 打赏
    打赏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

__万波__

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值