浅谈Entity Embedding

lrhaowx

于 2021-02-27 22:58:36 发布

阅读量579

点赞数 1

分类专栏：深度学习 python 词嵌入文章标签：深度学习 python 机器学习算法人工智能

本文链接：https://blog.csdn.net/weixin_43475854/article/details/114197428

版权

python 同时被 3 个专栏收录

13 篇文章 1 订阅

订阅专栏

深度学习

4 篇文章 0 订阅

订阅专栏

词嵌入

1 篇文章 0 订阅

订阅专栏

浅谈Entity Embedding

公众号：ChallengeHub

“万物皆可Embedding”
现实生活或者比赛中，我们会经常见到表格数据，其中包含了各种类别特征。
本文将简单介绍利用神经网络来表示类别特征的方法-Entity Embedding，这个方法首先出现在kaggle上的《Rossmann Store Sales》中的rank 3的解决方案，作者在比赛完后为此方法整理一篇论文放在了arXiv，文章名：《Entity Embeddings of Categorical Variables》。

1 常见类别编码方法

在数据挖掘中，处理类别特征的方法有很多，最常见的思路是转为one-hot编码等。总结如下：

label encoding
特征存在内在顺序 (ordinal feature)
one hot encoding
特征无内在顺序，category数量 < 4
target encoding (mean encoding, likelihood encoding, impact encoding)
特征无内在顺序，category数量 > 4
beta target encoding
特征无内在顺序，category数量 > 4, K-fold cross validation
不做处理（模型自动编码）
CatBoost，lightgbm

2 实体嵌入 Entity Embedding

核心：把正整数（索引）转换为固定大小的稠密向量

# 代码来自：https://blog.csdn.net/anshuai_aw1/article/details/83586404
import numpy as np
from keras.layers.embeddings import Embedding
from keras.models import Sequential
import tensorflow as tf
import random as rn

# ===================================================================================================
# 保证结果的复现
import os
os.environ['PYTHONHASHSEED'] = '0'

np.random.seed(42)

rn.seed(12345)

session_conf = tf.ConfigProto(intra_op_parallelism_threads=1, inter_op_parallelism_threads=1)

from keras import backend as K

tf.set_random_seed(1234)

sess = tf.Session(graph=tf.get_default_graph(), config=session_conf)
K.set_session(sess)
# ===================================================================================================


'''
输入数据是32*2，32个样本，2个类别特征，且类别特征的可能值是0到9之间（10个）。
对这2个特征做one-hot的话，应该为32*20，
embedding就是使1个特征原本应该one-hot的10维变为3维（手动设定，也可以是其它），因为有2个类别特征
这样输出的结果就应该是32*6
'''
model = Sequential()
model.add(Embedding(10, 3, input_length=2))

# 构造输入数据
input_array = np.random.randint(10, size=(32, 2))

# 搭建模型
model.compile('rmsprop', 'mse')

# 得到输出数据 输出格式为32*2*3。我们最终想要的格式为32*6，其实就是把2*3按照行拉成6维，然后就是我们对类别特征进行
# embedding后得到的结果了。
output_array = model.predict(input_array)

# 查看权重参数
weight = model.get_weights()

'''
我们肯定好奇：output_array是怎么得到的？
我们先来看weight的内容：10*3。这是什么意思呢，就是其实就是一个索引的结果表，如果原来特征值为0，那么就找第一行，如果原来特征值为3，
那么就找第4行。
0.00312117  -0.0475833  0.0386381
0.0153809   -0.0185934  0.0234457
0.0137821   0.00433551  0.018144
0.0468446   -0.00687895 0.0320682
0.0313594   -0.0179525  0.03054
0.00135239  0.0309016   0.0453686
0.0145149   -0.0165581  -0.0280098
0.0370018   -0.0200525  -0.0332663
0.0330335   0.0110769   0.00161555
0.00262188  -0.0495747  -0.0343777
以input_array的第一行为例
input_array的第一行是7和4，那么就找第8行和第5行，形成了output_array的第一个2*3，即
0.0370018   -0.0200525  -0.0332663
0.0313594   -0.0179525  0.03054
然后，拉成一个向量0.0370018  -0.0200525  -0.0332663 0.0313594    -0.0179525  0.03054
这就是原始特征值8和5经过embedding层后的转换结果!
'''

1、对每一个类别特征构建一个embedding层。对embedding层进行拼接。

2、训练网络，得到训练后的embedding层的输出作为类别特征one-hot的替换，这样的embedding的输出更精确。

在《Entity Embeddings of Categorical Variables》结构非常简单，就是embedding层后面接上了两个全连接层，代码用keras写的，构建模型的代码量也非常少，用的keras的sequence model。
在这里插入图片描述

文章有几点分析比较值得关注的地方。

店铺所在地的嵌入向量在用TSNE投影到两维空间后和地图位置有着极大的相似性。
使用嵌入后的向量可以提高其他算法（KNN、随机森林、gdbt）的准确性。
作者探索了embedding和度量空间之间的联系，试图从数学层面深入探讨embedding的作用。

02. 代码实践

作者代码：https://github.com/entron/entity-embedding-rossmann
自己的尝试：https://github.com/yanqiangmiffy/Data-Finance-Cup/，将类别特征嵌入层与数值特征的全连接层进行拼接：

在这里插入图片描述

def build_embedding_network():
    inputs = []
    embeddings = []
    for i in range(len(embed_cols)):
        cate_input = Input(shape=(1,))
        input_dim = len(col_vals_dict[embed_cols[i]])
        if input_dim > 1000:
            output_dim = 50
        else:
            output_dim = (len(col_vals_dict[embed_cols[i]]) // 2) + 1

        embedding = Embedding(input_dim, output_dim, input_length=1)(cate_input)
        embedding = Reshape(target_shape=(output_dim,))(embedding)
        inputs.append(cate_input)
        embeddings.append(embedding)

    input_numeric = Input(shape=(4,))
    embedding_numeric = Dense(5)(input_numeric)
    inputs.append(input_numeric)
    embeddings.append(embedding_numeric)

    x = Concatenate()(embeddings)
    x = Dense(300, activation='relu')(x)
    x = Dropout(.35)(x)
    x = Dense(100, activation='relu')(x)
    x = Dropout(.15)(x)
    output = Dense(1, activation='sigmoid')(x)

    model = Model(inputs, output)

    model.compile(loss='binary_crossentropy', optimizer='rmsprop')

    return model

03. 参考资料

类别特征处理与实体嵌入张月鹏的博客-CSDN博客实体嵌入
实体嵌入(向量化)：用深度学习处理结构化数据 - 知乎
类别特征处理与实体嵌入
An Introduction to Using Entity Embeddings of Categorical Variables | by David Heffernan | Medium
On learning embeddings for categorical data using Keras
Learning Entity Embeddings in one breath
kaggle编码categorical feature总结 - 知乎
利用神经网络的embedding层处理类别特征
= Using Deep Learning for Structured Data with Entity Embeddings

欢迎扫码关注ChallengeHub公众号
在这里插入图片描述
欢迎加入ChallengeHub学习交流群

lrhaowx

关注

1
点赞
踩
8

收藏

觉得还不错? 一键收藏
1
评论
浅谈Entity Embedding

浅谈Entity Embedding原创致Great ，公众号：ChallengeHub“万物皆可Embedding”现实生活或者比赛中，我们会经常见到表格数据，其中包含了各种类别特征。本文将简单介绍利用神经网络来表示类别特征的方法-Entity Embedding，这个方法首先出现在kaggle上的《Rossmann Store Sales》中的rank 3的解决方案，作者在比赛完后为此方法整理一篇论文放在了arXiv，文章名：《Entity Embeddings of Categorical
复制链接

扫一扫

专栏目录