扒一扒 FM 算法的实现

欢迎关注公众号:python科技园,一起学习python和算法知识。

FM 模型是在 LRPOLY2 的基础上发展而来的,与 POLY2 相比的主要区别是用两个向量的内积 <v_i, v_j>取代了单一的权重系数 w_{i,j}。具体地说,FM 为每个特征学习了一个隐向量权重。在做特征交叉时,使用两个特征隐向量的内积作为交叉特征的权重。想了解更多细节,请参考 FM算法解析

FM(w, x) = w_0 + \sum_{i=1}^{n} w_ix_i + \sum_{i=1}^{n-1}{\sum_{j=i+1}^{n}{<v_i,v_j>x_ix_j}}

其中:

\begin{align*} \sum_{i=1}^{n-1}{\sum_{j=i+1}^{n}{<v_i,v_j>x_ix_j}} &= \frac{1}{2}\sum_{i=1}^{n}{\sum_{j=1}^{n}{<v_i,v_j>x_ix_j}} - \frac{1}{2} {\sum_{i=1}^{n}{<v_i,v_i>x_ix_i}} \\ &= \frac{1}{2} \left( \sum_{i=1}^{n}{\sum_{j=1}^{n}{\sum_{f=1}^{k}{v_{i,f}v_{j,f}x_ix_j}}} - \sum_{i=1}^{n}{\sum_{f=1}^{k}{v_{i,f}v_{i,f}x_ix_i}} \right) \\ &= \frac{1}{2}\sum_{f=1}^{k}{\left[ \left( \sum_{i=1}^{n}{v_{i,f}x_i} \right) \cdot \left( \sum_{j=1}^{n}{v_{j,f}x_j} \right) - \sum_{i=1}^{n}{v_{i,f}^2 x_i^2} \right]} \\ &= \frac{1}{2}\sum_{f=1}^{k}{\left[ \left( \sum_{i=1}^{n}{v_{i,f}x_i} \right)^2 - \sum_{i=1}^{n}{v_{i,f}^2 x_i^2} \right]} \end{align*}

本质上,FM 算法引入隐向量的做法,与矩阵分解用隐向量表示用户和物品的做法异曲同工。FM 是将矩阵分解隐向量的思想进行了进一步的扩展,从单纯的 user embedding、item embedding 扩展到了所有的特征上。

隐向量的引入使 FM 能够更好的解决数据稀疏性的问题,举例来说,在某商品推荐的场景下,样本有2个特征,分别是频道(channel)和品牌(brand),某训练样本的特征组合是 (ESPN, Adidas)。在  POLY2  中,只有当 ESPN 和 Adidas 同时出现在一个训练样本中,模型才能学到这个组合特征对应的权重;而在 FM 中,ESPN 的隐向量也可以通过(ESPN, Nike) 样本进行更新,Adidas的隐向量也可以通过 (NBC, Adidas) 进行更新,大幅度降低了模型对数据稀疏性的要求。同时对于一个未曾出现过的组合 (NBC, Nike),由于模型之前已经分别学习过 NBC 和 Nike 的隐向量,已经具备计算该特征组合权重的能力,泛化性得到大幅提升。

代码安排:

1. 先使用原生的模式计算一下交叉项,即  \sum_{i=1}^{n-1}{\sum_{j=i+1}^{n}{<v_i,v_j>x_ix_j}}

a = [
        [[0., 1., 6., 11.], 
         [1., 2., 3., 4.], 
         [4., 5., 6., 1.]]
    ]
from itertools import combinations
import tensorflow as tf
from tensorflow.python.keras.layers import Dot


def raw_fm_cross_layer(embbeding_list):
    dot_list = []
    
    for i, j in combinations(embbeding_list, 2):
        i = tf.convert_to_tensor([i])
        j = tf.convert_to_tensor([j])
        
        cur_dot_value = Dot(axes=1)([i, j])
        dot_list.append(cur_dot_value)
        
    return sum(dot_list)
# 计算,并查看结果值

embedding_list = a[0]
raw_fm_result = raw_fm_cross_layer(embedding_list)
 
sess = tf.InteractiveSession()
raw_fm_result = sess.run(raw_fm_result)
print(raw_fm_result)
# [[152.]]

2. 使用隐向量模式计算一下交叉项,即\frac{1}{2}\sum_{f=1}^{k}{\left[ \left( \sum_{i=1}^{n}{v_{i,f}x_i} \right)^2 - \sum_{i=1}^{n}{v_{i,f}^2 x_i^2} \right]}

from tensorflow.python.keras import backend as K
import tensorflow as tf


def fm_cross_layer(embedding_list):
    square_of_sum = tf.square(tf.reduce_sum(
                embedding_list, axis=1, keepdims=True))

    sum_of_square = tf.reduce_sum(
                embedding_list * embedding_list, axis=1, keepdims=True)

    cross_term = square_of_sum - sum_of_square
    cross_term = 0.5 * tf.reduce_sum(cross_term, axis=2, keepdims=False)
    
    return cross_term

查看一下结果值。

# 计算,并查看结果值

embedding_list = tf.convert_to_tensor(a)
fm_result = fm_cross_layer(embedding_list)

sess = tf.InteractiveSession()
fm_result = sess.run(fm_result)
print(fm_result)

# [[152.]]

可以看到 方式1 和 方式2 的结果是一致的。

 


接下来继续,通过实际例子应用一下 FM 算法。

一、准备数据

titanic数据集 的目标是根据乘客信息预测他们在Titanic号撞击冰山沉没后能否生存。

结构化数据一般会使用Pandas中的DataFrame进行预处理。

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

df_data = pd.read_csv('data/train.csv')

titanic数据集下载地址: https://www.kaggle.com/c/titanic/data

字段说明:

  • Survived: 0代表死亡,1代表存活                    【y标签】
  • Pclass: 乘客所持票类,有三种值(1,2,3)           【类别变量】
  • Name: 乘客姓名                                             【舍去】
  • Sex: 乘客性别                                                 【类别变量】
  • Age: 乘客年龄(有缺失)                                    【数值特征】
  • SibSp: 乘客兄弟姐妹/配偶的个数(整数值)        【数值特征】
  • Parch: 乘客父母/孩子的个数(整数值)               【数值特征】
  • Ticket: 票号(字符串)                                        【舍去】
  • Fare: 乘客所持票的价格(浮点数,0-500不等)  【数值特征】
  • Cabin: 乘客所在船舱(有缺失)                          【类别变量】
  • Embarked: 乘客登船港口:S、C、Q(有缺失)    【类别变量】
# 类别变量重新编码
# 数值变量,用0填充缺失值

sparse_feature_list = ["Pclass", "Sex", "Cabin", "Embarked"]
dense_feature_list = ["Age", "SibSp", "Parch", "Fare"]


sparse_feature_reindex_dict = {}
for i in sparse_feature_list:
    cur_sparse_feature_list = df_data[i].unique()
    
    sparse_feature_reindex_dict[i] = dict(zip(cur_sparse_feature_list, \
        range(1, len(cur_sparse_feature_list)+1)
                                     )
                                 )
    
    df_data[i] = df_data[i].map(sparse_feature_reindex_dict[i])


for j in dense_feature_list:
    df_data[j] = df_data[j].fillna(0)
# 分割数据集

data = df_data[sparse_feature_list + dense_feature_list]
label = df_data["Survived"].values

xtrain, xtest, ytrain, ytest = train_test_split(data, label, test_size=0.2, random_state=2020)
xtrain_data = {"Pclass": np.array(xtrain["Pclass"]), \
              "Sex": np.array(xtrain["Sex"]), \
              "Cabin": np.array(xtrain["Cabin"]), \
              "Embarked": np.array(xtrain["Embarked"]), \
              "Age": np.array(xtrain["Age"]), \
              "SibSp": np.array(xtrain["SibSp"]), \
              "Parch": np.array(xtrain["Parch"]), \
              "Fare": np.array(xtrain["Fare"])}

xtest_data = {"Pclass": np.array(xtest["Pclass"]), \
              "Sex": np.array(xtest["Sex"]), \
              "Cabin": np.array(xtest["Cabin"]), \
              "Embarked": np.array(xtest["Embarked"]), \
              "Age": np.array(xtest["Age"]), \
              "SibSp": np.array(xtest["SibSp"]), \
              "Parch": np.array(xtest["Parch"]), \
              "Fare": np.array(xtest["Fare"])}

 

二、构建模型

(1)加载python模块

import tensorflow as tf
from tensorflow.python.keras import backend as K
from tensorflow.python.keras.layers import Input, Embedding, \
    Dot, Flatten, Concatenate, Dense

from tensorflow.keras.models import Model
from tensorflow.python.keras.layers import Layer
from tensorflow.python.keras.initializers import Zeros
from tensorflow.python.keras.optimizers import Adam

from keras.utils import plot_model

(2)定义类别变量输入层Embedding层

def input_embedding_layer(
    shape=1, \
    name=None, \
    vocabulary_size=1, \
    embedding_dim=1):
    
    input_layer = Input(shape=[shape, ], name=name)
    embedding_layer = Embedding(vocabulary_size, embedding_dim)(input_layer)
    
    return input_layer, embedding_layer

(3)定义 线性层FM二阶交叉层预测层

class Linear(Layer):

    def __init__(self, l2_reg=0.0, mode=2, use_bias=True, **kwargs):

        self.l2_reg = l2_reg
        #self.l2_reg = tf.contrib.layers.l2_regularizer(float(l2_reg_linear))
        if mode not in [0, 1, 2]:
            raise ValueError("mode must be 0, 1 or 2")
        self.mode = mode
        self.use_bias = use_bias
        super(Linear, self).__init__(**kwargs)

    def build(self, input_shape):
        if self.use_bias:
            self.bias = self.add_weight(name='linear_bias',
                                        shape=(1,),
                                        initializer=tf.keras.initializers.Zeros(),
                                        trainable=True)
            
        if self.mode == 1:
            self.kernel = self.add_weight(
                'linear_kernel',
                shape=[int(input_shape[-1]), 1],
                initializer=tf.keras.initializers.glorot_normal(),
                regularizer=tf.keras.regularizers.l2(self.l2_reg),
                trainable=True)
            
        elif self.mode == 2 :
            self.kernel = self.add_weight(
                'linear_kernel',
                shape=[int(input_shape[1][-1]), 1],
                initializer=tf.keras.initializers.glorot_normal(),
                regularizer=tf.keras.regularizers.l2(self.l2_reg),
                trainable=True)

        super(Linear, self).build(input_shape)  # Be sure to call this somewhere!

    def call(self, inputs, **kwargs):
        if self.mode == 0:
            sparse_input = inputs
            linear_logit = reduce_sum(sparse_input, axis=-1, keep_dims=True)
        elif self.mode == 1:
            dense_input = inputs
            fc = tf.tensordot(dense_input, self.kernel, axes=(-1, 0))
            linear_logit = fc
        else:
            sparse_input, dense_input = inputs
            fc = tf.tensordot(dense_input, self.kernel, axes=(-1, 0))
            linear_logit = tf.reduce_sum(sparse_input, axis=-1, keep_dims=False) + fc
        if self.use_bias:
            linear_logit += self.bias

        return linear_logit

    def compute_output_shape(self, input_shape):
        return (None, 1)

    def compute_mask(self, inputs, mask):
        return None

    def get_config(self, ):
        config = {'mode': self.mode, 'l2_reg': self.l2_reg,'use_bias':self.use_bias}
        base_config = super(Linear, self).get_config()
        return dict(list(base_config.items()) + list(config.items()))
class FM(Layer):
    """Factorization Machine models pairwise (order-2) feature interactions
     without linear term and bias.

      Input shape
        - 3D tensor with shape: ``(batch_size,field_size,embedding_size)``.

      Output shape
        - 2D tensor with shape: ``(batch_size, 1)``.

      References
        - [Factorization Machines](https://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf)
    """

    def __init__(self, **kwargs):

        super(FM, self).__init__(**kwargs)

    def build(self, input_shape):
        if len(input_shape) != 3:
            raise ValueError("Unexpected inputs dimensions % d,\
                             expect to be 3 dimensions" % (len(input_shape)))

        super(FM, self).build(input_shape)  # Be sure to call this somewhere!

    def call(self, inputs, **kwargs):

        if K.ndim(inputs) != 3:
            raise ValueError(
                "Unexpected inputs dimensions %d, expect to be 3 dimensions"
                % (K.ndim(inputs)))

        concated_embeds_value = inputs

        square_of_sum = tf.square(tf.reduce_sum(
            concated_embeds_value, axis=1, keep_dims=True))
        sum_of_square = tf.reduce_sum(
            concated_embeds_value * concated_embeds_value, axis=1, keep_dims=True)
        cross_term = square_of_sum - sum_of_square
        cross_term = 0.5 * tf.reduce_sum(cross_term, axis=2, keep_dims=False)

        return cross_term

    def compute_output_shape(self, input_shape):
        return (None, 1)
class PredictionLayer(Layer):
    """
      Arguments
         - **task**: str, ``"binary"`` for  binary logloss or  ``"regression"`` for regression loss

         - **use_bias**: bool.Whether add bias term or not.
    """

    def __init__(self, task='binary', use_bias=True, **kwargs):
        if task not in ["binary", "multiclass", "regression"]:
            raise ValueError("task must be binary, multiclass or regression")
        self.task = task
        self.use_bias = use_bias
        super(PredictionLayer, self).__init__(**kwargs)

    def build(self, input_shape):

        if self.use_bias:
            self.global_bias = self.add_weight(
                shape=(1,), initializer=Zeros(), name="global_bias")

        # Be sure to call this somewhere!
        super(PredictionLayer, self).build(input_shape)

    def call(self, inputs, **kwargs):
        x = inputs
        if self.use_bias:
            x = tf.nn.bias_add(x, self.global_bias, data_format='NHWC')
        if self.task == "binary":
            x = tf.sigmoid(x)

        output = tf.reshape(x, (-1, 1))

        return output

    def compute_output_shape(self, input_shape):
        return (None, 1)

    def get_config(self, ):
        config = {'task': self.task, 'use_bias': self.use_bias}
        base_config = super(PredictionLayer, self).get_config()
        return dict(list(base_config.items()) + list(config.items()))

(4)定义 FM模型结构

def fm(sparse_feature_list, \
       sparse_feature_reindex_dict, \
       dense_feature_list, \
       task='binary'):
    
    sparse_input_layer_list = []
    sparse_embedding_layer_list = []
    
    dense_input_layer_list = []

    
    # 1. Input & Embedding sparse features
    for i in sparse_feature_list:
        shape = 1
        name = i
        vocabulary_size = len(sparse_feature_reindex_dict[i]) + 1
        embedding_dim = 64
        
        cur_sparse_feaure_input_layer, cur_sparse_feaure_embedding_layer = \
            input_embedding_layer(
                shape = shape, \
                name = name, \
                vocabulary_size = vocabulary_size, \
                embedding_dim = embedding_dim)
        
        sparse_input_layer_list.append(cur_sparse_feaure_input_layer)
        sparse_embedding_layer_list.append(cur_sparse_feaure_embedding_layer)

    #print("sparse_embedding_layer_list: ", sparse_embedding_layer_list)
    #print("\n"*3)

    
    # 2. Input dense features   
    for j in dense_feature_list:
        dense_input_layer_list.append(Input(shape=(1, ), name=j))

    #print("dense_input_layer_list", dense_input_layer_list)
    #print("\n"*3)
    
    
    # === linear ===
    sparse_linear_input = Concatenate(axis=-1)(sparse_embedding_layer_list)
    dense_linear_input = Concatenate(axis=-1)(dense_input_layer_list)
    
    #print("sparse_linear_input", sparse_linear_input)
    #print("dense_linear_input", dense_linear_input)
    #print("\n"*3)
      
    linear_logit = Linear()([sparse_linear_input, dense_linear_input])
    #print("linear_logit", linear_logit)
    #print("\n"*3)
    
    
    # === fm cross ===
    sparse_embedding_layer_list = Concatenate(axis=1)(sparse_embedding_layer_list)
    #print("sparse_embedding_layer_list", sparse_embedding_layer_list)
    #print("\n"*3)
    
    fm_cross_logit = FM()(sparse_embedding_layer_list)
    #print(K.ndim(tf.convert_to_tensor(sparse_embedding_layer_list)))
    #print("fm_cross_logit", fm_cross_logit)
    #print("\n"*3)
    

    # === predict ===
    #out = Dense(1, activation='sigmoid')(tf.keras.layers.add([linear_logit, fm_cross_logit]))
    #out = PredictionLayer(task)(linear_logit + fm_cross_logit)
    out = PredictionLayer(task)(tf.keras.layers.add([linear_logit, fm_cross_logit]))
    #print(out)

    
    fm_model = Model(inputs= sparse_input_layer_list + dense_input_layer_list, outputs=out)
    
    
    return fm_model

(5)开始 应用 FM 模型

fm_model = fm(sparse_feature_list, \
              sparse_feature_reindex_dict, \
              dense_feature_list)

(6)打印 FM 模型 summary

print(fm_model.summary())

(7)输出 FM 模型结构图

plot_model(fm_model, to_file='fm_model.png')

(8)编译 FM 模型训练模型

fm_model.compile(loss='binary_crossentropy', \
        optimizer=Adam(lr=1e-3), \
        metrics=['accuracy'])

history = fm_model.fit(xtrain_data, ytrain, epochs=5, batch_size=32, validation_data=(xtest_data, ytest))

(9)绘制 损失函数 图

import matplotlib.pyplot as plt
 
loss = history.history['loss']
val_loss = history.history['val_loss']
epochs = range(1, len(loss) + 1)
plt.figure()
plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()
plt.show()

 

参考:

1. 深度学习推荐系统,王喆著,京东购买链接:https://u.jd.com/j7l2xP

2. FM算法解析,https://zhuanlan.zhihu.com/p/37963267

3. https://github.com/shenweichen/DeepCTR/

4. 结构化数据建模流程范例.md

  • 3
    点赞
  • 5
    收藏
    觉得还不错? 一键收藏
  • 4
    评论
FM(因子分解机)是一种经典的推荐算法,它可以用于处理稀疏数据并且具有较好的预测性能。下面是使用PyTorch实现FM算法的基本步骤: 1. 导入需要的库: ```python import torch import torch.nn as nn import torch.optim as optim ``` 2. 定义FM模型 ```python class FM(nn.Module): def __init__(self, input_dim, k): super(FM, self).__init__() self.k = k self.linear = nn.Linear(input_dim, 1) self.v = nn.Parameter(torch.randn(input_dim, k)) def forward(self, x): linear_part = self.linear(x) inter_part1 = torch.matmul(x, self.v) inter_part2 = torch.matmul(torch.pow(x, 2), torch.pow(self.v, 2)) inter_part = 0.5 * torch.sum(torch.sub(inter_part1, inter_part2), 1, keepdim=True) output = linear_part + inter_part return output ``` 3. 定义训练函数 ```python def train(model, dataloader, optimizer, criterion): model.train() train_loss = 0 for batch_idx, (data, target) in enumerate(dataloader): optimizer.zero_grad() output = model(data) loss = criterion(output, target) train_loss += loss.item() loss.backward() optimizer.step() return train_loss / len(dataloader.dataset) ``` 4. 定义测试函数 ```python def test(model, dataloader, criterion): model.eval() test_loss = 0 with torch.no_grad(): for data, target in dataloader: output = model(data) test_loss += criterion(output, target).item() return test_loss / len(dataloader.dataset) ``` 5. 加载数据集和设置超参数 ```python from torch.utils.data import DataLoader, Dataset class CustomDataset(Dataset): def __init__(self, x, y): self.x = x self.y = y def __getitem__(self, index): return self.x[index], self.y[index] def __len__(self): return len(self.x) X_train, y_train = ... X_test, y_test = ... train_dataset = CustomDataset(X_train, y_train) test_dataset = CustomDataset(X_test, y_test) batch_size = 64 train_dataloader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True) test_dataloader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False) input_dim = X_train.shape[1] k = 10 lr = 0.01 num_epochs = 50 ``` 6. 训练模型 ```python model = FM(input_dim, k) optimizer = optim.SGD(model.parameters(), lr=lr) criterion = nn.MSELoss() for epoch in range(num_epochs): train_loss = train(model, train_dataloader, optimizer, criterion) test_loss = test(model, test_dataloader, criterion) print(f'Epoch {epoch+1}, Train Loss: {train_loss:.4f}, Test Loss: {test_loss:.4f}') ``` 这样就可以使用PyTorch实现FM算法了。
评论 4
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值