推荐系统 - 基于FM算法的协同召回算法

最新推荐文章于 2022-05-06 11:45:39 发布

马飞飞

最新推荐文章于 2022-05-06 11:45:39 发布

阅读量3.7k

点赞数

分类专栏：推荐系统实践

本文链接：https://blog.csdn.net/maqunfi/article/details/99560292

版权

推荐系统同时被 2 个专栏收录

11 篇文章 8 订阅

订阅专栏

推荐系统实践

11 篇文章 13 订阅

订阅专栏

详细理论参照此处：https://zhuanlan.zhihu.com/p/58160982

说明

1.FM是一种LR基础上扩展的线性模型，在特征上做了两两组合上的信息挖掘，因为其高效性，既可以作为召回模型，也可以作为排序模型，并且在数据量非常大时，其相对于lightGBM这样的模型，可以做流式训练，能承载更大数据量的训练，比如在2018腾讯广告算法大赛中，top选手在复赛大数据量情况下，多使用FFM、DeepFFM这种由FM扩展的模型。

2.FM主要在线性模型基础上加入了特征间的关联关系，其式子如下：

但是针对于上式，一个很大的问题，用户交互矩阵往往是比较稀疏的，这样就会导致对w的估算存在很大的问题，因此这里使用了矩阵分解的思想，对交叉项系数w进行如下的矩阵分解

我们称分解出的向量V为隐向量，V矩阵的第j列表示第j个特征的隐向量，从而可以将式子转换成如下的形式。

其中<vi,vj>表示点乘，隐向量长度为k（k<<n），这样变换后，二次项的系数个数减少至kn个，隐向量矩阵对应的形式如下。

经过V矩阵组合后，新的W可以表示为如下的形式。

2.算法主要分为两个步骤：

（1）交互数据的embedding转换。根据用户和商品的出现情况做统计，构建出col_ix矩阵，利用data矩阵和row_ix来对col_ix做形式的包装，使用csr.csr_matrix做矩阵压缩获取根据用户和商品交互每行数据所对应的embedding形式。

（2）将每行的embedding输入到FM模型中，进行下式所对应的交互。[FM的核心思想，两两xixj特征的交互]

然而从FM的原始数学公式看，因为在进行二阶（2-order）特征组合的时候，假设有n个不同的特征，那么二阶特征组合意味着任意两个特征都要进行交叉组合，所以可以直接推论得出：FM的时间复杂度是n的平方。但是如果故事仅仅讲到这里，FM模型是不太可能如此广泛地被工业界使用的。因为现实生活应用中的n往往是个非常巨大的特征数，如果FM是n平方的时间复杂度，那估计基本就没人带它玩了。

因此我们对FM可以得到如下的公式改进，具体推导请参照https://zhuanlan.zhihu.com/p/58160982

从而大大提升计算效率，通过输入embedding特征、进行分类预测，我们可以得到最终的推荐模型。

代码

数据集从此处下载：链接：https://pan.baidu.com/s/19dvx42ZJwSMUh_bXOAPNlw
提取码：znhn

代码如下，主要分为两个部分。交互数据的embedding转化、 FM网络的训练两个过程。这里并没有对不同特征进行不同的复杂处理。

# -*- encoding:utf-8 -*-
from itertools import count
from collections import defaultdict
from scipy.sparse import csr
import numpy as np
import pandas as pd
import numpy as np
from sklearn.feature_extraction import DictVectorizer
import tensorflow as tf
from tqdm import tqdm_notebook as tqdm


def vectorize_dic(dic,ix=None,p=None,n=0,g=0):
    """
    dic -- dictionary of feature lists. Keys are the name of features
    ix -- index generator (default None)
    p -- dimension of feature space (number of columns in the sparse matrix) (default None)
    把字典直接输入网络是非常不合适的，所以这里从id里面提取特征，做了embedding的一个向量化表示。

    这里把  项的字典  转换成了   embedding表示，至于怎么简单对id进行词向量转换的，还挺有意思的，字典的向量转换（这里可以看到进行压缩前，比较有意义的矩阵是 对出现次数统计的矩阵）
    其余还有全为0和全为1的参与到 稀疏矩阵的合成中。
                            csr.csr_matrix: [[0. 2. 0. ... 0. 0. 0.]
                         [0. 1. 1. ... 0. 0. 0.]
                         [0. 1. 0. ... 0. 0. 0.]
                         ...
                         [0. 0. 0. ... 0. 0. 0.]
                         [0. 0. 0. ... 0. 0. 0.]
                         [0. 0. 0. ... 0. 0. 0.]]
        向量转换的简单计算再这里，可以看得出是简单的  出现次数下的计算累加  （这里）
            for k,lis in dic.items():
                for t in range(len(lis)):
                    ix[str(lis[t]) + str(k)] = ix.get(str(lis[t]) + str(k),0) + 1
                    col_ix[i+t*g] = ix[str(lis[t]) + str(k)]
                i += 1

        常常需要把一个稀疏的np.array压缩，这时候就用到scipy库中的sparse.csr_matrix(csr:Compressed Sparse Row marix) 和sparse.csc_matric(csc:Compressed Sparse Column marix)

        现在明白了，这就是一个对特征 的字典形式存储转成 的二维矩阵 形式，外加矩阵的压缩。  从而实现对  特征的embedding转换，由id索引转成词向量。


        接下来就是将向量投入矩阵进行训练即可， 这就是一个线性模型，所以是比较简单的网络处理思路。  这里field也并不是分开处理的。


        这里是对
    """
    print('dic:',dic)
    if ix==None:
        ix = dict()

    nz = n * g

    col_ix = np.empty(nz,dtype = int)

    i = 0
    for k,lis in dic.items():
        for t in range(len(lis)):
            ix[str(lis[t]) + str(k)] = ix.get(str(lis[t]) + str(k),0) + 1
            col_ix[i+t*g] = ix[str(lis[t]) + str(k)]
        i += 1

    row_ix = np.repeat(np.arange(0,n),g)
    data = np.ones(nz)
    if p == None:
        p = len(ix)

    ixx = np.where(col_ix < p)    #这里  row_ix[ixx]和  col_ix[ixx]  都是一个通过不断累加统计下的二维矩阵，行是field，列是域列值的数量。
    print('csr.csr_matrix:', csr.csr_matrix((c[ixx],(row_ix[ixx],col_ix[ixx])),shape=(n,p)).todense())
    print('csr.csr_matrix shape:', csr.csr_matrix((data[ixx], (row_ix[ixx], col_ix[ixx])), shape=(n, p)).todense().shape)
    print('data[ixx] shape :',data)
    print('row_ix[ixx] shape :', row_ix)
    print('col_ix[ixx] shape :',col_ix)

    return csr.csr_matrix((data[ixx],(row_ix[ixx],col_ix[ixx])),shape=(n,p)),ix


def batcher(X_, y_=None, batch_size=-1):
    n_samples = X_.shape[0]

    if batch_size == -1:
        batch_size = n_samples
    if batch_size < 1:
       raise ValueError('Parameter batch_size={} is unsupported'.format(batch_size))

    for i in range(0, n_samples, batch_size):
        upper_bound = min(i + batch_size, n_samples)
        ret_x = X_[i:upper_bound]
        ret_y = None
        if y_ is not None:
            ret_y = y_[i:i + batch_size]
            yield (ret_x, ret_y)


cols = ['user','item','rating','timestamp']

train = pd.read_csv('data/ua.base',delimiter='\t',names = cols)
test = pd.read_csv('data/ua.test',delimiter='\t',names = cols)

x_train,ix = vectorize_dic({'users':train['user'].values,
                            'items':train['item'].values},n=len(train.index),g=2)
print('x_train:',x_train)

x_test,ix = vectorize_dic({'users':test['user'].values,
                           'items':test['item'].values},ix,x_train.shape[1],n=len(test.index),g=2)


print(x_train)
y_train = train['rating'].values
y_test = test['rating'].values

x_train = x_train.todense()
x_test = x_test.todense()

print('x_train  todense:', x_train)

print(x_train.shape)
print (x_test.shape)


n,p = x_train.shape

k = 10


x = tf.placeholder('float',[None,p])

y = tf.placeholder('float',[None,1])

w0 = tf.Variable(tf.zeros([1]))
w = tf.Variable(tf.zeros([p]))

v = tf.Variable(tf.random_normal([k,p],mean=0,stddev=0.01))

#y_hat = tf.Variable(tf.zeros([n,1]))

linear_terms = tf.add(w0,tf.reduce_sum(tf.multiply(w,x),1,keep_dims=True)) # n * 1
pair_interactions = 0.5 * tf.reduce_sum(
    tf.subtract(
        tf.pow(
            tf.matmul(x,tf.transpose(v)),2),
        tf.matmul(tf.pow(x,2),tf.transpose(tf.pow(v,2)))
    ),axis = 1 , keep_dims=True)

y_hat = tf.add(linear_terms,pair_interactions)

lambda_w = tf.constant(0.001,name='lambda_w')
lambda_v = tf.constant(0.001,name='lambda_v')

l2_norm = tf.reduce_sum(
    tf.add(
        tf.multiply(lambda_w,tf.pow(w,2)),
        tf.multiply(lambda_v,tf.pow(v,2))
    )
)

error = tf.reduce_mean(tf.square(y-y_hat))
loss = tf.add(error,l2_norm)


train_op = tf.train.GradientDescentOptimizer(learning_rate=0.01).minimize(loss)


epochs = 10
batch_size = 1000

# Launch the graph
init = tf.global_variables_initializer()
with tf.Session() as sess:
    sess.run(init)

    for epoch in tqdm(range(epochs), unit='epoch'):
        perm = np.random.permutation(x_train.shape[0]) #随机打乱，np.random.permutation(x)：不在原数组上进行，返回新的数组，不改变自身数组。
        # iterate over batches
        for bX, bY in batcher(x_train[perm], y_train[perm], batch_size):
            _,t = sess.run([train_op,loss], feed_dict={x: bX.reshape(-1, p), y: bY.reshape(-1, 1)})
            print(t)


    errors = []
    for bX, bY in batcher(x_test, y_test):
        errors.append(sess.run(error, feed_dict={x: bX.reshape(-1, p), y: bY.reshape(-1, 1)}))
        print(errors)
    RMSE = np.sqrt(np.array(errors).mean())
    print (RMSE)