1 - Factorization Machines ( Steffen Rendle, 2010 )

最新推荐文章于 2024-04-19 16:26:31 发布

-大道至简-

最新推荐文章于 2024-04-19 16:26:31 发布

阅读量2.6k

点赞数

分类专栏：精读论文系列文章标签： FM SVM 分解机推荐系统

本文链接：https://blog.csdn.net/albert_dong_fang/article/details/74421896

版权

本文介绍了一种新型模型——因子分解机（FM），它结合了支持向量机（SVM）和因子分解模型的优点。FM能处理高稀疏性数据，尤其在推荐系统中，当SVM无法可靠估计参数时，FM仍能有效工作。FM模型的计算复杂度为线性，可以直接优化，避免了SVM的双重形式优化和对支持向量的依赖。此外，FM可以模拟多种特殊因子分解模型，如SVD++、PITF和FPMC，而这些模型通常需要特定任务的定制。FM是一种通用预测模型，适用于任何实值特征向量，简化了因子分解模型的使用。

摘要由CSDN通过智能技术生成

ABSTRACT

In this paper, we introduce Factorization Machines (FM) which are a new model class that combines the advantages of Support Vector Machines (SVM) with factorization models.

由上可知，FM模型是一种新的模型，其综合了SVM模型与factorization模型的优点。

factorization models的基本形式为： $f(x) = q_1(x)q_2(x)q_3(x)...q_t(x)$ 。

Like SVMs, FMs are a general predictor working with any real valued feature vector. In contrast to SVMs, FMs model all interactions between variables using factorized parameters. Thus they are able to estimate interactions even in problems with huge sparsity (like recommender systems) where SVMs fail.

在特征较为稀疏的情况下，如推荐系统等，SVM模型不再适用
为了解决该问题，FM模型引入factorized parameters，该参数用于对交叉特征进行学习

We show that the model equation of FMs can be calculated in linear time and thus FMs can be optimized directly. So unlike nonlinear SVMs, a transformation in the dual form is not necessary and the model parameters can be estimated directly without the need of any support vector in the solution. We show the relationship to SVMs and the advantages of FMs for parameter estimation in sparse settings.

FM模型避免了SVM模型在训练时的弊端，其模型复杂度为 $O(n)$

On the other hand there are many different factorization models like matrix factorization, parallel factor analysis or specialized models like SVD++, PITF or FPMC. The drawback of these models is that they are not applicable for general prediction tasks but work only with special input data. Furthermore their model equations and optimization algorithms are derived individually for each task.

factorization models的两个缺点：

对输入的数据有限制，如推荐系统中，输入数据的形式为：uid, sid, score；

模型及其所采用的优化方法需要根据具体的task进行定制；

We show that FMs can mimic these models just by specifying the input data (i.e. the feature vectors). This makes FMs easily applicable even for users without expert knowledge in factorization models.

I. INTRODUCTION

Support Vector Machines are one of the most popular predictors in machine learning and data mining. Nevertheless in settings like collaborative filtering, SVMs play no important role and the best models are either direct applications of standard matrix/ tensor factorization models like PARAFAC [1] or specialized models using factorized parameters [2], [3], [4].
In this paper, we show that the only reason why standard SVM predictors are not successful in these tasks is that they cannot learn reliable parameters (‘hyperplanes’) in complex (non-linear) kernel spaces under very sparse data.

重点关注论文中，如何证明为什么non-linear SVM不适用于稀疏数据集。

On the other hand, the drawback of tensor factorization models and even more for specialized factorization models is that
(1) they are not applicable to standard prediction data (e.g. a real valued feature vector in $\mathbb{R}^n$ .)
(2) that specialized models are usually derived individually for a specific task requiring effort in modeling and design of a learning algorithm.

In this paper, we introduce a new predictor, the Factorization Machine (FM), that is a general predictor like SVMs but is also able to estimate reliable parameters under very high sparsity.
The factorization machine models all nested variable interactions(comparable to a polynomial kernel in SVM), but uses a factorized parametrization instead of a dense parametrization like in SVMs.

polynomial kernel：多项式核函数，形式为 $K_n(X, {X}') = (1 + \gamma X^T{X}')^n, 其中\gamma > 0$

polynomial kernel-SVM模型，其模型参数采用dense parametrization，而FM模型参数采用factorized parametrization

We show that the model equation of FMs can be computed in linear time and that it depends only on a linear number of parameters. This allows direct optimization and storage of model parameters without the need of storing any training data (e.g. support vectors) for prediction. In contrast to this, non-linear SVMs are usually optimized in the dual form and computing a prediction (the model equation) depends on parts of the training data (the support vectors).
We also show that FMs subsume many of the most successful approaches for the task of collaborative filtering including biased MF, SVD++ [2], PITF [3] and FPMC [4].

In total, the advantages of our proposed FM are:
1) FMs allow parameter estimation under very sparse data where SVMs fail.
2) FMs have linear complexity, can be optimized in the primal and do not rely on support vectors like SVMs. We show that FMs scale to large datasets like Netflix with 100 millions of training instances.
3) FMs are a general predictor that can work with any real valued feature vector. In contrast to this, other state-of-the-art factorization models work only on very restricted input data. We will show that just by defining the feature vectors of the input data, FMs can mimic state-of-the-art models like biased MF, SVD++, PITF or FPMC.

II. PREDICTION UNDER SPARSITY

The most common prediction task is to estimate a function $y: \mathbb{R}^n \rightarrow T$ from a real valued feature vector $x \in \mathbb{R}^n$ to a target domain $T$ (e.g. $T = \mathbb{R}$ for regression or $T = \{+, −\}$ for classification). In supervised settings, it is assumed that there is a training dataset $D = \{(x^{(1)}, y^{(1)}),(x^{(2)}, y^{(2)}), . . .\}$ of examples for the target function $y$ given.We also investigate the ranking task where the function $y$ with target $T = \mathbb{R}$ can be used to score feature vectors $x$ and sort them according to their score. Scoring functions can be learned with pairwise training data [5], where a feature tuple $(x^{(A)}, x^{(B)}) \in D$ means that $x^{(A)}$ should be ranked higher than $x^{(B)}$ . As the pairwise ranking relation is antisymmetric, it is sufficient to use only positive training instances.

In this paper, we deal with problems where $x$ is highly sparse, i.e. almost all of the elements $x_i$ of a vector $x$ are zero. Let $m(x)$ be the number of non-zero elements in the feature vector $x$ and $\overline{m}_D$ be the average number of non-zero elements $m(x)$ of all vectors $x \in D$ . Huge sparsity ( $\overline{m}_D \ll n$ ) appears in many real-world data like feature vectors of event transactions (e.g. purchases in recommender systems) or text analysis (e.g. bag of word approach). One reason for huge sparsity is that the underlying problem deals with large categorical variable domains.

Example 1 Assume we have the transaction data of a movie review system. The system records which user $u \in U$ rates a movie (item) $i \in I$ at a certain time $t \in R$ with a rating $r \in \{1, 2, 3, 4, 5\}$ . Let the users $U$ and items $I$ be:

U = {A l i c e (A), B o b (B), C h a r l i e (C), . . .} I = {T i t a n i c (T I), N o t t i n g H i l l (N H), S t a r W a r s (S W), S t a r T r e k (S T), . . .}

$U = \{Alice (A), Bob (B), Charlie (C), . . .\}\\ I = \{Titanic (TI), Notting\ Hill (NH), Star\ Wars (SW),Star\ Trek (ST), . . .\}$
Let the observed data

S $S$ be:

S = {(A, T I, 2010 - 1, 5), (A, N H, 2010 - 2, 3), (A, S W, 2010 - 4, 1), (B, S W, 2009 - 5, 4), (B, S T, 2009 - 8, 5), (C, T I, 2009 - 9, 1), (C, S W, 2009 - 12, 5)}

$S = \{(A, TI, 2010-1, 5),(A, NH, 2010-2, 3),(A, SW, 2010-4, 1), (B, SW, 2009-5, 4),(B, ST, 2009-8, 5), (C, TI, 2009-9, 1),(C, SW, 2009-12, 5)\}$

An example for a prediction task using this data, is to estimate a function $\widehat{y}$ that predicts the rating behavior of a user for an item at a certain point in time.

Figure 1 shows one example of how feature vectors can be created from $S$ for this task. Here, first there are $|U|$ binary indicator variables (blue) that represent the active user of a transaction – there is always exactly one active user in each transaction $(u, i, t, r) ∈ S$ , e.g. user Alice in the first one $(x^{(1)}_A = 1)$ . The next $|I|$ binary indicator variables (red) hold the active item – again there is always exactly one active item (e.g. $x^{(1)}_{TI} = 1$ ). The feature vectors in figure 1 also contain indicator variables (yellow) for all the other movies the user has ever rated. For each user, the variables are normalized such that they sum up to 1. E.g. Alice has rated Titanic, Notting Hill and Star Wars. Additionally the example contains a variable (green) holding the time in months starting from January, 2009. And finally the vector contains information of the last movie (brown) the user has rated before (s)he rated the active one – e.g. for $x^{(2)}$ , Alice rated Titanic before she rated Notting Hill. In section V, we show how factorization machines using such feature vectors as input data are related to specialized state-of-the-art factorization models.

Fig. 1

We will use this example data throughout the paper for illustration. However please note that FMs are general predictors like SVMs and thus are applicable to any real valued feature vectors and are not restricted to recommender systems.

III. FACTORIZATION MACHINES (FM)

In this section, we introduce factorization machines. We discuss the model equation in detail and show shortly how to apply FMs to several prediction tasks.

A. Factorization Machine Model

1) Model Equation: The model equation for a factorization machine of degree $d = 2$ is defined as:

y ˆ (x) : = w 0 + \sum i = 1 n w i x i + \sum i = 1 n \sum j = i + 1 n ⟨ v i, v j ⟩ x i x j

$\widehat{y}(\mathbf{x}):= w_0 + \sum_{i=1}^{n}w_{i}x_{i} + \sum_{i=1}^{n}\sum_{j=i+1}^{n} {\left \langle \mathbf{v_i}, \mathbf{v_j} \right \rangle}x_ix_j$
where the model parameters that have to be estimated are:

w 0 \in ℝ, w \in ℝ n, V \in ℝ n \times k

$w_0 \in \mathbb{R}, \mathbf{w} \in \mathbb{R}^n, \mathbf{V} \in \mathbb{R}^{n\times k}$

A row $\mathbf{v}_i$ within $\mathbf{V}$ describes the $i$ -th variable with $k$ factors. $k \in \mathbb{N}^+_0$ is a hyperparameter that defines the dimensionality of the factorization.

A 2-way FM (degree $d = 2$ ) captures all single and pairwise interactions between variables:

$w_0$ is the global bias.
$w_i$ models the strength of the $i$ -th variable.
$\widehat{w}_{i,j} := \left \langle \mathbf{v_i}, \mathbf{v_j} \right \rangle$ models the interaction between the $i$ -th and $j$ -th variable. Instead of using an own model parameter $w_{i,j} \in \mathbb{R}$ for each interaction, the FM models the interaction by factorizing it. We will see later on, that this is the key point which allows high quality parameter estimates of higher order interactions ( $d ≥ 2$ ) under sparsity.