简要声明
- 课程学习相关网址
- 由于课程学习内容为英文,文本会采用英文进行内容记录,采用中文进行简要解释。
- 本学习笔记单纯是为了能对学到的内容有更深入的理解,如果有错误的地方,恳请包容和指正。
- 非常感谢Andrew Ng吴恩达教授的无私奉献!!!
文章目录
专有名词
Recommender Systems | 推荐系统 | Collaborative filtering | 协同过滤 |
---|---|---|---|
feature learning | 特征学习 | low rank matrix factorization | 低秩矩阵分解 |
Recommender Systems
Recommender Systems
- 推荐系统是机器学习中的一个重要应用
- 特征学习思想:对于某些问题,有一些算法可以自动地学习一系列合适的特征
- n_u表示用户的数量,n_M表示电影的数量
- r(i, j) = 1 表示用户j给电影i进行了评价,y^(i, j)^表示用户j对电影i做给出的评分
- 推荐系统的问题是给出了r(i, j)和y^(i, j)^的数据,然后去查找那些没有被评级的电影并试图预测这些电影的评价星级
Content‐based recommendations
Problem formulation
- 每个用户的评价预测值看成一个线性回归问题 →对于每一个用户j,学习参数向量θ(j),预测用户j评价电影i的值 (θ(j))Tx(i)
- Parameters
- r(i, j) = 1 if user has rated movie (0 otherwise) →如果用户j评价了电影i则将r(i, j)记为1
- y^(i, j)^ = rating by user j on movie i (if defined) →对电影的评价
- θ(j) = parameter vector for user j →每个用户x(i)的一个参数
- x(i) = feature vector for movie i →特定电影的一个特征向量
- m(j) = no. of movies rated by user j →用户j评价的电影数量
- 对于用户j,电影i,预测值为(θ(j))Tx(i),学习θ(j)
Optimization objective
-
To learn θ(j) (parameter for user j): →学习θ(j)(用户j的参数)
min θ ( j ) 1 2 ∑ i : r ( i , j ) = 1 ( ( θ ( j ) ) T x ( i ) − y ( i , j ) ) 2 + λ 2 ∑ k = 1 n ( θ k ( j ) ) 2 \min\limits_{\theta^{(j)}}\frac{1}{2} \sum_{i:r(i,j)=1}((\theta ^{(j)})^T x^{(i)}-y^{(i,j)})^2 + \frac{\lambda}{2}\sum_{k=1}^n(\theta_k^{(j)})^2 θ(j)min21i:r(i,j)=1∑((θ(j))Tx(i)−y(i,j))2+2λk=1∑n(θk(j))2
∑ i : r ( i , j ) = 1 \sum_{i:r(i,j)=1} i:r(i,j)=1∑
对所有的i值求和 →对所有用户j评价的所有电影求和
-
To learn θ(1), θ(2), … , θ(n_u):
min θ ( 1 ) , ⋯ , θ ( n u ) 1 2 ∑ j = 1 n u ∑ i : r ( i , j ) = 1 ( ( θ ( j ) ) T x ( i ) − y ( i , j ) ) 2 + λ 2 ∑ j = 1 n u ∑ k = 1 n ( θ k ( j ) ) 2 \min\limits_{\theta^{(1)},\cdots,\theta^{(n_u)}}\frac{1}{2} \sum_{j=1}^{n_u} \sum_{i:r(i,j)=1}((\theta ^{(j)})^T x^{(i)}-y^{(i,j)})^2 + \frac{\lambda}{2} \sum_{j=1}^{n_u} \sum_{k=1}^n(\theta_k^{(j)})^2 θ(1),⋯,θ(nu)min21j=1∑nui:r(i,j)=1∑((θ(j))Tx(i)−y(i,j))2+2λj=1∑nuk=1∑n(θk(j))2
-
Gradient descent update
θ k ( j ) : = θ k ( j ) − α ∑ i : r ( i , j ) = 1 ( ( θ ( j ) ) T x ( i ) − y ( i , j ) ) ⋅ x k ( i ) ( f o r k = 0 ) \theta_k^{(j)}:=\theta_k^{(j)}-\alpha\sum_{i:r(i,j)=1}((\theta ^{(j)})^T x^{(i)}-y^{(i,j)}) \cdot x_k^{(i)}\qquad (for\ k=0) θk(j):=θk(j)−αi:r(i,j)=1∑((θ(j))Tx(i)−y(i,j))⋅xk(i)(for k=0)
θ k ( j ) : = θ k ( j ) − α ( ∑ i : r ( i , j ) = 1 ( ( θ ( j ) ) T x ( i ) − y ( i , j ) ) ⋅ x k ( i ) + λ θ k ( j ) ) ( f o r k ≠ 0 ) \theta_k^{(j)}:=\theta_k^{(j)}-\alpha (\sum_{i:r(i,j)=1}((\theta ^{(j)})^T x^{(i)}-y^{(i,j)}) \cdot x_k^{(i)}+\lambda\theta_k^{(j)}) \qquad (for\ k\ne0) θk(j):=θk(j)−α(i:r(i,j)=1∑((θ(j))Tx(i)−y(i,j))⋅xk(i)+λθk(j))(for k=0)
Collaborative filtering
- feature learning 特征学习 →自行学习所要使用的特征
- 假如有一个数据集但是不知道特征值是多少,假设获得每个用户喜欢不同类型电影的程度θ(1), θ(2), … , θ(n_u),推测出每部电影的特征值
Optimization algorithm
-
Given θ(1), θ(2), … , θ(n_u), to learn x(i):
min x ( i ) 1 2 ∑ j : r ( i , j ) = 1 ( ( θ ( j ) ) T x ( i ) − y ( i , j ) ) 2 + λ 2 ∑ k = 1 n ( x k ( i ) ) 2 \min\limits_{x^{(i)}}\frac{1}{2} \sum_{j:r(i,j)=1}((\theta ^{(j)})^T x^{(i)}-y^{(i,j)})^2 + \frac{\lambda}{2}\sum_{k=1}^n(x_k^{(i)})^2 x(i)min21j:r(i,j)=1∑((θ(j))Tx(i)−y(i,j))2+2λk=1∑n(xk(i))2
-
Given θ(1), θ(2), … , θ(n_u), to learn x(1), … , x(n_m):
min x ( 1 ) , ⋯ , x ( n m ) 1 2 ∑ i = 1 n m ∑ j : r ( i , j ) = 1 ( ( θ ( j ) ) T x ( i ) − y ( i , j ) ) 2 + λ 2 ∑ i = 1 n m ∑ k = 1 n ( x k ( i ) ) 2 \min\limits_{x^{(1)},\cdots,x^{(n_m)}}\frac{1}{2} \sum_{i=1}^{n_m} \sum_{j:r(i,j)=1}((\theta ^{(j)})^T x^{(i)}-y^{(i,j)})^2 + \frac{\lambda}{2} \sum_{i=1}^{n_m} \sum_{k=1}^n(x_k^{(i)})^2 x(1),⋯,x(nm)min21i=1∑nmj:r(i,j)=1∑((θ(j))Tx(i)−y(i,j))2+2λi=1∑nmk=1∑n(xk(i))2
Collaborative filtering
- Given x(1), x(2), … , x(n_m), can estimate θ(1), … , θ(n_m) →通过x(i)估计θ(j)
- Given θ(1), θ(2), … , θ(n_u), can estimate x(1), … , x(n_m) →通过θ(j)估计x(i)
- Collaborative filtering 协同过滤算法 →当执行算法时要观察大量的用户的实际行为来协同的得到最佳的每个人对电影的评分值 →如果每个用户都对一部分电影做出了评价,那么每个用户都在帮助算法更好地学习特征,然后学习出的特征又可以被用来更好地预测其它用户的评分
Collaborative filtering algorithm
Given x(1), x(2), … , x(n_m), estimate θ(1), … , θ(n_m)
min θ ( 1 ) , ⋯ , θ ( n u ) 1 2 ∑ j = 1 n u ∑ i : r ( i , j ) = 1 ( ( θ ( j ) ) T x ( i ) − y ( i , j ) ) 2 + λ 2 ∑ j = 1 n u ∑ k = 1 n ( θ k ( j ) ) 2 \min\limits_{\theta^{(1)},\cdots,\theta^{(n_u)}}\frac{1}{2} \sum_{j=1}^{n_u} \sum_{i:r(i,j)=1}((\theta ^{(j)})^T x^{(i)}-y^{(i,j)})^2 + \frac{\lambda}{2} \sum_{j=1}^{n_u} \sum_{k=1}^n(\theta_k^{(j)})^2 θ(1),⋯,θ(nu)min21j=1∑nui:r(i,j)=1∑((θ(j))Tx(i)−y(i,j))2+2λj=1∑nuk=1∑n(θk(j))2
Given θ(1), θ(2), … , θ(n_u), estimate x(1), … , x(n_m)
min x ( 1 ) , ⋯ , x ( n m ) 1 2 ∑ i = 1 n m ∑ j : r ( i , j ) = 1 ( ( θ ( j ) ) T x ( i ) − y ( i , j ) ) 2 + λ 2 ∑ i = 1 n m ∑ k = 1 n ( x k ( i ) ) 2 \min\limits_{x^{(1)},\cdots,x^{(n_m)}}\frac{1}{2} \sum_{i=1}^{n_m} \sum_{j:r(i,j)=1}((\theta ^{(j)})^T x^{(i)}-y^{(i,j)})^2 + \frac{\lambda}{2} \sum_{i=1}^{n_m} \sum_{k=1}^n(x_k^{(i)})^2 x(1),⋯,x(nm)min21i=1∑nmj:r(i,j)=1∑((θ(j))Tx(i)−y(i,j))2+2λi=1∑nmk=1∑n(xk(i))2
Minimizing x(1), x(2), … , x(n_m) and θ(1), … , θ(n_m) simultaneously:
J ( x ( 1 ) , ⋯ , x ( n m ) , θ ( 1 ) , ⋯ , θ ( n u ) ) = 1 2 ∑ ( i , j ) : r ( i , j ) = 1 ( ( θ ( j ) ) T x ( i ) − y ( i , j ) ) 2 + λ 2 ∑ i = 1 n m ∑ k = 1 n ( x k ( i ) ) 2 + λ 2 ∑ j = 1 n u ∑ k = 1 n ( θ k ( j ) ) 2 J(x^{(1)},\cdots,x^{(n_m)},\theta^{(1)},\cdots,\theta^{(n_u)})=\frac{1}{2} \sum_{(i,j):r(i,j)=1}((\theta ^{(j)})^T x^{(i)}-y^{(i,j)})^2 + \frac{\lambda}{2} \sum_{i=1}^{n_m} \sum_{k=1}^n(x_k^{(i)})^2 + \frac{\lambda}{2} \sum_{j=1}^{n_u} \sum_{k=1}^n(\theta_k^{(j)})^2 J(x(1),⋯,x(nm),θ(1),⋯,θ(nu))=21(i,j):r(i,j)=1∑((θ(j))Tx(i)−y(i,j))2+2λi=1∑nmk=1∑n(xk(i))2+2λj=1∑nuk=1∑n(θk(j))2
min x ( 1 ) , ⋯ , x ( n m ) , θ ( 1 ) , ⋯ , θ ( n u ) J ( x ( 1 ) , ⋯ , x ( n m ) , θ ( 1 ) , ⋯ , θ ( n u ) ) \min\limits_{x^{(1)},\cdots,x^{(n_m)},\theta^{(1)},\cdots,\theta^{(n_u)}}J(x^{(1)},\cdots,x^{(n_m)},\theta^{(1)},\cdots,\theta^{(n_u)}) x(1),⋯,x(nm),θ(1),⋯,θ(nu)minJ(x(1),⋯,x(nm),θ(1),⋯,θ(nu))
-
当以这种方式学习特征量时惯例(特征x_0=1)可以不遵循,只有n维的x和n维的θ
-
Initialize x(1), x(2), … , x(n_m),θ(1), … , θ(n_m) to small random values →将x和θ初识为小的随机值
-
Minimize J(x(1), x(2), … , x(n_m),θ(1), … , θ(n_m)) using gradient descent (or an advanced optimization algorithm) →使用梯度下降或者其它的高级优化算法把代价函数最小化
-
对于每一个 j = 1, … , n_u;i = 1, … , n_m →不存在 x_0 和 θ_0,不需要分出 k=0 的特殊情况
x k ( j ) : = x k ( j ) − α ( ∑ j : r ( i , j ) = 1 ( ( θ ( j ) ) T x ( i ) − y ( i , j ) ) ⋅ θ k ( j ) + λ x k ( i ) ) x_k^{(j)}:=x_k^{(j)}-\alpha (\sum_{j:r(i,j)=1}((\theta ^{(j)})^T x^{(i)}-y^{(i,j)}) \cdot \theta_k^{(j)}+\lambda x_k^{(i)}) xk(j):=xk(j)−α(j:r(i,j)=1∑((θ(j))Tx(i)−y(i,j))⋅θk(j)+λxk(i))
θ k ( j ) : = θ k ( j ) − α ( ∑ i : r ( i , j ) = 1 ( ( θ ( j ) ) T x ( i ) − y ( i , j ) ) ⋅ x k ( i ) + λ θ k ( j ) ) \theta_k^{(j)}:=\theta_k^{(j)}-\alpha (\sum_{i:r(i,j)=1}((\theta ^{(j)})^T x^{(i)}-y^{(i,j)}) \cdot x_k^{(i)}+\lambda\theta_k^{(j)}) θk(j):=θk(j)−α(i:r(i,j)=1∑((θ(j))Tx(i)−y(i,j))⋅xk(i)+λθk(j))
-
给一个用户具有一些参数θ以及给一部电影带有已知特征x,可以预测该用户给这部电影的评分θTx
-
如果用户j尚未对电影i评分,可以预测这个用户 j 将会根据 (θ(j))Tx 对电影i评分
Vectorization: Low rank matrix factorization
-
矩阵Y包含所有的元素和所有的数据(包括问号?)
-
y(i,j)表示第j个用户给第i部电影的评分
-
predicted rating 预测评分矩阵 →low rank matrix factorization 低秩矩阵分解
X = [ ( x ( 1 ) ) T ( x ( 2 ) ) T ⋮ ( x ( n m ) ) T ] Θ = [ ( θ ( 1 ) ) T ( θ ( 2 ) ) T ⋮ ( θ ( n u ) ) T ] X=\begin{bmatrix} (x^{(1)})^T \\ (x^{(2)})^T \\ \vdots \\ (x^{(n_m)})^T\\ \end{bmatrix} \qquad \Theta=\begin{bmatrix} (\theta^{(1)})^T \\ (\theta^{(2)})^T \\ \vdots \\ (\theta^{(n_u)})^T\\ \end{bmatrix} X= (x(1))T(x(2))T⋮(x(nm))T Θ= (θ(1))T(θ(2))T⋮(θ(nu))T
X Θ T = [ ( θ ( 1 ) ) T ( x ( 1 ) ) ( θ ( 2 ) ) T ( x ( 1 ) ) ⋯ ( θ ( n u ) ) T ( x ( 1 ) ) ( θ ( 1 ) ) T ( x ( 2 ) ) ( θ ( 2 ) ) T ( x ( 2 ) ) ⋯ ( θ ( n u ) ) T ( x ( 2 ) ) ⋮ ⋮ ⋱ ⋮ ( θ ( 1 ) ) T ( x ( n m ) ( θ ( 2 ) ) T ( x ( n m ) ) ⋯ ( θ ( n u ) ) T ( x ( n m ) ) ] X\Theta^T=\begin{bmatrix} (\theta^{(1)})^T(x^{(1)}) & (\theta^{(2)})^T(x^{(1)}) & \cdots & (\theta^{(n_u)})^T(x^{(1)}) \\ (\theta^{(1)})^T(x^{(2)}) & (\theta^{(2)})^T(x^{(2)}) & \cdots & (\theta^{(n_u)})^T(x^{(2)}) \\ \vdots & \vdots & \ddots & \vdots \\ (\theta^{(1)})^T(x^{(n_m}) & (\theta^{(2)})^T(x^{(n_m)}) & \cdots & (\theta^{(n_u)})^T(x^{(n_m)}) \\ \end{bmatrix} XΘT= (θ(1))T(x(1))(θ(1))T(x(2))⋮(θ(1))T(x(nm)(θ(2))T(x(1))(θ(2))T(x(2))⋮(θ(2))T(x(nm))⋯⋯⋱⋯(θ(nu))T(x(1))(θ(nu))T(x(2))⋮(θ(nu))T(x(nm))
Mean normalization
-
如果一个新用户没有评分,那么最后计算的参数θ等于0,预判值均为0,则没有一部电影拥有高一点的预测评分推荐给新用户
-
计算每个电影所得评分的均值μ,将电影评分减去均分 →把每个电影评分归一化使得均值为0
-
对用户j对电影i的评分预测值为
( θ ( j ) ) T x ( ( i ) ) + μ i (\theta^{(j)})^Tx^({(i)})+\mu_i (θ(j))Tx((i))+μi
吴恩达教授语录
- “Over the last few years, occasionally I visit different, you know, technology companies here in Silicon Valley and I often talk to people working on machine learning applications there and so I’ve asked people what are the most important applications of machine learning or what are the machine learning applications that you would most like to get an improvement in the performance of. And one of the most frequent answers I heard was that there are many groups out in Silicon Valley now, trying to build better recommender systems.”