Andrew Ng 机器学习笔记总结


本文由 CDFMLR 原创,收录于个人主页 https://clownote.github.io,并同时发布到 CSDN。本人不保证 CSDN 排版正确,敬请访问 clownote 以获得良好的阅读体验。


机器学习

Emmmm,这学期在 Coursera 学完了 Andrew Ng 的 Machine Learning 课程。我对这个课程一向是不以为意的,却不小心报了个名,还手贱申请了个经济援助,学完就可以免费拿证书(卖几百块哒),课程期间还送正版的 Matlab Online,这一系列的偶(占)然(小)事(便)件(宜)促使我开始刷这个课了。越学越觉得,嗯,真香,是真的很香!这个课真的是很好的机器学习入门,难怪那么多人推荐。

Coursera 里课程笔记有每一章的总结,总结的非常好,推荐学完之后看一看。但我还是喜欢自己写自己的,所以我之前边看视频边写了几乎涵盖整个课程的笔记,其实好多是在抄老师的原话和PPT😂,就当练习打字、英语还有 LaTeX\LaTeX 了。放假回家在火车上百无聊赖心血来潮,想到了应该整理一下课程里面学到的东西,就有了这篇文章。

这里我主要是写了各种算法的描述,还从编程作业里提取了算法大概的代码实现,方便日后快速查阅吧。一开始的回归比较简单,所以我写的很少,就堆了点公式(其实是硬卧上铺空调太冷致使我生病了,思路堵塞写不出东西来😷);后面SVM、推荐系统什么的比较复杂就多写了一些(其实是我掌握的不好,归纳不出重点🤯)。至于课程里老师花大力气讲的关于机器学习系统的设计、优化、debug 还有各种奇技淫巧 骚操作 实用技巧 以及 Octave 入门哪一块我就一概不提了,这些东西还是要看老师的视频才能体会到精髓(…优(挂)秀(科)的大学生自然是不能承认原因是自己懒惰的😎)。

由于我看课程的时候全程没有开中文字幕,平时查阅的中文资料也比较少,所以好多术语我都不知道要怎么翻译,写这篇文章的时候我大概查了一些自己难以表达的,其他的全靠臆测,我不保证正确。

Emmm,不小心就写了几百字的废话😂下面就开始吧。

监督学习

监督学习是给x、y数据去训练的。

image-20191105144318970

回归问题

做预测,值域为连续的数(例如区间[0,100][0,100]

数学模型:

  • 预测函数:

hθ(x)=i=0mθixi=[θ0θ1θn][x0x1xn]=θTX h_\theta(x)= \sum_{i=0}^m \theta_ix_i = \left[\begin{array}{c}\theta_0 & \theta_1 & \ldots & \theta_n\end{array}\right] \left[\begin{array}{c}x_0 \\ x_1 \\ \vdots \\ x_n \end{array}\right]= \theta^TX

  • 待求参数θ=[θ0θ1θn]\theta=\left[\begin{array}{c}\theta_0 & \theta_1 & \ldots & \theta_n\end{array}\right]

  • 代价函数
    J(θ)=12mi=1m(h(x(i))y(i))2=(i=1mhθ(X)Y)22m J(\theta)=\frac{1}{2m}\sum_{i=1}^m(h(x^{(i)})-y^{(i)})^2 =\frac{(\sum_{i=1}^mh_\theta(X)-Y)^2}{2m}

👉代码实现:

  function J = computeCostMulti(X, y, theta)
  %COMPUTECOSTMULTI Compute cost for linear regression with multiple variables
  %   J = COMPUTECOSTMULTI(X, y, theta) computes the cost of using theta as the
  %   parameter for linear regression to fit the data points in X and y
  
  m = length(y); % number of training examples
  
  J = 0;
  
  J = 1 / (2*m) * (X*theta - y)' * (X*theta - y);
  
  % or less vectorizedly: 
  % predictions = X * theta;
  % sqrErrors = (predictions - y) .^ 2;
  % J = 1 / (2*m) * sum(sqrErrors);
  
  end
  • 优化目标:找到一组θ\theta使JJ最小。

求解方法:

梯度下降

(这里暂且只讨论 batch gradient descent)

repeat until convergence {θj:=θjαθjJ(θ):=θjα1mi=1m[hθ(x(i))y(i)]xj(i)for j:=0,...,n} \begin{array}{ll} \textrm{repeat until convergence } \{ \\ \qquad \theta_j:=\theta_j-\alpha\frac{\partial}{\partial\theta_j}J(\theta)\\ \qquad\quad:= \theta_j - \alpha \frac{1}{m} \sum^m_{i=1}[h_\theta(x^{(i)})-y^{(i)}] \cdot x_j^{(i)}\qquad \textrm{for }j:=0, ..., n \\ \} \end{array}

向量化表示:θ=θαmXT(hθ(X)Y)\theta=\theta-\frac{\alpha}{m}X^T(h_\theta(X)-Y)

👉代码实现:

function [theta, J_history] = gradientDescentMulti(X, y, theta, alpha, num_iters)
%GRADIENTDESCENTMULTI Performs gradient descent to learn theta
%   theta = GRADIENTDESCENTMULTI(x, y, theta, alpha, num_iters) updates theta by
%   taking num_iters gradient steps with learning rate alpha

m = length(y); % number of training examples
J_history = zeros(num_iters, 1);

for iter = 1:num_iters

    predictions = X * theta;
    errors = (predictions - y);
    theta = theta - alpha / m * (X' * errors);
    
    % Save the cost J in every iteration    
    J_history(iter) = computeCostMulti(X, y, theta);

end

end

正规方程

θ=(XTX)1XTy \theta = (X^TX)^{-1}X^Ty
👉代码实现:

function [theta] = normalEqn(X, y)
%NORMALEQN Computes the closed-form solution to linear regression 
%   NORMALEQN(X,y) computes the closed-form solution to linear 
%   regression using the normal equations.

theta = zeros(size(X, 2), 1);

theta = pinv(X' * X) * X' * y;

end

这里我们求伪逆以确保正常运行。通常造成XTXX^TX不可逆的原因是:

  1. 存在可约特征,即给定的某两/多个特征线性相关,只保留一个删除其他即可解决。(e.g. there are the size of house in feet^2 and the size of house in meter^2, where we know that 1 meter = 3.28 feet)
  2. 给定特征过多,(mnm \le n). 可以删除一些不重要的特征(考虑PCA算法)

梯度下降vs正规方程

Gradient Descent Normal Equation
需要选择 alpha -
第三方 -
时间复杂度 O(kn2)O(kn^2) XTXX^TX的伪逆需要O(n3)O(n^3)
n 相当大时 可以工作 十分缓慢甚至不可计算

实际上,当 n>10,000n>10,000 时,我们通常更倾向于使用梯度下降,否则正规方程一般都表现得更好。

注:特征缩放

我们可以通过使输入值大概在一定的范围内来使梯度下降运行更快,比如说,我们可以把所有值变到 [1,1][-1,1] 的范围内,同时,我们还可以通过处理让输入值之间的差距不要太大(例如,输入值中同时有 0.000001 和 1 这样差距大的值会影响梯度下降的效率)。

在实践中,我们通常想要保证变量值在 [3,13)(+13,+3][-3,-\frac{1}{3}) \cup (+\frac{1}{3}, +3] 这个范围内取值。

为达成该目标,我们做如下操作:

  1. Feature scaling

Range:si=max(xi)min(xi)Or Range:si=standard deviation of xiScaling:xi:=xisi \begin{array}{rl} \textrm{Range:} & s_i = max(x_i)-min(x_i)\\ \textrm{Or Range:} & s_i = \textrm{standard deviation of } x_i\\ \textrm{Scaling:} & x_i:=\frac{x_i}{s_i} \end{array}

  1. Mean normalizaton

Mean(Average):μi=sum(xi)mnormalizing:xi:=xiμi \begin{array}{rl} \textrm{Mean(Average):} & \mu_i = \frac{sum(x_i)}{m}\\ \textrm{normalizing:} & x_i:=x_i-\mu_i \end{array}

把两个操作和在一起,即:
xi:=xiμisi x_i:=\frac{x_i-\mu_i}{s_i}
其中,μi\mu_i 是特征(i)的值的平均,sis_i是值的范围。

👉代码实现:

function [X_norm, mu, sigma] = featureNormalize(X)
%FEATURENORMALIZE Normalizes the features in X 
%   FEATURENORMALIZE(X) returns a normalized version of X where
%   the mean value of each feature is 0 and the standard deviation
%   is 1. This is often a good preprocessing step to do when
%   working with learning algorithms.

X_norm = X;
mu = zeros(1, size(X, 2));
sigma = zeros(1, size(X, 2));

mu = mean(X);
sigma = std(X);
X_norm = (X - mu) ./ sigma;

end

分类问题

做预测,值域为离散的几个特定值(例如 0 或 1;0/1/2/3)

逻辑回归

假设函数:
{hθ(x)=g(θTx)z=θTxg(z)=11+ez \left\{\begin{array}{l} h_\theta(x) = g(\theta^Tx)\\ z = \theta^T x\\ g(z) = \frac{1}{1+e^{-z}}\\ \end{array}\right.
其中,g(z)g(z) 称为 Simoid 函数,或逻辑函数,其图像如下:

image-20190917162504420

👉代码实现:

function g = sigmoid(z)
%SIGMOID Compute sigmoid functoon
%   J = SIGMOID(z) computes the sigmoid of z.

g = 1.0 ./ (1.0 + exp(-z));
end

上式可化简得:
hθ(x)=11+eθTx h_\theta(x) = \frac{1}{1+e^{-\theta^Tx}}

hθh_\theta 的输出是预测值为1的可能性,并有下两式成立:
hθ(x)=P(y=1x;θ)=1P(y=0x;θ) h_\theta(x)=P(y=1 \mid x;\theta)=1-P(y=0 \mid x; \theta)

P(y=0x;θ)+P(y=1x;θ)=1 P(y=0 \mid x;\theta) + P(y=1 \mid x;\theta) = 1

决策边界:逻辑回归的决策边界就是将区域分成y=0y=0y=1y=1 两部分的一个超平面。

决策边界由假设函数决定。这是由于要完成分类,需用hθh_\theta的输出来决定结果是 0 还是 1。定 0.5 为分界,即:
hθ(x)0.5y=1hθ(x)<0.5y=0 \begin{array}{rcl} h_\theta(x) \ge 0.5 &\Rightarrow& y=1\\ h_\theta(x) < 0.5 &\Rightarrow& y=0 \end{array}
由 Simoid 函数的性质,上式等价为:
θTX0y=1θTX0y=0 \begin{array}{rcl} \theta^TX \ge 0 &\Rightarrow& y=1\\ \theta^TX \le 0 &\Rightarrow& y=0\\ \end{array}
那么对于给定的一组 θ\theta,例如θ=[510]\theta=\left[\begin{array}{c}5\\-1\\0\end{array} \right],有 y=1y=1 当且仅当 5+(1)x1+0x205+(-1)x_1+0x_2 \ge 0,这时决策边界为 x1=5x_1=5

image-20190917173722734

决策边界也可以是下面这种复杂的情况:

image-20190917174144868

逻辑回归模型
Training set:{(x(1),y(1)),(x(2),y(2)),,(x(m),y(m))}m examples:x[x0x1xn]where (x0=1),y{0,1}Hypothesis:hθ(x)=11+eθTxCost Function:J(θ)=1mi=1m[y(i)log(hθ(x))+(1y(i))log(1hθ(x(i)))] \begin{array}{rcl} \textrm{Training set} &:& \{(x^{(1)},y^{(1)}), (x^{(2)},y^{(2)}), \ldots, (x^{(m)},y^{(m)})\}\\ \\ \textrm{m examples} &:& x \in \left[\begin{array}{c} x_0\\x_1\\ \vdots \\ x_n \end{array}\right] \textrm{where }(x_0=1) ,\quad y \in \{0,1\}\\ \\ \textrm{Hypothesis} &:& h_\theta(x)=\frac{1}{1+e^{-\theta^Tx}}\\\\ \textrm{Cost Function} &:& J(\theta)=-\frac{1}{m}\sum_{i=1}^m\Bigg[y^{(i)}log\Big(h_\theta(x)\Big)+(1-y^{(i)})log\Big(1-h_\theta(x^{(i)})\Big)\Bigg] \end{array}
向量化表示:
h=g(Xθ)J(θ)=1m(yTlog(h)(1y)Tlog(1h)) \begin{array}{l} h=g(X\theta)\\ J(\theta)=\frac{1}{m}\cdot\big(-y^T log(h) -(1-y)^T log(1-h)\big) \end{array}
梯度下降
Repeat{θj:=θjαmi=1m(hθ(x(i))y(i))xj(i)} \begin{array}{l} Repeat \quad \{\\ \qquad \theta_j:=\theta_j-\frac{\alpha}{m}\sum_{i=1}^m(h_\theta(x^{(i)})-y^{(i)})\cdot x_j^{(i)}\\ \} \end{array}
向量化表示:
θ:=θαmXT(g(Xθ)y) \theta:=\theta-\frac{\alpha}{m}X^T(g(X\theta)-\overrightarrow{y})
👉代码实现(使用Advanced Optimization):

  1. 提供J(θ),θjJ(θ)J(\theta), \frac{\partial}{\partial\theta_j}J(\theta)
function [J, grad] = costFunction(theta, X, y)
%COSTFUNCTION Compute cost and gradient for logistic regression
%   J = COSTFUNCTION(theta, X, y) computes the cost of using theta as the
%   parameter for logistic regression and the gradient of the cost
%   w.r.t. to the parameters.

m = length(y); % number of training examples

J = 0;
grad = zeros(size(theta));

h = sigmoid(X*theta);

J = 1/m * (-y'*log(h) - (1-y)'*log(1-h));

grad = 1/m * X'*(h-y);

end
  1. 调用 Advanced Optimization 函数解决优化问题:
options = optimset('GradObj', 'on', 'MaxIter', 100);
initialTheta = zeros(2, 1);

[optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);
多元分类

我们采用一系列的单元(逻辑)分类来完成多元分类:
y{0,1,,n}hθ(0)(x)=P(y=0x;θ)hθ(0)(x)=P(y=0x;θ)hθ(0)(x)=P(y=0x;θ)prediction=maxθ(hθ(i)(x)) \begin{array}{l} y \in \{0,1,\cdots,n\}\\\\ h_\theta^{(0)}(x)=P(y=0|x;\theta)\\ h_\theta^{(0)}(x)=P(y=0|x;\theta)\\ \vdots\\ h_\theta^{(0)}(x)=P(y=0|x;\theta)\\\\ prediction = \mathop{max}\limits_{\theta}\big(h_\theta^{(i)}(x)\big) \end{array}

img

注:过拟合

img

过拟合对训练集中的数据预测的很好,但对没见过的新样本预测效果不佳。

解决过拟合的方法有:

  1. 减少特征数量(PCA)

  2. 正则化:在代价函数中加入 θ\theta 的权重:

    minθ12m i=1m(hθ(x(i))y(i))2+λ j=1nθj2\mathop{min}\limits_{\theta} \dfrac{1}{2m}\ \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda\ \sum_{j=1}^n \theta_j^2

    注意,θ0\theta_0是我们加上的常数项,不应该被正则化。

    代码实现:

   function [J, grad] = lrCostFunction(theta, X, y, lambda)
   %LRCOSTFUNCTION Compute cost and gradient for logistic regression with 
   %regularization
   %   J = LRCOSTFUNCTION(theta, X, y, lambda) computes the cost of using
   %   theta as the parameter for regularized logistic regression and the
   %   gradient of the cost w.r.t. to the parameters. 
   
   m = length(y); % number of training examples
   
   J = 0;
   grad = zeros(size(theta));
   
   % Unregularized cost function & gradient for logistic regression
   h = sigmoid(X * theta);
   J = 1/m * (-y'*log(h) - (1-y)'*log(1-h));
   grad = 1/m * X'*(h-y);
   
   % Regularize
   temp = theta;
   temp(1) = 0;
   J = J + lambda/(2*m) * sum(temp.^2);
   grad = grad + lambda/m * temp;
   
   grad = grad(:);
   
   end

神经网络

[x0x1x2x3][a1(2)a2(2)a3(2)][a1(3)a2(3)a3(3)]hθ(x) \left[\begin{array}{c}x_0 \\ x_1 \\ x_2 \\ x_3\end{array}\right] \rightarrow \left[\begin{array}{c}a_1^{(2)} \\ a_2^{(2)} \\ a_3^{(2)} \\ \end{array}\right] \rightarrow \left[\begin{array}{c}a_1^{(3)} \\ a_2^{(3)} \\ a_3^{(3)} \\ \end{array}\right] \rightarrow h_\theta(x)

第一层是数据集,称为输入层,可以看作 a(0)a^{(0)} ;中间是数个隐藏层,最终得到的就是预测函数,这一层叫做输出层。

z(j)=Θ(j1)a(j1) z^{(j)} = \Theta^{(j-1)}a^{(j-1)}

a(j)=g(z(j)) a^{(j)} = g(z^{(j)})

假设有 c 个层,则:

hΘ(x)=a(c+1)=g(z(c+1)) h_\Theta(x)=a^{(c+1)}=g(z^{(c+1)})

例如,用一层的神经网络,我们可以建立一些表达逻辑函数的神经网络:
AND:Θ(1)=[302020]NOR:Θ(1)=[102020]OR:Θ(1)=[102020] \begin{array}{l}AND:\\&\Theta^{(1)} &=\begin{bmatrix}-30 & 20 & 20\end{bmatrix} \\ NOR:\\&\Theta^{(1)} &= \begin{bmatrix}10 & -20 & -20\end{bmatrix} \\ OR:\\&\Theta^{(1)} &= \begin{bmatrix}-10 & 20 & 20\end{bmatrix} \\\end{array}

多元分类

y(i)=[1000],[0100],[0010],[0001] y^{(i)}=\begin{bmatrix}1\\0\\0\\0\end{bmatrix},\begin{bmatrix}0\\1\\0\\0\end{bmatrix},\begin{bmatrix}0\\0\\1\\0\end{bmatrix},\begin{bmatrix}0\\0\\0\\1\end{bmatrix}

[x0x1x2x3][a1(2)a2(2)a3(2)...][a1(3)a2(3)a3(3)...][hΘ(x)1hΘ(x)2hΘ(x)3hΘ(x)4] \left[\begin{array}{c}x_0 \\ x_1 \\ x_2 \\ x_3\end{array}\right] \rightarrow \left[\begin{array}{c}a_1^{(2)} \\ a_2^{(2)} \\ a_3^{(2)} \\ ... \end{array}\right] \rightarrow \left[\begin{array}{c}a_1^{(3)} \\ a_2^{(3)} \\ a_3^{(3)} \\ ... \end{array}\right] \rightarrow \cdots \rightarrow \left[\begin{array}{c}h_\Theta(x)_1 \\ h_\Theta(x)_2 \\ h_\Theta(x)_3 \\ h_\Theta(x)_4 \end{array}\right]

神经网络的拟合
Notation Represent
LL 神经网络中的总层数
sls_l ll层中的节点数(不算偏移单元a0a_0
KK 输出节点数

代价函数
J(Θ)=1mi=1mk=1K[yk(i)log((hΘ(x(i)))k)+(1yk(i))log(1(hΘ(x(i)))k)]+λ2ml=1L1i=1slj=1sl+1(Θj,i(l))2 J(\Theta)=-\frac{1}{m}\sum_{i=1}^{m}\sum_{k=1}^{K}\Big[ y_k^{(i)}log\Big(\big(h_\Theta(x^{(i)})\big)_k\Big)+ (1-y_k^{(i)})log\Big(1-\big(h_\Theta(x^{(i)})\big)_k\Big) \Big]+ \frac{\lambda}{2m}\sum_{l=1}^{L-1}\sum_{i=1}^{s_l}\sum_{j=1}^{s_{l+1}}\Big(\Theta_{j,i}^{(l)}\Big)^2
向后传播算法
Give training set (x(1),y(1)),...,(x(m),y(m))Set Δi,j(l):=0 for each l,i,j (get a matrix full of zeros)For training example t=1 to m:a(1):=x(t)Compute a(l) for l=2,3,,L by forward propagationUsing y(t) to compute δ(L)=a(L)y(t)Compute δ(l)=((Θ(l))Tδ(l+1)).a(l).(1a(l)) for δ(L1),δ(L2),...,δ(2)Δ(l):=Δ(l)+δ(l+1)(a(l))TEnd ForDi,j(l):=1mΔi,j(l) if j=0Di,j(l):=1m(Δi,j(l)+λΘi,j(l)) if j0Get Θi,j(l)J(Θ)=Di,j(l) \begin{array}{lll} \textrm{Give training set }{(x^{(1)},y^{(1)}),...,(x^{(m)},y^{(m)})}\\ \textrm{Set }\Delta_{i,j}^{(l)}:=0\textrm{ for each } l,i,j \textrm{ (get a matrix full of zeros)}\\ \mathop{\textrm{For}} \textrm{ training example $t=1$ to $m$}:\\ \qquad a^{(1)}:= x^{(t)}\\ \qquad \textrm{Compute $a^{(l)}$ for $l=2,3,\cdots,L$ by forward propagation}\\ \qquad \textrm{Using $y^{(t)}$ to compute } \delta^{(L)}=a^{(L)}-y^{(t)}\\ \qquad \textrm{Compute } \delta^{(l)}=\big((\Theta^{(l)})^T\delta^{(l+1)}\big).*a^{(l)}.*(1-a^{(l)}) \textrm{ for } \delta^{(L-1)},\delta^{(L-2)},...,\delta^{(2)}\\ \qquad \Delta^{(l)}:=\Delta^{(l)}+\delta^{(l+1)}(a^{(l)})^T\\ \textrm{End For}\\ D_{i,j}^{(l)}:=\frac{1}{m}\Delta_{i,j}^{(l)}\textrm{ if } j=0\\ D_{i,j}^{(l)}:=\frac{1}{m}\big(\Delta_{i,j}^{(l)}+\lambda\Theta_{i,j}^{(l)}\big) \textrm{ if } j\neq 0 \\ \textrm{Get } \frac{\partial}{\partial\Theta_{i,j}^{(l)}}J(\Theta)=D_{i,j}^{(l)} \end{array}
注:上式中 ..* 代表 Matlab/Octave 中的 element-wise 的乘法。

向后传播的使用

先看几个涉及到的方法:

  • 参数展开:为使用优化函数,我们需要把所有的Θ\Theta矩阵展开并拼接成一个长向量:
thetaVector = [ Theta1(:); Theta2(:); Theta3(:) ];
deltaVector = [ D1(:); D2(:); D3(:) ];

在得到优化结果后返回原来的矩阵:

Theta1 = reshape(thetaVector(1:110),10,11)
Theta2 = reshape(thetaVector(111:220),10,11)
Theta3 = reshape(thetaVector(221:231),1,11)
  • 梯度检查:利用 ΘjJ(Θ)J(Θ1,...,Θj+ϵ,...,Θn)J(Θ1,...,Θjϵ,...,Θn)2ϵ\frac{\partial}{\partial\Theta_j}J(\Theta) \approx \frac{J(\Theta_1,...,\Theta_j+\epsilon,...,\Theta_n)-J(\Theta_1,...,\Theta_j-\epsilon,...,\Theta_n)}{2\epsilon} 取一个小的邻域如 ϵ=104\epsilon=10^{-4},可以检查我们用向后传播求出的梯度是否正确(若正确,有 gradApprox ≈ deltaVector 成立)。代码实现:
epsilon = 1e-4;
for i = 1 : n
	thetaPlus = theta;
	thetaPlus(i) += epsilon;
	thetaMinus = theta;
	thetaMinus(i) += epsilon;
	gradApprox(i) = (J(thetaPlus) - J(thetaMinus)) / (2*epsilon);
end
  • 随即初始化:在开始时,将 Θij(l)\Theta_{ij}^{(l)} 随机初始化,应保证随机值的取值在一个 [ϵ,ϵ][-\epsilon,\epsilon] 的范围内(这个 ϵ\epsilon 与梯度检查中的无关)。代码实现:
# If the dimensions of Theta1 is 10x11, Theta2 is 10x11 and Theta3 is 1x11.

Theta1 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta2 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta3 = rand(1,11) * (2 * INIT_EPSILON) - INIT_EPSILON;

将上述技巧与向后传播算法结合,我们就得到了了训练神经网络的方法:

  1. 随机初始化
  2. 向前传播得到 hΘ(x(i))h_\Theta(x^{(i)}) 对任意 x(i)x^{(i)}
  3. 计算代价函数
  4. 使用向后传播计算偏导
  5. 利用梯度检查验证向后传播是否正确,若没问题则关闭梯度检查功能
  6. 使用梯度下降或优化函数得到Θ\Theta

👉代码实现:

  1. 随机初始化
function W = randInitializeWeights(L_in, L_out)
%RANDINITIALIZEWEIGHTS Randomly initialize the weights of a layer with L_in
%incoming connections and L_out outgoing connections
%   W = RANDINITIALIZEWEIGHTS(L_in, L_out) randomly initializes the weights 
%   of a layer with L_in incoming connections and L_out outgoing 
%   connections. 
%
%   Note that W should be set to a matrix of size(L_out, 1 + L_in) as
%   the first column of W handles the "bias" terms
%

W = zeros(L_out, 1 + L_in);

% epsilon_init = 0.12
epsilon_init = sqrt(6 / (L_in + L_out));
W = rand(L_out, 1 + L_in) * (2 * epsilon_init) - epsilon_init;

end
  1. 计算代价
function [J grad] = nnCostFunction(nn_params, ...
                                   input_layer_size, ...
                                   hidden_layer_size, ...
                                   num_labels, ...
                                   X, y, lambda)
%NNCOSTFUNCTION Implements the neural network cost function for a two layer
%neural network which performs classification
%   [J grad] = NNCOSTFUNCTON(nn_params, hidden_layer_size, num_labels, ...
%   X, y, lambda) computes the cost and gradient of the neural network. The
%   parameters for the neural network are "unrolled" into the vector
%   nn_params and need to be converted back into the weight matrices. 
% 
%   The returned parameter grad should be a "unrolled" vector of the
%   partial derivatives of the neural network.
%

% Reshape nn_params back into the parameters Theta_1 and Theta_2, the weight matrices
% for our 2 layer neural network
Theta_1 = reshape(nn_params(1:hidden_layer_size * (input_layer_size + 1)), ...
                 hidden_layer_size, (input_layer_size + 1));

Theta_2 = reshape(nn_params((1 + (hidden_layer_size * (input_layer_size + 1))):end), ...
                 num_labels, (hidden_layer_size + 1));

% Setup some useful variables
m = size(X, 1);
K = num_labels;

J = 0;
Theta_1_grad = zeros(size(Theta_1));
Theta_2_grad = zeros(size(Theta_2));

% y(5000x1) -> Y(5000x10)
Y = zeros(m, K);
for i = 1 : m
    Y(i, y(i)) = 1;
end

% Feedforward
a_1 = X;
a_1_bias = [ones(m, 1), a_1];

z_2 = a_1_bias * Theta_1';
a_2 = sigmoid(z_2);
a_2_bias = [ones(m, 1), a_2];

z_3 = a_2_bias * Theta_2';
a_3 = sigmoid(z_3);
h = a_3;

% Cost Function
% for i = 1 : K
%     yK = Y(:, i);
%     hK = h(:, i);
%     J += 1/m * (-yK'*log(hK) - (1-yK)'*log(1-hK));

% J can be get by element-wise compute more elegantly.
J = 1/m * sum(sum((-Y.*log(h) - (1-Y).*log(1-h))));

% Regularize
J = J + lambda/(2*m) * (sum(sum(Theta_1(:, 2:end).^2)) + sum(sum(Theta_2(:, 2:end).^2)));

% Backpropagation

delta_3 = a_3 .- Y;
delta_2 = (delta_3 * Theta_2) .* sigmoidGradient([ones(m, 1), z_2]);
% sigmoidGradient: return g = sigmoid(z) .* (1 - sigmoid(z));
delta_2 = delta_2(:, 2:end);

Delta_1 = delta_2' * a_1_bias;
Delta_2 = delta_3' * a_2_bias;

Theta_1_grad = Delta_1 ./ m + lambda/m * [zeros(size(Theta_1, 1), 1), Theta_1(:, 2:end)];
Theta_2_grad = Delta_2 ./ m + lambda/m * [zeros(size(Theta_2, 1), 1), Theta_2(:, 2:end)];

% Unroll gradients
grad = [Theta_1_grad(:) ; Theta_2_grad(:)];

end
  1. 预测
function p = predict(Theta1, Theta2, X)
%PREDICT Predict the label of an input given a trained neural network
%   p = PREDICT(Theta1, Theta2, X) outputs the predicted label of X given the
%   trained weights of a neural network (Theta1, Theta2)

% Useful values
m = size(X, 1);
num_labels = size(Theta2, 1);

p = zeros(size(X, 1), 1);

h1 = sigmoid([ones(m, 1) X] * Theta1');
h2 = sigmoid([ones(m, 1) h1] * Theta2');
[dummy, p] = max(h2, [], 2);

end
  1. 驱动
input_layer_size  = 400;  % 20x20 Input Images of Digits
hidden_layer_size = 25;   % 25 hidden units
num_labels = 10;          % 10 labels, from 1 to 10   
                          % (note that we have mapped "0" to label 10)

fprintf('\nInitializing Neural Network Parameters ...\n')

load('Xy.mat');

fprintf('\nTraining Neural Network... \n')

%  value to see how more training helps.
options = optimset('MaxIter', 1000);

lambda = 1;

% Create "short hand" for the cost function to be minimized
costFunction = @(p) nnCostFunction(p, ...
                                   input_layer_size, ...
                                   hidden_layer_size, ...
                                   num_labels, X, y, lambda);

% Now, costFunction is a function that takes in only one argument (the
% neural network parameters)
[nn_params, cost] = fmincg(costFunction, initial_nn_params, options);

% Obtain Theta1 and Theta2 back from nn_params
Theta1 = reshape(nn_params(1:hidden_layer_size * (input_layer_size + 1)), ...
                 hidden_layer_size, (input_layer_size + 1));

Theta2 = reshape(nn_params((1 + (hidden_layer_size * (input_layer_size + 1))):end), ...
                 num_labels, (hidden_layer_size + 1));

pred = predict(Theta1, Theta2, X);

fprintf('\nTraining Set Accuracy: %f\n', mean(double(pred == y)) * 100);

支持向量机

优化目标:
$$
\min_\theta
C\sum_{i=1}^m \large[ y^{(i)} \mathop{\textrm{cost}_1}(\thetaTx{(i)})

  • (1 - y{(i)}) \mathop{\textrm{cost}_0}(\thetaTx^{(i)})\large]
  • \frac{1}{2}\sum_{j=1}^n \theta_j^2
    KaTeX parse error: Can't use function '$' in math mode at position 4: 当 $̲C$ 值比较大时,这个优化目标…
    \begin{array}{l}
    \min_\theta \frac{1}{2}\sum_{j=1}{n}\theta_j2\\
    s.t. \quad \begin{array}{l}
    \thetaTx{(i)} \ge 1 & \textrm{if } y^{(i)}=1\
    \thetaTx{(i)} \le -1 & \textrm{if } y^{(i)}=0
    \end{array}
    \end{array}
    由欧氏空间的知识:
    \begin{array}{ccl}
    ||u|| &=& \textrm{length of vector } u= \sqrt{u_12+u_22}\
    p &=& \textrm{length of projection of } v \textrm{ onto } u \textrm{ (signed)} \ \
    u^Tv &=& p \cdot ||u||
    \end{array}
    上式可表示为:
    \begin{array}{l}
    \min_\theta \frac{1}{2}\sum_{j=1}{n}\theta_j2
    =\frac{1}{2}\Big(\sqrt{\sum_{j=1}n\theta_j2}\Big)^2
    =\frac{1}{2}||\theta||^2 \\
    s.t. \quad \begin{array}{l}
    p^{(i)}\cdot ||\theta|| \ge 1 & \textrm{if } y^{(i)}=1\
    p^{(i)}\cdot ||\theta|| \le -1 & \textrm{if } y^{(i)}=0
    \end{array}\\
    \textrm{where p(i)p^{(i)} is the projection of x(i)x^{(i)} onto the vector θ\theta.}
    \end{array}
    $$
    SVM 会选择最大的间隙:

屏幕快照 2019-10-31 12.58.43

核方法

面对如下分类问题:

image-20191102114127170

我们可以使用多项式来回归,例如,当 θ0+θ1x1+θ2x2+θ3x1x2+θ4x12+θ5x22+0\theta_0+\theta_1x_1+\theta_2x_2+\theta_3x_1x_2+\theta_4x_1^2+\theta_5x_2^2+\cdots\ge 0 时预测 y=1y=1

这样有太多多项式比较麻烦,我们可以考虑如下方法:
Predict y=1ifθ0+θ1f1+θ2f2+θ3f3+0 \begin{array}{lcl} \textrm{Predict } y=1 &\textrm{if}&\theta_0+\theta_1f_1+\theta_2f_2+\theta_3f_3+\cdots\ge 0 \end{array}
这里的 f1=x1,f2=x2,f3=x1x2,f4=x12,...f_1=x_1,f_2=x_2,f_3=x_1x_2,f_4=x_1^2,...

我们用 fif_i 替换了多项式,避免了高次项的麻烦,那么如何确定 fif_i?大概的思想如下:

image-20191102124615797

为方便描述,假设我们只有 x0,x1,x2x_0,x_1,x_2,并且只打算构造 f1,f2,f3f_1,f_2,f_3。那么,不管 x0x_0(偏移项),我们从 x1x_1-x2x_2 的图像中选择 3 个点,记为 l(1),l(2),l(3)l^{(1)},l^{(2)},l^{(3)},称之为标记点。任给 xx,我们通过计算其与各标记点的临近程度得到一组 fif_i
fi=similarity(x,l(i))=exp(xl(i)22σ2)=exp(j=1n(xjlj(i))22σ2) f_i = \mathop{\textrm{similarity}}(x,l^{(i)}) = \exp(-\frac{||x-l^{(i)}||^2}{2\sigma^2}) = \exp(-\frac{\sum_{j=1}^n(x_j-l_j^{(i)})^2}{2\sigma^2})
这里具体的 similarity 函数称为核函数,核函数多种多样。我们这里写的是很常用的 exp(j=1n(xjlj(i))22σ2)\exp(-\frac{\sum_{j=1}^n(x_j-l_j^{(i)})^2}{2\sigma^2}) ,称为 Gaussian Kernel,他的代码实现如下:

由这种方法,我们知道,给定 xx,对每个 l(i)l^{(i)} 我们会得到一个 fif_i,满足:

  1. xx 接近 l(i)l^{(i)}

filimxl(i)exp(xl(i)22σ2)=exp(022σ2)=1 f_i \approx \lim_{x\to l^{(i)}}\exp(-\frac{||x-l^{(i)}||^2}{2\sigma^2}) = \exp(-\frac{0^2}{2\sigma^2}) = 1

  1. xx 远离 l(i)l^{(i)}

filimxl(i)+exp(xl(i)22σ2)=exp(22σ2)=0 f_i \approx \lim_{||x-l^{(i)}||\to +\infin}\exp(-\frac{||x-l^{(i)}||^2}{2\sigma^2}) = \exp(-\frac{\infin^2}{2\sigma^2}) = 0

σ\sigma 的选择会影响 fif_i 值随 xx 远离 l(i)l^{(i)} 而下降的速度,σ2\sigma^2 越大,fif_i减小地越慢:

image-20191102132852867

使用核方法,我们可以做出这种预测:

image-20191102133600603

当且仅当给定xx临近l(1)l^{(1)}l(2)l^{(2)} 时预测 1,否则预测 0.

SVM 中使用核
  1. 给定训练集 (x(1),y(1)),(x(2),y(2)),...,(x(m),y(m))(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),...,(x^{(m)},y^{(m)})

  2. 选择标记点:l(i)=x(i)for i=1,2,,ml^{(i)}=x^{(i)} \quad \textrm{for }i=1,2,\cdots,m

  3. 对于样本 xx,计算核:fi=similarity(x,l(i))for i=1,2,,mf_i=\mathop{\textrm{similarity}}(x,l^{(i)}) \quad \textrm{for }i=1,2,\cdots,m

  4. f=[f0,f1,f2,,fm]Tf=[f_0,f_1,f_2,\cdots,f_m]^T,其中 f01f_0 \equiv 1

预测:

  • 给定 xx,计算 fRm+1f\in\R^{m+1}
  • 预测 y=1y=1 如果 θTf=θ0f0+θ1f1+0\theta^Tf=\theta_0f_0+\theta_1f_1+\cdots\ge 0

训练:
minθCi=1m[y(i)cost1(θTf(i))+(1y(i)) cost0(θTf(i))]+12j=1nθj2 \min_\theta C\sum_{i=1}^m \large[ y^{(i)} \mathop{\textrm{cost}_1}(\theta^Tf^{(i)})+ (1 - y^{(i)})\ \mathop{\textrm{cost}_0}(\theta^Tf^{(i)})\large]+ \frac{1}{2}\sum_{j=1}^n \theta_j^2
实现:

我们可以利用诸如 liblinear, libsvm 之类的库来得到 SVM 的 参数 θ\theta,要使用这些库,我们一般需要做以下工作:

  • 选择参数 CC
  • 选择核函数
    • No kernel(即线性核,亦即做逻辑回归:Predict y=1y=1 if θTx0\theta^Tx\ge0),适用于 n大 m小 的情况(避免过拟合)
    • Gaussian kernelfi=exp(xl(i)22σ2) where l(i)=x(i) for i=1,,mf_i=\exp(-\frac{||x-l^{(i)}||^2}{2\sigma^2})\textrm{ where } l^{(i)}=x^{(i)} \textrm{ for } i=1,\cdots,m)适用于 m大 n小 的情况(可拟合更复杂的非线性边界)
  • 提供核函数(Gaussian kernel 为例):

function f = kernel(x1, x2)f=exp(x1x222σ2)return \begin{array}{l} \textrm{function f = kernel(x1, x2)}\\ \qquad \textrm{f} = \exp(-\frac{||\textrm{x1}-\textrm{x2}||^2}{2\sigma^2})\\ \textrm{return} \end{array}

注意:使用 Gaussian Kernel 前务必做特征缩放!

👉无核函数的代码实现:

function sim = linearKernel(x1, x2)
%LINEARKERNEL returns a linear kernel between x1 and x2
%   sim = linearKernel(x1, x2) returns a linear kernel between x1 and x2
%   and returns the value in sim

% Ensure that x1 and x2 are column vectors
x1 = x1(:); x2 = x2(:);

% Compute the kernel
sim = x1' * x2;  % dot product

end

👉高斯核的代码实现:

function sim = gaussianKernel(x1, x2, sigma)
%RBFKERNEL returns a radial basis function kernel between x1 and x2
%   sim = gaussianKernel(x1, x2) returns a gaussian kernel between x1 and x2
%   and returns the value in sim

% Ensure that x1 and x2 are column vectors
x1 = x1(:); x2 = x2(:);

sim = 0;

sim = exp(-sum((x1 - x2).^2) / (2 * sigma ^ 2));

end

👉SVM 示例:

% Load X, y, Xtest and ytest
load('data.mat');

% SVM Parameters
C = 1;         % C = 1 ~ 100 is fine
sigma = 0.1;    % sigma = 0.03 ~ 0.1 gives somewhat good boundary, less is better

% We set the tolerance and max_passes lower here so that the code will run
% faster. However, in practice, you will want to run the training to
% convergence.
model= svmTrain(X, y, C, @(x1, x2) gaussianKernel(x1, x2, sigma)); 
p = svmPredict(model, Xtest);

fprintf('Test Accuracy: %f\n', mean(double(p == ytest)) * 100);
多元分类

image-20191102171432072

逻辑回归 vs 神经网络 vs SVM

nn = 特征数(xRn+1x\in\R^{n+1}

mm = 训练样本数

  • n相对于m大 (e.g. n=10,000,m=101000n=10,000, m=10 \sim 1000
    • 逻辑回归,或 无核SVM
  • n小、m适中(e.g. n=11000,m=50,000n=1\sim1000,m=50,000
    • 用 Gaussian 核 SVM
  • n小、m大
    • 创造/添加特征,然后用 逻辑回归或无核SVM

神经网络通常可以解决上述任何一种情况,但可能相对较慢。

无监督学习

无监督学习是只给x数据的,不给y。

屏幕快照 2019-11-05 14.49.05

K-Means 聚类

把一堆东西自动分成K堆。

输入:

  • KK:聚类的个数
  • {x(1),x(2),,x(m)}\{x^{(1)},x^{(2)},\cdots,x^{(m)}\}:训练集

输出:

  • KK 个类

K-Means算法:
$$
\begin{array}{l}

\textrm{Randomly initialize KK cluster centroids μ1,μ2,...μkRn\mu_1, \mu_2,...\mu_k \in \R^n}\
\textrm{Repeat }{\
\qquad \textrm{for i=1i=1 to mm:}\qquad\textrm{// Cluster assignment step}\
\qquad\qquad c^{(i)} := k \ \textrm{ s.t. } \min_k||x{(i)}-\mu_k||2 \
\qquad \textrm{for k=1k=1 to KK:}\qquad\textrm{// Move centroid step}\
\qquad\qquad \mu_k:= \textrm{average (mean) of points assigned to cluster kk}\
}\

\end{array}
代价函数:
J(c{(1)},\cdots,c{(m)},\mu_1,\cdots,\mu_K)=\frac{1}{m}\sum_{i=1}{m}||x{(i)}-\mu_{c{(i)}}||2
优化目标:
\min_{
\begin{array}
{1}c{(1)},\cdots,c{(m)},\
\mu_1,\cdots,\mu_K
\end{array}}
J(c{(1)},\cdots,c{(m)},\mu_1,\cdots,\mu_K)
() 得到较优解(不一定能得到最优解)的算法:
\begin{array}{l}
\textrm{For i=1i=1 to 100100 <or 50~1000> {}\
\qquad\textrm{Randomly initialize K-means.}\
\qquad\textrm{Run K-means. Get c(1),,c(m),μ1,,μkc^{(1)},\cdots,c^{(m)},\mu_1,\cdots,\mu_k}\
\qquad\textrm{Compute cost function (distortion):}\
\qquad\qquad J(c{(1)},\cdots,c{(m)},\mu_1,\cdots,\mu_K)\
\textrm{}}\
\textrm{pick clustering that gave lowest JJ.}
\end{array}
$$
KK的选择:

  1. 更具实际问题的需求易得;
  2. 选择拐点:image-20191106171612460

👉代码实现

  1. 找最近的类中心:
function idx = findClosestCentroids(X, centroids)
%FINDCLOSESTCENTROIDS computes the centroid memberships for every example
%   idx = FINDCLOSESTCENTROIDS (X, centroids) returns the closest centroids
%   in idx for a dataset X where each row is a single example. idx = m x 1 
%   vector of centroid assignments (i.e. each entry in range [1..K])
%

% Set K
K = size(centroids, 1);

idx = zeros(size(X,1), 1);

for i = 1 : size(X, 1)
    min_j = 0;
    min_l = Inf;
    for j = 1 : size(centroids, 1)
        l = sum((X(i, :) - centroids(j, :)) .^ 2);
        if l <= min_l
            min_j = j;
            min_l = l;
        end
    end
    idx(i) = min_j;
end

end
  1. 计算中心:
function centroids = computeCentroids(X, idx, K)
%COMPUTECENTROIDS returns the new centroids by computing the means of the 
%data points assigned to each centroid.
%   centroids = COMPUTECENTROIDS(X, idx, K) returns the new centroids by 
%   computing the means of the data points assigned to each centroid. It is
%   given a dataset X where each row is a single data point, a vector
%   idx of centroid assignments (i.e. each entry in range [1..K]) for each
%   example, and K, the number of centroids. You should return a matrix
%   centroids, where each row of centroids is the mean of the data points
%   assigned to it.
%

% Useful variables
[m n] = size(X);

centroids = zeros(K, n);

for i = 1 : K
    ck = find(idx == i);
    centroids(i, :) = sum(X(ck,:)) / size(ck, 1);
end

end
  1. 运行K-Means
function [centroids, idx] = runkMeans(X, initial_centroids, ...
                                      max_iters, plot_progress)
%RUNKMEANS runs the K-Means algorithm on data matrix X, where each row of X
%is a single example
%   [centroids, idx] = RUNKMEANS(X, initial_centroids, max_iters, ...
%   plot_progress) runs the K-Means algorithm on data matrix X, where each 
%   row of X is a single example. It uses initial_centroids used as the
%   initial centroids. max_iters specifies the total number of interactions 
%   of K-Means to execute. plot_progress is a true/false flag that 
%   indicates if the function should also plot its progress as the 
%   learning happens. This is set to false by default. runkMeans returns 
%   centroids, a Kxn matrix of the computed centroids and idx, a m x 1 
%   vector of centroid assignments (i.e. each entry in range [1..K])
% 若使用 plot_progress 需要额外的画图函数实现,这里没有给出.
%

% Set default value for plot progress
if ~exist('plot_progress', 'var') || isempty(plot_progress)
    plot_progress = false;
end

% Plot the data if we are plotting progress
if plot_progress
    figure;
    hold on;
end

% Initialize values
[m n] = size(X);
K = size(initial_centroids, 1);
centroids = initial_centroids;
previous_centroids = centroids;
idx = zeros(m, 1);

% Run K-Means
for i=1:max_iters
    
    % Output progress
    fprintf('K-Means iteration %d/%d...\n', i, max_iters);
    if exist('OCTAVE_VERSION')
        fflush(stdout);
    end
    
    % For each example in X, assign it to the closest centroid
    idx = findClosestCentroids(X, centroids);
    
    % Optionally, plot progress here
    if plot_progress
        plotProgresskMeans(X, centroids, previous_centroids, idx, K, i);
        previous_centroids = centroids;
        fprintf('Press enter to continue.\n');
        input("...");
    end
    
    % Given the memberships, compute new centroids
    centroids = computeCentroids(X, idx, K);
end

% Hold off if we are plotting progress
if plot_progress
    hold off;
end

end
  1. 驱动脚本:
% Load an example dataset
load('data.mat');

% Settings for running K-Means
K = 3;
max_iters = 10;

% For consistency, here we set centroids to specific values
% but in practice you want to generate them automatically, such as by
% settings them to be random examples (as can be seen in
% kMeansInitCentroids).
initial_centroids = [3 3; 6 2; 8 5];

% Run K-Means algorithm. The 'true' at the end tells our function to plot
% the progress of K-Means
[centroids, idx] = runkMeans(X, initial_centroids, max_iters, true);
fprintf('\nK-Means Done.\n\n');

PCA 维数约减

主成分分析:把n维的数据(投影)降到k维,略去不重要的部分(k<=n)。

PCA算法

  1. 数据预处理

    训练集:x(1),x(2),,x(m)x^{(1)},x^{(2)},\cdots,x^{(m)}

    预处理(feature scaling & mean normalization):

    • μj=1mi=1mxj(i),sj=standard deviation of feature j\mu_j=\frac{1}{m}\sum_{i=1}^m x_j^{(i)},\qquad s_j=\textrm{standard deviation of feature }j

    • Replace each xj(i)x_j^{(i)} with xjμjsj\frac{x_j-\mu_j}{s_j}

2)降维

  1. 计算协方差矩阵Σ\Sigma(这个矩阵记做大Sigma,注意和求和号区分):
    Σ=1mi=1n(x(i))(x(i))T \Sigma = \frac{1}{m}\sum_{i=1}^n(x^{(i)})(x^{(i)})^T
  2. Σ\Sigma的特征值(实际上是奇异值分解):[U, S, V] = svd(Sigma);
  3. 从上一步svd得到:
    U=[u(1)u(2)u(n)]Rn×nUreduce=[u(1)u(2)u(k)] U = \left[\begin{array}{cccc} | & | & & |\\ u^{(1)} & u^{(2)} & \cdots & u^{(n)}\\ | & | & & | \end{array}\right] \in \R^{n\times n} \Rightarrow U_{reduce}=\left[\begin{array}{cccc} | & | & & |\\ u^{(1)} & u^{(2)} & \cdots & u^{(k)}\\ | & | & & | \end{array}\right]
  4. 完成降维:xRnzRkx\in\R^n\to z\in\R^k:
    z=UreduceTx=[(u(1))T(u(k))T]x z = U_{reduce}^Tx =\left[\begin{array}{ccc} -- & (u^{(1)})^T & --\\ & \vdots & \\ -- & (u^{(k)})^T & --\\ \end{array}\right]x

👉代码实现:

% do feature scaling & mean normalization

Sigma = 1/m * X' * X;
[U, S, V] = svd(Sigma);

Ureduce = U(:, 1:K);
Z = X * Ureduce;

数据复原:将数据还原到原来的维度(zRkxapproxRnz\in\R^k \to x_{approx}\in\R^n):
xapprox=Ureducez x_{approx}=U_{reduce}z
一般情况下 xxapproxx \neq x_{approx},我们只能期望 xapproxx_{approx} 尽量接近 xx.

image-20191110215011122

kk(主成分个数)的选择

一般,选择 kk 为使得下式成立的最小值:
1mi=1mx(i)xapprox(i)21mi=1mx(i)20.01 \frac{\frac{1}{m}\sum_{i=1}^m||x^{(i)}-x_{approx}^{(i)}||^2}{\frac{1}{m}\sum_{i=1}^{m}||x^{(i)}||^2}\le0.01
算法:
Try PCA with k=1,,n:Compute Ureduce,z(1),,z(m),xapprox(1),,xapproxmCheck if 1mi=1mx(i)xapprox(i)21mi=1mx(i)20.01 \begin{array}{l} \textrm{Try PCA with } k=1,\cdots,n:\\ \quad \textrm{Compute } U_{reduce},z^{(1)},\cdots,z^{(m)},x_{approx}^{(1)},\cdots,x_{approx}^{m}\\ \quad \textrm{Check if } \frac{\frac{1}{m}\sum_{i=1}^m||x^{(i)}-x_{approx}^{(i)}||^2}{\frac{1}{m}\sum_{i=1}^{m}||x^{(i)}||^2}\le0.01 \end{array}

异常检测

从一堆数据中找出异常于其他的。

问题描述:给定数据集 {x(1),x(2),,x(m)}\{x^{(1)},x^{(2)},\cdots,x^{(m)}\},通过训练,判断 xtestx_{test} 是否异常。

要解决这个问题,我们可以对p(x)p(x)(概率)建立一个模型,选择一个临界值 ϵ\epsilon,使:
p(xtest)<ϵanomalyp(xtest)ϵOK \begin{array}{l} p(x_{test})<\epsilon \Rightarrow \textrm{anomaly}\\ p(x_{test})\ge\epsilon \Rightarrow \textrm{OK} \end{array}
这样问题可以转化为密度值估计。我们常用高斯分布解决这个问题。

高斯分布

xx 服从高斯分布:xN(μ,σ2)x \sim \mathcal{N}(\mu,\sigma^2)

则,xx 的概率为:
p(x)=12πσexp((xμ)22σ2) p(x) = \frac{1}{\sqrt{2\pi}\sigma}\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right)
其中参数 μ\muσ\sigma 由下式确定(这是在机器学习里常用的格式,不一定和数学里的一样):
μ=1mi=1mx(i) \mu=\frac{1}{m}\sum_{i=1}^{m}x^{(i)}

σ2=1mi=1m(x(i)μ)2 \sigma^2=\frac{1}{m}\sum_{i=1}^{m}\left(x^{(i)}-\mu\right)^2

👉代码实现:

function [mu sigma2] = estimateGaussian(X)
%ESTIMATEGAUSSIAN This function estimates the parameters of a 
%Gaussian distribution using the data in X
%   [mu sigma2] = estimateGaussian(X), 
%   The input X is the dataset with each n-dimensional data point in one row
%   The output is an n-dimensional vector mu, the mean of the data set
%   and the variances sigma^2, an n x 1 vector
% 

% Useful variables
[m, n] = size(X);

mu = zeros(n, 1);
sigma2 = zeros(n, 1);

mu = mean(X);
sigma2 = var(X) * (m - 1) / m;

end

借此我们便可得到异常检查算法:

异常检查算法
  1. 选择认为可能表现出样本异常的数据特征 xix_i
  2. 计算参数 μ1,,μn,σ12,,σn2\mu_1,\cdots,\mu_n,\sigma_1^2,\cdots,\sigma_n^2

μ=1mi=1mx(i) \mu=\frac{1}{m}\sum_{i=1}^{m}x^{(i)}

σ2=1mi=1m(x(i)μ)2 \sigma^2=\frac{1}{m}\sum_{i=1}^{m}\left(x^{(i)}-\mu\right)^2

  1. 对于新给的样本 xx,计算 p(x)p(x)

p(x)=j=1np(xj;μj,σj2)=j=1n12πσjexp((xjμj)22σj2) p(x)=\prod_{j=1}^{n}p(x_j;\mu_j,\sigma_j^2)=\prod_{j=1}^{n}\frac{1}{\sqrt{2\pi}\sigma_j}\exp\left(-\frac{(x_j-\mu_j)^2}{2\sigma_j^2}\right)

  1. 如果p(x)<ϵp(x)<\epsilon,则预测异常。

多元高斯分布

p(x;μ,Σ)=exp(12(xμ)TΣ1(xμ))(2π)nΣ p(x;\mu,\Sigma)=\frac {\exp\left(-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)\right)} {\sqrt{(2\pi)^{n}|\Sigma|}}

参数:

  • μRn\mu\in\R^n
  • ΣRn×n\Sigma\in\R^{n\times n} (covariance matrix, Sigma = 1/m * X' * X;)

参数的计算:
μ=1mi=1mx(i)Σ=1mi=1m(x(i)μ)(x(i)μ)T \mu=\frac{1}{m}\sum_{i=1}^mx^{(i)} \qquad \Sigma=\frac{1}{m}\sum_{i=1}^m\left(x^{(i)}-\mu\right)\left(x^{(i)}-\mu\right)^T

👉代码实现:

function p = multivariateGaussian(X, mu, Sigma2)
%MULTIVARIATEGAUSSIAN Computes the probability density function of the
%multivariate gaussian distribution.
%    p = MULTIVARIATEGAUSSIAN(X, mu, Sigma2) Computes the probability 
%    density function of the examples X under the multivariate gaussian 
%    distribution with parameters mu and Sigma2. If Sigma2 is a matrix, it is
%    treated as the covariance matrix. If Sigma2 is a vector, it is treated
%    as the \sigma^2 values of the variances in each dimension (a diagonal
%    covariance matrix)
%

k = length(mu);

if (size(Sigma2, 2) == 1) || (size(Sigma2, 1) == 1)
    Sigma2 = diag(Sigma2);
end

X = bsxfun(@minus, X, mu(:)');
p = (2 * pi) ^ (- k / 2) * det(Sigma2) ^ (-0.5) * ...
    exp(-0.5 * sum(bsxfun(@times, X * pinv(Sigma2), X), 2));

end
用多元高斯分布的异常检查
  1. 拟合多元高斯分布的 p(x)p(x) 模型,通过参数:

μ=1mi=1mx(i)Σ=1mi=1m(x(i)μ)(x(i)μ)T \mu=\frac{1}{m}\sum_{i=1}^mx^{(i)} \qquad \Sigma=\frac{1}{m}\sum_{i=1}^m\left(x^{(i)}-\mu\right)\left(x^{(i)}-\mu\right)^T

  1. 对于新给 xx,计算:

p(x)=exp(12(xμ)TΣ1(xμ))(2π)nΣ p(x)=\frac {\exp\left(-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)\right)} {\sqrt{(2\pi)^{n}|\Sigma|}}

  1. 如果p(x)<ϵp(x)<\epsilon,则预测异常。

门槛选择

通过计算 F1F_1 值可以得到最适合的 ϵ\epsilon

F1F_1 值由 precision (precprec) 和 recall (recrec) 给出:
F1=2precrecprec+rec F_1=\frac{2\cdot prec \cdot rec}{prec+rec}
其中:
prec=tptp+fp prec = \frac{tp}{tp+fp}

rec=tptp+fn rec = \frac{tp}{tp+fn}

  • tptp 是 true positives:预测为正,实际也为正
  • fpfp 是 false positives:预测为正,实际为负
  • fnfn 是 false negatives:预测为负,实际为正

image-20191026152903469

👉代码实现:

function [bestEpsilon bestF1] = selectThreshold(yval, pval)
%SELECTTHRESHOLD Find the best threshold (epsilon) to use for selecting
%outliers
%   [bestEpsilon bestF1] = SELECTTHRESHOLD(yval, pval) finds the best
%   threshold to use for selecting outliers based on the results from a
%   validation set (pval) and the ground truth (yval).
%

bestEpsilon = 0;
bestF1 = 0;
F1 = 0;

stepsize = (max(pval) - min(pval)) / 1000;
for epsilon = min(pval):stepsize:max(pval)

    cvPredictions = pval < epsilon;
    
    tp = sum((cvPredictions == 1) & (yval == 1));
    fp = sum((cvPredictions == 1) & (yval == 0));
    fn = sum((cvPredictions == 0) & (yval == 1));

    prec = tp / (tp + fp);
    rec = tp / (tp + fn);

    F1 = (2 * prec * rec) / (prec + rec);

    if F1 > bestF1
       bestF1 = F1;
       bestEpsilon = epsilon;
    end
end

end

推荐系统

通过评分,推荐用户新内容。

符号说明:(假设我们要推荐的东西是电影)

  • nun_u = 用户数
  • nmn_m = 电影数
  • r(i,j)=1r(i,j)=1 若用户 jj 对电影 ii 评过分,否则为 0
  • y(i,j)y^{(i,j)} = 用户 jj 给电影 ii 的评分(只有当 r(i,j)=1r(i,j)=1 时才有定义)

基于内容推荐

预测模型

屏幕快照 2019-11-26 16.04.26

  • r(i,j)=1r(i,j)=1 若用户 jj 对电影 ii 评过分
  • y(i,j)y^{(i,j)} = 用户 jj 给电影 ii 的评分(如果有定义)
  • θ(j)\theta^{(j)} 用户 jj 的参数(向量)
  • x(i)x^{(i)} 电影 ii 的特征(向量)

对于用户 jj,电影 ii,预测评分:(θ(j))T(x(i))(\theta^{(j)})^T(x^{(i)})

优化目标

  1. 优化 θ(j)\theta^{(j)} (对于单个用户 jj 的参数)

minθ(j)i:r(i,j)=1((θ(j))Tx(i)y(i,j))2+λ2k=1n(θk(j))2 \min_{\theta^{(j)}}\sum_{i:r(i,j)=1}\left((\theta^{(j)})^Tx^{(i)}-y^{(i,j)}\right)^2+\frac{\lambda}{2}\sum_{k=1}^n \left(\theta_k^{(j)}\right)^2

  1. 优化 θ(1),θ(2),,θ(nu)\theta^{(1)},\theta^{(2)},\cdots,\theta^{(n_u)}(对所有用户)

minθ(1),,θ(nu)j=1nui:r(i,j)=1((θ(j))Tx(i)y(i,j))2+λ2j=1nuk=1n(θk(j))2 \min_{\theta^{(1)},\cdots,\theta^{(n_u)}} \sum_{j=1}^{n_u}\sum_{i:r(i,j)=1}\left((\theta^{(j)})^Tx^{(i)}-y^{(i,j)}\right)^2 + \frac{\lambda}{2}\sum_{j=1}^{n_u}\sum_{k=1}^n \left(\theta_k^{(j)}\right)^2

我们可以用梯度下降解决问题:
Repeat{θ0(j):=θ0(j)αi:r(i,j)=1((θ(j))T(x(i))y(i,j))x0(i)θk(j):=θk(j)α[(i:r(i,j)=1((θ(j))T(x(i))y(i))xk(i))+λθk(j)](for k0)} \begin{array}{l} Repeat\quad\{\\ \qquad \theta_0^{(j)}:=\theta_0^{(j)}-\alpha\sum_{i:r(i,j)=1} \big((\theta^{(j)})^T(x^{(i)})-y^{(i,j)}\big)x_0^{(i)}\\ \qquad \theta_k^{(j)}:=\theta_k^{(j)}-\alpha\Big[\Big(\sum_{i:r(i,j)=1}\big((\theta^{(j)})^T(x^{(i)})-y^{(i)}\big)x_k^{(i)}\Big)+\lambda\theta_k^{(j)}\Big]\qquad (\textrm{for } k \neq 0)\\ \} \end{array}

协同过滤

在基于内容推荐中我们有时会很难把握电影(我们要推荐的东西)有哪些特征(x(i)x^{(i)}),我们想让机器学习自己找特征,这就用到协同过滤。

新加的优化目标:(之前在基于内容推荐里面的优化目标仍需考虑)

  • 给定 θ(1),θ(2),,θ(nu)\theta^{(1)},\theta^{(2)},\cdots,\theta^{(n_u)},学习 x(i)x^{(i)}:

minx(i)12i:r(i,j)=1((θ(j))Tx(i)y(i,j))2+λ2k=1n(xk(i))2 \min_{x^{(i)}}\frac{1}{2}\sum_{i:r(i,j)=1}\left((\theta^{(j)})^Tx^{(i)}-y^{(i,j)}\right)^2+\frac{\lambda}{2}\sum_{k=1}^n \left(x_k^{(i)}\right)^2

  • 给定 θ(1),θ(2),,θ(nu)\theta^{(1)},\theta^{(2)},\cdots,\theta^{(n_u)} ,学习 x(1),,x(nm)x^{(1)},\cdots,x^{(n_m)}

minx(1),,x(nm)12i=1nmi:r(i,j)=1((θ(j))Tx(i)y(i,j))2+λ2i=1nmk=1n(xk(i))2 \min_{x^{(1)},\cdots,x^{(n_m)}}\frac{1}{2} \sum_{i=1}^{n_m}\sum_{i:r(i,j)=1}\left((\theta^{(j)})^Tx^{(i)}-y^{(i,j)}\right)^2 + \frac{\lambda}{2}\sum_{i=1}^{n_m}\sum_{k=1}^n \left(x_k^{(i)}\right)^2

协同过滤

现在我们的问题是即没有训练好的 θ\theta,又没有一组充分优化的 xx,但学习 θ\theta 要先有 xx,学习 xx 要先有 θ\theta。这就变成了一个类似鸡生蛋、蛋生鸡的问题。

我们可以考虑这样解决这个难题:

首先随机猜一组 θ\theta,然后用这组 θ\theta 就可以得到一组 xx;用这组得到的 xx 又可以优化 θ\theta,优化后的 θ\theta 又拿来优化 xx … 不断重复这个过程,我们可以期望得到一组 xxθ\theta 都充分优化的解(事实上它们最终是会收敛的)。

屏幕快照 2019-11-28 15.20.59
$$
\begin{array}{l}

\textrm{Given }
x{(1)},\cdots,x{(n_m)}
\textrm{ , estimate }
\theta{(1)},\cdots,\theta{(n_u)}:\
\quad
\min_{\theta{(1)},\cdots,\theta{(n_u)}}
\sum_{j=1}{n_u}\sum_{i:r(i,j)=1}\left((\theta{(j)})Tx{(i)}-y{(i,j)}\right)2 +
\frac{\lambda}{2}\sum_{j=1}{n_u}\sum_{k=1}n \left(\theta_k{(j)}\right)2\

\textrm {Given }
\theta{(1)},\cdots,\theta{(n_u)}
\textrm{ , estimate }
x{(1)},\cdots,x{(n_m)}:\
\quad
\min_{x{(1)},\cdots,x{(n_m)}}\frac{1}{2}
\sum_{i=1}{n_m}\sum_{i:r(i,j)=1}\left((\theta{(j)})Tx{(i)}-y{(i,j)}\right)2 +
\frac{\lambda}{2}\sum_{i=1}{n_m}\sum_{k=1}n \left(x_k{(i)}\right)2
\end{array}
$$
我们随机初始化一组参数,然后重复来回计算 θ\thetaxx,最终会得到解,但这样比较麻烦,我们可以做的更高效:

同时优化 x(1),,x(nm)x^{(1)},\cdots,x^{(n_m)}θ(1),,θ(nu)\theta^{(1)},\cdots,\theta^{(n_u)}:
J(x(1),,x(nm),θ(1),,θ(nu))=12(i,j):r(i,j)=1((θ(j))Tx(i)y(i,j))2+λ2i=1nmk=1n(xk(i))2+λ2i=1nmk=1n(xk(i))2 J(x^{(1)},\cdots,x^{(n_m)},\theta^{(1)},\cdots,\theta^{(n_u)})= \frac{1}{2} \sum_{(i,j):r(i,j)=1}\left((\theta^{(j)})^Tx^{(i)}-y^{(i,j)}\right)^2+ \frac{\lambda}{2}\sum_{i=1}^{n_m}\sum_{k=1}^n \left(x_k^{(i)}\right)^2+ \frac{\lambda}{2}\sum_{i=1}^{n_m}\sum_{k=1}^n \left(x_k^{(i)}\right)^2

minx(1),,x(nm)θ(1),,θ(nu)J(x(1),,x(nm),θ(1),,θ(nu)) \min_{\begin{array}{c}x^{(1)},\cdots,x^{(n_m)}\\\theta^{(1)},\cdots,\theta^{(n_u)}\end{array}} J(x^{(1)},\cdots,x^{(n_m)},\theta^{(1)},\cdots,\theta^{(n_u)})

协同过滤算法

  1. x(1),,x(nm),θ(1),,θ(nu)x^{(1)},\cdots,x^{(n_m)},\theta^{(1)},\cdots,\theta^{(n_u)} 随机初始化为一些比较小的随机值
  2. 优化 J(x(1),,x(nm),θ(1),,θ(nu))J(x^{(1)},\cdots,x^{(n_m)},\theta^{(1)},\cdots,\theta^{(n_u)})
  3. 对于给定用户,该用户的参数是 θ\theta,则用训练得到的某电影的特征 xx ,我们可以预测该用户可能为此电影评分:θTx\theta^Tx

低秩矩阵分解

我们可以看到,我们最终的预测是这样的:
Predict=[(x(1))T(θ(1))(x(1))T(θ(nu))(x(nm))T(θ(1))(x(nm))T(θ(nu))] Predict = \left[\begin{array}{ccccc} (x^{(1)})^T(\theta^{(1)}) & \ldots & (x^{(1)})^T(\theta^{(n_u)})\\ \vdots & \ddots & \vdots \\ (x^{(n_m)})^T(\theta^{(1)}) & \ldots & (x^{(n_m)})^T(\theta^{(n_u)})\end{array}\right]
考虑到几乎不可能有用户把接近所有的电影都评分,这个预测矩阵是稀疏的,存储这个矩阵会造成大量浪费,不妨令:
$$
X = \left[\begin{array}{ccc}

  • & (x{(1)})T & - \
    & \vdots & \
  • & (x{(n_m)})T & - \
    \end{array}\right],
    \qquad
    \Theta = \left[\begin{array}{ccc}
  • & (\theta{(1)})T & - \
    & \vdots & \
  • & (\theta{(n_u)})T & - \
    \end{array}\right]
    则有:
    Predict=X\Theta^T
    $$
    我们便将它分为了两部分。用这个方法求取 XXΘ\Theta,获得推荐系统需要的参数,称之为低秩矩阵分解。该方法不仅能在编程时直接通过向量化的手法获得参数,还通过矩阵分解节省了内存空间。

寻找相关电影

我们常需要推荐与电影 ii 相关的电影 jj,可以这样找到:
smallestx(i)x(j) \mathop{\textrm{smallest}} ||x^{(i)}-x^{(j)}||
均值归一化处理

再电影推荐问题中,由于评分总是1到5分(或其他范围),故不用特征缩放,但可以做 mean normalization:
μi=averagey(i,:) \mu_i=\mathop{\textrm{average}} y^{(i,:)}

Yi=Yiμi Y_i = Y_i-\mu_i

对用户 jj, 电影 ii, 预测:
(Θ(j))T(x(i))+μi \left(\Theta^{(j)}\right)^T\left(x^{(i)}\right)+\mu_i
👉代码实现

  1. 代价函数:
function [J, grad] = cofiCostFunc(params, Y, R, num_users, num_movies, ...
                                  num_features, lambda)
%COFICOSTFUNC Collaborative filtering cost function
%   [J, grad] = COFICOSTFUNC(params, Y, R, num_users, num_movies, ...
%   num_features, lambda) returns the cost and gradient for the
%   collaborative filtering problem.
%

% Unfold the U and W matrices from params
X = reshape(params(1:num_movies*num_features), num_movies, num_features);
Theta = reshape(params(num_movies*num_features+1:end), ...
                num_users, num_features);

            
J = 0;
X_grad = zeros(size(X));
Theta_grad = zeros(size(Theta));

h = X * Theta';
er = (h - Y) .* R;

J = 1/2 * sum(sum(er.^2));
X_grad = er * Theta;
Theta_grad = er' * X; 

% Regularized

J += lambda/2 *(sum(sum(Theta.^2)) + sum(sum(X.^2)));
X_grad += lambda * X;
Theta_grad += lambda * Theta;
grad = [X_grad(:); Theta_grad(:)];

end
  1. 均值归一
function [Ynorm, Ymean] = normalizeRatings(Y, R)
%NORMALIZERATINGS Preprocess data by subtracting mean rating for every 
%movie (every row)
%   [Ynorm, Ymean] = NORMALIZERATINGS(Y, R) normalized Y so that each movie
%   has a rating of 0 on average, and returns the mean rating in Ymean.
%

[m, n] = size(Y);
Ymean = zeros(m, 1);
Ynorm = zeros(size(Y));
for i = 1:m
    idx = find(R(i, :) == 1);
    Ymean(i) = mean(Y(i, idx));
    Ynorm(i, idx) = Y(i, idx) - Ymean(i);
end

end
  1. 驱动脚本
%  Normalize Ratings
[Ynorm, Ymean] = normalizeRatings(Y, R);

%  Useful Values
num_users = size(Y, 2);
num_movies = size(Y, 1);
num_features = 10;

% Set Initial Parameters (Theta, X)
X = randn(num_movies, num_features);
Theta = randn(num_users, num_features);

initial_parameters = [X(:); Theta(:)];

% Set options for fmincg
options = optimset('GradObj', 'on', 'MaxIter', 100);

% Set Regularization
lambda = 10;
theta = fmincg (@(t)(cofiCostFunc(t, Ynorm, R, num_users, num_movies, ...
                                num_features, lambda)), ...
                initial_parameters, options);

% Unfold the returned theta back into U and W
X = reshape(theta(1:num_movies*num_features), num_movies, num_features);
Theta = reshape(theta(num_movies*num_features+1:end), ...
                num_users, num_features);

fprintf('Recommender system learning completed.\n');

p = X * Theta';
my_predictions = p(:,1) + Ymean;

EOF

发布了6 篇原创文章 · 获赞 1 · 访问量 391
展开阅读全文