# ’感知机’，‘k近邻’，'朴素贝叶斯’

## 二、梯度下降法的思想和内容。

### 梯度定义：

$\nabla = \frac{\text{df}\left( \theta \right)}{\text{dθ}}$

$\theta = \ \theta_{0} - \eta \bullet \nabla f\left( \theta_{0} \right)$

### 梯度下降法合理性证明：

$f\left( \theta \right) \approx \ f\left( \theta_{0} \right) + \left( \theta - \theta_{0} \right) \bullet \nabla f\left( \theta_{0} \right)$

$\theta - \ \theta_{0} = \eta \bullet v$

$f\left( \theta \right) \approx \ f\left( \theta_{0} \right) + \eta v \bullet \nabla f\left( \theta_{0} \right)$

$f\left( \theta \right) - f\left( \theta_{0} \right) \approx \eta v \bullet \nabla f\left( \theta_{0} \right) < 0$

$v \bullet \nabla f\left( \theta_{0} \right) < 0$

$v = - \frac{\ \nabla f\left( \theta_{0} \right)}{\ \left\| \nabla f\left( \theta_{0} \right) \right\|}$

$\theta\ = \theta_{0} - \eta \bullet \nabla f\left( \theta_{0} \right)$

### 梯度下降法：

$\text{grad}f\left( x_{0},x_{1},\cdots,x_{n} \right) = \left( \frac{\ \partial f}{\ \partial x_{0}},\cdots,\frac{\ \partial f}{\ \partial x_{n}} \right)$

Repeat{

$x_{0} : = x_{0} - \alpha\frac{\ \partial f}{\ \partial x_{0}}$

$\vdots$

$x_{j} : = x_{j} - \alpha\frac{\ \partial f}{\ \partial x_{j}}$

$\vdots$

$x_{n} : = x_{n} - \alpha\frac{\ \partial f}{\ \partial x_{n}}$

}

## 三、感知机的原理；

### 模型定义：

$f\left( x \right) = sign\left( w \cdot x + b \right)$称为感知机；

### 感知机学习策略：

$- \frac{1}{\left\| w \right\|}y_{i}\left( w \cdot x_{i} + b \right),其中\left\| w \right\| 为w的范数$

$- \frac{1}{\left\| w \right\|}\sum_{x_{i} \in M}^{}{y_{i}\ \left( w \cdot x_{i} + b \right)}$

$L\left( w,b \right) = - \sum_{x_{i} \in M}^{}{y_{i}\ \left( w \cdot x_{i} + b \right)}$

### 感知机学习算法：

#### 原始形式：

a. 选取初始值$w_{0},B_{0}$

b. 在训练集中选取数据${(x}_{i},y_{i)}$

c. 如果$y_{i}\ \left( w \cdot x_{i} + b \right) \leq 0,$$w \leftarrow w + \eta y_{i}x_{i}$$b \leftarrow b + \eta y_{i}$

d. 转至(b),直至训练数据集中没有误分类点；

$y_{i}\left( w_{\text{opt}} \cdot x_{i} + b_{\text{opt}} \right) \geq \gamma$;$k\ \leq {\ \left( \ \frac{R}{\gamma}\ \right)}^{2}$;

#### 对偶形式：

$w = \sum_{j = 1}^{N}\left( a_{j}x_{j}y_{j} \right),b = \sum_{j = 1}^{N}\left( a_{j}y_{j} \right)$

a) $a \leftarrow 0$,$\ b \leftarrow 0$

b) 在训练集中选取数据${(x}_{i},y_{i)}$

c) 如果$y_{i}\ \sum_{j = 1}^{N}\left( a_{j}x_{j}y_{j} \cdot x_{i} + b \right) \leq 0,$$w \leftarrow w + \eta y_{i}x_{i}$,$b \leftarrow b + \eta y_{i}$

d) 转至(b),直至训练数据集中没有误分类点；

$G = \left\lbrack x_{i} \cdot x_{j} \right\rbrack_{n \times n}$

## 四、K近邻的原理；

### 核心原理：

1. 根据给定的距离度量，在训练集$T$中找出与$x$最邻近的$k$个点，涵盖这$k$个点的$x$的邻域记作$N_{k}\left( x \right);$

2. $N_{k}\left( x \right)$中根据分类决策规则（如多数表决）决定$x$的类别$y$

### 相关细节：

• K邻近算法的特殊情况是$k = 1$的时候称为最邻近模型。

• K近邻没有显示的学习过程

• K近邻算法中，当训练集、距离度量（欧氏距离）、k值、分类决策规则（如多数表决）确定后，对于任何一个新的输入，他所属的类别唯一确定。每个训练点拥有一个单元，所有训练示例点构成对特征空间的划分每个单元都有其所在点的类标记；

• 距离度量:度量距离通常用欧氏距离，也可用更一般的$L_{p}$距离(p根据情况取)；

$L_{p}\left( x_{i}{,x}_{j} \right) = \left( \sum_{l = 1}^{n}\left| x_{i}^{\left( l \right)} - x_{j}^{\left( l \right)} \right|^{p} \right)^{\frac{1}{n}}$

• K值：一般取一个比较小的数值，如果过小会造成过拟合，过大会让整体模型变得简单，通常采用交叉验证法来选取最优的k值

• **分类决策规则：**往往是多数表决，多数表决等价于经验风险最小化；

• K近邻算法主要需要考虑对训练数据进行快速k近邻搜索，为了提高k近邻搜索的效率，可以考虑使用特殊的结构存储训练的数据，具体方法有很多，kd树就是其中的一种；

## 五、朴素贝叶斯的原理；

### 基本方法：

$P\left( X = x|y = c_{k} \right) = P\left( X^{\left( 1 \right)} = x^{\left( 1 \right)},\cdots,X^{\left( n \right)} = x^{\left( n \right)} \middle| y = c_{k} \right)$

$= \prod_{j = 1}^{n}{P\left( X^{\left( j \right)} = x^{\left( j \right)} \middle| y = c_{k} \right)}（1）$

$P\left( y = c_{k}|X = x \right) = \frac{P\left( X = x|y = c_{k} \right)P\left( y = c_{k} \right)}{\sum_{k}^{}{P\left( X = x|y = c_{k} \right)P\left( y = c_{k} \right)}}\text{\ \ \ \ }（2）$

${y = f\left( x \right) = {\text{arg\ }\max_{\text{ck}}}\frac{P\left( y = c_{k} \right)\prod_{j = 1}^{n}{P\left( X^{\left( j \right)} = x^{\left( j \right)} \middle| y = c_{k} \right)}}{\sum_{k}^{}{P\left( y = c_{k} \right)\prod_{j = 1}^{n}{P\left( X^{\left( j \right)} = x^{\left( j \right)} \middle| y = c_{k} \right)}}}}_{\ }$

${y = \arg\ \max_{\text{ck}}}_{\ }P\left( y = c_{k} \right)\prod_{j = 1}^{n}{P\left( X^{\left( j \right)} = x^{\left( j \right)} \middle| y = c_{k} \right)}$

### 后概率最大化：

$L\left( Y,f\left( x \right) \right) = \left\{ \begin{matrix} 1,Y \neq f(x) \\ 0,Y = f(x) \\ \end{matrix} \right.\$

$R_{\exp}\left( f \right) = E\left\lbrack L(Y,f(x)) \right\rbrack$

$R_{\exp}\left( f \right) = E\sum_{k = 1}^{k}{\left\lbrack L\left( c_{k},f\left( x \right) \right) \right\rbrack P(c_{k}|X)}$

$f\left( x \right) = arg\min_{\text{yϵY}}{\sum_{k = 1}^{k}{L\left( c_{k},y \right)P\left( c_{k} \middle| X = x \right)}}$

$= arg\min_{\text{yϵY}}{\sum_{k = 1}^{k}{P(y \neq c_{k}|X = x)}}$

$= arg\min_{\text{yϵY}}{(1 - P\left( y = c_{k} \middle| X = x \right))}$

$= arg\max_{\text{yϵY}}{P\left( y = c_{k} \middle| X = x \right)}$

$f\left( x \right) = \ arg\max_{\text{ck}}{P\left( y = c_{k} \middle| X = x \right)}$

### 朴素贝叶斯的参数估计：

#### 极大似然估计：

$P\left( Y = c_{k} \right) = \frac{\sum_{i = 1}^{n}{I(y_{i}{= c}_{k})}}{N}$

$P\left( X^{(j)} = x^{(j)} \middle| y = c_{k} \right) = \frac{\sum_{i = 1}^{n}{I(x_{i}^{\left( j \right)} = a_{\text{jl}}，y_{i}{= c}_{k})}}{\sum_{i = 1}^{n}{I(y_{k}{= c}_{k})}}$

#### 贝叶斯估计：

$P\left( X^{(j)} = x^{(j)} \middle| y = c_{k} \right) = \frac{\sum_{i = 1}^{n}{I\left( x_{i}^{\left( j \right)} = a_{\text{jl}}，y_{i}{= c}_{k} \right) + \lambda}}{\sum_{i = 1}^{n}{I(y_{k}{= c}_{k})} + S_{j}\lambda}$

$P\left( Y = c_{k} \right) = \frac{\sum_{i = 1}^{n}{I\left( y_{i}{= c}_{k} \right) + \lambda}}{N + k\lambda}$

### 学习和分类算法：

• 计算先验概率及条件概率

• 对于给定的实例x,计算

$P\left( y = c_{k} \right)\prod_{j = 1}^{n}{P\left( X^{\left( j \right)} = x^{\left( j \right)} \middle| y = c_{k} \right)}$

• 确定实例*$x$*的类

${y = \arg\ \max_{\text{ck}}}_{\ }P\left( y = c_{k} \right)\prod_{j = 1}^{n}{P\left( X^{\left( j \right)} = x^{\left( j \right)} \middle| y = c_{k} \right)}$

## 六、朴素贝叶斯简单实例；

### 源码：

%%---------------------程序入口---------------------

clc
clear
%书中例题实现朴素贝叶斯
%特征1的取值集合
A1=[1;2;3];
%特征2的取值集合
A2=[4;5;6];%S M L
AValues={A1;A2};
%Y的取值集合
YValue=[-1;1];
%数据集和
T=[ 1,4,-1;
1,5,-1;
1,5,1;
1,4,1;
1,4,-1;
2,4,-1;
2,5,-1;
2,5,1;
2,6,1;
2,6,1;
3,6,1;
3,5,1;
3,5,1;
3,6,1;
3,6,-1];
%训练朴素贝叶斯模型
theta = NBtrain(T(:, 1:size(T, 2) - 1), T(:, size(T, 2)), AValues, YValue);
%训练带la平滑的朴素贝叶斯模型
ltheta = LaplaceNBtrain(T(:, 1:size(T, 2) - 1), T(:, size(T, 2)), AValues, YValue, 1);
%测试两个数据与书中答案相符
correct = 0;
for i=1:length(T)
if NBtest(theta, [T(i,1),T(i,2);], AValues, YValue)*T(i,3)==1
correct = correct + 1;
end
end
accuracy = correct/length(T);
disp(['极大似然做出参数估计对原始数据的准确率为'  num2str(accuracy)])

correct = 0;
for i=1:length(T)
if NBtest(ltheta, [T(i,1),T(i,2);], AValues, YValue)*T(i,3)==1
correct = correct + 1;
end
end
accuracy = correct/length(T);
disp(['贝叶斯估计做出参数估计对原始数据的准确率为'  num2str(accuracy)])

%%---------------------使用训练模型进行预测---------------------

function y = NBtest(theta, X, AValues, YValue)
Xindice=ones(size(X, 1), size(X, 2));
%找到特征在取值集合中的下标,将X矩阵转化为下标矩阵
for j=1:1:size(X, 2)
AXi = AValues{j, 1};
for i=1:1:size(X, 1)
for t=1:1:size(AXi, 1)
if(X(i, j) == AXi(t, 1))
Xindice(i, j) = t;
break
end
end
end
end
%矩阵用于记录所有X在不同Yi下的P(X|Y)P(Y)
Ys = zeros(size(X, 1), size(YValue, 1));
PX_Y = theta{1,1};
PY = theta{2,1};
for i=1:1:size(Ys, 1)
x=Xindice(i, :);
for k=1:1:size(Ys, 2)
ans = PY(k, 1);
for j=1:1:size(x, 2)
ans = ans * PX_Y{k, j}(x(1, j), 1);
end
Ys(i, k) = ans;
end
end
Ys;
%后验概率最大化
y=zeros(size(Ys, 1), 1);
for i=1:1:size(Ys, 1)
max = -1;
max_indice = 0;
for j=1:1:size(Ys, 2)
if(Ys(i, j) > max)
max = Ys(i, j);
max_indice = j;
end
end
y(i, 1) = YValue(max_indice, 1);
end
end

%---------------------极大似然估计---------------------

function theta=NBtrain(X,Y,AValues,YValue)
%计算先验概率
TY = zeros(size(YValue, 1), 1);
for i=1:1:size(Y, 1)
for j=1:1:size(YValue)
if(Y(i, 1) == YValue(j, 1))
Y(i,1);
TY(j, 1) = TY(j, 1) + 1;
end
end
end
PY = TY/size(Y, 1);
%计算条件概率
PX_Y=cell(size(YValue, 1), size(X, 2));
for k=1:1:size(YValue, 1)
%条件y＝yk
for i=1:1:size(X, 2)
%i为特征编号
%取得第i个特征的取值集合
XAi = AValues{i, 1};
TXij_Y = zeros(size(XAi, 1), 1);
for j=1:1:size(XAi, 1)
%查找数据中所有Y＝yk且特征i的值为Aij的数据个数并累加
for t=1:1:size(X, 1)
if(Y(t, 1)==YValue(k, 1) && X(t, i) == XAi(j, 1))
TXij_Y(j, 1) = TXij_Y(j, 1) + 1;
end
end
end
PX_Y{k, i} = TXij_Y/TY(k, 1);
end
end
theta = cell(2,1);
theta{1,1} = PX_Y;
theta{2,1} = PY;
end
%---------------------贝叶斯估计---------------------

function theta=LaplaceNBtrain(X,Y,AValues,YValue,lambda)
%计算先验概率
TY = zeros(size(YValue, 1), 1);
for i=1:1:size(Y, 1)
for j=1:1:size(YValue)
if(Y(i, 1) == YValue(j, 1))
Y(i,1);
TY(j, 1) = TY(j, 1) + 1;
end
end
end
PY = (TY + lambda)/(size(Y, 1) + lambda * size(YValue, 1));
%计算条件概率
PX_Y=cell(size(YValue, 1), size(X, 2));
for k=1:1:size(YValue, 1)
%条件y＝yk
for i=1:1:size(X, 2)
%i为特征编号
%取得第i个特征的取值集合
XAi = AValues{i, 1};
TXij_Y = zeros(size(XAi, 1), 1);
for j=1:1:size(XAi, 1)
%查找数据中所有Y＝yk且特征i的值为Aij的数据个数并累加
for t=1:1:size(X, 1)
if(Y(t, 1)==YValue(k, 1) && X(t, i) == XAi(j, 1))
TXij_Y(j, 1) = TXij_Y(j, 1) + 1;
end
end
end
PX_Y{k, i} = (TXij_Y + lambda)/(TY(k, 1) + lambda * size(XAi, 1));
end
end
theta = cell(2,1);
theta{1,1} = PX_Y;
theta{2,1} = PY;
end


### 实验结果：

©️2019 CSDN 皮肤主题: 深蓝海洋 设计师: CSDN官方博客