机器学习笔记2-监督学习

最新推荐文章于 2024-01-22 19:13:26 发布

爱老虎呦

最新推荐文章于 2024-01-22 19:13:26 发布

阅读量179

点赞数

分类专栏：机器学习

本文链接：https://blog.csdn.net/u013992766/article/details/115719410

版权

机器学习专栏收录该内容

2 篇文章 0 订阅

订阅专栏

1. 线性回归

1.1 三种梯度下降策略

随机梯度下降法：逐个地在每个数据点应用平方（或绝对）误差，并重复这一流程很多次；
批量梯度下降法：同时在每个数据点应用平方（或绝对）误差，并重复这一流程很多次；
小批次梯度下降法：线性回归的最佳方式是将数据拆分成很多小批次。每个批次都大概具有相同数量的数据点。然后使用每个批次更新权重。
具体而言，向数据点应用平方（或绝对）误差时，就会获得可以与模型权重相加的值。我们可以加上这些值，更新权重，然后在下个数据点应用平方（或绝对）误差。或者同时对所有点计算这些值，加上它们，然后使用这些值的和更新权重。

1.2 梯度下降与解方程

If there are only two dimensions, we have a system of two equations and two unknows, we can easily solve this using linear algebra. If you have n dimensions, then you would have n equations with n unknowns, and solving a system of n equations with n unknows is very expensive, because if n is big, then at some point of our solution, we have to invert an n by n matrix(对N姐矩阵求逆), inverting huge matrix is something that takes a lot of time an a lot of computing power. So this is simply not feasible. So instead, this is why we use gradient descent, it will not give us the exact answer necessarily, but is will get us pretty close to the best answer which will give us a solution that fits out data pretty well. But if we had infinite computing power, we would just solve the system and solve linear regression in one step.
在这里插入图片描述

1.3 Regularization

We have a parameter called lambda and what lambda does is it multiplies into the complexity error to make it bigger or smaller, if we have a large lambda, then we are punishing complexity by a lot, and we are picking a simpler model. Which one to use, L1 or L2? L1 regularization is actually computationally inefficient even thought it seems simpler, those absolute values are hard to differentiate. Whereas in L2 regularization, the squares have a very nice derivative and so that makes thing very easy to do computationally. The only time L1 regularization is faster than L2 is when the data is sparse. So let’s say you have a thousand counts of data, but really only 10 of them are relevant, and there’s a lot of zeroes in between then L1 is better. So, as I said, L1 is better for sparse outputs, and L2 is better of non-sparse outputs. When there’s just a lot of column to our zero, L1 is better. But when the data is very equally distributed among columns and the L2 is better. And L1 has one huge benefit which is that it gives this feature selection. So let’s say we have again data of a thousand columns, but really only 10 of them matter, and the rest are just noise. L1 will notice this and will make those irrelevant columns of data zero. L2 on the other hand, will not do this. L2 will just kind of take all the columns and treat them equally and give us a combination of all of them as our result.

2. 决策树

2.1 熵

在这里插入图片描述

2.2 信息增益

在这里插入图片描述

2.3 随技森林

在这里插入图片描述

So, how do we solve this? In the simplest possible way. Take a look at this. Pick some of the columns randomly. Build a Decision Tree in those columns. Now pick some other columns randomly and build a Decision Tree in those, and do it again. And now just let the trees vote. When we have a new data point, we just let all the trees make a prediction and pick the one that appears the most. Since we used a bunch of trees on randomly picked columns, this is called a random forest. There are better ways to pick the columns than randomly and we’ll call the ensemble methods.
在这里插入图片描述

2.4 sklearn 中的决策树

对于决策树模型，你将使用 scikit-learn 的 Decision Tree Classifier 类。该类提供了定义模型并将模型与数据进行拟合的函数。

>>> from sklearn.tree import DecisionTreeClassifier
>>> model = DecisionTreeClassifier()
>>> model.fit(x_values, y_values)

在上述示例中，model 变量是一个拟合到数据 x_values 和 y_values 的决策树模型。拟合模型是指寻找拟合训练数据的最佳线条。我们使用模型的 predict() 函数进行两项预测。

>>> print(model.predict([ [0.2, 0.8], [0.5, 0.4] ]))
[[ 0., 1.]]

该模型返回了一个预测结果数组，每个输入数组一个预测结果。第一个输入 [0.2, 0.8] 的预测结果为 0.。第二个输入 [0.5, 0.4] 的预测结果为 1.。

超参数

当我们定义模型时，可以指定超参数。在实践中，最常见的超参数包括：

max_depth：树中的最大层级数量。
min_samples_leaf：叶子允许的最低样本数量。
min_samples_split：拆分内部节点所需的最低样本数量。
max_features：寻找最佳拆分方法时要考虑的特征数量。

例如，在此例中，我们定义了一个模型：树的最大深度 max_depth 为7，每个叶子的最低元素数量 min_samples_leaf 是 10。

>>> model = DecisionTreeClassifier(max_depth = 7, min_samples_leaf = 10)

3 贝叶斯

3.1 贝叶斯定理

在这里插入图片描述

朴素贝叶斯中的朴素的意思就是假设事件之间是相互独立的，可以拆开做乘积运算。

4 支持向量机

Our new error for this algorithm is going to be a classification error plus a margin error(分类误差+边际误差). And minimizing this error is what’s going to give us the algorithm for support vector machine.

4.1 感知器算法Vs分类误差

This model gives us an error of six, and the idea will be to minimize this error using gradient descent in order to find the ideal W and b that give us the best possible cut. And that is the perceptron algorithm.
在这里插入图片描述
In a nutshell, that is a classification error for support vector machines.

4.2 边际误差

为了惩罚间隔小，所以在小的间隔下，误差算出来比较大。The margin error is just the norm of the vector w squared. It is the exact same error that is given by the regularization term in L2 regularization.
在这里插入图片描述

4.3 边际误差计算

首先， W=(w1,w2) ，x=(x1,x2)，并且 Wx=w1x1+w2x2，在这里我们有三条线，方程如下：

Wx+b=1
Wx+b=0
Wx+b=-1

由于这三条线为等距平行线，要想确定第一条线和第三条线之间的距离，我们只需要计算前两条线之间的距离，接着将这个数字乘以二。这也就是说我们需要确定图 1 中前两条线之间的距离。
在这里插入图片描述
请注意，由于我们只需计算线条之间的距离，因此也可以将线条平移，直到其中一条线与原点相交。这时得到的方程如下：

Wx=0
Wx=1
现在，第一条线的方程为 Wx=0，这意味着它与标记为红色的向量W=(w1,w2) 垂直。

该向量与方程为 Wx=1 的线条相交于蓝点。假设该点的坐标为、(p,q)。那么我们可以得到下面两个结果：
w1p+w2q=1 （由于该点位于这条线上），并且
由于该点位于向量 W=(w1,w2) 上，(p,q)是 (w1,w2) 的倍数。

4.4 C参数

在这里插入图片描述

4.5 多项式内核

A kernel really means a set of functions that will come to help us out. In high dimensions, we can create a lot more functions and build a lot more boundaries. We just add some dimensions to the data, find a higher dimensional surface, project it down, and we get our curves. And in general, the degree of a polynomial kernel is a hyper parameter that we can train to find the best possible model.
在这里插入图片描述

4.6 RBF内核

How the kernel method works in higher dimensions. First draw a mountain at every point. In the 3D case, this mountain is a Gaussian paraboloid and it lifts the points(像高斯抛物面把点升起来). Then if you want to separate the point from the rest, we can cut it with a plane. The plane will intersect the paraboloid at a circle and this circle is what will become our boundary. If we have more points, we’d do the same. We use a similar method than before to find the right weights for the combination of mountains that will bring the majority of the red points up while keeping the majority of the blue points down. And then cut this with the plane. When we project down, then the intersections of the curve on the plane will give us the boundaries that will split out data.
在这里插入图片描述

γ参数，就是正态分布中的方差。

4.7 sklearn 中的支持向量机

model 变量是一个拟合到数据 x_values 和 y_values 的支持向量机模型。拟合模型是指寻找拟合训练数据的最佳界线。我们使用模型的 predict() 函数进行两项预测。

>>> from sklearn.svm import SVC
>>> model = SVC(kernel='poly', degree=4, C=0.1)
>>> model.fit(x_values, y_values)
>>> print(model.predict([ [0.2, 0.8], [0.5, 0.4] ]))
[[ 0., 1.]]

超参数
当我们定义模型时，可以指定超参数。正如在此部分中看到的，最常见的超参数包括：

C：C 参数。
kernel：内核。最常见的内核为 ‘linear’、‘poly’ 和 ‘rbf’。
degree：如果内核是多项式，则此参数为内核中的最大单项式次数。
gamma：如果内核是径向基函数，则此参数为 γ 参数。

5 集成方法

All we need it that all method do just slightly better than random chance. Voila is what we obtain when we make the models vote. That’s the bagging algorithm.
在这里插入图片描述

5.1 sklearn中的AdaBoost

model 变量是一个决策树模型，它与数据 x_values 和 y_values 进行拟合。函数 fit 和 predict 的功能与之前相同。
超参数
当我们定义模型时，我们可以确定超参数。在实际操作中，最常见的超参数为：

base_estimator: 弱学习器使用的模型（切勿忘记导入该模型）。
n_estimators: 使用的弱学习器的最大数量。

比如在下面的例子中，我们定义了一个模型，它使用 max_depth 为 2 的决策树作为弱学习器，并且它允许的弱学习器的最大数量为 4。

>>> from sklearn.ensemble import AdaBoostClassifier
>>> model = AdaBoostClassifier()
>>> model.fit(x_train, y_train)
>>> model.predict(x_test)

>>> from sklearn.tree import DecisionTreeClassifier
>>> model = AdaBoostClassifier(base_estimator = DecisionTreeClassifier(max_depth=2), n_estimators = 4)