Scikit-learn与Tensorflow_Aurelien——2017学习笔记 chapter4

最新推荐文章于 2024-04-16 20:01:03 发布

zhuiyuan2012

最新推荐文章于 2024-04-16 20:01:03 发布

阅读量360

点赞数

分类专栏：机器学习

本文链接：https://blog.csdn.net/zhuiyuanzhongjia/article/details/80672095

版权

机器学习专栏收录该内容

37 篇文章 2 订阅

订阅专栏

Training Models

1. Linear Regression

2. The Normal Equation

函数： np.linalg.inv() ,LinearRegression()

3.Computational Complexity

On the positive side, this equation is linear with regards to the number of instances in the training set (it is O(m)), so it handles large training sets efficiently, provided they can fit in memory. predictions are very fast: the computational complexity is linear with regards to both the number of instances you want to make predictions on and the number of features. In other words, making predictions on twice as many
instances (or twice as many features) will just take roughly twice as much time.

4.Gradient Descent

The general idea of Gradient Descent is to tweak parameters iteratively in order to minimize a cost function.

MSE cost function for a Linear Regression model happens to be a convex function, which means that if you pick any two points on the curve, the line segment joining them never crosses the curve. This implies that there are no local
minima, just one global minimum.

These two facts have a great consequence: Gradient Descentis guaranteed to approach arbitrarily close the global minimum (if you wait long enough and if the learning rate is not too high)

(1) Batch Gradient Descent

Batch Gradient Descent: it uses the whole batch of training data at every step

To find a good learning rate, you can use grid search. A simple solution is to set a very large number of iterations but to interrupt the algorithm when the gradient vector becomes tiny—that is, when its norm becomes smaller than a tiny number ϵ (called the tolerance)—because this happens when Gradient Descent has (almost) reached the minimum.

(2)Stochastic Gradient Descent

Stochastic Gradient Descent just picks a random instance in the training set at every step and computes the gradients
based only on that single instance.It also makes it possible to train on huge training sets, since only one instance needs to be in memory at each iteration (SGD can be implemented as an out-of-core algorithm. this algorithm is much less regular than Batch Gradient Descent: instead of gently decreasing until it reaches the minimum, the cost function will bounce up and down, decreasing only on average. the final parameter values are good, but not optimal

the algorithm can never settle at the minimum. One solution to this dilemma is to gradually reduce the learning rate. The steps start out large (which helps make quick progress and escape local minima), then get smaller and smaller, allowing the algorithm to settle at the global minimum. This process is called simulated annealing(模拟退火算法以一定的概率来接受一个比当前解要差的解，因此有可能会跳出这个局部的最优解，达到全局的最优解),

函数：SGDRegressor()

(3)Mini-batch Gradient Descent

mini-batch GD computes the gradients on small random sets of instances called mini-batches. The main advantage of Mini-batch GD over Stochastic GD is that you can get a performance boost from hardware optimization of matrix operations, especially when using GPUs.

5. Polynomial Regression

A simple way to do this is to add powers of each feature as new features, then train a linear model on this extended
set of features.

函数：PolynomialFeatures(),fit_transform(),LinearRegression()

PolynomialFeatures(degree=d) transforms an array containing n features into an array containing features.

6. Learning Curves

If you perform high-degree Polynomial Regression, you will likely fit the training data much better than with plain Linear Regression.

7. The Bias/Variance Tradeoff

Bias：This part of the generalization error is due to wrong assumptions, such as assuming that the data is linear when it is actually quadratic. A high-bias model is most likely to underfit the training data

Variance：This part is due to the model’s excessive sensitivity to small variations in the training data. A model with many degrees of freedom (such as a high-degree polynomial model) is likely to have high variance, and thus to overfit the training data.

Irreducible error：This part is due to the noisiness of the data itself. The only way to reduce this part of the error is to clean up the data

这里写图片描述

Increasing a model’s complexity will typically increase its variance and reduce its bias. Conversely, reducing a model's complexity increases its bias and reduces its variance. This is why it is called a tradeoff.

过多的变量（特征），同时只有非常少的训练数据，会导致出现过度拟合的问题

方法一：尽量减少选取变量的数量,保留重要的特征变量

方法二：正则化

保留所有的特征变量，但是会减小特征变量的数量级（参数数值的大小θ(j)）

8.Regularized Linear Models

岭回归与Lasso回归的出现是为了解决线性回归出现的过拟合以及在通过正规方程方法求解θ的过程中出现的x转置乘以x不可逆这两类问题的，

（1）Ridge Regression(also called Tikhonov regularization, L2范数)

This forces the learning algorithm to not only fit the data but also keep the model weights as small as possible.

Note that the bias term θ 0 is not regularized (the sum starts at i = 1, not 0)

函数： Ridge(), SGDRegressor()

(2)Lasso Regression(L1范数)

An important characteristic of Lasso Regression is that it tends to completely eliminate the weights of the least important features (i.e., set them to zero).Lasso Regression automatically performs feature selection and outputs a
sparse model (i.e., with few nonzero feature weights)

函数：Lasso()

(3)Elastic Net

The regularization term is a simple mix of both Ridge and Lasso’s regularization terms, and you can control the mix ratio r.

(4)Early Stopping

A very different way to regularize iterative learning algorithms such as Gradient Descent is to stop training as soon as the validation error reaches a minimum. This is called early stopping.

With early stopping you just stop training as soon as the validation error reaches the minimum.

9.Logistic Regression

(1)Estimating Probabilities

instead of outputting the result directly like the Linear Regression model does, it outputs the logistic of this result

(2)Training and Cost Function

this cost function is convex, so Gradient Descent (or any other optimization algorithm) is guaranteed to find the global minimum

(3)Decision Boundaries

函数：load_iris(), LogisticRegression(), predict_proba(),

Let’s try to build a classifier to detect the Iris-Virginica type based only on the petal width feature.

(4) Softmax Regression(Multinomial Logistic Regression)

The Logistic Regression model can be generalized to support multiple classes directly, without having to train and combine multiple binary classifiers. when given an instance x, the Softmax Regression model first computes a score s k (x) for each class k, then estimates the probability of each class by applying the softmax function (also called the normalized exponential) to the scores.