Scikit-learn与Tensorflow_Aurelien——2017学习笔记 chapter4

Training Models


1. Linear Regression




2. The Normal Equation


函数:  np.linalg.inv() ,LinearRegression()

3.Computational Complexity



On the positive side, this equation is linear with regards to the number of instances in the training set (it is O(m)), so it handles large training sets efficiently, provided they can fit in memory. predictions are very fast: the computational complexity is linear with regards to both the number of instances you want to make predictions on and the number of features. In other words, making predictions on twice as many
instances (or twice as many features) will just take roughly twice as much time.

4.Gradient Descent

The general idea of Gradient Descent is to tweak parameters iteratively in order to minimize a cost function.

MSE cost function for a Linear Regression model happens to be a convex function, which means that if you pick any two points on the curve, the line segment joining them never crosses the curve. This implies that there are no local
minima, just one global minimum.

These two facts have a great consequence: Gradient Descentis guaranteed to approach arbitrarily close the global minimum (if you wait long  enough and if the learning rate is not too high)

(1) Batch Gradient Descent

Batch Gradient Descent: it uses the whole batch of training data at every step


To find a good learning rate, you can use grid search.    A simple solution is to set a very large number of iterations but to interrupt the algorithm when the gradient vector becomes tiny—that is, when its norm becomes smaller than a tiny number  ϵ (called the tolerance)—because this happens when Gradient Descent has (almost) reached the minimum.


(2)Stochastic Gradient Descent

 Stochastic Gradient Descent just picks a random instance in the training set at every step and computes the gradients
based only on that single instance.It also makes it possible to train on huge training sets, since only one instance needs to be in memory at each iteration (SGD can be implemented as an out-of-core algorithm. this algorithm is much less regular than Batch Gradient Descent: instead of gently decreasing until it reaches the minimum, the cost function will bounce up and down, decreasing only on average. the final parameter values are good, but not optimal


 the algorithm can never settle at the minimum. One solution to this dilemma is to gradually reduce the learning rate. The steps start out large (which helps make quick progress and escape local minima), then get smaller and smaller, allowing the algorithm to settle at the global minimum. This process is called simulated annealing(模拟退火算法以一定的概率来接受一个比当前解要差的解,因此有可能会跳出这个局部的最优解,达到全局的最优解),

函数:SGDRegressor()


(3)Mini-batch Gradient Descent

mini-batch GD computes the gradients on small random sets of instances called mini-batches. The main advantage of Mini-batch GD over Stochastic GD is that you can get a performance boost from hardware optimization of matrix operations, especially when using GPUs.





5. Polynomial Regression


A simple way to do this is to add powers of each feature as new features, then train a linear model on this extended
set of features.

函数:PolynomialFeatures(),fit_transform(),LinearRegression()

PolynomialFeatures(degree=d) transforms an array containing n features into an array containing features.


6. Learning Curves

If you perform high-degree Polynomial Regression, you will likely fit the training data much better than with plain Linear Regression.


 7. The Bias/Variance Tradeoff

    Bias:This part of the generalization error is due to wrong assumptions, such as assuming that the          data is linear when it is actually quadratic. A high-bias model is most likely to underfit the training             data

      Variance:This part is due to the model’s excessive sensitivity to small variations in the training             data. A model with many degrees of freedom (such as a high-degree polynomial model) is likely to           have  high variance, and thus to overfit the training data.

       Irreducible error:This part is due to the noisiness of the data itself. The only way to reduce this               part of the error is to clean up the data

       这里写图片描述


         Increasing a model’s complexity will typically increase its variance and reduce its bias. Conversely, reducing a          model's complexity increases its bias and reduces its variance. This is why it is called a tradeoff.

过多的变量(特征),同时只有非常少的训练数据,会导致出现过度拟合的问题  

 方法一:尽量减少选取变量的数量,保留重要的特征变量

方法二:正则化

保留所有的特征变量,但是会减小特征变量的数量级(参数数值的大小θ(j))


 8.Regularized Linear Models

   岭回归与Lasso回归的出现是为了解决线性回归出现的过拟合以及在通过正规方程方法求解θ的过程中出现的x转置乘以x不可逆这两类问题的,

(1)Ridge Regression(also called Tikhonov regularization, L2范数)

This forces the learning algorithm to not only fit the data but also keep the model weights as small as possible.

        Note that the bias term θ 0 is not regularized (the sum starts at i = 1, not 0)


函数: Ridge(), SGDRegressor()

    (2)Lasso Regression(L1范数)

     

An important characteristic of Lasso Regression is that it tends to completely eliminate the weights of the least important features (i.e., set them to zero).Lasso Regression automatically performs feature selection and outputs a
sparse model (i.e., with few nonzero feature weights)


函数:Lasso()


       (3)Elastic Net

  The regularization term is a simple mix of both Ridge and Lasso’s regularization terms, and you can control the mix ratio r.

  

(4)Early Stopping

A very different way to regularize iterative learning algorithms such as Gradient Descent is to stop training as soon as the validation error reaches a minimum. This is called early stopping.

With early stopping you just stop training as soon as the validation error reaches the minimum.

9.Logistic Regression

(1)Estimating Probabilities

instead of outputting the result directly like the Linear Regression model does, it outputs the logistic of this result



(2)Training and Cost Function



 this cost function is convex, so Gradient Descent (or any other optimization algorithm) is guaranteed to find the global minimum

(3)Decision Boundaries

函数:load_iris(), LogisticRegression(), predict_proba(),

Let’s try to build a classifier to detect the Iris-Virginica type based only on the petal width feature.



(4) Softmax Regression(Multinomial Logistic Regression)

The Logistic Regression model can be generalized to support multiple classes directly, without having to train and combine multiple binary classifiers. when given an instance x, the Softmax Regression model first computes a score s k (x) for each class k, then estimates the probability of each class by applying the softmax function (also called the normalized exponential) to the scores.




Notice that when there are just two classes (K = 2), this cost function is equivalent to the Logistic Regression’s cost function


 use Gradient Descent(or any other optimization algorithm) to find the parameter matrix Θ that minimizes the cost function.









  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值