Regularization - Regularized logistic regression

摘要: 本文是吴恩达 (Andrew Ng)老师《机器学习》课程,第八章《正则化》中第58课时《正则化logistic regression》的视频原文字幕。为本人在视频学习过程中记录下来并加以修正,使其更加简洁,方便阅读,以便日后查阅使用。现分享给大家。如有错误,欢迎大家批评指正,在此表示诚挚地感谢!同时希望对大家的学习能有所帮助。

For logistic regression, we previously talked about two types of optimization algorithms. We talked about how to use gradient descent to optimize the cost function J(\theta ). And we also talked about advanced optimization methods. Ones that require that you provide a way to compute the cost function J(\theta ), and that you provide a way to compute the derivatives. In this video, we’ll show how you can adapt both of those techniques, both gradient and the more advanced optimization technique, in order to have them work for regularized logistic regression. So, here is the idea.

We saw earlier that logistic regression can also be prone to overfitting. If you fit it with a very sort of high order polynomial features like this, where g is the sigmoid function, and in particular, you may end up with a hypothesis whose decision boundary to be just sort of an overly complex and extremely contortive function that really isn’t such a great hypothesis for this training set. And more generally, if you have logistic regression with a lot of features, not necessarily polynomial ones, that just with a lot of features you can end up with overfitting. This was our cost function for logistic regression. And if we want to modify it to use regularization, all we need to do is add to it the following term, \frac{\lambda }{2m}\sum_{j=1}^{n}\theta _{j}^{2}. And this has the effect therefore of penalizing the parameters \theta _{1}, \theta _{2}, \theta _{3} up to \theta _{n} and so on from being too large. And if you do this, then it will have the effect that even though you’re fitting a very high order polynomial with a lot of parameters, so long as you apply regularization and keep the parameters small, you’re more likely to get a decision boundary, that maybe looks more like this. It looks more reasonable for separating out the positive and negative examples. So, when using regularization, even when you have a lot of features, the regularization can help take care of the overfitting problem. How do we actually implement this?

Well, for the original gradient descent algorithm, this was the update we had. We will repeatedly perform the following update to \theta _{j}. This slide looks a lot like the previous one for linear regression. But what I’m going to do is write the update for \theta _{0} separately. So the first line is for update for \theta _{0}, and the second line is now my update for \theta _{1} up to \theta _{n}, because I’m going to treat \theta _{0} separately. And in order to modify this algorithm to use a regularized cost function, all I need to do is pretty similar to what we did for linear regression, is actually to just modify this second update rule as follows. And once again, this cosmetically looks identical to what we have for linear regression. But of course is not the same algorithm as we had, because now the hypothesis is defined using this. So this is not the same algorithm as regularized linear regression, because hypothesis is different. Even though this update I wrote down is actually looks cosmetically the same as what we had earlier, we’re working out gradient descent for regularized logistic regression. And of course, just to wrap up this discussion. This term here in this square brackets, is of course the new partial derivative for respect of \theta _{j} of the new cost function J(\theta ), where J(\theta ) here is the cost function we defined on previous slide that does use regularization. So, that’s gradient descent for regularized logistic regression.

Let’s talk about how to get regularized logistic regression to work using the more advanced optimization methods. And just to remind you, for those methods, what we needed to do was to define the function that’s called costFunction that takes us input the parameter vector \theta. And once again, in the equations we’ve been writing here, we used 0-indexed vectors, so we had \theta _{0} up to \theta _{n}. But because Octave indexes the vectors starting from 1, \theta _{0} is written in Octave as \theta _{1}, \theta _{1} is written in Octave as \theta _{2}, and so on, down to \theta _{n+1}. And what we needed to do was provide a function. Let’s provide a function called costFunction that we would then pass in to what we had, we saw earlier. We use the fminunc(@costFunction, …). But the fminunc was the f min unconstrained, and this will work with, and fminunc was what will take the costFunction and minimize it for us. So the two main things that costFunction needed to return were first jVal. And for that, we need to write code to compute the cost function J(\theta ). Now, when we were using regularized logistic regression, of course the cost function J(\theta ) changes. And in particular, now a cost function needs to include this additional regularization term at the end as well. So when you compute J(\theta ), be sure to include that term at the end. And then, the other thing that this cost function thing needs to derive with a gradient. So gradient(1) needs to be set to the partial derivative of J(\theta ) with respect to \theta _{0}, gradient(2) needs to be set to that, and so on. Once again, the index is off by one, because of the indexing from 1 that Octave uses. And looking at these terms. This term over here \frac{\partial }{\partial \theta _{0}}J(\theta ), we actually worked this out on previous slide, is actually equal to this. It doesn’t change, because the derivative for \theta _{0} doesn’t change, compared to the version without regularization. And the other terms do change. And in particular, the derivative respect to \theta _{1}, we worked this out on the previous slide as well, it’s equal to the original term plus \frac{\lambda }{m}\theta _{1}. Just so which we pass this correctly, if we can add parenthesis here, so the summation doesn’t extend. And similarly, this other term here looks like this, with this additional term that we had on the previous slide that corresponds to the gradient from their regularization objective. So if you implement this costFunction, and pass this into fminunc or one of those advanced optimization techniques, that will minimize the new regularized cost function J(\theta ). And the parameters you get out will be the ones that corresponds to logistic regression with regularization. So now you know how to implement regularized logistic regression.

When I walk around Silicon Valley, I live here in Silicon Valley, there are a lot of engineers that are frankly making a ton of money for the companies using machine learning algorithms. And I know we’ve only been studying this stuff for a little while, but if you understand linear regression, logistic regression, the advanced optimization algorithms, and regularization, by now, frankly, you probably know quite a lot more machine learning than many certainly now, but you probably know quite a lot more machine learning right now than frankly, many of the Silicon Valley engineers, while they’re having very successful careers, making tons of money for the companies, or building great products using machine learning algorithms. So, congratulations, you’ve actually come a long ways, and you can actually know enough to apply this stuff and get to work for many problems. So congratulations for that. But of course, there’s still a lot more that we want to teach you and in the next set of videos after this, we’ll start to talk about a very powerful class of non-linear classifier. So whereas linear regression, logistic regression you can form polynomial terms but it turns out that there are much more powerful non-liner classifiers that can then solve for polynomial regression. And in the next set of videos after this one, I’ll start telling you about them. So that you have even more powerful learning algorithms than you have now to apply to different problems.

<end>

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Sure, here is an example Python code for Uplift model with low-rank regularization logistic regression for multiple correlated binary responses: ```python import numpy as np from sklearn.linear_model import LogisticRegression # define the Uplift model with low-rank regularization logistic regression class UpliftLowRankLogisticRegression: def __init__(self, rank=5, alpha=1.0): self.rank = rank self.alpha = alpha self.models = [] self.w = None def fit(self, X, y, t): # calculate the treatment and control groups X_treatment = X[t == 1] y_treatment = y[t == 1] X_control = X[t == 0] y_control = y[t == 0] # fit the logistic regression model for each response for i in range(y.shape[1]): model = LogisticRegression(penalty='l2', C=self.alpha) model.fit(np.hstack((X_treatment, y_treatment[:, i].reshape(-1, 1))), y_treatment[:, i]) self.models.append(model) # use SVD to learn the low-rank representation w U, S, Vt = np.linalg.svd(y_control - self.predict(X_control)) self.w = Vt[:self.rank].T def predict(self, X): # calculate the uplift score for each response uplift_scores = np.zeros((X.shape[0], len(self.models))) for i, model in enumerate(self.models): uplift_scores[:, i] = model.predict_proba(X)[:, 1] # calculate the predicted response for the control group y_control_pred = np.dot(X, self.w) # calculate the predicted response for the treatment group y_treatment_pred = y_control_pred + uplift_scores # return the predicted response matrix return np.vstack((y_control_pred, y_treatment_pred)) ``` The `UpliftLowRankLogisticRegression` class takes two hyperparameters: `rank` for the rank of the low-rank representation w and `alpha` for the regularization strength of logistic regression. In the `fit` method, the treatment and control groups are separated, and logistic regression models are fitted for each response using the treatment group. Then, SVD is used to learn the low-rank representation w from the predicted responses of the control group. In the `predict` method, the uplift scores for each response are calculated using the logistic regression models and added to the predicted responses of the control group to obtain the predicted responses of the treatment group. The predicted response matrix is returned by stacking the predicted responses of the control and treatment groups vertically.

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值