Univaribe Linear Regression (单变量线性回归)

      Let us use some motivating example of predicting housing prices, we are going to use a data set of housing prices, and here i'm gonna plot my data set of housing prices that were different sizes that were sold for a range of different prices.

Given that data set if the house's size was 1250 square feet, well, one thing you could do is fit a model, maybe fit a straight line to this data and the prices of the house for around 22000.

this is a example of a supervised learning algorithm, because given the "right answer" for each example in the data, regression problem too.

what defined by the train set is ?

so here is how this supervised learning algorithm works, we saw that with the train set like our training set of houses prices, and we feed that to our learning algorithm, is the job of a learning algorithm to then output a function , which by convention is usually denoted lowerase h and h stands for hypothesis. and what the job of the hopyothesis is, a function that takes as input the size of a house, and you wanted to output for the corresponding input .

Hypothesis: h\partial (x) = \partial 0 + \partial 1x, parameter : \partial 0, \partial 1x.

 now the question is how to choose a0, a1.

The ide is to choose a0, a1 so that h(x) is close to y for our training example (x,y)

Given the x's in the training set, we make resonably accurate predictions for the y values , let 's formalize this, so linear regression, which we are going to do is that i'm going to want to solve a minimization problem, so i'm going to write minimize over theta one .

                                                                           J(\partial 0,\partial 1) = 1/m \sum_{i=1}^{m}(h\partial (x^{i}) - y^{i})

                                                                                 h\partial(x^{i}) = \partial 0 + \partial x^{i}                                         

                                                                                Minimize \ J (\partial 0, \partial 1)                                                      (1)

this functon is called cost function (Squared Error Function).

Example : i'm going to work with a simplified hypothesis function, which is just theta one times x.

so what if theta one is equal to 0.5 ?

this is theta one is equal to 0.

.

after a range of computation .

(the conclusion is when theta one equal to 1,that is indeed the best possble straight line)

so for each value of theta one, we wound up with a different value of J of theta one, and we can then use this to trace out this plot in about picture, the optimization objective for our learning algorithm is we want to choose the value of theta one, that minimizes J of the theta one, thise was our objective function for the linear regression.


Gradient Descent (梯度下降算法)

  • purpose: Have some function J(\partial 0, \partial 1)
  • want : minJ(\partial 0, \partial 1).​​​​​​​

Outline 

  • start with some theta one and theta zero.
  • keep changing theta one and theta zero to reduce J(\partial 0, \partial 1)​​​​​​​.

until we hopefully end up ai minimum.

so here is the problem setup and we want to come up with an algorithm for minimizing that as a function of J .

here is the idea for gradient descent, what we are going to do is going to start off with some initial guesses for theta zero and theta one . but a common choice would be we set theta one  and theta zero as zero, we will keep changing theta one and theta zero ,a little bit to try to reduce J, until we wind up at a minimum.

The definitiong of the Gradieng Descent Algorithm

correct: Simultaneous update(theta one and theta zero)

":="为C语言中的赋值符号.

'a'这个a尔法符号 is called the learning rate , so if  'a' is very large , then that corresponds to a very aggressive gradient descent procedure.(奥尔发值越大,梯度下降速度越快;越小则相反)


In order to convey these intuitions(解释背后概念) , I want to do is use a slightly simpler example where we want to minimize the fnction of just one parameter the theta one is a real number, so we can have ONE D plots.

the top picture has a positive slope, so it has a positive derivative.

and the picture bottom has a positive derivative ,and the end will gradually closer to the minimum.


let us see what happen about 'a' 's variety.


what if your parameter theat onr is already at a local minimum? (局部最优点)

what do you think one step of gradient descent will do ?

 the result is the function stop changing becasue the slop is equal to zero.

Gradient Descent can converge to a local minimum, even with the learning rate 'a' fixex.

As we approach a local minimum, gradient descent wil automatically take smaller steps , so no need to decrease a over time.

over .

thank you for reading.

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值