机器学习day1(英)

What is machine learning?

Field of study gives computers the ability to learn without being expilicity programmed.

  • Supervised learning 监督学习
  • Usupervised learning 无监督学习
  • Reinforcement learning 强化学习

Supervised learning

  • definition:learns from being given “right answer”

Two main categories

Regression 回归
  • definition:Try to predict number infinitely many possible outputs
classification problem 分类
  • definition:Try to predict small categories number of possible outputs

Unsupervised learning

  • definition:Datas only comes with input x, but not output by labels y. Algorithm has to find structure in data.

Three main categories

clustering 聚类
  • definition:Group similar data points together.
Anomaly detection 异常检测
  • definition:Find unusaual datapoints
Dimensionality redution 降维
  • definition:Compress data using fewer numbers

Cost function 成本函数

Definition

The equation for cost with one variable is:
J ( w , b ) = 1 2 m ∑ i = 0 m − 1 ( f w , b ( x ( i ) ) − y ( i ) ) 2 (1) J(w,b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})^2 \tag{1} J(w,b)=2m1i=0m1(fw,b(x(i))y(i))2(1)
where
f w , b ( x ( i ) ) = w x ( i ) + b (2) f_{w,b}(x^{(i)}) = wx^{(i)} + b \tag{2} fw,b(x(i))=wx(i)+b(2)

  • f w , b ( x ( i ) ) f_{w,b}(x^{(i)}) fw,b(x(i)) is our prediction for example i i i using parameters w , b w,b w,b.
  • ( f w , b ( x ( i ) ) − y ( i ) ) 2 (f_{w,b}(x^{(i)}) -y^{(i)})^2 (fw,b(x(i))y(i))2 is the squared difference between the target value and the prediction.
  • These differences are summed over all the m m m examples and divided by 2m to produce the cost, J ( w , b ) J(w,b) J(w,b).

Note, in lecture summation ranges are typically from 1 to m, while code will be from 0 to m-1.

Gradient descent 梯度下降

Summary

So far in this course, you have developed a linear model that predicts f w , b ( x ( i ) ) f_{w,b}(x^{(i)}) fw,b(x(i)):
f w , b ( x ( i ) ) = w x ( i ) + b (1) f_{w,b}(x^{(i)}) = wx^{(i)} + b \tag{1} fw,b(x(i))=wx(i)+b(1)
In linear regression, you utilize input training data to fit the parameters w w w, b b b by minimizing a measure of the error between our predictions f w , b ( x ( i ) ) f_{w,b}(x^{(i)}) fw,b(x(i)) and the actual data y ( i ) y^{(i)} y(i). The measure is called the c o s t cost cost, J ( w , b ) J(w,b) J(w,b). In training you measure the cost over all of our training samples x ( i ) , y ( i ) x^{(i)},y^{(i)} x(i),y(i)
J ( w , b ) = 1 2 m ∑ i = 0 m − 1 ( f w , b ( x ( i ) ) − y ( i ) ) 2 (2) J(w,b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})^2\tag{2} J(w,b)=2m1i=0m1(fw,b(x(i))y(i))2(2)

More General

Hvae some function J w , b J_{w,b} Jw,b, just want minnimize J w , b J_{w,b} Jw,b

  • In lecture, gradient descent was described as:
    repeat  until convergence:    {    w = w − α ∂ J ( w , b ) ∂ w b = b − α ∂ J ( w , b ) ∂ b } \begin{align*} \text{repeat} & \text{ until convergence:} \; \{ \\ \; w &= w - \alpha \frac{\partial J(w,b)}{\partial w} \tag{3} \\ b &= b - \alpha \frac{\partial J(w,b)}{\partial b} \\ \} \end{align*} repeatwb} until convergence:{=wαwJ(w,b)=bαbJ(w,b)(3)
    where, parameters w w w and b b b are updated simultaneously.

    The gradient is defined as:
    ∂ J ( w , b ) ∂ w = 1 m ∑ i = 0 m − 1 ( f w , b ( x ( i ) ) − y ( i ) ) x ( i ) ∂ J ( w , b ) ∂ b = 1 m ∑ i = 0 m − 1 ( f w , b ( x ( i ) ) − y ( i ) ) \begin{align} \frac{\partial J(w,b)}{\partial w} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})x^{(i)} \tag{4} \\ \frac{\partial J(w,b)}{\partial b} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)}) \tag{5} \end{align} wJ(w,b)bJ(w,b)=m1i=0m1(fw,b(x(i))y(i))x(i)=m1i=0m1(fw,b(x(i))y(i))(4)(5)
    Here simultaniously means that you calculate the partial derivatives for all the parameters before updating any of the parameters.

在更新任何参数之前,你需要先计算所有参数的偏导数。

Implement Gradient Descent

You will implement gradient descent algorithm for one feature. You will need three functions.

  • compute_gradient implementing equation (4) and (5) above
  • compute_cost implementing equation (2) above (code from previous lab)
  • gradient_descent, utilizing compute_gradient and compute_cost

Conventions:

  • The naming of python variables containing partial derivatives follows this pattern, ∂ J ( w , b ) ∂ b \frac{\partial J(w,b)}{\partial b} bJ(w,b) will be dj_db.
  • w.r.t is With Respect To, as in partial derivative of J ( w b ) J(wb) J(wb) With Respect To b b b.

Multiple Features 多类特征

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

Vector representation 向量表示

Matrix X containing our examples

Similar to the table above, examples are stored in a NumPy matrix X_train. Each row of the matrix represents one example. When you have m m m training examples ( m m m is three in our example), and there are n n n features (four in our example), X \mathbf{X} X is a matrix with dimensions ( m m m, n n n) (m rows, n columns).

X = ( x 0 ( 0 ) x 1 ( 0 ) ⋯ x n − 1 ( 0 ) x 0 ( 1 ) x 1 ( 1 ) ⋯ x n − 1 ( 1 ) ⋯ x 0 ( m − 1 ) x 1 ( m − 1 ) ⋯ x n − 1 ( m − 1 ) ) \mathbf{X} = \begin{pmatrix} x^{(0)}_0 & x^{(0)}_1 & \cdots & x^{(0)}_{n-1} \\ x^{(1)}_0 & x^{(1)}_1 & \cdots & x^{(1)}_{n-1} \\ \cdots \\ x^{(m-1)}_0 & x^{(m-1)}_1 & \cdots & x^{(m-1)}_{n-1} \end{pmatrix} X= x0(0)x0(1)x0(m1)x1(0)x1(1)x1(m1)xn1(0)xn1(1)xn1(m1)
notation:

  • x ( i ) \mathbf{x}^{(i)} x(i) is vector containing example i. x ( i ) \mathbf{x}^{(i)} x(i) $ = (x^{(i)}_0, x^{(i)}1, \cdots,x^{(i)}{n-1})$
  • x j ( i ) x^{(i)}_j xj(i) is element j in example i. The superscript in parenthesis indicates the example number while the subscript represents an element.

Parameter vector w, b

  • w \mathbf{w} w is a vector with n n n elements.
    • Each element contains the parameter associated with one feature.
    • in our dataset, n is 4.
    • notionally, we draw this as a column vector

w = ( w 0 w 1 ⋯ w n − 1 ) \mathbf{w} = \begin{pmatrix} w_0 \\ w_1 \\ \cdots\\ w_{n-1} \end{pmatrix} w= w0w1wn1

  • b b b is a scalar parameter.

Model Prediction With Multiple Variables

The model’s prediction with multiple variables is given by the linear model:

f w , b ( x ) = w 0 x 0 + w 1 x 1 + . . . + w n − 1 x n − 1 + b (1) f_{\mathbf{w},b}(\mathbf{x}) = w_0x_0 + w_1x_1 +... + w_{n-1}x_{n-1} + b \tag{1} fw,b(x)=w0x0+w1x1+...+wn1xn1+b(1)
or in vector notation:
f w , b ( x ) = w ⋅ x + b (2) f_{\mathbf{w},b}(\mathbf{x}) = \mathbf{w} \cdot \mathbf{x} + b \tag{2} fw,b(x)=wx+b(2)
where ⋅ \cdot ​​ is a vector dot product

Compute Cost With Multiple Variables

The equation for the cost function with multiple variables J ( w , b ) J(\mathbf{w},b) J(w,b) is:
J ( w , b ) = 1 2 m ∑ i = 0 m − 1 ( f w , b ( x ( i ) ) − y ( i ) ) 2 (3) J(\mathbf{w},b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})^2 \tag{3} J(w,b)=2m1i=0m1(fw,b(x(i))y(i))2(3)
where:
f w , b ( x ( i ) ) = w ⋅ x ( i ) + b (4) f_{\mathbf{w},b}(\mathbf{x}^{(i)}) = \mathbf{w} \cdot \mathbf{x}^{(i)} + b \tag{4} fw,b(x(i))=wx(i)+b(4)
In contrast to previous labs, w \mathbf{w} w and x ( i ) \mathbf{x}^{(i)} x(i) are vectors rather than scalars supporting multiple features.

Gradient Descent With Multiple Variables

Gradient descent for multiple variables:

repeat  until convergence:    {    w j = w j − α ∂ J ( w , b ) ∂ w j    for j = 0..n-1 b    = b − α ∂ J ( w , b ) ∂ b } \begin{align*} \text{repeat}&\text{ until convergence:} \; \lbrace \newline\; & w_j = w_j - \alpha \frac{\partial J(\mathbf{w},b)}{\partial w_j} \tag{5} \; & \text{for j = 0..n-1}\newline &b\ \ = b - \alpha \frac{\partial J(\mathbf{w},b)}{\partial b} \newline \rbrace \end{align*} repeat} until convergence:{wj=wjαwjJ(w,b)b  =bαbJ(w,b)for j = 0..n-1(5)
where, n is the number of features, parameters w j w_j wj, b b b, are updated simultaneously and where

∂ J ( w , b ) ∂ w j = 1 m ∑ i = 0 m − 1 ( f w , b ( x ( i ) ) − y ( i ) ) x j ( i ) ∂ J ( w , b ) ∂ b = 1 m ∑ i = 0 m − 1 ( f w , b ( x ( i ) ) − y ( i ) ) \begin{align} \frac{\partial J(\mathbf{w},b)}{\partial w_j} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})x_{j}^{(i)} \tag{6} \\ \frac{\partial J(\mathbf{w},b)}{\partial b} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)}) \tag{7} \end{align} wjJ(w,b)bJ(w,b)=m1i=0m1(fw,b(x(i))y(i))xj(i)=m1i=0m1(fw,b(x(i))y(i))(6)(7)

  • m is the number of training examples in the data set

  • f w , b ( x ( i ) ) f_{\mathbf{w},b}(\mathbf{x}^{(i)}) fw,b(x(i)) is the model’s prediction, while y ( i ) y^{(i)} y(i) is the target value

Learning Rate 学习率

The learning rate α \alpha α controls the size of the update to the parameters.
在这里插入图片描述

Checking Gradient descent for convergence

在这里插入图片描述

Feature Scaling 特征缩放

outline

The lectures discussed three different techniques:

- Feature scaling, essentially dividing each feature by a user selected value to result in a range between -1 and 1.

- Mean normalization: $x_i := \dfrac{x_i - \mu_i}{max - min} $

- Z-score normalization which we will explore below.

z-score normalization

After z-score normalization, all features will have a mean of 0 and a standard deviation of 1.

To implement z-score normalization, adjust your input values as shown in this formula:
x j ( i ) = x j ( i ) − μ j σ j (4) x^{(i)}_j = \dfrac{x^{(i)}_j - \mu_j}{\sigma_j} \tag{4} xj(i)=σjxj(i)μj(4)
where j j j selects a feature or a column in the X matrix. µ j µ_j µj is the mean of all the values for feature (j) and σ j \sigma_j σj is the standard deviation of feature (j).
μ j = 1 m ∑ i = 0 m − 1 x j ( i ) σ j 2 = 1 m ∑ i = 0 m − 1 ( x j ( i ) − μ j ) 2 \begin{align} \mu_j &= \frac{1}{m} \sum_{i=0}^{m-1} x^{(i)}_j \tag{5}\\ \sigma^2_j &= \frac{1}{m} \sum_{i=0}^{m-1} (x^{(i)}_j - \mu_j)^2 \tag{6} \end{align} μjσj2=m1i=0m1xj(i)=m1i=0m1(xj(i)μj)2(5)(6)

Implementation Note:

When normalizing the features, it is important to store the values used for normalization - the mean value and the standard deviation used for the computations. After learning the parameters from the model, we often want to predict the prices of houses we have not seen before. Given a new x value (living room area and number of bedrooms), we must first normalize x using the mean and standard deviation that we had previously computed from the training set.

在对特征进行归一化时,务必要存储用于归一化的值——即计算中使用的均值和标准差。在从模型中学习到参数后,我们通常需要预测之前未见过的房屋价格。给定一个新的 x 值(例如客厅面积和卧室数量),我们必须首先使用从训练集中计算出的均值和标准差对 x 进行归一化处理。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值