【吴恩达机器学习】第一周课程精简笔记——基础知识和一元线性回归

辰阳星宇

已于 2023-01-06 21:41:17 修改

阅读量905

点赞数

分类专栏：机器学习文章标签：人工智能

于 2022-01-10 19:35:29 首次发布

本文链接：https://blog.csdn.net/qq_41094332/article/details/122408495

版权

机器学习专栏收录该内容

17 篇文章 0 订阅

订阅专栏

Introduction

1. What is Machine Learning?（什么是机器学习）

Tom Mitchell provides a more modern definition: “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.”

Tom Mitchell 提供了一个更为现代化的定义：“若经由性能度量P测试后发现，通过使用经验E可提升执行任务T的计算机程序的性能，那么便称关于某些任务T和性能度量P，计算机程序可从经验E中学习。”

经验E： 用于机器学习中，计算机可处理的数据
任务T： 机器学习系统应如何处理样本
性能度量P： 性能指标对解决任务能力的定量度量，模型评价和模型选择的依据

2. What is supervised learning?（什么是监督学习？）

Supervised learning problems are categorized into “regression” and “classification” problem.

regression：a continuous output
input variables —map—> some continuous function
classification：a discrete output
input variables —map—> discrete categories

监督学习的问题分为“回归问题”和“分类问题”。

回归：连续型输出
输入变量 —映射—> 一些连续型函数
分类：离散型输出
输入变量 —映射—> 离散分类

3. what is unsupervised learning？（什么是无监督学习？）

we can dervie struct from data where we don’t necessarity know the effect of the variables.We can derive this structure by clustering the data based on relationships among the variables in the data.With unsupervised learning there is no feedback based on the prediction results.

clustering
Non-clustering

我们可以在不需要知道变量影响的情况下推导出数据结构（分布）。我们可以根据数据中变量之间的关系，对数据进行聚类，从而得出这种结构（分布）。在无监督学习中，没有对于结果的反馈情况。

聚类
非聚类

Model and Cost Function

1. Model Representation（模型描述）

（1）input：x⁽ⁱ⁾ ，output：y⁽ⁱ⁾，traning example：(x⁽ⁱ⁾,y⁽ⁱ⁾) ，i = 1,2,…,m
tip : “i” in the notation is simply an index into the training set, and has nothing to do with exponentiation.

（2）learn a function h（hypothesis）：x - > y （h(x) - > y）
different target variable: continuous - > regression ; discrete - > classification

（1）输入，输出，训练集
注：标注"i"仅是训练集的一个简单索引与求幂无关。

（2）学习一个函数h（假设）
不同的目标变量：连续型（目标变量）-> 回归；离散型（目标变量） -> 分类

2. Cost Function（代价函数）

（1）Definiition

We can measure the accuracy of our hypothesis function by using a cost function.

Hypothesis ： $h_θ(x) = θ_0 + θ_1 x$
Parametes： $θ_0、θ_1$
Cost Function： 在这里插入图片描述

This functon is otherwise called the “Squared error function”, or “Mean squared error.”

Goal：
在这里插入图片描述

我们可以用代价函数来衡量出我们假设函数的精确度。

这个函数也被称为“平方误差函数”或“均方误差”。

（2）θ0 = 0

If we try to think of it in visual terms, our training data set is scattered on the x-y plane. We are trying to make a straight line(defined by hθ(x)) which passes through these scattered data points.

Our objective is to get the best possible line. The best possible line will be such so that the average squared vertical distances of the scattered poins from the line will be least.
在这里插入图片描述

如果我们试着用图形的方式来考虑这个问题，那么我们的训练数据集是分散在x-y平面上的。我们要做一条经过这些分散数据点的直线（这条直线由hθ(x)定义)。

我们的目标是得到尽可能好的线路。最好的直线是这样的，即分散的点到这条直线的平均垂直距离的平方是最小的

我们的目标是找到一条尽可能好的直线。这条最好的直线是分散的点到该直线平均垂直距离的平方达到最小。

（3）θ0 ≠ 0

A contour plot is a graph that contains many contours lines. A contour line of a two variable function has a constant value at all points of the same line.

Taking any color and going along the ‘circle’, one would expect to get the same value of the cost function.
在这里插入图片描述

等高线图是包含许多等高线的图。双变量函数的等高线在同一直线上的所有点上都有一个常数值。

取任何颜色，沿着“圆”走，我们会期望得到相同的代价函数值。

Parameter Learning

（1）Gradient Descent（梯度下降）

Now we need to estimate the parameters in the hypothesis function. That’s where gradient descent comes in.

Imagine that we graph our hypothesis function based on its fields $\theta_0$ and $\theta_1$ . We are not graphing x and y itself, but the parameter range of our hypothesis function and the cost resulting from selecting a particular set of parameters.
在这里插入图片描述
We will know that we have succeeded when our cost function is at the very bottom of the pits in our graph.

The way we do this is by taking the derivative (the tangential line to a function) of our cost function.We make steps down the cost function in the direction with the steepest descent. The size of each step is determined by the parameter α, which is called the learning rate.

the distance between each ‘star’ in the graph above represents a step determined by our parameter α. A smaller α would result in a smaller step and a larger α results in a larger step.Depending on where one starts on the graph, one could end up at different points. The image above shows us two different starting points that end up in two different places.

The gradient descent algorithm

$\theta_j := \theta_j - \alpha\frac{\partial }{\partial \theta_j}J(\theta_0, \theta_1)$

where j = 0, 1 represents the feature index number

在这里插入图片描述
At each iteration j, one should simultaneously update the parameters $\theta_1,\theta_2,...,\theta_n$

现在我们需要去估计假设函数里的参数，因此就要介绍一下梯度函数。

假设我们画出基于 $\theta_0$ 和 $\theta_1$ 取值范围的假设函数。我们画的并不是x和y自身，而是假设函数的参数范围以及选择一组特定参数所产生的代价结果。
在这里插入图片描述
当我们的代价函数在图像中的“坑”的最底部时，我们就知道我们成功了。

我们的方式是对代价函数求导（一个函数的切线）。我们让代价函数沿着梯度下降速度最快的方向下降。每一步的大小由参数α，该参数被称为学习率。

上图中每个“星”之间的距离代表了由参数α决定的一个阶距。更小的α会导致更小的一步，更大的α会导致更大的一步。不同的起始点，可能会产生不同而终点。上图显示，两个不同的起始点，在两个不同的位置结束（即产生了不同的终点）。

梯度下降算法

$\theta_j := \theta_j - \alpha\frac{\partial }{\partial \theta_j}J(\theta_0, \theta_1)$

其中j = 0, 1代表特征的索引数。
在这里插入图片描述
在每次迭代j时，应该同时更新参数 $\theta_1,\theta_2,...,\theta_n$ 。

（2）Gradient Descent Intuition（梯度下降直观解释）

We used one paramenter $\theta_1$ and plotted its cost function to implement a gradient descent. Our formula for a single parameter was :

Repeat until convergence:
$\theta_1 := \theta_1 - \alpha\frac{\partial}{\theta_1}J(\theta_1)$

在这里插入图片描述

The intuition behind the convergence is that $\frac{d}{d\theta_1}J(\theta_1)$ approaches 0 as we approach the bottom of our convex function. At the minimum, the derivative will always be 0 and thus we get:

$\theta_1 := \theta_1 - \alpha * 0$
在这里插入图片描述

我们使用一个参数 $\theta_1$ 来实现梯度下降，并绘制其图像。单一参数的公式是：

重复直到收敛：
$\theta_1 := \theta_1 - \alpha\frac{\partial}{\theta_1}J(\theta_1)$

在这里插入图片描述

收敛背后的直觉是当我们接近凸函数的底部时， $\frac{d}{d\theta_1}J(\theta_1)$ 接近0。在最小值时，导数总是0，因此我们可以得到：

$\theta_1 := \theta_1 - \alpha * 0$

在这里插入图片描述

（3）Gradient Descent For Linear Regression（梯度下降法得到线性回归模型）——Batch

The Gradient descent algorithm descent to minimize our squared error cost function.

$\begin{aligned} & h_\theta(x)=\theta_0 + \theta_1x \\ & J(\theta_0,\theta_1)=\frac{1}{2m} \sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)}) \end{aligned}$

When specifically applied to the case of linear regression, a new form of the gradient descent equation can be derived by .

repeat until convergence：
$\theta_j := \theta_j-\alpha\frac{1}{m}\sum_{i=1}^{m}((h_\theta(x^{(i)})-y^{(i)})x^{(i)}) \ (for \ j=1,2)$

Note that we have separated out the two cases for $\theta_j$ into separate equations for $\theta_0$ and $\theta_1$ .
$\begin{aligned} \frac{\partial}{\partial\theta_j}J(\theta)&=\frac{\partial}{\partial\theta_j}\frac{1}{2}(h_\theta(x)-y)^2\\ &=2·\frac{1}{2}(h_\theta(x)-y)·\frac{\partial}{\partial\theta_j}(h_\theta(x)-y)\\ &=(h_\theta(x)-y)·\frac{\partial}{\partial\theta_j}(\sum_{i=0}^{n}\theta_ix_i-y)\\ &=(h_\theta(x)-y)x_j \end{aligned}$

This method looks at every example in the entire training set on every step, and is called batch gradient descent. Note that, while gradient descent can be susceptible to local minima in general, the optimization problem we have posed here for linear regression has only one global, and no other local, optima; thus gradient descent always converges (assuming the learning rate α is not too large) to the global minimum.
在这里插入图片描述

梯度下降算法通过下降让平方误差代价函数的结果最小。

$\begin{aligned} & h_\theta(x)=\theta_0 + \theta_1x \\ & J(\theta_0,\theta_1)=\frac{1}{2m} \sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)}) \end{aligned}$

当具体应用到线性回归模型时，可以得到一个新的梯度下降方程。

重复执行下属算法，直到收敛：
$\theta_j := \theta_j-\alpha\frac{1}{m}\sum_{i=1}^{m}((h_\theta(x^{(i)})-y^{(i)})x^{(i)}) \ (for \ j=1,2)$

注意：我们将 $\theta_j$ 的两种情况分成了 $\theta_0$ 和 $\theta_1$ 的独立方程。
$\begin{aligned} \frac{\partial}{\partial\theta_j}J(\theta)&=\frac{\partial}{\partial\theta_j}\frac{1}{2}(h_\theta(x)-y)^2\\ &=2·\frac{1}{2}(h_\theta(x)-y)·\frac{\partial}{\partial\theta_j}(h_\theta(x)-y)\\ &=(h_\theta(x)-y)·\frac{\partial}{\partial\theta_j}(\sum_{i=0}^{n}\theta_ix_i-y)\\ &=(h_\theta(x)-y)x_j \end{aligned}$

这种方法在整个训练集的每走一步都使用训练集中所有的样例来计算梯度，此方法称为批梯度下降。注意，虽然梯度下降在一般情况下容易受到局部最小值的影响，但我们在这里为线性回归提出的优化问题只有一个全局最优解，而没有其他的局部最优值，故若学习率α不是非常大的情况下时，则梯度下降法总会收敛。
在这里插入图片描述

Linear Algebra（线性代数）

（1）Matrices and Vectors

Dimension of matrix: the number of rows times the number of columns.

A_ij： the element in the ith row and jth columns of matrix A.

$\vec{x}$ ：A vector with ‘n’ rows is referred to as an ‘n’-dimensional vector.

R refers to the set of scalar real numbers.

Rⁿ refers to the set of n-dimensional vectors of real numbers.

Dimension of matrix: the number of rows times the number of columns.

矩阵的维度： 行数×列数

A_ij： 矩阵A中第i行, 第j列的元素。

$\vec{x_i}$ ：具有n行的向量称为n维向量。

R 指标量实数的集合。

Rⁿ 指n维标量实数的集合。

（2）Addition and Scalar Multiplication

To add or substract two matrices, their dimensions must be the same.

Addition and substraction are element-wise, so you simply add or substract each corresponding element.

$\begin{bmatrix} a & b \end{bmatrix} + \begin{bmatrix} c & d \end{bmatrix} = \begin{bmatrix} a + b & c + d \end{bmatrix}$

In scalar multiplication, we simply multiply every element by the scalar value:

$\begin{bmatrix} a & bc & d \end{bmatrix} *x = \begin{bmatrix} a*x & bc*x & d*x \end{bmatrix}$

In scalar division, we simply divide every element by the scalar value:
$\begin{bmatrix} a & bc & d \end{bmatrix} / x = \begin{bmatrix} a/x & bc/x & d/x \end{bmatrix}$

对两个矩阵进行加或减时，它们的维度必须相同。

加法和减法是与元素相关的，所以只需对每个对应的元素进行加减。

$\begin{bmatrix} a & b \end{bmatrix} + \begin{bmatrix} c & d \end{bmatrix} = \begin{bmatrix} a + b & c + d \end{bmatrix}$

在标量乘法中，只需将每个元素都乘以标量值:

$\begin{bmatrix} a & bc & d \end{bmatrix} *x = \begin{bmatrix} a*x & bc*x & d*x \end{bmatrix}$

在标量除法中，我们只需将每个元素除以标量值:

$\begin{bmatrix} a & bc & d \end{bmatrix} / x = \begin{bmatrix} a/x & bc/x & d/x \end{bmatrix}$

（3）Matrix Vector Multiplication

An m × n matrix multiplied by an n × 1 vector results in an m × 1 vector.

$\begin{bmatrix} a & bc & de \end{bmatrix} * \begin{bmatrix} xy \end{bmatrix}= \begin{bmatrix} a*xy & bc*xy & de*xy \end{bmatrix}$

一个m × n矩阵乘以一个n × 1向量得到一个m × 1向量。

$\begin{bmatrix} a & bc & de \end{bmatrix} * \begin{bmatrix} xy \end{bmatrix}= \begin{bmatrix} a*xy & bc*xy & de*xy \end{bmatrix}$

（4）Matrix Matrix Multiplication

An m x n matrix multiplied by an n x p matrix results in an m x p matrix.

$\begin{bmatrix} 1 & 2\\ 3 & 4 \end{bmatrix} * \begin{bmatrix} 2 & 5\\ 3 & 1 \end{bmatrix} = \begin{bmatrix} 2*1 + 3*2 & 5*1 + 1*2\\ 2*3 + 3*4 & 5*3 + 1*4 \end{bmatrix} \\ = \begin{bmatrix} 8 & 7 \\ 18 & 19 \end{bmatrix}$

一个mxn矩阵乘以一个nxp矩阵得到一个mxp矩阵。

（5）Matrix Multiplication Properties

Matrices are not commutative:
$A * B \neq = B * A$
Matrices are associative:
$(A * B) * C = A * (B * C)$

The identity matrix is denoted $I_n$ .

The identity matrix, when multiplied by any matrix of the same dimensions, results in the original matrix.

矩阵没有交换律：
$A * B \neq = B * A$
矩阵有结合律：
$(A * B) * C = A * (B * C)$

单位矩阵用 $I_n$ ,表示。

当单位矩阵乘以任何相同维数的矩阵时，得到结果为那个维度相同的矩阵。

（6）Inverse and Transpose（逆和转置）

The inverse of a matrix A is denoted $A^{-1}$ .Multiplying by the inverse results in the identity matrix.

A non square matrix does not have an inverse matrix. Matrices that don’t have an inverse are singular or degenerate.

The transposition of a matrix is like rotating the matrix 90° in clockwise direction and then reversing it. It is denoted $A^T$ .

矩阵A的逆用 $A^{-1}$ 表示。矩阵A乘以矩阵A的逆矩阵，得到的结果是单位矩阵。

非方阵没有逆矩阵，没有逆的矩阵是奇异矩阵或退化矩阵。

矩阵的转置就像将矩阵顺时针旋转90°，然后将其反转。矩阵的转置用 $A^T$ 表示。

辰阳星宇

关注

0
点赞
踩
2

收藏

觉得还不错? 一键收藏
打赏
0
评论
【吴恩达机器学习】第一周课程精简笔记——基础知识和一元线性回归

Introduction1. What is Machine Learning?（什么是机器学习）Tom Mitchell provides a more modern definition: “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as
复制链接

扫一扫