机器学习day1（英）-CSDN博客

本文链接：https://blog.csdn.net/ya3288426755/article/details/146552607

What is machine learning?

Field of study gives computers the ability to learn without being expilicity programmed.

Supervised learning 监督学习
Usupervised learning 无监督学习
Reinforcement learning 强化学习

Supervised learning

definition:learns from being given “right answer”

Two main categories

Regression 回归

definition:Try to predict number infinitely many possible outputs

classification problem 分类

definition:Try to predict small categories number of possible outputs

Unsupervised learning

definition:Datas only comes with input x, but not output by labels y. Algorithm has to find structure in data.

Three main categories

clustering 聚类

definition:Group similar data points together.

Anomaly detection 异常检测

definition:Find unusaual datapoints

Dimensionality redution 降维

definition:Compress data using fewer numbers

Cost function 成本函数

Definition

The equation for cost with one variable is:
$\frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})^2 \tag{1}$
where
$f_{w,b}(x^{(i)}) = wx^{(i)} + b \tag{2}$

$f_{w,b}(x^{(i)})$ is our prediction for example $i$ using parameters $w, b$ .
$f_{w,b}(x^{(i)}) -y^{(i)})^2$ is the squared difference between the target value and the prediction.
These differences are summed over all the $m$ examples and divided by 2m to produce the cost, $J (w, b)$ .

Note, in lecture summation ranges are typically from 1 to m, while code will be from 0 to m-1.

Gradient descent 梯度下降

Summary

So far in this course, you have developed a linear model that predicts $f_{w,b}(x^{(i)})$ :
$f_{w,b}(x^{(i)}) = wx^{(i)} + b \tag{1}$
In linear regression, you utilize input training data to fit the parameters $w$ , $b$ by minimizing a measure of the error between our predictions $f_{w,b}(x^{(i)})$ and the actual data $y^{(i)}$ . The measure is called the $cos t$ , $J (w, b)$ . In training you measure the cost over all of our training samples $x^{(i)},y^{(i)}$
$\frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})^2\tag{2}$

More General

Hvae some function $J_{w,b}$ , just want minnimize $J_{w,b}$

In lecture, gradient descent was described as:
$\begin{align*} \text{repeat} & \text{ until convergence:} \; \{ \\ \; w &= w - \alpha \frac{\partial J(w,b)}{\partial w} \tag{3} \\ b &= b - \alpha \frac{\partial J(w,b)}{\partial b} \\ \} \end{align*}$
where, parameters $w$ and $b$ are updated simultaneously.

The gradient is defined as:
$\begin{align} \frac{\partial J(w,b)}{\partial w} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})x^{(i)} \tag{4} \\ \frac{\partial J(w,b)}{\partial b} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)}) \tag{5} \end{align}$
Here simultaniously means that you calculate the partial derivatives for all the parameters before updating any of the parameters.

在更新任何参数之前，你需要先计算所有参数的偏导数。

Implement Gradient Descent

You will implement gradient descent algorithm for one feature. You will need three functions.

compute_gradient implementing equation (4) and (5) above
compute_cost implementing equation (2) above (code from previous lab)
gradient_descent, utilizing compute_gradient and compute_cost

Conventions:

The naming of python variables containing partial derivatives follows this pattern, $\frac{\partial J(w,b)}{\partial b}$ will be dj_db.
w.r.t is With Respect To, as in partial derivative of $J (w b)$ With Respect To $b$ .

Multiple Features 多类特征

在这里插入图片描述

Vector representation 向量表示

Matrix X containing our examples

Similar to the table above, examples are stored in a NumPy matrix X_train. Each row of the matrix represents one example. When you have $m$ training examples ( $m$ is three in our example), and there are $n$ features (four in our example), $\mathbf{X}$ is a matrix with dimensions ( $m$ , $n$ ) (m rows, n columns).

$\mathbf{X} = \begin{pmatrix} x^{(0)}_0 & x^{(0)}_1 & \cdots & x^{(0)}_{n-1} \\ x^{(1)}_0 & x^{(1)}_1 & \cdots & x^{(1)}_{n-1} \\ \cdots \\ x^{(m-1)}_0 & x^{(m-1)}_1 & \cdots & x^{(m-1)}_{n-1} \end{pmatrix}$
notation:

$\mathbf{x}^{(i)}$ is vector containing example i. $\mathbf{x}^{(i)}$ $ = (x^{(i)}_0, x^{(i)}1, \cdots,x^{(i)}{n-1})$
$x^{(i)}_j$ is element j in example i. The superscript in parenthesis indicates the example number while the subscript represents an element.

Parameter vector w, b

$\mathbf{w}$ is a vector with $n$ elements.
- Each element contains the parameter associated with one feature.
- in our dataset, n is 4.
- notionally, we draw this as a column vector

$\mathbf{w} = \begin{pmatrix} w_0 \\ w_1 \\ \cdots\\ w_{n-1} \end{pmatrix}$

$b$ is a scalar parameter.

Model Prediction With Multiple Variables

The model’s prediction with multiple variables is given by the linear model:

$f_{\mathbf{w},b}(\mathbf{x}) = w_0x_0 + w_1x_1 +... + w_{n-1}x_{n-1} + b \tag{1}$
or in vector notation:
$f_{\mathbf{w},b}(\mathbf{x}) = \mathbf{w} \cdot \mathbf{x} + b \tag{2}$
where $\cdot$ is a vector dot product

Compute Cost With Multiple Variables

The equation for the cost function with multiple variables $J(\mathbf{w},b)$ is:
$J(\mathbf{w},b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})^2 \tag{3}$
where:
$f_{\mathbf{w},b}(\mathbf{x}^{(i)}) = \mathbf{w} \cdot \mathbf{x}^{(i)} + b \tag{4}$
In contrast to previous labs, $\mathbf{w}$ and $\mathbf{x}^{(i)}$ are vectors rather than scalars supporting multiple features.

Gradient Descent With Multiple Variables

Gradient descent for multiple variables:

$\begin{align*} \text{repeat}&\text{ until convergence:} \; \lbrace \newline\; & w_j = w_j - \alpha \frac{\partial J(\mathbf{w},b)}{\partial w_j} \tag{5} \; & \text{for j = 0..n-1}\newline &b\ \ = b - \alpha \frac{\partial J(\mathbf{w},b)}{\partial b} \newline \rbrace \end{align*}$
where, n is the number of features, parameters $w_j$ , $b$ , are updated simultaneously and where

$\begin{align} \frac{\partial J(\mathbf{w},b)}{\partial w_j} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})x_{j}^{(i)} \tag{6} \\ \frac{\partial J(\mathbf{w},b)}{\partial b} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)}) \tag{7} \end{align}$

m is the number of training examples in the data set
$f_{\mathbf{w},b}(\mathbf{x}^{(i)})$ is the model’s prediction, while $y^{(i)}$ is the target value

Learning Rate 学习率

The learning rate $\alpha$ controls the size of the update to the parameters.
在这里插入图片描述

Checking Gradient descent for convergence

在这里插入图片描述

Feature Scaling 特征缩放

outline

The lectures discussed three different techniques:

- Feature scaling, essentially dividing each feature by a user selected value to result in a range between -1 and 1.

- Mean normalization: $x_i := \dfrac{x_i - \mu_i}{max - min} $

- Z-score normalization which we will explore below.

z-score normalization

After z-score normalization, all features will have a mean of 0 and a standard deviation of 1.

To implement z-score normalization, adjust your input values as shown in this formula:
$x^{(i)}_j = \dfrac{x^{(i)}_j - \mu_j}{\sigma_j} \tag{4}$
where $j$ selects a feature or a column in the X matrix. $µ_j$ is the mean of all the values for feature (j) and $\sigma_j$ is the standard deviation of feature (j).
$\begin{align} \mu_j &= \frac{1}{m} \sum_{i=0}^{m-1} x^{(i)}_j \tag{5}\\ \sigma^2_j &= \frac{1}{m} \sum_{i=0}^{m-1} (x^{(i)}_j - \mu_j)^2 \tag{6} \end{align}$

Implementation Note:

When normalizing the features, it is important to store the values used for normalization - the mean value and the standard deviation used for the computations. After learning the parameters from the model, we often want to predict the prices of houses we have not seen before. Given a new x value (living room area and number of bedrooms), we must first normalize x using the mean and standard deviation that we had previously computed from the training set.