Machine Learning (Stanford University)第一周笔记

最新推荐文章于 2022-11-06 11:37:29 发布

yokin‘s DA trackbook

最新推荐文章于 2022-11-06 11:37:29 发布

阅读量429

点赞数

分类专栏： Machine learning 文章标签：机器学习

本文链接：https://blog.csdn.net/weixin_49285816/article/details/116278926

版权

Machine learning 专栏收录该内容

2 篇文章 1 订阅

订阅专栏

(第一周)

一、Introduction

1、What is machine learning

1） “the field of study that gives computers the ability to learn without being explicitly programmed.” This is an older, informal definition.机器学习是在没有详细地编程情况下，赋予计算机学习的能力的学科。

2）Tom Mitchell provides a more modern definition: "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E."

假设：电脑程序带着任务T,从经历E学习，任务T的表现用P来衡量。如果任务T的表现能随着经历E而提升，那么就是机器学习。

Example: playing checkers.

E = the experience of playing many games of checkers

T = the task of playing checkers.

P = the probability that the program will win the next game.

2、Machine learning algorithms

1）Supervised learning

Supervised learning problems are categorized into “regression” and “classification” problems. In a regression problem, we are trying to predict results within a continuous output, meaning that we are trying to map input variables to some continuous function. In a classification problem, we are instead trying to predict results in a discrete output. In other words, we are trying to map input variables into discrete categories.

监督学习问题分为回归和分类；回归问题是预测结果为连续变量，分类是预测结果为离散变量

2）Unsupervised learning

Unsupervised learning allows us to approach problems with little or no idea what our results should look like. We can derive structure from data where we don’t necessarily know the effect of the variables.

We can derive this structure by clustering the data based on relationships among the variables in the data.

With unsupervised learning there is no feedback based on the prediction results.

无监督学习是在我们无法界定学习结果的前提下，能基于变量数据之间的关系自动获取变量间的结构。
无监督学习的预测结果是没有衡量手段的

Example:

（分类）Clustering: Take a collection of 1,000,000 different genes, and find a way to automatically group these genes into groups that are somehow similar or related by different variables, such as lifespan, location, roles, and so on.

Non-clustering: The “Cocktail Party Algorithm”, allows you to find structure in a chaotic environment. (i.e. identifying individual voices and music from a mesh of sounds at a cocktail party).

3) Others: reinforcement learning, recommender systems

二、Model&Cost Function

1、Model Representation

To describe the supervised learning problem slightly more formally, our goal is, given a training set, to learn a function h : X → Y so that h(x) is a “good” predictor for the corresponding value of y. For historical reasons, this function h is called a hypothesis. Seen pictorially, the process is therefore like this:

h是假设方程，是对样本点的拟合函数，同时也是
$\theta 0 和\theta 1对x的函数$

When the target variable that we’re trying to predict is continuous, such as in our housing example, we call the learning problem a regression problem. When y can take on only a small number of discrete values (such as if, given the living area, we wanted to predict if a dwelling is a house or an apartment, say), we call it a classification problem.

2、Cost Function 代价函数

在这里插入图片描述

使用均方差，除以样本examples数，是为了消除样本规模的影响。2m的2是为了方便后续计算

3、Gradient Descent 梯度下降

注意：J是关于theta（parameters)的函数； h是关于关于x的函数，也叫hypothesis function

在知道hypothesis function假设函数和代价函数J之后，我们知道如何衡量拟合函数的拟合程度。现在我们需要衡量在hypothesis function中的参数theta。这就是梯度下降存在的目的。

1) The gradient descent algorithm

repeat until convergence:
$\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1) 注意：不是乘以J,而是对J的两个参数分别求导$
where

j=0,1 represents the feature index number.

At each iteration j, one should simultaneously update the parameters
$\theta_1, \theta_2,...,\theta_n$
. Updating a specific parameter prior to calculating another one on the
$j^{(th)}$
iteration would yield to a wrong implementation.

[

在这里插入图片描述

两个Theta参数要同时更新（很重要！）
（如果不是同步更新两个参数值，那么第一个先更新的参数值会带入到第二个参数的运算中，影响第二个参数的值。

2) gradient descent graph梯度下降图

假设把Theta0作为x轴，theta1作为y轴，代价函数作为z轴,则可以画出下图。

when cost function is at the bottom, we will know that we have succedded. 如果代价函数值在底部，那么就成功了
The slope of the tangent is the derivative at that point and it will give us a direction to move towards, then we make steps down the cost function in the direction with the steepest descent.点的斜率会告诉我们前进的方向
the size of each step is determined by the parameter alpha

,which is called the learning rate

在这里插入图片描述

3）gradient descent intuition

无论曲线对参数求导数的符号正负如何，theta1始终会朝着代价函数的minimum移动。

在这里插入图片描述

即使学习率固定，下降的步伐也会越来越小，因为曲线导数越来越小。

The intuition behind the convergence is that
$\frac{d}{d\theta_1} J(\theta_1)$
approaches 0 as we approach the bottom of our convex function （在local minimum的低点导数为0）. At the minimum, the derivative will always be 0 and thus we get:
$\theta_1:=\theta_1-\alpha*0$
在这里插入图片描述

4）gradient descent for linear regression

When specifically applied to the case of linear regression, a new form of the gradient descent equation can be derived. We can substitute our actual cost function and our actual hypothesis function and modify the equation to : 假设 hx= theta0 + theta1✖️xi
在这里插入图片描述

把对线性回归的参数求导详细的写出来，进行替换

三、Matrix

1、Matrix Multiplication Properties

Matrices are not commutative:
$\neq B∗A$
矩阵乘法不满足交换律
Matrices are associative:
$(A * B) * C = A * (B * C)$
矩阵乘法满足结合律
The identity matrix, when multiplied by any matrix of the same dimensions, results in the original matrix. It’s just like multiplying numbers by 1. The identity matrix simply has 1’s on the diagonal (upper left to lower right diagonal) and 0’s elsewhere.

乘以单位矩阵，结果是原来的矩阵
When multiplying the identity matrix after some matrix (A∗I), the square identity matrix’s dimension should match the other matrix’s columns. When multiplying the identity matrix before some other matrix (I∗A), the square identity matrix’s dimension should match the other matrix’s rows.

乘以单位矩阵，要注意维度相同。***I***表示单位矩阵identity matrix

% Initialize random matrices A and B 
A = [1,2;4,5]
B = [1,1;0,2]

% Initialize a 2 by 2 identity matrix
I = eye(2) 单位矩阵 identity matrix

% The above notation is the same as I = [1,0;0,1]

% What happens when we multiply I*A ? 
IA = I*A 

% How about A*I ? 
AI = A*I 

% Compute A*B 
AB = A*B 

% Is it equal to B*A? 
BA = B*A 

% Note that IA = AI but AB != BA