博客园 - 链接: Coursera 学习笔记|Machine Learning by Standford University - 吴恩达
Chapter 1 - Introduction
1.1 Definition
- Arthur Samuel
The field of study that gives computers the ability to learn without being explicitly programmed. - Tom Mitchell
A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E.
1.2 Concepts
1.2.1 Classification of Machine Learning
- Supervised Learning 监督学习:given a labeled data set; already know what a correct output/result should look like
- Regression 回归:continuous output
- Classification 分类:discrete output
- Unsupervised Learning 无监督学习:given an unlabeled data set or an data set with the same labels; group the data by ourselves
- Clustering 聚类:group the data into different clusters
- Non-Clustering 非聚类
- Others: Reinforcement Learning, Recommender Systems…
1.2.2 Model Representation
-
Training Set 训练集
x 1 ( 1 ) x 2 ( 1 ) ⋯ x n ( 1 ) y ( 1 ) x 1 ( 2 ) x 2 ( 2 ) ⋯ x n ( 2 ) y ( 2 ) ⋮ ⋮ ⋱ ⋮ ⋮ x 1 ( m ) x 2 ( m ) ⋯ x n ( m ) y ( m ) \begin{matrix} x^{(1)}_1&x^{(1)}_2&\cdots&x^{(1)}_n&&y^{(1)}\\ x^{(2)}_1&x^{(2)}_2&\cdots&x^{(2)}_n&&y^{(2)}\\ \vdots&\vdots&\ddots&\vdots&&\vdots\\ x^{(m)}_1&x^{(m)}_2&\cdots&x^{(m)}_n&&y^{(m)} \end{matrix} x1(1)x1(2)⋮x1(m)x2(1)x2(2)⋮x2(m)⋯⋯⋱⋯xn(1)xn(2)⋮xn(m)y(1)y(2)⋮y(m)
-
符号说明
m = m= m= the number of training examples 训练样本的数量 - 行数
n = n= n= the number of features 特征数量 - 列数
x = x= x= input variable/feature 输入变量/特征
y = y= y= output variable/target variable 输出变量/目标变量
( x j ( i ) , y ( i ) ) (x^{(i)}_j,y^{(i)}) (xj(i),y(i)) :第 j j j个特征的第 i i i 个训练样本,其中 i = 1 , . . . , m i=1, ..., m i=1,...,m, j = 1 , . . . , n j=1, ..., n j=1,...,n
1.2.3 Cost Function 代价函数
1.2.4 Gradient Descent 梯度下降
Chapter 2 - Linear Regression 线性回归
x 0 x 1 ( 1 ) x 2 ( 1 ) ⋯ x n ( 1 ) y ( 1 ) x 0 x 1 ( 2 ) x 2 ( 2 ) ⋯ x n ( 2 ) y ( 2 ) ⋮ ⋮ ⋮ ⋱ ⋮ ⋮ x 0 x 1 ( m ) x 2 ( m ) ⋯ x n ( m ) y ( m ) θ 0 θ 1 θ 2 ⋯ θ n \begin{matrix} x_0&x^{(1)}_1&x^{(1)}_2&\cdots&x^{(1)}_n&&y^{(1)}\\ x_0&x^{(2)}_1&x^{(2)}_2&\cdots&x^{(2)}_n&&y^{(2)}\\ \vdots&\vdots&\vdots&\ddots&\vdots&&\vdots\\ x_0&x^{(m)}_1&x^{(m)}_2&\cdots&x^{(m)}_n&&y^{(m)}\\ \\ \theta_0&\theta_1&\theta_2&\cdots&\theta_n&& \end{matrix} x0x0⋮x0θ0x1(1)x1(2)⋮x1(m)θ1x2(1)x2(2)⋮x2(m)θ2⋯⋯⋱⋯⋯xn(1)xn(2)⋮xn(m)θny(1)y(2)⋮y(m)
2.1 Linear Regression with One Variable 单元线性回归
-
Hypothesis Function
h θ ( x ) = θ 0 + θ 1 x h_{\theta}(x)=\theta_0+\theta_1x hθ(x)=θ0+θ1x
-
Cost Function - Square Error Cost Function 平方误差代价函数
J ( θ 0 , θ 1 ) = 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 J(\theta_0,\theta_1)=\frac{1}{2m}\displaystyle\sum_{i=1}^m(h_{\theta}(x^{(i)})-y^{(i)})^2 J(θ0,θ1)=2m1i=1∑m(hθ(x(i))−y(i))2 -
Goal
min ( θ 0 , θ 1 ) J ( θ 0 , θ 1 ) \min_{(\theta_0,\theta_1)}J(\theta_0,\theta_1) (θ0,θ1)minJ(θ0,θ1)
2.2 Multivariate Linear Regression 多元线性回归
-
Hypothesis Function
KaTeX parse error: Undefined control sequence: \ at position 92: …atrix} \right],\̲ ̲x= \left[ \begi…
h θ ( x ) = θ 0 + θ 1 x 1 + θ 2 x 2 + ⋯ + θ n x n = θ T x \begin{aligned}h_\theta(x)&=\theta_0+\theta_1x_1+\theta_2x_2+\cdots+\theta_nx_n\\ &=\theta^Tx \end{aligned} hθ(x)=θ0+θ1x1+θ2x2+⋯+θnxn=θTx
-
Cost Function
J ( θ T ) = 1 2 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) 2 J(\theta^T)=\frac{1}{2m}\displaystyle\sum_{i=1}^m(h_{\theta}(x^{(i)})-y^{(i)})^2 J(θT)=2m1i=1∑m(hθ(x(i))−y(i))2
-
Goal
min θ T J ( θ T ) \min_{\theta^T}J(\theta^T) θTminJ(θT)
2.3 Algorithm Optimization
2.3.1 Gradient Descent 梯度下降法
- 算法过程
Repeat until convergence(simultaneous update for each j = 1 , . . . , n j=1, ..., n j=1,...,n)
θ j : = θ j − α ∂ ∂ θ j J ( θ T ) : = θ j − α 1 m ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) \begin{aligned} \theta_j &:=\theta_j-\alpha{\partial\over\partial\theta_j}J(\theta^T)\\ &:=\theta_j-\alpha{1\over{m}}\displaystyle\sum_{i=1}^m(h_{\theta}(x^{(i)})-y^{(i)})x^{(i)}_j \end{aligned} θj:=θj−α∂θj∂J(θT):=θj−αm1i=1∑m(hθ(x(i))−y(i))xj(i) - Feature Scaling 特征缩放
对每个特征 x j x_j xj 有 x j = x j − μ j s j x_j={{x_j-\mu_j}\over{s_j}} xj=sjxj−μj
其中 μ j \mu_j μj 为 m m m 个特征 x j x_j xj 的平均值, s j s_j sj 为 m m m 个特征 x j x_j xj 的范围(最大值与最小值之差)或标准差。 - Learning Rate 学习率
2.3.2 Normal Equation(s) 正规方程(组)
令
KaTeX parse error: Undefined control sequence: \ at position 212: …atrix} \right],\̲ ̲y=\left[ \begin…
其中 X X X 为 m × ( n + 1 ) m\times(n+1) m×(n+1) 维矩阵, y y y 为 m m m 维的列向量。则
θ = ( X T X ) − 1 X T y \theta=(X^TX)^{-1}X^Ty θ=(XTX)−1XTy
如果 X T X X^TX XTX 不可逆(noninvertible),可能是因为:
- Redundant features 冗余特征:存在线性相关的两个特征,需要删除其中一个;
- 特征过多,如 m ≤ n m\leq n m≤n:需要删除一些特征,或对其进行正规化(regularization)处理。
2.4 Polynomial Regression 多项式回归
If a linear
h
θ
(
x
)
h_\theta(x)
hθ(x) can’t fit the data well, we can change the behavior or curve of
h
θ
(
x
)
h_\theta(x)
hθ(x) by making it a quadratic, cubic or square root function(or any other form).
e.g.
-
h θ ( x ) = θ 0 + θ 1 x 1 + θ 2 x 1 2 , x 2 = x 1 2 h_{\theta}(x)=\theta_0+\theta_1x_1+\theta_2x_1^2,\ x_2=x_1^2 hθ(x)=θ0+θ1x1+θ2x12, x2=x12
-
h θ ( x ) = θ 0 + θ 1 x 1 + θ 2 x 1 2 + θ 3 x 1 3 , x 2 = x 1 2 , x 3 = x 1 3 h_{\theta}(x)=\theta_0+\theta_1x_1+\theta_2x_1^2+\theta_3x_1^3,\ x_2=x_1^2,\ x_3=x_1^3 hθ(x)=θ0+θ1x1+θ2x12+θ3x13, x2=x12, x3=x13
-
h θ ( x ) = θ 0 + θ 1 x 1 + θ 2 x 1 , x 2 = x 1 h_{\theta}(x)=\theta_0+\theta_1x_1+\theta_2\sqrt{x_1},\ x_2=\sqrt{x_1} hθ(x)=θ0+θ1x1+θ2x1, x2=x1