文章目录
Week two:Regression with multiple input variables
1、Multiple linear regression
The model’s prediction with multiple variables is given by the linear model:
f
w
,
b
(
x
)
=
w
0
x
0
+
w
1
x
1
+
.
.
.
+
w
n
−
1
x
n
−
1
+
b
f_{\mathbf{w},b}(\mathbf{x}) = w_0x_0 + w_1x_1 +... + w_{n-1}x_{n-1} + b
fw,b(x)=w0x0+w1x1+...+wn−1xn−1+bor in vector notation:
f
w
,
b
(
x
)
=
w
⋅
x
+
b
f_{\mathbf{w},b}(\mathbf{x}) = \mathbf{w} \cdot \mathbf{x} + b
fw,b(x)=w⋅x+b
1.1、Vectorization
f = n p . d o t ( w , x ) + b f=np.dot(\mathbf{w},\mathbf{x})+b f=np.dot(w,x)+b
- Shorter code
- Faster than others
- Special hardware was used to add the altogether rather than add them one by one.
- When update w ,such as w j = w j − 0.1 d j w_j=w_j-0.1d_j wj=wj−0.1dj,it’s more efficient.
1.2、Lab of python and numpy
- 1-D array, shape (n,): n elements indexed [0] through [n-1].
It doesn’t mean that the number of this array is 1.In the contrary ,it means that this array has a row of elements.(一维数组)
1.2.1、The creation
a = np.zeros(4); print(f"np.zeros(4) : a = {a}, a shape = {a.shape}, a data type = {a.dtype}")
a = np.zeros((4,2)); print(f"np.zeros(4,2) : a = {a}, a shape = {a.shape}, a data type = {a.dtype}")
a = np.zeros((4,)); print(f"np.zeros(4,) : a = {a}, a shape = {a.shape}, a data type = {a.dtype}")
a = np.random.random_sample(4); print(f"np.random.random_sample(4): a = {a}, a shape = {a.shape}, a data type = {a.dtype}")
'''
np.zeros(4) : a = [0. 0. 0. 0.], a shape = (4,), a data type = float64
np.zeros(4,2) : a = [[0. 0.]
[0. 0.]
[0. 0.]
[0. 0.]], a shape = (4, 2), a data type = float64
np.zeros(4,) : a = [0. 0. 0. 0.], a shape = (4,), a data type = float64
np.random.random_sample(4): a = [0.93520726 0.67662427 0.59914887 0.55925697], a shape = (4,), a data type = float64
'''
a = np.arange(4.); print(f"np.arange(4.): a = {a}, a shape = {a.shape}, a data type = {a.dtype}")
# 使用浮点4.
a = np.random.rand(4); print(f"np.random.rand(4): a = {a}, a shape = {a.shape}, a data type = {a.dtype}")
'''
np.arange(4.): a = [0. 1. 2. 3.], a shape = (4,), a data type = float64
np.random.rand(4): a = [0.62211656 0.87796894 0.86304218 0.80868278], a shape = (4,), a data type = float64
'''
a = np.array([5,4,3,2]); print(f"np.array([5,4,3,2]): a = {a}, a shape = {a.shape}, a data type = {a.dtype}")
a = np.array([5.,4,3,2]); print(f"np.array([5.,4,3,2]): a = {a}, a shape = {a.shape}, a data type = {a.dtype}")
'''
np.array([5,4,3,2]): a = [5 4 3 2], a shape = (4,), a data type = int64
np.array([5.,4,3,2]): a = [5. 4. 3. 2.], a shape = (4,), a data type = float64
'''
1.2.2、The operations on vectors
- Indexing means referring to an element of an array by its position within the array.
- a.shape[0] returns the length of array - Slicing means getting a subset of elements from an array based on their indices.(切片)
c = a[2:7:2]; print("a[2:7:2] = ", c) # a[2:7:2] = [2 4 6]
c = a[3:]; print("a[3:] = ", c)# a[3:] = [3 4 5 6 7 8 9]
- Operations on a single vector.
b = -a
print(f"b = -a : {b}")
b = np.sum(a)
print(f"b = np.sum(a) : {b}")
b = np.mean(a)
print(f"b = np.mean(a): {b}")
b = a**2
print(f"b = a**2 : {b}")
'''
b = -a : [-1 -2 -3 -4]
b = np.sum(a) : 10
b = np.mean(a): 2.5
b = a**2 : [ 1 4 9 16]
'''
- On two vectors
- {a+b}
- b = 5 * a c = np.dot(a, b)
- shape
- When the parameter is a scalar number(标量)it returns () and when we use a.shape(a is a 1D-vector),it return(length of a, )
- When a is a 2-D vector,such as a= [[0 1] [2 3][4 5]],
#a.shape: (3, 2),a[2].shape: (2,)
1.2.3、Matrices
Matrices, are two dimensional arrays.m is often the number of rows and n the number of columns.Matrices have a two-dimensional (2-D) index [m,n].
- Creation
a = np.zeros((1, 5))
print(f"a shape = {a.shape}, a = {a},{a.dtype}")
a = np.zeros((2, 1))
print(f"a shape = {a.shape}, a = {a}")
a = np.random.random_sample((1, 1))
print(f"a shape = {a.shape}, a = {a}")
"""
a shape = (1, 5), a = [[0. 0. 0. 0. 0.]],float64
a shape = (2, 1), a = [[0.]
[0.]]
# Notice further than NumPy, when printing, will print one row per line.
a shape = (1, 1), a = [[0.04997798]]
"""
a = np.array([[5], [4], [3]]); print(f" a shape = {a.shape}, np.array: a = {a}")
# a shape = (3, 1), np.array: a = [[5]
# [4]
# [3]]
a = np.arange(6).reshape(-1, 2)# It is equal to a = np.arange(6).reshape(3, 2)
"""
a= [[0 1]
[2 3]
[4 5]]
"""
# The -1 argument tells the routine to compute the number of rows given the size of the array and the number of columns.
- Slicing
a[:, 2:7:1] = [[ 2 3 4 5 6][12 13 14 15 16]] ,
a[:, 2:7:1].shape = (2, 5) a 2-D array
1.3 Gradient descent for multiple linear regression
repeat
until convergence:
{
w
j
=
w
j
−
α
∂
J
(
w
,
b
)
∂
w
j
for j = 0..n-1
b
=
b
−
α
∂
J
(
w
,
b
)
∂
b
}
\begin{align*} \text{repeat}&\text{ until convergence:} \; \lbrace \newline\; & w_j = w_j - \alpha \frac{\partial J(\mathbf{w},b)}{\partial w_j} \; & \text{for j = 0..n-1}\newline &b\ \ = b - \alpha \frac{\partial J(\mathbf{w},b)}{\partial b} \newline \rbrace \end{align*}
repeat} until convergence:{wj=wj−α∂wj∂J(w,b)b =b−α∂b∂J(w,b)for j = 0..n-1
While:
∂
J
(
w
,
b
)
∂
w
j
=
1
m
∑
i
=
0
m
−
1
(
f
w
,
b
(
x
(
i
)
)
−
y
(
i
)
)
x
j
(
i
)
∂
J
(
w
,
b
)
∂
b
=
1
m
∑
i
=
0
m
−
1
(
f
w
,
b
(
x
(
i
)
)
−
y
(
i
)
)
\begin{align} \frac{\partial J(\mathbf{w},b)}{\partial w_j} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)})x_{j}^{(i)} \\ \frac{\partial J(\mathbf{w},b)}{\partial b} &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{\mathbf{w},b}(\mathbf{x}^{(i)}) - y^{(i)}) \end{align}
∂wj∂J(w,b)∂b∂J(w,b)=m1i=0∑m−1(fw,b(x(i))−y(i))xj(i)=m1i=0∑m−1(fw,b(x(i))−y(i))
1.4、The Lab
- x ( i ) \mathbf{x}^{(i)} x(i) is vector containing example i. x ( i ) \mathbf{x}^{(i)} x(i) = ( x 0 ( i ) , x 1 ( i ) , ⋯ , x n − 1 ( i ) ) = (x^{(i)}_0, x^{(i)}_1, \cdots,x^{(i)}_{n-1}) =(x0(i),x1(i),⋯,xn−1(i))
- The code
def compute_cost(X, y, w, b):
"""
compute cost
Args:
X (ndarray (m,n)): Data, m examples with n features
y (ndarray (m,)) : target values
w (ndarray (n,)) : model parameters
b (scalar) : model parameter
Returns:
cost (scalar): cost
"""
m = X.shape[0]
cost = 0.0
for i in range(m):
f_wb_i = np.dot(X[i], w) + b #(n,)(n,) = scalar (see np.dot)
cost = cost + (f_wb_i - y[i])**2 #scalar
cost = cost / (2 * m) #scalar
return cost
def compute_gradient(X, y, w, b):
"""
Computes the gradient for linear regression
Args:
X (ndarray (m,n)): Data, m examples with n features
y (ndarray (m,)) : target values
w (ndarray (n,)) : model parameters
b (scalar) : model parameter
Returns:
dj_dw (ndarray (n,)): The gradient of the cost w.r.t. the parameters w.
dj_db (scalar): The gradient of the cost w.r.t. the parameter b.
"""
m,n = X.shape #(number of examples, number of features)
dj_dw = np.zeros((n,))
dj_db = 0.
for i in range(m):
err = (np.dot(X[i], w) + b) - y[i]
for j in range(n):
dj_dw[j] = dj_dw[j] + err * X[i, j]
dj_db = dj_db + err
dj_dw = dj_dw / m
dj_db = dj_db / m
return dj_db, dj_dw
def gradient_descent(X, y, w_in, b_in, cost_function, gradient_function, alpha, num_iters):
"""
Performs batch gradient descent to learn theta. Updates theta by taking
num_iters gradient steps with learning rate alpha
Args:
X (ndarray (m,n)) : Data, m examples with n features
y (ndarray (m,)) : target values
w_in (ndarray (n,)) : initial model parameters
b_in (scalar) : initial model parameter
cost_function : function to compute cost
gradient_function : function to compute the gradient
alpha (float) : Learning rate
num_iters (int) : number of iterations to run gradient descent
Returns:
w (ndarray (n,)) : Updated values of parameters
b (scalar) : Updated value of parameter
"""
# An array to store cost J and w's at each iteration primarily for graphing later
J_history = []
w = copy.deepcopy(w_in) #avoid modifying global w within function
b = b_in
for i in range(num_iters):
# Calculate the gradient and update the parameters
dj_db,dj_dw = gradient_function(X, y, w, b) ##None
# Update Parameters using w, b, alpha and gradient
w = w - alpha * dj_dw ##None
b = b - alpha * dj_db ##None
# Save cost J at each iteration
if i<100000: # prevent resource exhaustion
J_history.append( cost_function(X, y, w, b))
# Print cost every at intervals 10 times or as many iterations if < 10
if i% math.ceil(num_iters / 10) == 0:
print(f"Iteration {i:4d}: Cost {J_history[-1]:8.2f} ")
return w, b, J_history #return final w,b and J history for graphing
2、Gradient descent in practice
2.1、Feature scaling(特征缩放)
可以看到,当X1和X2的尺度不同,那么他们对最终损失造成的影响也是不相同的,如左图所示,你可以想象成一个三维的地形,或者一个崎岖的山路,而右图由于输入尺度相同,当然也是个三维的地形,但层次分明。当你想到达最低点时,左图需要根据梯度下降指引下降的方向,可能到达不同点,他所指引的方向并不指向最低点;而相反,右图中,他的梯度下降由于周围地形很相近,所以一直可以指向最低点,这也就是feature scaling加速收敛的原因所在!(引自Feature Scaling 的意义)
-
Mean normalization
x i : = x i − μ i m a x − m i n x_i := \frac{x_i - \mu_i}{max - min} xi:=max−minxi−μi -
Z-score normalization
x j ( i ) = x j ( i ) − μ j σ j x^{(i)}_j = \dfrac{x^{(i)}_j - \mu_j}{\sigma_j} xj(i)=σjxj(i)−μj
where j j j selects a feature or a column in the X \mathbf{X} X matrix and:
σ j 2 = 1 m ∑ i = 0 m − 1 ( x j ( i ) − μ j ) 2 \sigma^2_j = \frac{1}{m} \sum_{i=0}^{m-1} (x^{(i)}_j - \mu_j)^2 σj2=m1i=0∑m−1(xj(i)−μj)2
After z-score normalization, all features will have a mean of 0 and a standard deviation of 1.Given a new x value (living room area and number of bed- rooms), we must first normalize x using the mean and standard deviation that we had previously computed from the training set.
- The code:
def zscore_normalize_features(X):
"""
computes X, zcore normalized by column
Args:
X (ndarray (m,n)) : input data, m examples, n features
Returns:
X_norm (ndarray (m,n)): input normalized by column
mu (ndarray (n,)) : mean of each feature
sigma (ndarray (n,)) : standard deviation of each feature
"""
# find the mean of each column/feature
mu = np.mean(X, axis=0) # mu will have shape (n,)
# axis = 0 means Vertical axis while axis = 1 means Horizontal
# find the standard deviation of each column/feature
sigma = np.std(X, axis=0) # sigma will have shape (n,)
# element-wise, subtract mu for that column from each example, divide by std for that column
X_norm = (X - mu) / sigma
return (X_norm, mu, sigma)
2.2、The learning rate
With a small enough
α
\alpha
α,
J
(
w
,
b
)
J(w,b)
J(w,b)should decrease on every iteration
2.3、Feature engineering
所谓特征工程,就是根据你的直觉,通过变换或组合原始特征来设计新的特征,以使学习算法更容易做出准确的预测。
2.4、Polynomial regression(多项式回归)
When use a simple quadratic:
y
=
1
+
x
2
y = 1+x^2
y=1+x2 to start , it is not a great fit.
If we swap the original data with a version that squares the
x
x
x value, then we can achieve
y
=
w
0
x
0
2
+
b
y= w_0x_0^2 + b
y=w0x02+b. Let’s try it,such as swapping X
for x*2
.
Then ,we can use
y
=
w
0
∗
X
+
b
y=w_0*X+b
y=w0∗X+b to fit the data.
As,we can see,near perfect fit.Therefore, we knew that an
x
2
x^2
x2 term was required. It may not always be obvious which features are required. One could add a variety of potential features to try and find the most useful. For example, what if we had instead tried
y
=
w
0
x
0
+
w
1
x
1
2
+
w
2
x
2
3
+
b
y=w_0x_0 + w_1x_1^2 + w_2x_2^3+b
y=w0x0+w1x12+w2x23+b ?
X = np.c_[x, x**2, x**3]
#<-- added engineered feature
Then we attain
0.08
x
+
0.54
x
2
+
0.03
x
3
+
0.0106
0.08x + 0.54x^2 + 0.03x^3 + 0.0106
0.08x+0.54x2+0.03x3+0.0106
Gradient descent has emphasized the data that is the best fit to the
x
2
x^2
x2 data by increasing the
w
1
w_1
w1 (0.54) term relative to the others.
Another way to think about this is to note that we are still using linear regression once we have created new features. Given that, the best features will be linear relative to the target. This is best understood with an example.Just as follows.
2.5、Scikit-Learn
Please refer to sklearn