1 简单线性回归
线性回归算法的特点:
1.解决回归问题
2.思想简单,实现容易
3.许多强大的非线性模型的基础
4.结果 具有很好的可解释性
5.蕴含机器学习中的很多重要思想
样本特征只有一个,称为简单线性回归
样本特征有多个,称为多元线性回归
假设我们找到了最佳拟合的直线方程:y=ax+b。则对于每一个样本点x(i) ,根据我们的直线方程,预测值为:y_hat(i) = ax(i) + b,真值为y(i) 。
我们希望y(i)和y_hat(i)的差距尽量小。表达y(i)和y_hat(i)的差距:(y(i) - y_hat(i) )2
2 简单线性回归的实现
实现Simple Linear Regression:
import numpy as np
import matplotlib.pyplot as plt
x = np.array([1, 2, 3, 4, 5])
y = np.array([1, 3, 2, 3, 5])
plt.scatter(x, y)
plt.axis([0, 6, 0, 6])
plt.show()
输出:
# 计算xy的均值
x_mean = np.mean(x)
y_mean = np.mean(y)
num = 0
d = 0
for x_i, y_i in zip(x, y):
num += (x_i - x_mean) * (y_i - y_mean)
d += (x_i - x_mean) ** 2
a = num / d
b = y_mean - a * x_mean
print(a)
print(b)
输出:
0.8
0.39999999999999947
将拟合出的直线绘制出来:
# 将我们计算得出的方程绘制出来
y_hat = a * x + b
plt.scatter(x, y)
plt.plot(x, y_hat, color = 'r')
plt.axis([0, 6, 0, 6])
plt.show()
输出:
# 当新来了一个样本数据时,如何使用得出的模型预测
x_predict = 6
y_predict = a * x_predict + b
y_predict
>>>5.2
在pycharm中整理SimpleLinearRegression算法:
import numpy as np
class SimpleLinearRegression1:
def __init__(self):
self.a_ = None
self.b_ = None
def fit(self, x_train, y_train):
"""根据x_train, y_train训练SimpleLinearRegression模型"""
assert x_train.ndim == 1, 'simple linear regression can only solve single feature training data'
assert len(x_train) == len(y_train), 'the size of x_train must be equal to y_train'
x_mean = np.mean(x_train)
y_mean = np.mean(y_train)
num = 0
d = 0
for x_i, y_i in zip(x_train, y_train):
num += (x_i - x_mean) * (y_i - y_mean)
d += (x_i - x_mean) ** 2
self.a_ = num / d
self.b_ = y_mean - self.a_ * x_mean
return self
def predict(self, x_predict):
"""给定待预测的数据集x_predict,返回表示结果的向量"""
assert x_predict.ndim == 1, 'simple linear regression can only solve single feature training data'
assert self.a_ is not None and self.b_ is not None, 'must fit before predict'
return np.array([self._predict(x_i) for x_i in x_predict])
def _predict(self, x_i):
return self.a_ * x_i + self.b_
def __repr__(self):
return 'SimpleLinearRegression1()'
接着在jupyter notebook中使用自己编写的SimpleLinearRegression:
from simple_linear_regression.SimpleLinearRegression import SimpleLinearRegression1
reg = SimpleLinearRegression1()
reg.fit(x, y)
>>>SimpleLinearRegression1()
reg.predict(np.array([x_predict]))
>>>array([5.2])
# 查看计算出的参数
print(reg.a_)
print(reg.b_)
输出:
0.8
0.39999999999999947
3 向量化
这节课将上文计算a与b的for循环改成向量化运算,将上面编写的SimpleLinearRegression1类整个复制一份,命名为SimpleLinearRegression2,将其中的fit方法改为如下:
def fit(self, x_train, y_train):
"""根据x_train, y_train训练SimpleLinearRegression模型"""
assert x_train.ndim == 1, 'simple linear regression can only solve single feature training data'
assert len(x_train) == len(y_train), 'the size of x_train must be equal to y_train'
x_mean = np.mean(x_train)
y_mean = np.mean(y_train)
# 向量化运算
num = (x_train - x_mean).dot(y_train - y_mean)
d = (x_train - x_mean).dot(x_train - x_mean)
self.a_ = num / d
self.b_ = y_mean - self.a_ * x_mean
return self
在jupyter notebook中调用编写的SimpleLinearRegression2,接着上节课的notebook内容继续输入:
from simple_linear_regression.SimpleLinearRegression import SimpleLinearRegression2
reg2 = SimpleLinearRegression2()
reg2.fit(x, y)
>>>SimpleLinearRegression2()
print(reg2.a_)
print(reg2.b_)
输出:
0.8
0.39999999999999947
接下来进行向量化运算与普通运算的性能测试:
m = 1000000
big_x = np.random.random(size = m)
big_y = big_x * 2 + 3