机器学习笔记(1)——线性回归LinearRegression(单变量线性回归)

最新推荐文章于 2024-06-05 10:44:52 发布

口袋的天空Zard

最新推荐文章于 2024-06-05 10:44:52 发布

阅读量3.4k

点赞数

分类专栏：机器学习

机器学习专栏收录该内容

15 篇文章 2 订阅

订阅专栏

线性回归LinearRegression

单变量线性回归实战：根据城市人口预测某行业利润

1.需要导入的python包:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
 #线性回归，安装scikit-learn包
from mpl_toolkits.mplot3d import axes3d #绘制3D图，安装matplotlib包

2.pandas.set_option

#pandas.set_option(pat,value) = <pandas.core.config.CallableDynamicDoc object at 0xb559f20c>

用如上的方法设置pandas的选项,设置数据图表属性

pd.set_option('display.notebook_repr_html', False) 
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 150)
pd.set_option('display.max_seq_items', None)

3.%matplotlib

%matplotlib inline

使用%matplotlib命令可以将matplotlib的图表直接嵌入到Notebook之中，

或者使用指定的界面库显示图表，它有一个参数指定matplotlib图表的显示方式。

inline表示将图表嵌入到Notebook中。

4.导入seaborn

import seaborn as sns #导入seaborn库用于数据可视化
sns.set_context('notebook') #设置默认属性
sns.set_style('white') #控制图表样式

seaborn是在matplotlib基础上作了一系列改进.绘图流程简单, 图形美观,

官方文档也清晰直白doc

单变量线性回归

5.导入数据：打开数据文件并设置读取时的分隔符

data=np.loadtxt('linear_regression_data1.txt',delimiter=',')

print data

输出结果如下：（二维点集）

array([[  6.1101 ,  17.592  ],
       [  5.5277 ,   9.1302 ],
       [  8.5186 ,  13.662  ],
       [  7.0032 ,  11.854  ],
       [  5.7737 ,   2.4406 ],
       [  7.8247 ,   6.7318 ],
       [  7.0931 ,   1.0463 ],
       [  5.0702 ,   5.1337 ],
       ……
       [  5.0594 ,   2.8214 ],
       [  5.7077 ,   1.8451 ],
       [  7.6366 ,   4.2959 ],
       [  5.8707 ,   7.2029 ],
       [  5.3054 ,   1.9869 ],
       [  8.2934 ,   0.14454],
       [ 13.394  ,   9.0551 ],
       [  5.4369 ,   0.61705]])

6.构造初始矩阵

X = np.c_[np.ones(data.shape[0]),data[:,0]]  #包含x数组的矩阵，每行为[1,x]
y = np.c_[data[:,1]]                         #y数组

(1)numpy.ones用法：构造指定大小数组

np.ones(5)

输出结果：

array([1,1,1,1,1])

np.ones里面加一个tuple

np.ones((2,3))

输出结果：

array([[1.,1.,1.],
       [1.,1.,1.]])

#其中，ones函数的参数如下
numpy.ones(shape, dtype=None, order='C')
#返回值就是一个给定类型和大小的数组

(2)python列表切片：

#假设sequence是个一维数组
sequence[2:5] #第2至第4的元素
sequnece[::2] #从第0开始隔一个取一个
sequence[::-1] #可以视为翻转操作
sequence[a:b:c] #表示从a到b间隔为c
sequence[:] #返回整个数组
sequence[-1] #返回数组最后一个元素

#假设data是个二维数组
data[0,1] #返回data[0][1]
data[:,0] #返回所有第0列的元素组成的数组

(3)numpy.c_函数用法：

np.c_([1,2,3],[4,5,6])

输出结果：

array([[1,4],
       [2,5],
       [3,6]])

(4)shape函数用法：

7.利用pyplot输出散点图

plt.scatter (X[:,-1], y, s=30, c='r', marker='x', linewidths=1)
plt.xlim(4,24)
plt.xlabel('Population of City in 10,000s')        #设置横坐标标签
plt.ylabel('Profit in $10,000s')                   #设置纵坐标标签

输出结果：

这里写图片描述

(1)pyplot.scatter函数

matplotlib.pyplot.scatter(x, y, s=20, c='b', marker='o', cmap=None, norm=None,
                    vmin=None, vmax=None, alpha=None,linewidths=None,
                        verts=None, hold=None, **kwargs)

这里写图片描述

常用marker:

这里写图片描述

颜色属性：

b -> blue
c -> cyan
g -> green
k -> black
m -> magenta
r -> red
w -> white
y -> yellow

(2)xlim(),ylim()函数：

xlim()获取或设置当前图像 x 轴的范围

ylim()获取或设置当前图像 y 轴的范围：

xmin, xmax = xlim()   # return the current xlim
xlim( (xmin, xmax) )  # set the xlim to xmin, xmax
xlim( xmin, xmax )    # set the xlim to xmin, xmax

8.代价函数

代价函数CostFunction是一种衡量我们在这组参数下预估的结果和实际结果差距的函数。

线性回归的代价函数定义为:

m表示(x,y)点对的个数

直线的代价函数为：

其中： $h_\theta(x) =\theta_0+\theta_1x$

计算代码(直线)：

theta即表示参数[ $\theta_0$ , $\theta_1$ ]

def computeCost(X, y, theta=[[0],[0]]):  
    m = y.size
    J = 0
    h = X.dot(theta)   
    J = 1.0/(2*m)*(np.sum(np.square(h-y)))
    return J

np.sum(A)    #将矩阵(numpy数组)中的每一个元素加和，返回总和
np.square(A) #将矩阵(numpy数组)中的每一个元素平方，返回矩阵

计算代价函数：

computeCost(X,y)  #最开始默认系数均为0的情况下计算

结果为：32.072733877455676

9.梯度下降

梯度下降(GradientDescent)算法是调整参数 $\theta$ 使得代价函数 $J(\theta)$ 取得最小值的最基本方法之一

从直观上理解，就是我们在凸函数上取一个初始值，

然后挪动这个值一步步靠近最低点的过程，如图：

这里写图片描述

如果 $\theta_0$ 一直为0，则 $\theta_1$ 的 $J$ 函数为：

这里写图片描述

如果 $\theta_0$ 与 $\theta_1$ 都不确定，则 $\theta_0$ 与 $\theta_1$ 的 J <script type="math/tex" id="MathJax-Element-13">J</script>函数为：

这里写图片描述

梯度下降过程：

这里写图片描述

算法如下：

这里写图片描述

算法实现：

def gradientDescent(X, y, theta=[[0],[0]], alpha=0.01, num_iters=1500):
    m = y.size
    #num_iters:迭代循环次数
    J_history = np.zeros(num_iters)
    for iter in np.arange(num_iters):
        h = X.dot(theta)
        theta = theta - alpha*(1.0/m)*(X.T.dot(h-y))
        J_history[iter] = computeCost(X, y, theta)
    return(theta, J_history)

np.arange()用于返回循环范围

np.zeros(m)用于初始化数组长度为m

J_history储存每次迭代后得到的损失函数数值，用于观察变化

10.画出迭代与损失函数值的变化

theta , Cost_J = gradientDescent(X, y)
print('theta: ',theta.ravel()) #输出拟合的theta值
#('theta: ', array([-3.63029144,  1.16636235]))

plt.plot(Cost_J)
plt.ylabel('Cost J')
plt.xlabel('Iterations')

绘图结果：
这里写图片描述

可见当迭代次数达到1500次损失函数的值已经接近不变了

11.画出线性回归图线

并与scikit-learn中的线性回归对比

xx = np.arange(5,23)    #用于画拟合直线的x坐标
yy = theta[0]+theta[1]*xx   #用于画直线相应的y坐标

# 画出自己写的线性回归梯度下降收敛的情况
plt.scatter(X[:,1], y, s=30, c='r', marker='x', linewidths=1)
plt.plot(xx,yy, label='Linear regression (Gradient descent)')

# 和Scikit-learn中的线性回归对比一下 
regr = LinearRegression()
regr.fit(X[:,1].reshape(-1,1), y.ravel())
plt.plot(xx, regr.intercept_+regr.coef_*xx, label='Linear regression (Scikit-learn GLM)')

plt.xlim(4,24)
plt.xlabel('Population of City in 10,000s')
plt.ylabel('Profit in $10,000s')
plt.legend(loc=4)

pyplot.plot为二维线画图函数

plt.plot(x, y, label='') #x坐标数组，y坐标数组，标签，使用平滑线连接

pyplot.legend为绘制图例函数

plt.legend(loc=4)  #调整位置为4号位，表示第四象限位置，右下角

12.完成预测工作

预测人口在4~24万之间城市的利润

print(theta.T.dot([1, 4.6])*10000)

输出结果：[ 17349.75372139]

详见https://github.com/icepoint666/MachineLearning

口袋的天空Zard

关注

0
点赞
踩
6

收藏

觉得还不错? 一键收藏
0
评论
机器学习笔记(1)——线性回归LinearRegression(单变量线性回归)

线性回归LinearRegression单变量线性回归实战：根据城市人口预测某行业利润1.需要导入的python包:import pandas as pdimport numpy as npimport matplotlib.pyplot as pltfrom sklearn.linear_model import LinearRegression #线性回归，安装
复制链接

扫一扫