吴恩达机器学习课程作业 Exercise 1：Linear Regression

最新推荐文章于 2024-06-19 21:37:28 发布

u010660276

最新推荐文章于 2024-06-19 21:37:28 发布

阅读量943

点赞数

分类专栏：机器学习

本文链接：https://blog.csdn.net/u010660276/article/details/93381966

版权

机器学习专栏收录该内容

5 篇文章 0 订阅

订阅专栏

吴恩达机器学习课程作业 Exercise 1：Linear Regression

数据准备
成本函数
单变量线性回归
多变量线性回归
sklearn实现线性回归
使用函数总结

数据准备

  path =  'ex1data1.txt'
  data = pd.read_csv(path,header=None,names=['Population', 'Profit'])

成本函数

$J\left( \theta \right)=\frac{1}{2m}\sum\limits_{i=1}^{m}{{{\left( {{h}_{\theta }}\left( {{x}^{(i)}} \right)-{{y}^{(i)}} \right)}^{2}}}$

其中：
${{h}_{\theta }}\left( x \right)={{\theta }^{T}}X={{\theta }_{0}}{{x}_{0}}+{{\theta }_{1}}{{x}_{1}}+{{\theta }_{2}}{{x}_{2}}+...+{{\theta }_{n}}{{x}_{n}}$

单变量线性回归

成本计算函数

def computeCost(X, y, theta):
    inner = np.power(((X * theta.T) - y), 2)
    return np.sum(inner) / (2 * len(X))

梯度下降函数
${{\theta }_{j}}:={{\theta }_{j}}-\alpha \frac{\partial }{\partial {{\theta }_{j}}}J\left( \theta \right)$

def gradientDescent(X, y, theta, alpha, iters):
    temp = np.matrix(np.zeros(theta.shape))
    parameters = int(theta.ravel().shape[1])
    cost = np.zeros(iters)

    for i in range(iters):
        error = (X * theta.T) - y

        for j in range(parameters):
            term = np.multiply(error, X[:,j])
            temp[0,j] = theta[0,j] - ((alpha / len(X)) * np.sum(term))

        theta = temp
        cost[i] = computeCost(X, y, theta)

    return theta, cost

变量初始化

data.insert(0, 'Ones', 1)
cols = data.shape[1]
X = data.iloc[:,0:cols-1]
y = data.iloc[:,cols-1:cols]
X = np.matrix(X.values)
y = np.matrix(y.values)
theta = np.matrix(np.array([0,0]))
alpha = 0.01
iters = 1000

计算过程

g, cost = gradientDescent(X, y, theta, alpha, iters)

线性模型可视化

x = np.linspace(data.Population.min(), data.Population.max(), 100)
f = g[0, 0] + (g[0, 1] * x)

fig, ax = plt.subplots(figsize=(12,8))
ax.plot(x, f, 'r', label='Prediction')
ax.scatter(data.Population, data.Profit, label='Traning Data')
ax.legend(loc=2)
ax.set_xlabel('Population')
ax.set_ylabel('Profit')
ax.set_title('Predicted Profit vs. Population Size')
plt.show()

学习过程中成本函数可视化

fig, ax = plt.subplots(figsize=(12,8))
ax.plot(np.arange(iters), cost, 'r')
ax.set_xlabel('Iterations')
ax.set_ylabel('Cost')
ax.set_title('Error vs. Training Epoch')
plt.show()

将梯度下降的过程向量化，时间比上述过程快

filepath = 'ex1data1.txt'
data = pd.read_csv(filepath, header=None, names=['Population', 'Profile'])
data.insert(0, 'Theta0', 1)
cols = data.shape[1]
X = data.iloc[:, 0:cols-1]
y = data.iloc[:, cols-1:cols]
X = np.array(X).T
y = np.array(y).T
theta = np.array([[0, 0]]).T
def computeCost(X, y, theta):
    return np.sum(np.power((np.dot(theta.T, X) - y), 2)) / (2 * X.shape[1])
def gradientDescent(X, y, theta, alpha, iters):
    cost = np.zeros(iters)
    for i in range(iters):
        error = np.dot(theta.T, X) - y
        term = error * X
        theta = theta - (alpha / len(X.T)) * np.sum(term, axis=1).reshape(theta.shape)
        cost[i] = computeCost(X, y, theta)
    return theta, cost
alpha = 0.01
iters = 1000
before = time.time()
g, cost = gradientDescent(X, y, theta, alpha, iters)
after = time.time()
print(after-before)

多变量线性回归

多变量线性回归与单变量过程一致，多了归一化的处理

path =  'ex1data2.txt'
data2 = pd.read_csv(path, header=None, names=['Size', 'Bedrooms', 'Price'])
data2 = (data2 - data2.mean()) / data2.std()
data2.insert(0, 'Ones', 1)
cols = data2.shape[1]
X2 = data2.iloc[:,0:cols-1]
y2 = data2.iloc[:,cols-1:cols]

X2 = np.matrix(X2.values)
y2 = np.matrix(y2.values)
theta2 = np.matrix(np.array([0,0,0]))

g2, cost2 = gradientDescent(X2, y2, theta2, alpha, iters)
computeCost(X2, y2, g2)

sklearn实现线性回归

from sklearn import linear_model
model = linear_model.LinearRegression()
model.fit(X, y)
x = np.array(X[:, 1].A1)
f = model.predict(X).flatten()

fig, ax = plt.subplots(figsize=(12,8))
ax.plot(x, f, 'r', label='Prediction')
ax.scatter(data.Population, data.Profit, label='Traning Data')
ax.legend(loc=2)
ax.set_xlabel('Population')
ax.set_ylabel('Profit')
ax.set_title('Predicted Profit vs. Population Size')
plt.show()

使用函数总结

DataFrame

DataFrame.loc

1.取值操作

.loc[ ]中括号里面是先行后列，以逗号分隔，行和列分别是行标签和列标签

data.loc[1,'B']

如果这个DataFrame有index 值的话，也可以将index值放在第一个参数位，比如index=[a, b, c] ，那么 data.loc[‘b’,‘B’]
如果这个DataFrame的columns没有值得的话,data.loc[1,1]

2.切片操作

data.loc[1:2,'B':'C']

使用loc 时候，必须使用行或者列的name，如果行或列没有name，则可以使用其索引值

DataFrame.iloc

同loc一样，先行后列，不过不能使用行标签，只能使用行索引，和列索引来取数，或者进行切片操作，同样想取出5 的话，只能

data.iloc[1,1]

注意：

data.iloc[1:2,:]

切片data.iloc[1:2,:] 只能取出来行索引为1的行，而不能取2，
切片data.loc[1:2,:] 可以取出行索引为1，2的行

使用iloc 时候，必须使用行或者列的索引，另外注意切片与loc的区别

DataFrame.ix

混合索引，同时通过标签和行号选取数据。ix方法也有两个参数，按顺序控制行列选取。

注意：ix的两个参数中，每个参数在索引时必须保持只使用标签或行号进行数据选取，否则会返回一部分控制结果。

DataFrame.insert

DataFrame.insert(loc, column, value, allow_duplicates=False)

在指定的地方插入一列数据。如果dataframe中已经存在某列,将allow_duplicates置为true才可以将指定得列插入。

Parameters	Notes
loc : int	Insertion index. Must verify 0 <= loc <= len(columns) 要插入的那一列
column : string, number, or hashable object	label of the inserted column 要插入那列的标签
value : int, Series, or array-like
allow_duplicates : bool, optional	布尔类型，可选择

numpy

numpy.power

每个元素做power操作

numpy.sum

求所有元素的和，可以指定axis求指定轴上的元素和

sum(a, axis=None, dtype=None, out=None, keepdims=np._NoValue)

在参数列表中：

a是要进行加法运算的向量/数组/矩阵

axis的值可以为None,也可以为整数和元组

其形参的注释如下:

axis的取值有三种情况：1.None，2.整数， 3.整数元组。
（在默认/缺省的情况下，axis取None）

如果axis取None，即将数组/矩阵中的元素全部加起来，得到一个和。

如果axis为整数，axis的取值不可大于数组/矩阵的维度，且axis的不同取值会产生不同的结果。

如果axis为整数元组（x，y），则是求出axis=x和axis=y情况下得到的和。

numpy.multiply

数组对应元素相乘

import numpy as np
a = np.array([[1,2],[3,4]])
b = np.array([[5,6],[7,8]])
np.multiply(a,b)

输出

array([[ 5, 12],
       [21, 32]])

np.multiply(2, a)

输出

array([[2, 4],
       [6, 8]])

numpy.matrix

numpy矩阵

需要注意的是matrix之间用等同于矩阵相乘，array之间（nump.multiply）等同于对应元素相乘

numpy.dot

numpy.array之间进行矩阵相乘

numpy.ravel

将多维数组变为一维

numpy.linspace

numpy.linspace(start, stop, num=50, endpoint=True, retstep=False, dtype=None)

在指定的间隔内返回均匀间隔的数字。返回num均匀分布的样本，在[start, stop]。这个区间的端点可以任意的被排除在外。

Parameters(参数):

start : scalar(标量)

The starting value of the sequence(序列的起始点).

stop : scalar

序列的结束点，除非endpoint被设置为False，在这种情况下, the sequence consists of all but the last of num + 1 evenly spaced samples(该序列包括所有除了最后的num+1上均匀分布的样本(感觉这样翻译有点坑)), 以致于stop被排除.当endpoint is False的时候注意步长的大小(下面有例子).

num : int, optional(可选)

生成的样本数，默认是50。必须是非负。

endpoint : bool, optional

如果是真，则一定包括stop，如果为False，一定不会有stop

retstep : bool, optional

If True, return (samples, step), where step is the spacing between samples.(看例子)

dtype : dtype, optional

The type of the output array. If dtype is not given, infer the data type from the other input arguments(推断这个输入用例从其他的输入中).

New in version 1.9.0.

Returns:

samples : ndarray

There are num equally spaced samples in the closed interval [start, stop] or the half-open interval [start, stop) (depending on whether endpoint is True or False).

step : float(只有当retstep设置为真的时候才会存在)

Only returned if retstep is True

Size of spacing between samples.

u010660276

关注

0
点赞
踩
5

收藏

觉得还不错? 一键收藏
0
评论
吴恩达机器学习课程作业 Exercise 1：Linear Regression

吴恩达机器学习课程作业 Exercise 1：Linear Regression线性回归编程作业新的改变功能快捷键合理的创建标题，有助于目录的生成如何改变文本的样式插入链接与图片如何插入一段漂亮的代码片生成一个适合你的列表创建一个表格设定内容居中、居左、居右SmartyPants创建一个自定义列表如何创建一个注脚注释也是必不可少的KaTeX数学公式新的甘特图功能，丰富你的文章UML 图表FLowc...
复制链接

扫一扫

专栏目录