线性回归李宏毅 HW1 PM2.5

最新推荐文章于 2022-09-30 20:57:34 发布

郡酱~

最新推荐文章于 2022-09-30 20:57:34 发布

阅读量690

点赞数 2

分类专栏：李宏毅机器学习作业

本文链接：https://blog.csdn.net/weixin_40977054/article/details/93062627

版权

李宏毅机器学习作业专栏收录该内容

2 篇文章 0 订阅

订阅专栏

一、作业描述（预测PM2.5）

给定数据训练集train.csv，每个月前20天每个小时的气象资料（每小时有18种测资）共12个月。根据给定数据集根据前9个小时的空气检测情况预测第10个小时的PM2.5含量。
训练集解析：

CSV文件，除去第一行的标题栏共4320行（4320维）4320=12个月×20天×18种监测物
每天的监测时间点为0时，1时…到23时，共24个时间节点;（24列数据）
可以吧数据集看成一个4320×24的矩阵

1.1数据处理

通过观察数据可以发现数据中RAINFALL一列有‘NR’字符表示’No Rain’所以我们可以把NR转换为数字。'NR’为0
然后把每天的数据分块，每18行为一个数据帧（因为有18种元素）
在这里插入图片描述

#read data
n_row = 0
text = open('data/train.csv', 'r', encoding='big5')
row = csv.reader(text, delimiter=',')
for r in row: 
    if n_row != 0: 
        for i in range(3,27): #第三列到第27列是数据
            if r[i] != "NR":
                data[(n_row-1)%18].append(float(r[i])) #每18行分一类
            else:
                data[(n_row-1)%18].append(float(0)) #让NR=0
    n_row = n_row + 1
text.close

训练集中每个月20天，每天24个小时。根据前9个小时观测的所有数据（共18中污染物），也就是189=162个feature输入来预测第10个小时的PM2.5值。由于数据的不连续性，在连续取九个小时的过程中，每个月第20天的后9个小时是没有对应的实际对应的测量值的，所以实际上每个月有 2024-9=471笔连续9小时，也就是每个月有471笔训练数据，一年就有 12471=5652笔训练数据;每笔训练数据包含9个小时的18种污染物189=162，也就是每笔数据有162个feature，需要162个参数外加一个bais需要163各参数。

实际上蓝色方块数据帧为一维数组，及[第一个月、第二个月…、第12个月]
在这里插入图片描述

#parse data to trainX and trainY
x = []
y = []
for i in range(12): #一年12个月
    for j in range(471): #每个月的数据数量： 471=20*24-9
        x.append([]) 
        for t in range(18): #18种观测数据
            for s in range(9): #前9个小时
                x[471*i + j].append(data[t][480*i+j+s]) 
        y.append(data[9][480*i+j+9]) #第十行为PM2.5所以为data[9][]
trainX = np.array(x) #每一行有9*18个数 每9个代表9天的某一种污染物
trainY = np.array(y)

二、模型选择与实现

根据要求，采用逻辑回归模型进行预测

step1：chose model

用前每个种类九个小时的观测值去预测第十个小时的PM2.5每笔训练数据包含9个小时的18种污染物18*9=162，
也就是每笔数据有162个feature，需要162个参数外加一个bais需要163各参数。
$y=\sum_{i=0}^{18*9=162}w_{(i)(j)}x_{(i)(j)}+b$
$x_{(i)(j)}为第 i 个小时第 j 个种类的数据$ ， $w_{(i)(j)}为对应第 i 个小时第 j 个种类的权重$ ， $b 为对应偏置值$ ， $y$ 为对应的预测结果。化为向量形式就是163个元素：162+1（bias）特别注意bias的转化！！矩阵首行添加一列单位向量

w = np.zeros(len(trainX[0]))

在这里插入图片描述

# add bias 
test_x = np.concatenate((np.ones((test_x.shape[0],1)),test_x), axis=1)
trainX = np.concatenate((np.ones((trainX.shape[0],1)), trainX), axis=1)

step2：Goodness of Function

(1)计算LOSS的值：
$L(w,b)=\sum_{n}^{N}(\hat{y}^{n}-y^{n})^{2}$ Y-f(X)表示的是残差，整个式子表示的是残差的平方和，而我们的目的就是最小化这个目标函数值（注：该式子未加入正则项），也就是最小化残差的平方和（residual sum of squares，RSS）。而在实际应用中，通常会使用均方差（MSE）作为一项衡量指标，且为了求导后方便计算，式子会 $*\frac{1}{2}$ 公式如下: $L(w,b)=\frac{1}{2n}\sum_{n}^{8}(\hat{y}^{n}-y^{n})^{2}$

(2)为了让Loss更为平滑，加入正则化系数
$L(w,b)=\frac{1}{2}[ \frac{1}{n}\sum_{n}^{N}(\hat{y}^{n}-y^{n})^{2}+\beta\sum_{i}^{n}w_i^{2} ]$ $\beta$ 为正则项系数

step3：Best Function

对方程求偏微分，找到让Loss的偏微分为0的式子: $arg_{min_{(w,b)}}L=\frac{1}{2}[ \frac{1}{n}\sum_{n}^{8}(\hat{y}^{n}-y^{n})^{2}+\beta\sum_{i}^{n}w_i^{2} ]$
梯度计算：
需明确此时的目标是使Loss最小，而可优化的参数为权重w和偏置值b，因此需要求Loss在w上的偏微分和Loss在b上的偏微分。
在这里插入图片描述
同时更新w和b的值

在更新参数时固定的 learning rate 在训练的时候效果并不好，所以我们使用 Adaptive learning rate

1、Adagrad

" Divid the learning rate of each parameter by the root mean square of its previous derivatives. "
$w^{t+1}=w^t-\frac{n^t}{\sigma^t}g^t$
$n^t=\frac{n}{\sqrt{t+1}}, \sigma^t过去所有微分值得均方根$
在这里插入图片描述

在这里插入图片描述
学习率更新：
采用Adagrad的方法更新训练参数为：

def ada(X, Y, w, eta, iteration, lambdaL2):
    s_grad = np.zeros(len(X[0]))
    list_cost = []
    for i in range(iteration):
        hypo = np.dot(X,w)
        loss = hypo - Y
        cost = np.sum(loss**2)/len(X) #average
        list_cost.append(cost)

        grad = np.dot(X.T, loss)/len(X) + lambdaL2*w
        s_grad += grad**2 #二阶倒约等于一阶导的平方和开更
        ada = np.sqrt(s_grad) 
        w = w - eta*grad/ada
    return w, list_cost

2、Stochastic Gradient Descent (SGD) 随机梯度下降法

上一种方法中，Loss值要涵盖所有的训练集，所以训练速度比较慢。为了让速度提高，我们可以用随机梯度下降法。该方法随机选取一个样本进行梯度下降。

def SGD(X, Y, w, eta, iteration, lambdaL2):
    list_cost = []
    for i in range(iteration):
        hypo = np.dot(X,w)
        loss = hypo - Y
        cost = np.sum(loss**2)/len(X)
        list_cost.append(cost)

        rand = np.random.randint(0, len(X)) #随机取值
        grad = X[rand]*loss[rand]/len(X) + lambdaL2*w
        w = w - eta*grad
    return w, list_cost

训练过程：
可以用close-form-solutuion进行验证，及用求解析解。

#train data
w = np.zeros(len(trainX[0]))
w_sgd, cost_list_sgd = SGD(trainX, trainY, w, eta=0.0001, iteration=20000, lambdaL2=0)
# w_sgd50, cost_list_sgd50 = SGD(trainX, trainY, w, eta=0.0001, iteration=20000, lambdaL2=50)
w_ada, cost_list_ada = ada(trainX, trainY, w, eta=1, iteration=20000, lambdaL2=0)
# w_gd, cost_list_gd = SGD(trainX, trainY, w, eta=0.0001, iteration=20000, lambdaL2=0)

#close form 可以用clos formsolution来检验答案的正确性
w_cf = inv(trainX.T.dot(trainX)).dot(trainX.T).dot(trainY)
cost_wcf = np.sum((trainX.dot(w_cf)-trainY)**2) / len(trainX)
hori = [cost_wcf for i in range(20000-3)]



#output testdata
y_ada = np.dot(test_x, w_ada)
y_sgd = np.dot(test_x, w_sgd)
y_cf = np.dot(test_x, w_cf)

#csv format
ans = []
for i in range(len(test_x)):
    ans.append(["id_"+str(i)])
    a = np.dot(w_ada,test_x[i])
    ans[i].append(a)

filename = "result/predict.csv"
text = open(filename, "w+")
s = csv.writer(text,delimiter=',',lineterminator='\n')
s.writerow(["id","value"])
for i in range(len(ans)):
    s.writerow(ans[i])
text.close()

画出结果图便于观察

#plot training data with different gradiant method
plt.plot(np.arange(len(cost_list_ada[3:])), cost_list_ada[3:], 'b', label="ada")
plt.plot(np.arange(len(cost_list_sgd[3:])), cost_list_sgd[3:], 'g', label='sgd')
# plt.plot(np.arange(len(cost_list_sgd50[3:])), cost_list_sgd50[3:], 'c', label='sgd50')
# plt.plot(np.arange(len(cost_list_gd[3:])), cost_list_gd[3:], 'r', label='gd')
plt.plot(np.arange(len(cost_list_ada[3:])), hori, 'y--', label='close-form')
plt.title('Train Process')
plt.xlabel('Iteration')
plt.ylabel('Loss Function(Quadratic)')
plt.legend()
plt.savefig(os.path.join(os.path.dirname('__file__'), "figures/TrainProcess"))
plt.show()

#plot fianl answer
plt.figure()
plt.subplot(131)
plt.title('CloseForm')
plt.xlabel('dataset')
plt.ylabel('pm2.5')
plt.plot(np.arange((len(ans_y))), ans_y, 'r,')
plt.plot(np.arange(240), y_cf, 'b')
plt.subplot(132)
plt.title('ada')
plt.xlabel('dataset')
plt.ylabel('pm2.5')
plt.plot(np.arange((len(ans_y))), ans_y, 'r,')
plt.plot(np.arange(240), y_ada, 'g')
plt.subplot(133)
plt.title('sgd')
plt.xlabel('dataset')
plt.ylabel('pm2.5')
plt.plot(np.arange((len(ans_y))), ans_y, 'r,')
plt.plot(np.arange(240), y_sgd, 'b')
plt.tight_layout()
plt.savefig(os.path.join(os.path.dirname('__file__'), "figures/Compare"))
plt.show()

三、参考链接

https://www.bilibili.com/video/av10590361?from=search&seid=1336383683270560723
https://www.cnblogs.com/HL-space/p/10676637.html
https://blog.csdn.net/ZHOUJIAN_TANK/article/details/83752987