李宏毅2017HW1 PM2.5 预测

最新推荐文章于 2022-04-02 10:33:22 发布

旋转吧！风火轮

最新推荐文章于 2022-04-02 10:33:22 发布

阅读量1k

点赞数 2

分类专栏：机器学习文章标签： pm2.5预测

本文链接：https://blog.csdn.net/qq_36183881/article/details/91874895

版权

机器学习专栏收录该内容

2 篇文章 0 订阅

订阅专栏

李宏毅2017HW1 PM2.5 预测

实现

引入工具

import csv
import numpy as np
import matplotlib.pyplot as plt

读取数据

需要从train.csv读取出矩阵trainX，trainY。
train.csv有
12个月 x 每个月20天 x 每天24个小时 = 5760个小时的数据
每个小时有18种污染物数据。
首先预处理数据，将所有数据读入。

#read data
data = []
for i in range(18):
    data.append([])

with open('data/train.csv', 'r', encoding='big5') as text:
    row_i=0
    rows=csv.reader(text,delimiter =',')
    for row in rows:
      if row_i!=0:
          #读取第4-27列数据，即colomn[3]-colomn[26]
          for colomn_i in range(3,27):
            if(row[colomn_i] == 'NR'):
                data[(row_i-1)%18].append(float(0))
            else:
                data[(row_i-1)%18].append(float(row[colomn_i]))
      row_i += 1;

整理出来数据像这样：

	0	1	…	…	5760
AMB_TEMP	1月第一天0时	1月第一天1时	…	…	12月第20天23时
CH4	1月第一天0时	1月第一天1时	…	…	12月第20天23时
CO	1月第一天0时	1月第一天1时	…	…	12月第20天23时
…	…	…	…	…	…
WS_HR	1月第一天0时	1月第一天1时	…	…	12月第20天23时

我们的目的是通过前面的9个小时数据，估计第10个小时的pm2.5，因此还需要进一步整理。
由于每个月给了20天的数据，也就是每个月最后9个小时不能构成连续的10个小时。每个月就有20*24-9=471笔数据。

#parse data to trainX and trainY
trainX=[]
trainY=[]
#m*480表示找到对应的那一个月，m*480+d表示找到对应的开始时间
for m in range(0,12):
    for d in range(0,471):
        trainX.append([])
        for row in range(0,18):
            for colomn in range(0,9):
                trainX[m*471+d].append(data[row][m*480+d+colomn])
        #下标为9是pm2.5
        trainY.append(data[9][m*480+d+9])
trainX=np.array(trainX)
trainY=np.array(trainY)

从test.csv中读出矩阵testX。
testX的格式应该和trainX相同。

testX=[]
with open('data/test.csv', 'r', encoding='big5') as text:
    rows=csv.reader(text,delimiter =',')
    row_n=0
    for row in rows:
        #每18行加一个list
        if(row_n % 18 == 0):
            testX.append([])
            for i in range(2,11):
                testX[row_n//18].append(float(row[i]))
        else:
            for i in range(2,11):
                if(row[i]=='NR'):
                    testX[row_n//18].append(float(0))
                else:
                    testX[row_n // 18].append(float(row[i]))
        row_n += 1
testX=np.array(testX)
#test.csv中有240笔数据，按照格式整理之后， 应该是4320/18=240行，18*9=162组数据     
testX=np.array(testX)

	0	1	…	…	162
0	id_0 AMB_TEMP 0时浓度	id_0 AMB_TEMP 1时浓度	…	…	id_0 WS_HR 9时浓度
1	id_1 AMB_TEMP 0时浓度	id_1 AMB_TEMP 1时浓度	…	…	id_1 WS_HR 9时浓度
2	id_2 AMB_TEMP 0时浓度	id_2 AMB_TEMP 1时浓度	…	…	id_2 WS_HR 9时浓度
…	…	…	…	…	…
239	id_239 AMB_TEMP 0时浓度	id_239 AMB_TEMP 1时浓度	…	…	id_239 WS_HR 9时浓度

也就是说，一行里面包含了下图框住的信息。在这里插入图片描述

从ans.csv中读出答案ansY。稍后用来对比预测结果。
ansY就比较简单了，读出来就完事儿。

#parse anser
y=[]
with open("data/ans.csv",'r',encoding='big5') as text:
    rows=csv.reader(text,delimiter=',')
    row_n = 0
    for row in rows:
        if row_n!=0:
            y.append(row[1])
        row_n +=1
y = np.array(list(map(int, y)))
y=y.tolist()

训练模型

设计函数

def ada(trainX,trainY,w,repeat,l_rate):
    loss_process=[]
    s_gra = np.zeros(len(trainX[0]))
    for i in range(repeat):
        hypo = np.dot(trainX, w)
        loss = hypo - trainY
        cost = np.sum(loss ** 2) / len(trainX)
        loss_process.append(cost)

        gra = np.dot(trainX.transpose(), loss)
        s_gra += gra ** 2
        ada = np.sqrt(s_gra)
        w = w - l_rate * gra / ada
        print('iteration : %d | Cost: %f ' % (i + 1, cost))
    return  w,loss_process

获得结果

调用函数，分析结果。

testX = np.concatenate((np.ones((testX.shape[0],1)),testX), axis=1)
trainX = np.concatenate((np.ones((trainX.shape[0],1)), trainX), axis=1)
#train data
w = np.zeros(len(trainX[0]))
w1,loss_list1 = ada(trainX,trainY,w,1000,1)

#save as csv,get result
predictY1=test(testX,w1,"mygd/mylearn_ada")

save_csv定义：

def test(testX,w,res_loc):
    predictY=[]
    ans=[]
    for i in range(len(testX)):
        ans.append(['id_' + str(i)])
        predicty = np.dot(testX[i], w)
        predictY.append(predicty)
        ans[i].append(predicty)

    text = open(res_loc, 'w+')
    s = csv.writer(text, delimiter=',', lineterminator='\n')
    s.writerow(['id', 'value'])
    for i in range(len(ans)):
        s.writerow(ans[i])
    text.close()
    return predictY

画图分析

plt.figure(figsize = (13, 7))
plt.plot(np.arange(0,240,1),predictY1,'r',label = 'predict pm2.5,ada')
# plt.plot(np.arange(0,240,1),predictY2,'y',label = 'predict pm2.5,gd')
# plt.plot(np.arange(0,240,1),predictY3,'p',label = 'predict pm2.5,sgd')
plt.plot(np.arange(0,240,1),y,'b',label='ans pm2.5')
plt.legend()
plt.show()

在这里插入图片描述

附加

另外给出stotastic gradient descent和gradient descent的函数，自己调用的时候总是overflow，再学一学python之后，回来看。

def SGD(trainX,trainY,w,repeat,l_rate):
    list_cost = []
    for i in range(repeat):
        hypo = np.dot(trainX,w)
        loss = hypo - trainY
        # cost = np.sum(loss**2)/len(trainX)
        cost = np.sum(abs(loss)) / len(trainX)
        list_cost.append(cost)

        rand = np.random.randint(0, len(trainX))
        grad = trainX[rand]*loss[rand]/len(trainX) #overflow
        w = w - l_rate*grad
    return w, list_cost
def GD(trainX,trainY,w,repeat,l_rate):
    list_cost = []
    for i in range(repeat):
        hypo = np.dot(trainX,w)
        loss = hypo - trainY
        cost = np.sum(abs(loss))/len(trainX)
        list_cost.append(cost)

        grad = np.dot(trainX.T, loss)/len(trainX)
        w = w - l_rate*grad
    return w, list_cost

参考资料

资料一

https://www.jianshu.com/p/9a76d453be6d
这个老哥讲的很清楚。
但是代码细节有猫病。再读取ans.csv的时候，没有把数据转化成int型的。应该添上下面的代码。

y = np.array(list(map(int, y)))
y=y.tolist()

在这里插入图片描述
不加的话就会变成下面这样子，改变learning rate和迭代次数当然毫无卵用。

资料二

链接：https://pan.baidu.com/s/1T5FnFvjcPZ-J-fNVCT76ow
提取码：qpjn
这个是github上面fork下来的，提供了测试数据，训练数据，最后还有训练速度的对比。
简洁明了。

旋转吧！风火轮

关注

2
点赞
踩
8

收藏

觉得还不错? 一键收藏
5
评论
李宏毅2017HW1 PM2.5 预测

李宏毅2017HW1 PM2.5 预测实现读取数据训练模型获得结果画图分析参考资料实现读取数据需要从train.csv读取出矩阵trainX，trainY。train.csv有12个月 x 每个月20天 x 每天24个小时 = 5760个小时的数据每个小时有18种污染物数据。首先预处理数据，将所有数据读入。#read datan_row = 0text = open('da...
复制链接

扫一扫