李宏毅机器学习特训营-PM2.5预测

PM2.5预测

课程链接:李宏毅课程-机器学习

数据集介绍

  • 本次作业使用丰原站的观测记录,分成 train set 跟 test set,train set 是丰原站每个月的前 20 天所有资料。test set 则是从丰原站剩下的资料中取样出来。
  • train.csv: 每个月前 20 天的完整资料。
  • test.csv : 从剩下的资料当中取样出连续的 10 小时为一笔,前九小时的所有观测数据当作 feature,第十小时的 PM2.5 当作 answer。一共取出 240 笔不重複的 test data,请根据 feature 预测这 240 笔的 PM2.5。
  • Data 含有 18 项观测数据 AMB_TEMP, CH4, CO, NHMC, NO, NO2, NOx, O3, PM10, PM2.5, RAINFALL, RH, SO2, THC, WD_HR, WIND_DIREC, WIND_SPEED, WS_HR。

1.数据分析

import os
import pandas as pd
import numpy as np

data=pd.read_csv("work/hw1_data/train.csv", encoding="big5")

1.1观察数据

data.head(20)
日期測站測項0123456...14151617181920212223
02014/1/1豐原AMB_TEMP14141413121212...22222119171615151515
12014/1/1豐原CH41.81.81.81.81.81.81.8...1.81.81.81.81.81.81.81.81.81.8
22014/1/1豐原CO0.510.410.390.370.350.30.37...0.370.370.470.690.560.450.380.350.360.32
32014/1/1豐原NMHC0.20.150.130.120.110.060.1...0.10.130.140.230.180.120.10.090.10.08
42014/1/1豐原NO0.90.60.51.71.81.51.9...2.52.22.52.32.11.91.51.61.81.5
52014/1/1豐原NO2169.28.26.96.83.86.9...1111222819128.176.96
62014/1/1豐原NOx179.88.78.68.55.38.8...1413253021139.78.68.77.5
72014/1/1豐原O316302723242824...65645134333437383836
82014/1/1豐原PM105650483525124...52516685856346364242
92014/1/1豐原PM2.526393635312825...36454249454441302413
102014/1/1豐原RAINFALLNRNRNRNRNRNRNR...NRNRNRNRNRNRNRNRNRNR
112014/1/1豐原RH77686774727374...47495667726970707069
122014/1/1豐原SO21.821.71.61.91.41.5...3.94.49.95.13.42.321.91.91.9
132014/1/1豐原THC2221.91.91.81.9...1.91.91.92.121.91.91.91.91.9
142014/1/1豐原WD_HR37805776110106101...307304307124118121113112106110
152014/1/1豐原WIND_DIREC35792.45594116106...313305291124119118114108102111
162014/1/1豐原WIND_SPEED1.41.810.61.72.52.5...2.52.21.42.22.832.62.72.12.1
172014/1/1豐原WS_HR0.50.90.60.30.61.92...2.12.11.912.52.52.82.62.42.3
182014/1/2豐原AMB_TEMP16151514141516...24242321201918181818
192014/1/2豐原CH41.81.81.81.81.81.81.8...1.81.81.81.81.81.81.81.81.81.8

20 rows × 27 columns

data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4320 entries, 0 to 4319
Data columns (total 27 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   日期      4320 non-null   object
 1   測站      4320 non-null   object
 2   測項      4320 non-null   object
 3   0       4320 non-null   object
 4   1       4320 non-null   object
 5   2       4320 non-null   object
 6   3       4320 non-null   object
 7   4       4320 non-null   object
 8   5       4320 non-null   object
 9   6       4320 non-null   object
 10  7       4320 non-null   object
 11  8       4320 non-null   object
 12  9       4320 non-null   object
 13  10      4320 non-null   object
 14  11      4320 non-null   object
 15  12      4320 non-null   object
 16  13      4320 non-null   object
 17  14      4320 non-null   object
 18  15      4320 non-null   object
 19  16      4320 non-null   object
 20  17      4320 non-null   object
 21  18      4320 non-null   object
 22  19      4320 non-null   object
 23  20      4320 non-null   object
 24  21      4320 non-null   object
 25  22      4320 non-null   object
 26  23      4320 non-null   object
dtypes: object(27)
memory usage: 911.4+ KB
data.describe()
日期測站測項0123456...14151617181920212223
count4320432043204320432043204320432043204320...4320432043204320432043204320432043204320
unique240118369361351355353342356...423411409423405374366374382370
top2014/1/20豐原WIND_SPEEDNRNRNRNRNRNRNR...NRNRNRNRNRNRNRNRNRNR
freq184320240221225229226229230226...220219221221222223225224226224

4 rows × 27 columns

1.1特征抽取

# rainfall补零,截取第四列一直到结束
data=data.iloc[:,3:]
data[data=='NR']=0
numpy_data=data.to_numpy()
# 可见雨全0
data.head(18)
0123456789...14151617181920212223
014141413121212121517...22222119171615151515
11.81.81.81.81.81.81.81.81.81.8...1.81.81.81.81.81.81.81.81.81.8
20.510.410.390.370.350.30.370.470.780.74...0.370.370.470.690.560.450.380.350.360.32
30.20.150.130.120.110.060.10.130.260.23...0.10.130.140.230.180.120.10.090.10.08
40.90.60.51.71.81.51.92.26.67.9...2.52.22.52.32.11.91.51.61.81.5
5169.28.26.96.83.86.97.81521...1111222819128.176.96
6179.88.78.68.55.38.89.92229...1413253021139.78.68.77.5
716302723242824222129...65645134333437383836
8565048352512421138...52516685856346364242
926393635312825201930...36454249454441302413
100000000000...0000000000
1177686774727374736656...47495667726970707069
121.821.71.61.91.41.51.65.115...3.94.49.95.13.42.321.91.91.9
132221.91.91.81.91.92.12...1.91.91.92.121.91.91.91.91.9
143780577611010610110412446...307304307124118121113112106110
1535792.4559411610694232153...313305291124119118114108102111
161.41.810.61.72.52.520.60.8...2.52.21.42.22.832.62.72.12.1
170.50.90.60.30.61.9220.50.3...2.12.11.912.52.52.82.62.42.3

18 rows × 24 columns

# RangeIndex: 4320 entries, 0 to 4319 24列(小时),18个特征
# 4320  18= 12月  18   480每月小时数...os
month_data={}
for month in range(12):
    # 每月数据量
    sample=np.empty([18,480])
    # 每天数据量
    for day in range(20):
        # 每天24小时,对应这个18个*24小时个数据
        sample[:, day*24:(day+1)*24]=numpy_data[18*(20*month +day): 18*(20*month +day+1),:]
    month_data[month]=sample

1.1数据分析

# 数据
x=np.empty([12*471,18*9],dtype=float)
# pm2.5
y=np.empty([12*471,1],dtype=float)
for month in range(12):
    for day in range(20):
        for hour in range(24):
            # 如果是最后一天,最后一个包结束,则返回
            if day==19 and hour>14:
                continue
            # 每个小时的18项数据
            x[month*471+day*24+hour,:]=month_data[month][:,day*24+hour:day*24+hour+9].reshape(1,-1)
            # pm值
            y[month*471+day*24+hour,0]=month_data[month][9,day*24+hour+9]
print(x)
print(y)
[[14.  14.  14.  ...  2.   2.   0.5]
 [14.  14.  13.  ...  2.   0.5  0.3]
 [14.  13.  12.  ...  0.5  0.3  0.8]
 ...
 [17.  18.  19.  ...  1.1  1.4  1.3]
 [18.  19.  18.  ...  1.4  1.3  1.6]
 [19.  18.  17.  ...  1.3  1.6  1.8]]
[[30.]
 [41.]
 [44.]
 ...
 [17.]
 [24.]
 [29.]]

1.1归一化

mean_x = np.mean(x, axis = 0) #18 * 9 
std_x = np.std(x, axis = 0) #18 * 9 
for i in range(len(x)): #12 * 471
    for j in range(len(x[0])): #18 * 9 
        if std_x[j] != 0:
            x[i][j] = (x[i][j] - mean_x[j]) / std_x[j]
import math
x_train_set = x[: math.floor(len(x) * 0.8), :]
y_train_set = y[: math.floor(len(y) * 0.8), :]
x_validation = x[math.floor(len(x) * 0.8): , :]
y_validation = y[math.floor(len(y) * 0.8): , :]
print(x_train_set)
print(y_train_set)
print(x_validation)
print(y_validation)
print(len(x_train_set))
print(len(y_train_set))
print(len(x_validation))
print(len(y_validation))
[[-1.35825331 -1.35883937 -1.359222   ...  0.26650729  0.2656797
  -1.14082131]
 [-1.35825331 -1.35883937 -1.51819928 ...  0.26650729 -1.13963133
  -1.32832904]
 [-1.35825331 -1.51789368 -1.67717656 ... -1.13923451 -1.32700613
  -0.85955971]
 ...
 [ 0.86929969  0.70886668  0.38952809 ...  1.39110073  0.2656797
  -0.39079039]
 [ 0.71018876  0.39075806  0.07157353 ...  0.26650729 -0.39013211
  -0.39079039]
 [ 0.3919669   0.07264944  0.07157353 ... -0.38950555 -0.39013211
  -0.85955971]]
[[30.]
 [41.]
 [44.]
 ...
 [ 7.]
 [ 5.]
 [14.]]
[[ 0.07374504  0.07264944  0.07157353 ... -0.38950555 -0.85856912
  -0.57829812]
 [ 0.07374504  0.07264944  0.23055081 ... -0.85808615 -0.57750692
   0.54674825]
 [ 0.07374504  0.23170375  0.23055081 ... -0.57693779  0.54674191
  -0.1095288 ]
 ...
 [-0.88092053 -0.72262212 -0.56433559 ... -0.57693779 -0.29644471
  -0.39079039]
 [-0.7218096  -0.56356781 -0.72331287 ... -0.29578943 -0.39013211
  -0.1095288 ]
 [-0.56269867 -0.72262212 -0.88229015 ... -0.38950555 -0.10906991
   0.07797893]]
[[13.]
 [24.]
 [22.]
 ...
 [17.]
 [24.]
 [29.]]
4521
4521
1131
1131

2.训练

# 1为常数项
dim = 18 * 9 + 1
w = np.zeros([dim, 1])
x = np.concatenate((np.ones([12 * 471, 1]), x), axis = 1).astype(float)
learning_rate = 100
iter_time = 3000
adagrad = np.zeros([dim, 1])
# 防止被除数为0
eps = 0.0000000001
for t in range(iter_time):
    # 平方差
    loss = np.sqrt(np.sum(np.power(np.dot(x, w) - y, 2))/471/12)#rmse
    # 每100次输出一次
    if(t%100==0):
        print(str(t) + ":" + str(loss))
    # 梯度
    gradient = 2 * np.dot(x.transpose(), np.dot(x, w) - y) #dim*1
    adagrad += gradient ** 2
    w = w - learning_rate * gradient / np.sqrt(adagrad + eps)
np.save('weight.npy', w)
0:27.071214829194115
100:33.78905859777454
200:19.913751298197102
300:13.531068193689693
400:10.645466158446172
500:9.277353455475067
600:8.518042045956506
700:8.014061987588427
800:7.636756824775698
900:7.336563740371129
1000:7.090968643947224
1100:6.887311480324104
1200:6.717116295730699
1300:6.574102121171872
1400:6.453381172520272
1500:6.351062466046938
1600:6.264012287766374
1700:6.18968838345321
1800:6.126016764546809
1900:6.071297135545992
2000:6.024129170225593
2100:5.98335455636348
2200:5.94801117011856
2300:5.9172966606664765
2400:5.890539375339805
2500:5.867175036840055
2600:5.846727947724756
2700:5.82879577424176
2800:5.8130371731415105
2900:5.799161687153943

3.测试

test_data=pd.read_csv("work/hw1_data/test.csv",header=None, encoding="big5")
test_data.head()
012345678910
0id_0AMB_TEMP212120201919191817
1id_0CH41.71.71.71.71.71.71.71.71.8
2id_0CO0.390.360.360.40.530.550.340.310.23
3id_0NMHC0.160.240.220.270.270.260.270.290.1
4id_0NO1.31.31.31.31.41.61.21.10.9
test_data=test_data.iloc[:,2:]
test_data.head(18)
2345678910
0212120201919191817
11.71.71.71.71.71.71.71.71.8
20.390.360.360.40.530.550.340.310.23
30.160.240.220.270.270.260.270.290.1
41.31.31.31.31.41.61.21.10.9
51714131418218.99.45
618161415202310105.8
7323131261612272026
8625044393832483625
93339392518181794
10NRNRNRNRNRNRNRNRNR
11838587878685788180
1221.81.81.82.12.622.32.4
131.81.91.9222221.9
145853675959737982104
15574473445611545107103
161.41.31.51.41.61.61.21.82.3
1710.90.90.91.20.710.61.8
# 雨置0
test_data[test_data == 'NR'] = 0
test_data=test_data.to_numpy()
# 240个记录,18*9
test_x=np.empty([240,18*9], dtype=float)
for i in range(240):
    test_x[i, :] = test_data[18 * i: 18* (i + 1), :].reshape(1, -1)
# 归一化
for i in range(len(test_x)):
    for j in range(len(test_x[0])):
        if std_x[j] != 0:
            test_x[i][j] = (test_x[i][j] - mean_x[j]) / std_x[j]
test_x = np.concatenate((np.ones([240, 1]), test_x), axis = 1).astype(float)

4.预测

w=np.load("weight.npy")
ans_y=np.dot(test_x,w)

5.保存

import csv
with open("submit.csv", mode="w",newline='') as submit_file:
    csv_writer=csv.writer(submit_file)
    header=['id','value']
    csv_writer.writerow(header)
    for i in range(240):
        row=["id_" +str (i), ans_y[i][0]]
        csv_writer.writerow(row)
        print(row)
['id_0', 6.664170072810432]
['id_1', 18.06982655390777]
['id_2', 22.749942554291376]
['id_3', 8.180003749823934]
['id_4', 27.299558542667313]
['id_5', 21.792597812878892]
['id_6', 23.292841352717286]
['id_7', 30.78108127047032]
['id_8', 16.586402887372365]
['id_9', 60.38550637230346]
['id_10', 12.993646232319968]
['id_11', 10.70200092475075]
['id_12', 65.64215329134743]
['id_13', 52.053427445578166]
['id_14', 21.375482365639467]
['id_15', 11.574783938138383]
['id_16', 32.25504500615686]
['id_17', 67.31566462493525]
['id_18', -0.6161598259829901]
['id_19', 17.665748375976165]
['id_20', 40.895099614627014]
['id_21', 72.26013598885945]
['id_22', 9.638987238738466]
['id_23', 17.84024484244622]
['id_24', 15.127968656872923]
['id_25', 36.92049576207732]
['id_26', 15.646177020329155]
['id_27', 72.87300224422786]
['id_28', 8.418221455802819]
['id_29', 55.413280515043375]
['id_30', 24.051900611720047]
['id_31', 7.595635842604071]
['id_32', 2.80866765727627]
['id_33', 20.849380390807408]
['id_34', 28.71170267153394]
['id_35', 37.875874390298236]
['id_36', 44.1847957487677]
['id_37', 30.146475294376234]
['id_38', 42.03786649980424]
['id_39', 34.65230321491354]
['id_40', 7.9113030707433]
['id_41', 41.395781258982645]
['id_42', 31.05374516132014]
['id_43', 50.00315759679335]
['id_44', 17.211685655310127]
['id_45', 35.19296162566461]
['id_46', 25.11387615254853]
['id_47', 9.720234223141482]
['id_48', 24.515130798666405]
['id_49', 31.47494455467578]
['id_50', 19.730949751475126]
['id_51', 8.913291561410212]
['id_52', 20.08798021632116]
['id_53', 53.27254568096532]
['id_54', 15.401087303450515]
['id_55', 37.67053940110959]
['id_56', 32.77899247186974]
['id_57', 21.41123889586001]
['id_58', 57.44806119989436]
['id_59', 21.678187560593535]
['id_60', 13.924095056569524]
['id_61', 41.16552226689072]
['id_62', 12.220232179497945]
['id_63', 48.60016487461208]
['id_64', 15.449295044541138]
['id_65', 15.038868567741854]
['id_66', 14.77659312157298]
['id_67', -0.12393970622780448]
['id_68', 43.013878736068186]
['id_69', 29.769997719853897]
['id_70', 19.4049186045943]
['id_71', 41.43270974491587]
['id_72', 60.011021614481734]
['id_73', 5.790473833735811]
['id_74', 15.060835293176906]
['id_75', 4.607634748943204]
['id_76', 42.4516627197642]
['id_77', 14.519597292642706]
['id_78', 21.144383868224526]
['id_79', 21.458453159307826]
['id_80', 23.47307510296709]
['id_81', 37.15080977541598]
['id_82', 23.196667102835374]
['id_83', 91.56436218225427]
['id_84', 37.01454023799413]
['id_85', 27.381368065697778]
['id_86', 22.51258704891688]
['id_87', 33.82908925642835]
['id_88', 23.7163185502162]
['id_89', 20.662333868645504]
['id_90', 27.16043037321573]
['id_91', 41.32360066432416]
['id_92', 4.950227076793279]
['id_93', 37.37826866950743]
['id_94', 45.66315465680254]
['id_95', 17.302825821514663]
['id_96', 32.2655664599684]
['id_97', 12.80342575817804]
['id_98', 24.514520107876013]
['id_99', 3.341263213367584]
['id_100', 17.7448081070595]
['id_101', 26.872663627227507]
['id_102', 13.432976335449231]
['id_103', 16.18361580153564]
['id_104', 23.745951452171397]
['id_105', 39.561980585215764]
['id_106', 30.569407437373993]
['id_107', 5.820516753820944]
['id_108', 7.776261212910277]
['id_109', 77.0360134901067]
['id_110', 48.51823236080682]
['id_111', 16.64399607916922]
['id_112', 27.89744678811886]
['id_113', 14.83503023603293]
['id_114', 12.350386773467099]
['id_115', 25.546454972791217]
['id_116', 27.295384379560794]
['id_117', 8.96325256218069]
['id_118', 17.270871337434464]
['id_119', 19.504275336220687]
['id_120', 80.42775154412193]
['id_121', 24.520714650707642]
['id_122', 36.60616009374052]
['id_123', 25.656459693387426]
['id_124', 6.756581151491991]
['id_125', 36.283040396068415]
['id_126', 8.449660049798918]
['id_127', 20.78541125464159]
['id_128', 28.07463452536681]
['id_129', 62.379890206239324]
['id_130', 20.29238599569938]
['id_131', 22.28870198252149]
['id_132', 59.33642653123038]
['id_133', 15.045482592933437]
['id_134', 14.1280592394024]
['id_135', 2.100755891474563]
['id_136', 11.483021758532646]
['id_137', 58.69987491139861]
['id_138', 21.30603819937202]
['id_139', 5.202835758613455]
['id_140', 30.614122638778852]
['id_141', 25.47776196375223]
['id_142', 46.22992006563068]
['id_143', 33.11002087024068]
['id_144', 17.46932186570128]
['id_145', 26.410956914472152]
['id_146', 13.159193439136454]
['id_147', 52.625202084698806]
['id_148', 23.811496585530058]
['id_149', 38.662302933242884]
['id_150', 11.044546263121934]
['id_151', 7.924147960146024]
['id_152', 23.872212660168767]
['id_153', 6.645445699285511]
['id_154', 15.452061334036236]
['id_155', 42.18119392984227]
['id_156', 7.590869211709671]
['id_157', 35.85064322159984]
['id_158', 11.648548436310495]
['id_159', 18.140214754345795]
['id_160', 40.82481616283134]
['id_161', 19.500250826726067]
['id_162', 11.927610907650287]
['id_163', 8.715880232728168]
['id_164', 51.66032451877395]
['id_165', 30.757728029528145]
['id_166', 0.7046788784008329]
['id_167', 15.69909391829567]
['id_168', 64.85026914636217]
['id_169', 14.091863832826492]
['id_170', 64.63390138328151]
['id_171', 41.1034153470575]
['id_172', 26.10220500513233]
['id_173', 21.306744952037718]
['id_174', 60.772880949567615]
['id_175', 25.640875539198568]
['id_176', 20.607724023722504]
['id_177', 36.440294733916055]
['id_178', 11.811390300876912]
['id_179', 31.04992122482525]
['id_180', 16.039969277294592]
['id_181', 11.636852886319232]
['id_182', 54.69463439993815]
['id_183', 45.554294800332826]
['id_184', 15.832874697329283]
['id_185', 35.30002592484912]
['id_186', 25.875100126915225]
['id_187', 69.12454857221925]
['id_188', 9.494422607611392]
['id_189', 56.09961011371668]
['id_190', 39.136486183970376]
['id_191', 17.750604163833337]
['id_192', 29.47332438184137]
['id_193', 1.0621027165500536]
['id_194', 19.635975194399606]
['id_195', 0.7239894381238666]
['id_196', 33.07818517010954]
['id_197', 10.39208734094487]
['id_198', 20.276530494790233]
['id_199', 60.57784848719553]
['id_200', 24.181583155767832]
['id_201', 25.220427389246172]
['id_202', 63.91231956611171]
['id_203', 8.6049510235601]
['id_204', 10.774123100084637]
['id_205', 11.05959353984231]
['id_206', 8.319178490346888]
['id_207', 1.996366494421347]
['id_208', 123.6426703940643]
['id_209', 21.014115178472828]
['id_210', 16.35701579048604]
['id_211', 15.284669239242278]
['id_212', 36.57007486198091]
['id_213', 36.122765008736714]
['id_214', 19.133595484703864]
['id_215', 35.399719868445146]
['id_216', 79.16726976217178]
['id_217', 0.7588151400164959]
['id_218', 13.148987152389058]
['id_219', 33.17931057833607]
['id_220', 15.085274432347417]
['id_221', 12.104912457818461]
['id_222', 113.28037023558156]
['id_223', 13.251132251849445]
['id_224', 16.92991057921663]
['id_225', 62.99821540914401]
['id_226', 15.707022994664912]
['id_227', 22.866534471291352]
['id_228', 10.651456054090744]
['id_229', 3.9686051537628377]
['id_230', 46.1859017733578]
['id_231', 12.635141586554013]
['id_232', 52.86065960966461]
['id_233', 42.90487239996108]
['id_234', 23.100383883121143]
['id_235', 41.10669330061684]
['id_236', 69.54906403996627]
['id_237', 41.822846361366885]
['id_238', 11.851913749450345]
['id_239', 17.43921590024002]
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包

打赏作者

煌澄艾

你的鼓励将是我创作的最大动力

¥1 ¥2 ¥4 ¥6 ¥10 ¥20
扫码支付:¥1
获取中
扫码支付

您的余额不足,请更换扫码支付或充值

打赏作者

实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值