因为测试集没有第10小时,又不打算在Kaggle上验证,所以修改成根据8小时预测第9个小时的PM2.5m,部分注释还是通过9预测第10 小时
本次目標:由前 9 個小時的 18 個 features (包含 PM2.5)預測的 10 個小時的 PM2.5。
Load 'train.csv’
train.csv 的資料為 12 個月中,每個月取 20 天,每天 24 小時的資料(每小時資料有 18 個 features)。
cd /data/jupyter/root/MachineLearning/Lee/wk1/
/data/jupyter/root/MachineLearning/Lee/wk1
import sys
import pandas as pd
import numpy as np
# from google.colab import drive
# !gdown --id '1wNKAxQ29G15kgpBy_asjTcZRRgmsCZRm' --output data.zip
# !unzip data.zip
data = pd.read_csv('./data/train.csv', header = 0, encoding = 'big5')
# data = pd.read_csv('./train.csv', encoding = 'big5')
print(len(data))
data[:22]
4320
日期 | 測站 | 測項 | 0 | 1 | 2 | 3 | 4 | 5 | 6 | ... | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2014/1/1 | 豐原 | AMB_TEMP | 14 | 14 | 14 | 13 | 12 | 12 | 12 | ... | 22 | 22 | 21 | 19 | 17 | 16 | 15 | 15 | 15 | 15 |
1 | 2014/1/1 | 豐原 | CH4 | 1.8 | 1.8 | 1.8 | 1.8 | 1.8 | 1.8 | 1.8 | ... | 1.8 | 1.8 | 1.8 | 1.8 | 1.8 | 1.8 | 1.8 | 1.8 | 1.8 | 1.8 |
2 | 2014/1/1 | 豐原 | CO | 0.51 | 0.41 | 0.39 | 0.37 | 0.35 | 0.3 | 0.37 | ... | 0.37 | 0.37 | 0.47 | 0.69 | 0.56 | 0.45 | 0.38 | 0.35 | 0.36 | 0.32 |
3 | 2014/1/1 | 豐原 | NMHC | 0.2 | 0.15 | 0.13 | 0.12 | 0.11 | 0.06 | 0.1 | ... | 0.1 | 0.13 | 0.14 | 0.23 | 0.18 | 0.12 | 0.1 | 0.09 | 0.1 | 0.08 |
4 | 2014/1/1 | 豐原 | NO | 0.9 | 0.6 | 0.5 | 1.7 | 1.8 | 1.5 | 1.9 | ... | 2.5 | 2.2 | 2.5 | 2.3 | 2.1 | 1.9 | 1.5 | 1.6 | 1.8 | 1.5 |
5 | 2014/1/1 | 豐原 | NO2 | 16 | 9.2 | 8.2 | 6.9 | 6.8 | 3.8 | 6.9 | ... | 11 | 11 | 22 | 28 | 19 | 12 | 8.1 | 7 | 6.9 | 6 |
6 | 2014/1/1 | 豐原 | NOx | 17 | 9.8 | 8.7 | 8.6 | 8.5 | 5.3 | 8.8 | ... | 14 | 13 | 25 | 30 | 21 | 13 | 9.7 | 8.6 | 8.7 | 7.5 |
7 | 2014/1/1 | 豐原 | O3 | 16 | 30 | 27 | 23 | 24 | 28 | 24 | ... | 65 | 64 | 51 | 34 | 33 | 34 | 37 | 38 | 38 | 36 |
8 | 2014/1/1 | 豐原 | PM10 | 56 | 50 | 48 | 35 | 25 | 12 | 4 | ... | 52 | 51 | 66 | 85 | 85 | 63 | 46 | 36 | 42 | 42 |
9 | 2014/1/1 | 豐原 | PM2.5 | 26 | 39 | 36 | 35 | 31 | 28 | 25 | ... | 36 | 45 | 42 | 49 | 45 | 44 | 41 | 30 | 24 | 13 |
10 | 2014/1/1 | 豐原 | RAINFALL | NR | NR | NR | NR | NR | NR | NR | ... | NR | NR | NR | NR | NR | NR | NR | NR | NR | NR |
11 | 2014/1/1 | 豐原 | RH | 77 | 68 | 67 | 74 | 72 | 73 | 74 | ... | 47 | 49 | 56 | 67 | 72 | 69 | 70 | 70 | 70 | 69 |
12 | 2014/1/1 | 豐原 | SO2 | 1.8 | 2 | 1.7 | 1.6 | 1.9 | 1.4 | 1.5 | ... | 3.9 | 4.4 | 9.9 | 5.1 | 3.4 | 2.3 | 2 | 1.9 | 1.9 | 1.9 |
13 | 2014/1/1 | 豐原 | THC | 2 | 2 | 2 | 1.9 | 1.9 | 1.8 | 1.9 | ... | 1.9 | 1.9 | 1.9 | 2.1 | 2 | 1.9 | 1.9 | 1.9 | 1.9 | 1.9 |
14 | 2014/1/1 | 豐原 | WD_HR | 37 | 80 | 57 | 76 | 110 | 106 | 101 | ... | 307 | 304 | 307 | 124 | 118 | 121 | 113 | 112 | 106 | 110 |
15 | 2014/1/1 | 豐原 | WIND_DIREC | 35 | 79 | 2.4 | 55 | 94 | 116 | 106 | ... | 313 | 305 | 291 | 124 | 119 | 118 | 114 | 108 | 102 | 111 |
16 | 2014/1/1 | 豐原 | WIND_SPEED | 1.4 | 1.8 | 1 | 0.6 | 1.7 | 2.5 | 2.5 | ... | 2.5 | 2.2 | 1.4 | 2.2 | 2.8 | 3 | 2.6 | 2.7 | 2.1 | 2.1 |
17 | 2014/1/1 | 豐原 | WS_HR | 0.5 | 0.9 | 0.6 | 0.3 | 0.6 | 1.9 | 2 | ... | 2.1 | 2.1 | 1.9 | 1 | 2.5 | 2.5 | 2.8 | 2.6 | 2.4 | 2.3 |
18 | 2014/1/2 | 豐原 | AMB_TEMP | 16 | 15 | 15 | 14 | 14 | 15 | 16 | ... | 24 | 24 | 23 | 21 | 20 | 19 | 18 | 18 | 18 | 18 |
19 | 2014/1/2 | 豐原 | CH4 | 1.8 | 1.8 | 1.8 | 1.8 | 1.8 | 1.8 | 1.8 | ... | 1.8 | 1.8 | 1.8 | 1.8 | 1.8 | 1.8 | 1.8 | 1.8 | 1.8 | 1.8 |
20 | 2014/1/2 | 豐原 | CO | 0.26 | 0.25 | 0.28 | 0.27 | 0.24 | 0.26 | 0.34 | ... | 0.34 | 0.35 | 0.38 | 0.61 | 0.44 | 0.4 | 0.4 | 0.55 | 0.41 | 0.33 |
21 | 2014/1/2 | 豐原 | NMHC | 0.06 | 0.05 | 0.06 | 0.05 | 0.05 | 0.07 | 0.09 | ... | 0.12 | 0.16 | 0.23 | 0.32 | 0.18 | 0.15 | 0.23 | 0.29 | 0.17 | 0.12 |
22 rows × 27 columns
Preprocessing
取需要的數值部分,將 ‘RAINFALL’ 欄位全部補 0。
另外,如果要在 colab 重覆這段程式碼的執行,請從頭開始執行(把上面的都重新跑一次),以避免跑出不是自己要的結果(若自己寫程式不會遇到,但 colab 重複跑這段會一直往下取資料。意即第一次取原本資料的第三欄之後的資料,第二次取第一次取的資料掉三欄之後的資料,…)。
data = data.iloc[:, 3:]
data[data == 'NR'] = 0
raw_data = data.to_numpy()
raw_data[0 : 18, :]# 第一天的18个features
array([['14', '14', '14', '13', '12', '12', '12', '12', '15', '17', '20',
'22', '22', '22', '22', '22', '21', '19', '17', '16', '15', '15',
'15', '15'],
['1.8', '1.8', '1.8', '1.8', '1.8', '1.8', '1.8', '1.8', '1.8',
'1.8', '1.8', '1.8', '1.8', '1.8', '1.8', '1.8', '1.8', '1.8',
'1.8', '1.8', '1.8', '1.8', '1.8', '1.8'],
['0.51', '0.41', '0.39', '0.37', '0.35', '0.3', '0.37', '0.47',
'0.78', '0.74', '0.59', '0.52', '0.41', '0.4', '0.37', '0.37',
'0.47', '0.69', '0.56', '0.45', '0.38', '0.35', '0.36', '0.32'],
['0.2', '0.15', '0.13', '0.12', '0.11', '0.06', '0.1', '0.13',
'0.26', '0.23', '0.2', '0.18', '0.12', '0.11', '0.1', '0.13',
'0.14', '0.23', '0.18', '0.12', '0.1', '0.09', '0.1', '0.08'],
['0.9', '0.6', '0.5', '1.7', '1.8', '1.5', '1.9', '2.2', '6.6',
'7.9', '4.2', '2.9', '3.4', '3', '2.5', '2.2', '2.5', '2.3',
'2.1', '1.9', '1.5', '1.6', '1.8', '1.5'],
['16', '9.2', '8.2', '6.9', '6.8', '3.8', '6.9', '7.8', '15',
'21', '14', '11', '14', '12', '11', '11', '22', '28', '19', '12',
'8.1', '7', '6.9', '6'],
['17', '9.8', '8.7', '8.6', '8.5', '5.3', '8.8', '9.9', '22',
'29', '18', '14', '17', '15', '14', '13', '25', '30', '21', '13',
'9.7', '8.6', '8.7', '7.5'],
['16', '30', '27', '23', '24', '28', '24', '22', '21', '29', '44',
'58', '50', '57', '65', '64', '51', '34', '33', '34', '37', '38',
'38', '36'],
['56', '50', '48', '35', '25', '12', '4', '2', '11', '38', '56',
'64', '56', '57', '52', '51', '66', '85', '85', '63', '46', '36',
'42', '42'],
['26', '39', '36', '35', '31', '28', '25', '20', '19', '30', '41',
'44', '33', '37', '36', '45', '42', '49', '45', '44', '41', '30',
'24', '13'],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0],
['77', '68', '67', '74', '72', '73', '74', '73', '66', '56', '45',
'37', '40', '42', '47', '49', '56', '67', '72', '69', '70', '70',
'70', '69'],
['1.8', '2', '1.7', '1.6', '1.9', '1.4', '1.5', '1.6', '5.1',
'15', '4.5', '2.7', '3.5', '3.6', '3.9', '4.4', '9.9', '5.1',
'3.4', '2.3', '2', '1.9', '1.9', '1.9'],
['2', '2', '2', '1.9', '1.9', '1.8', '1.9', '1.9', '2.1', '2',
'2', '2', '1.9', '1.9', '1.9', '1.9', '1.9', '2.1', '2', '1.9',
'1.9', '1.9', '1.9', '1.9'],
['37', '80', '57', '76', '110', '106', '101', '104', '124', '46',
'241', '280', '297', '305', '307', '304', '307', '124', '118',
'121', '113', '112', '106', '110'],
['35', '79', '2.4', '55', '94', '116', '106', '94', '232', '153',
'283', '269', '290', '316', '313', '305', '291', '124', '119',
'118', '114', '108', '102', '111'],
['1.4', '1.8', '1', '0.6', '1.7', '2.5', '2.5', '2', '0.6', '0.8',
'1.6', '1.9', '2.1', '3.3', '2.5', '2.2', '1.4', '2.2', '2.8',
'3', '2.6', '2.7', '2.1', '2.1'],
['0.5', '0.9', '0.6', '0.3', '0.6', '1.9', '2', '2', '0.5', '0.3',
'0.8', '1.2', '2', '2.6', '2.1', '2.1', '1.9', '1', '2.5', '2.5',
'2.8', '2.6', '2.4', '2.3']], dtype=object)
Extract Features (1)
將原始 4320 * 18 的資料依照每個月分重組成 12 個 18 (features) * 480 (hours) 的資料。 480 = 24 * 20(每个月取20天)
raw_data[0 : 18, :][0]# 这种记录会有480条,组成一个月的sample,但是sample[0],是第一个features的汇总
array(['14', '14', '14', '13', '12', '12', '12', '12', '15', '17', '20',
'22', '22', '22', '22', '22', '21', '19', '17', '16', '15', '15',
'15', '15'], dtype=object)
month_data = {}
for month in range(12):
sample = np.empty([18, 480])
for day in range(20):
sample[:, day * 24 : (day + 1) * 24] = raw_data[18 * (20 * month + day) : 18 * (20 * month + day + 1), :]
month_data[month] = sample
month_data[0][0][:24]# 第一月WS_HR汇总(顺序反了WS_HR排到最前面)
array([14., 14., 14., 13., 12., 12., 12., 12., 15., 17., 20., 22., 22.,
22., 22., 22., 21., 19., 17., 16., 15., 15., 15., 15.])
Extract Features (2)
每個月會有 480hrs,每 9 小時形成一個 data,每個月會有 471 個 data,故總資料數為 471 * 12 筆,而每筆 data 有 9 * 18 的 features (一小時 18 個 features * 9 小時)。
對應的 target 則有 471 * 12 個(第 10 個小時的 PM2.5)
np.set_printoptions(suppress = True)
x = np.empty([12 * 472, 18 * 8], dtype = float)
y = np.empty([12 * 472, 1], dtype = float)
for month in range(12):
for day in range(20):
for hour in range(24):
if day == 19 and hour > 14:
continue
x[month * 472 + day * 24 + hour, :] = month_data[month][:,day * 24 + hour : day * 24 + hour + 8].reshape(1, -1) #vector dim:18*9
y[month * 472 + day * 24 + hour, 0] = month_data[month][9, day * 24 + hour + 8] #value
print(x)
print(y)
[[14. 14. 14. ... 1.9 2. 2. ]
[14. 14. 13. ... 2. 2. 0.5]
[14. 13. 12. ... 2. 0.5 0.3]
...
[18. 19. 18. ... 1.1 1.4 1.3]
[19. 18. 17. ... 1.4 1.3 1.6]
[ 0. 0. 0. ... 0. 0. 0. ]]
[[19.]
[30.]
[41.]
...
[17.]
[24.]
[ 0.]]
x[25]#第一笔的第一个features(8小时)AMB_TEMP 9 * 18
array([ 15. , 15. , 14. , 14. , 15. , 16. , 16. , 17. ,
1.8 , 1.8 , 1.8 , 1.8 , 1.8 , 1.8 , 1.8 , 1.8 ,
0.25, 0.28, 0.27, 0.24, 0.26, 0.34, 0.56, 0.79,
0.05, 0.06, 0.05, 0.05, 0.07, 0.09, 0.19, 0.31,
1.1 , 1.3 , 1. , 1.2 , 1.1 , 1.6 , 8.4 , 17. ,
3.2 , 3.3 , 3.1 , 3.1 , 4.3 , 9.4 , 19. , 26. ,
4.3 , 4.7 , 4.1 , 4.3 , 5.5 , 11. , 27. , 43. ,
38. , 39. , 39. , 34. , 31. , 30. , 18. , 17. ,
34. , 31. , 16. , 18. , 8. , 16. , 24. , 37. ,
23. , 30. , 30. , 22. , 18. , 13. , 13. , 11. ,
0. , 0. , 0. , 0. , 0. , 0. , 0. , 0. ,
66. , 70. , 71. , 70. , 67. , 59. , 60. , 55. ,
0.8 , 0.9 , 0.8 , 0.8 , 0.7 , 0.8 , 1.2 , 3.3 ,
1.8 , 1.8 , 1.8 , 1.8 , 1.8 , 1.8 , 2. , 2.1 ,
117. , 113. , 110. , 100. , 64. , 80. , 88. , 50. ,
113. , 115. , 102. , 87. , 79. , 252. , 90. , 216. ,
2.7 , 2.9 , 2.3 , 1.5 , 1. , 0.8 , 4. , 1. ,
2.4 , 2.8 , 2.7 , 2. , 0.5 , 0.8 , 1.5 , 0.7 ])
Normalize (1)
mean_x = np.mean(x, axis = 0) #18 * 9
std_x = np.std(x, axis = 0) #18 * 9
for i in range(len(x)): #12 * 471
for j in range(len(x[0])): #18 * 9
if std_x[j] != 0:
x[i][j] = (x[i][j] - mean_x[j]) / std_x[j]
len(x)# 5652 = 12 * 471
5664
len(x[0])# 162 = 9 * 18
144
tag = x[0]
#Split Training Data Into “train_set” and "validation_set"
這部分是針對作業中 report 的第二題、第三題做的簡單示範,以生成比較中用來訓練的 train_set 和不會被放入訓練、只是用來驗證的 validation_set。
import math
x_train_set = x[: math.floor(len(x) * 0.8), :]
y_train_set = y[: math.floor(len(y) * 0.8), :]
x_validation = x[math.floor(len(x) * 0.8): , :]
y_validation = y[math.floor(len(y) * 0.8): , :]
print(x_train_set)
print(y_train_set)
print(x_validation)
print(y_validation)
print(len(x_train_set))
print(len(y_train_set))
print(len(x_validation))
print(len(y_validation))
[[-1.33404075 -1.33462207 -1.33500832 ... 0.17629941 0.26946316
0.26863726]
[-1.33404075 -1.33462207 -1.49203094 ... 0.26985884 0.26946316
-1.13432695]
[-1.33404075 -1.49171996 -1.64905356 ... 0.26985884 -1.1339323
-1.32138884]
...
[ 0.70895574 0.39345479 0.07819527 ... 1.39257204 0.26946316
-0.38607937]
[ 0.39464859 0.07925899 0.07819527 ... 0.26985884 -0.38545472
-0.38607937]
[ 0.08034144 0.07925899 0.07819527 ... -0.38505719 -0.38545472
-0.85373411]]
[[19.]
[30.]
[41.]
...
[ 7.]
[ 5.]
[14.]]
[[ 0.08034144 0.07925899 0.23521789 ... -0.38505719 -0.85325321
-0.57314126]
[ 0.08034144 0.23635689 0.23521789 ... -0.85285436 -0.57257412
0.5492301 ]
[ 0.23749501 0.23635689 -0.07882735 ... -0.57217606 0.55014225
-0.10548653]
...
[-0.70542645 -0.54913259 -0.70691783 ... -0.57217606 -0.29189502
-0.38607937]
[-0.54827287 -0.70623048 -0.86394045 ... -0.29149776 -0.38545472
-0.10548653]
[-3.53419082 -3.53399261 -3.53332501 ... -1.60132983 -1.60173079
-1.60198169]]
[[13.]
[24.]
[22.]
...
[17.]
[24.]
[ 0.]]
4531
4531
1133
1133
Training
(和上圖不同處: 下面的 code 採用 Root Mean Square Error)
因為常數項的存在,所以 dimension (dim) 需要多加一欄;eps 項是避免 adagrad 的分母為 0 而加的極小數值。
每一個 dimension (dim) 會對應到各自的 gradient, weight (w),透過一次次的 iteration (iter_time) 學習。
dim = 18 * 8 + 1
w = np.zeros([dim, 1])
x_ = np.concatenate((np.ones([12 * 472, 1]), x), axis = 1).astype(float)
print(len(x_[0]))
print(len(w))
145
145
这里补了为1的一列,用于常数项使用,即 x n p ∗ w p = w p x_{np} * w_p = w_p xnp∗wp=wp
learning_rate = 200
iter_time = 20000
adagrad = np.zeros([dim, 1])
eps = 0.0000000001
for t in range(iter_time):
loss = np.sqrt(np.sum(np.power(np.dot(x_, w) - y, 2))/472/12)#rmse
if(t%100==0):
print(str(t) + ":" + str(loss))
gradient = 2 * np.dot(x_.transpose(), np.dot(x_, w) - y) #dim*1
adagrad += gradient ** 2
w = w - learning_rate * gradient / np.sqrt(adagrad + eps)
np.save('weight.npy', w)
w
0:27.04269876504886
100:42.316253745299576
200:31.074186243413685
300:25.537280583796065
400:22.165757929684172
500:19.75254567886226
…
…
98300:5.688854049407231
98400:5.688854025502277
98500:5.68885400177681
98600:5.6888539782294805
98700:5.6888539548589465
98800:5.688853931663879
98900:5.688853908642958
99000:5.688853885794873
99100:5.688853863118321
99200:5.688853840612015
99300:5.688853818274671
99400:5.68885379610502
99500:5.688853774101799
99600:5.688853752263754
99700:5.6888537305896465
99800:5.68885370907824
99900:5.688853687728311
array([[21.32627119],
[-0.59770122],
[ 1.19618126],
[-1.71854272],
[-0.00412458],
[ 1.06928241],
[-0.36250559],
[-1.81932088],
[ 2.10286256],
[ 0.12625626],
[ 0.10239455],
[ 0.04337644],
[ 0.05290508],
[-0.0207053 ],
[-0.25997183],
[ 0.15053346],
[ 0.58737417],
[ 0.01482249],
[ 0.06264181],
[-0.16256801],
[ 0.14738827],
[-0.02307035],
[-0.16147156],
[ 0.07498768],
[ 0.40381712],
[ 0.31507161],
[-0.24065449],
[ 0.23966163],
[ 0.38239296],
[-0.52786859],
[ 0.0581854 ],
[ 0.16077347],
[ 0.06112612],
[ 0.11976572],
[ 0.23739248],
[-1.04772983],
[-0.47176797],
[-0.55928703],
[ 0.0159055 ],
[-0.54088112],
[-0.68640924],
[-0.06704892],
[-0.1638642 ],
[-2.79479872],
[-1.46477859],
[-0.90830161],
[-0.4895747 ],
[-2.71160536],
[ 0.13711248],
[-0.38775076],
[ 0.12351668],
[ 3.22246928],
[ 1.65049504],
[ 1.25627249],
[ 0.14705583],
[ 2.61715673],
[ 1.9780861 ],
[ 0.1623215 ],
[-0.16639199],
[-0.36364337],
[ 0.01706278],
[-0.56897905],
[-0.42067013],
[-0.17519396],
[ 1.8326884 ],
[ 0.36716326],
[-0.81083076],
[ 1.0596038 ],
[-0.10137155],
[-1.05794694],
[ 1.02079922],
[-0.05240129],
[ 0.91110234],
[-0.82112217],
[ 3.55240057],
[-3.51869125],
[-0.69883282],
[ 8.19423616],
[-9.11851928],
[ 0.28782482],
[15.86037477],
[ 0.00647815],
[-0.0904938 ],
[ 0.00927876],
[-0.1027838 ],
[ 0.10683582],
[ 0.03512709],
[-0.08788882],
[-0.15754034],
[ 0.12550395],
[ 0.37267731],
[-0.85912138],
[-0.58034475],
[ 1.18502834],
[-1.32860977],
[ 0.05769156],
[ 0.6495768 ],
[ 0.22441773],
[-0.08086894],
[-0.12145987],
[-0.03584999],
[ 0.10052267],
[-0.2054299 ],
[ 0.28685996],
[ 0.22959199],
[-0.32587512],
[ 0.12923094],
[-0.40514046],
[-0.175628 ],
[ 0.37578853],
[ 0.15293797],
[-0.35953503],
[-0.1440169 ],
[ 0.17412323],
[-0.06657272],
[ 0.18199105],
[ 0.05810032],
[ 0.16962856],
[-0.24043043],
[ 0.10899224],
[ 0.04145799],
[-0.09666332],
[ 0.09489148],
[-0.18459977],
[ 0.03905182],
[-0.00918566],
[ 0.00566057],
[-0.24082881],
[ 0.02100525],
[-0.14123642],
[ 0.19429907],
[ 0.0552319 ],
[-0.11646291],
[-0.08053281],
[-0.0443063 ],
[-0.0314302 ],
[-0.13354943],
[ 0.0625202 ],
[-0.10887952],
[-0.21179497],
[-0.02862832],
[ 0.34464294],
[-0.02245315],
[-0.29397586],
[ 0.18813077]])
Testing
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-WTIvSWaZ-1594744658937)(https://drive.google.com/uc?id=1165ETzZyE6HStqKvgR0gKrJwgFLK6-CW)]
載入 test data,並且以相似於訓練資料預先處理和特徵萃取的方式處理,使 test data 形成 240 個維度為 18 * 9 + 1 的資料。
# testdata = pd.read_csv('gdrive/My Drive/hw1-regression/test.csv', header = None, encoding = 'big5')
testdata = pd.read_csv('data/test.csv', header = None, encoding = 'big5')
test_data = testdata.iloc[:, 2:]
testdata[:20]
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | id_0 | AMB_TEMP | 21 | 21 | 20 | 20 | 19 | 19 | 19 | 18 | 17 |
1 | id_0 | CH4 | 1.7 | 1.7 | 1.7 | 1.7 | 1.7 | 1.7 | 1.7 | 1.7 | 1.8 |
2 | id_0 | CO | 0.39 | 0.36 | 0.36 | 0.4 | 0.53 | 0.55 | 0.34 | 0.31 | 0.23 |
3 | id_0 | NMHC | 0.16 | 0.24 | 0.22 | 0.27 | 0.27 | 0.26 | 0.27 | 0.29 | 0.1 |
4 | id_0 | NO | 1.3 | 1.3 | 1.3 | 1.3 | 1.4 | 1.6 | 1.2 | 1.1 | 0.9 |
5 | id_0 | NO2 | 17 | 14 | 13 | 14 | 18 | 21 | 8.9 | 9.4 | 5 |
6 | id_0 | NOx | 18 | 16 | 14 | 15 | 20 | 23 | 10 | 10 | 5.8 |
7 | id_0 | O3 | 32 | 31 | 31 | 26 | 16 | 12 | 27 | 20 | 26 |
8 | id_0 | PM10 | 62 | 50 | 44 | 39 | 38 | 32 | 48 | 36 | 25 |
9 | id_0 | PM2.5 | 33 | 39 | 39 | 25 | 18 | 18 | 17 | 9 | 4 |
10 | id_0 | RAINFALL | NR | NR | NR | NR | NR | NR | NR | NR | NR |
11 | id_0 | RH | 83 | 85 | 87 | 87 | 86 | 85 | 78 | 81 | 80 |
12 | id_0 | SO2 | 2 | 1.8 | 1.8 | 1.8 | 2.1 | 2.6 | 2 | 2.3 | 2.4 |
13 | id_0 | THC | 1.8 | 1.9 | 1.9 | 2 | 2 | 2 | 2 | 2 | 1.9 |
14 | id_0 | WD_HR | 58 | 53 | 67 | 59 | 59 | 73 | 79 | 82 | 104 |
15 | id_0 | WIND_DIREC | 57 | 44 | 73 | 44 | 56 | 115 | 45 | 107 | 103 |
16 | id_0 | WIND_SPEED | 1.4 | 1.3 | 1.5 | 1.4 | 1.6 | 1.6 | 1.2 | 1.8 | 2.3 |
17 | id_0 | WS_HR | 1 | 0.9 | 0.9 | 0.9 | 1.2 | 0.7 | 1 | 0.6 | 1.8 |
18 | id_1 | AMB_TEMP | 14 | 13 | 13 | 13 | 13 | 13 | 13 | 12 | 13 |
19 | id_1 | CH4 | 1.8 | 1.8 | 1.8 | 1.8 | 1.8 | 1.8 | 1.7 | 1.7 | 1.8 |
test_data[:20]
2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |
---|---|---|---|---|---|---|---|---|---|
0 | 21 | 21 | 20 | 20 | 19 | 19 | 19 | 18 | 17 |
1 | 1.7 | 1.7 | 1.7 | 1.7 | 1.7 | 1.7 | 1.7 | 1.7 | 1.8 |
2 | 0.39 | 0.36 | 0.36 | 0.4 | 0.53 | 0.55 | 0.34 | 0.31 | 0.23 |
3 | 0.16 | 0.24 | 0.22 | 0.27 | 0.27 | 0.26 | 0.27 | 0.29 | 0.1 |
4 | 1.3 | 1.3 | 1.3 | 1.3 | 1.4 | 1.6 | 1.2 | 1.1 | 0.9 |
5 | 17 | 14 | 13 | 14 | 18 | 21 | 8.9 | 9.4 | 5 |
6 | 18 | 16 | 14 | 15 | 20 | 23 | 10 | 10 | 5.8 |
7 | 32 | 31 | 31 | 26 | 16 | 12 | 27 | 20 | 26 |
8 | 62 | 50 | 44 | 39 | 38 | 32 | 48 | 36 | 25 |
9 | 33 | 39 | 39 | 25 | 18 | 18 | 17 | 9 | 4 |
10 | NR | NR | NR | NR | NR | NR | NR | NR | NR |
11 | 83 | 85 | 87 | 87 | 86 | 85 | 78 | 81 | 80 |
12 | 2 | 1.8 | 1.8 | 1.8 | 2.1 | 2.6 | 2 | 2.3 | 2.4 |
13 | 1.8 | 1.9 | 1.9 | 2 | 2 | 2 | 2 | 2 | 1.9 |
14 | 58 | 53 | 67 | 59 | 59 | 73 | 79 | 82 | 104 |
15 | 57 | 44 | 73 | 44 | 56 | 115 | 45 | 107 | 103 |
16 | 1.4 | 1.3 | 1.5 | 1.4 | 1.6 | 1.6 | 1.2 | 1.8 | 2.3 |
17 | 1 | 0.9 | 0.9 | 0.9 | 1.2 | 0.7 | 1 | 0.6 | 1.8 |
18 | 14 | 13 | 13 | 13 | 13 | 13 | 13 | 12 | 13 |
19 | 1.8 | 1.8 | 1.8 | 1.8 | 1.8 | 1.8 | 1.7 | 1.7 | 1.8 |
test_data[test_data == 'NR'] = 0
test_data = test_data.to_numpy()
test_x = np.empty([240, 18*8], dtype = float)
test_y = np.empty([240, 1], dtype = float)
for i in range(240):
test_x[i, :] = test_data[18 * i: 18* (i + 1), :8].reshape(1, -1)
test_y[i, :] = test_data[9 * (i+1), 8]
test_x[0]
test_y[0]
/root/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/root/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py:3414: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self._where(-key, value, inplace=True)
array([4.])
#重新计算mean和std
mean_x = np.mean(test_x, axis = 0) #18 * 9
std_x = np.std(test_x, axis = 0) #18 * 9
for i in range(len(test_x)):
for j in range(len(test_x[0])):
if std_x[j] != 0:
test_x[i][j] = (test_x[i][j] - mean_x[j]) / std_x[j]
test_x[0]
array([-0.36400913, -0.37580621, -0.58294053, -0.6039988 , -0.79754419,
-0.77993183, -0.77134213, -0.93924642, -0.23381365, -0.20614316,
-0.178965 , -0.19738551, -0.20112982, -0.17849306, -0.21800613,
-0.1462465 , -0.02243175, -0.21006416, -0.19137731, 0.00237043,
0.64996798, 0.8347128 , -0.24763691, -0.26774851, 0.12113192,
0.72194031, 0.62642991, 1.1193606 , 1.07661245, 0.99170142,
1.14457532, 1.59800227, -0.41421696, -0.37903818, -0.44066817,
-0.3619202 , -0.40277453, -0.27806221, -0.42701048, -0.42302961,
0.93257409, 0.38666369, 0.29567325, 0.40431516, 0.932154 ,
1.38609225, -0.35672738, -0.21797962, 0.66821673, 0.27394276,
0.11390661, 0.16298197, 0.74432972, 1.04369186, -0.44653418,
-0.37384103, -0.01811478, -0.05063733, -0.10172555, -0.35003493,
-0.83314626, -1.02814938, -0.37646249, -0.70842781, 0.35455966,
0.00370494, -0.20993074, -0.35090948, -0.3798995 , -0.57390832,
-0.14049582, -0.48849244, 0.27232352, 0.56559302, 0.54673056,
-0.10839156, -0.44431028, -0.46020247, -0.50975521, -0.91310287,
-0.11549831, -0.15822222, -0.14537882, -0.19986573, -0.12898544,
-0.12412819, -0.1111852 , -0.13155091, 0.7839987 , 0.94178976,
1.11502694, 1.11585946, 1.03145997, 0.94421659, 0.44932091,
0.67745778, -0.66448801, -0.69458922, -0.8241389 , -0.76852537,
-0.60883941, -0.3483424 , -0.81324231, -0.52086313, -0.39850542,
0.13751952, 0.20242608, 0.79950013, 0.80387309, 0.83182892,
0.89047392, 0.97892759, -1.25158954, -1.32292147, -1.08841118,
-1.13878394, -1.23022993, -1.13992527, -1.05895382, -0.98101608,
-1.25641454, -1.38157623, -1.0202544 , -1.36540596, -1.26620751,
-0.68274526, -1.44856806, -0.69165696, -0.71738483, -0.74234011,
-0.60566186, -0.6923032 , -0.59214802, -0.56088831, -0.96087669,
-0.36666637, -0.6863133 , -0.77984735, -0.76161203, -0.75062533,
-0.49862637, -1.04516832, -0.70121844, -1.02610681])
tag
len(test_x[0])
test_x = np.concatenate((np.ones([240, 1]), test_x), axis = 1).astype(float)
Prediction
說明圖同上
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-UzSTSYxu-1594744658939)(https://drive.google.com/uc?id=1165ETzZyE6HStqKvgR0gKrJwgFLK6-CW)]
有了 weight 和測試資料即可預測 target。
w = np.load('weight.npy')
w
array([[21.32627119],
[-0.59770122],
[ 1.19618126],
[-1.71854272],
[-0.00412458],
[ 1.06928241],
[-0.36250559],
[-1.81932088],
[ 2.10286256],
[ 0.12625626],
[ 0.10239455],
[ 0.04337644],
[ 0.05290508],
[-0.0207053 ],
[-0.25997183],
[ 0.15053346],
[ 0.58737417],
[ 0.01482249],
[ 0.06264181],
[-0.16256801],
[ 0.14738827],
[-0.02307035],
[-0.16147156],
[ 0.07498768],
[ 0.40381712],
[ 0.31507161],
[-0.24065449],
[ 0.23966163],
[ 0.38239296],
[-0.52786859],
[ 0.0581854 ],
[ 0.16077347],
[ 0.06112612],
[ 0.11976572],
[ 0.23739248],
[-1.04772983],
[-0.47176797],
[-0.55928703],
[ 0.0159055 ],
[-0.54088112],
[-0.68640924],
[-0.06704892],
[-0.1638642 ],
[-2.79479872],
[-1.46477859],
[-0.90830161],
[-0.4895747 ],
[-2.71160536],
[ 0.13711248],
[-0.38775076],
[ 0.12351668],
[ 3.22246928],
[ 1.65049504],
[ 1.25627249],
[ 0.14705583],
[ 2.61715673],
[ 1.9780861 ],
[ 0.1623215 ],
[-0.16639199],
[-0.36364337],
[ 0.01706278],
[-0.56897905],
[-0.42067013],
[-0.17519396],
[ 1.8326884 ],
[ 0.36716326],
[-0.81083076],
[ 1.0596038 ],
[-0.10137155],
[-1.05794694],
[ 1.02079922],
[-0.05240129],
[ 0.91110234],
[-0.82112217],
[ 3.55240057],
[-3.51869125],
[-0.69883282],
[ 8.19423616],
[-9.11851928],
[ 0.28782482],
[15.86037477],
[ 0.00647815],
[-0.0904938 ],
[ 0.00927876],
[-0.1027838 ],
[ 0.10683582],
[ 0.03512709],
[-0.08788882],
[-0.15754034],
[ 0.12550395],
[ 0.37267731],
[-0.85912138],
[-0.58034475],
[ 1.18502834],
[-1.32860977],
[ 0.05769156],
[ 0.6495768 ],
[ 0.22441773],
[-0.08086894],
[-0.12145987],
[-0.03584999],
[ 0.10052267],
[-0.2054299 ],
[ 0.28685996],
[ 0.22959199],
[-0.32587512],
[ 0.12923094],
[-0.40514046],
[-0.175628 ],
[ 0.37578853],
[ 0.15293797],
[-0.35953503],
[-0.1440169 ],
[ 0.17412323],
[-0.06657272],
[ 0.18199105],
[ 0.05810032],
[ 0.16962856],
[-0.24043043],
[ 0.10899224],
[ 0.04145799],
[-0.09666332],
[ 0.09489148],
[-0.18459977],
[ 0.03905182],
[-0.00918566],
[ 0.00566057],
[-0.24082881],
[ 0.02100525],
[-0.14123642],
[ 0.19429907],
[ 0.0552319 ],
[-0.11646291],
[-0.08053281],
[-0.0443063 ],
[-0.0314302 ],
[-0.13354943],
[ 0.0625202 ],
[-0.10887952],
[-0.21179497],
[-0.02862832],
[ 0.34464294],
[-0.02245315],
[-0.29397586],
[ 0.18813077]])
ans_y = np.dot(test_x, w)
print(ans_y[:10])
print(test_y[:10])
[[ 3.49897912]
[10.71431377]
[24.25967974]
[ 4.60526241]
[25.49323832]
[15.20700794]
[18.56378292]
[28.2902962 ]
[22.1783548 ]
[46.30543741]]
[[ 4.]
[13.]
[18.]
[28.]
[26.]
[14.]
[ 4.]
[31.]
[30.]
[20.]]
sum (abs(ans_y - test_y)) / len(ans_y)
array([16.0961944])
Save Prediction to CSV File
import csv
with open('submit.csv', mode='w', newline='') as submit_file:
csv_writer = csv.writer(submit_file)
header = ['id', 'value']
print(header)
csv_writer.writerow(header)
for i in range(240):
row = ['id_' + str(i), ans_y[i][0]]
csv_writer.writerow(row)
print(row)
相關 reference 可以參考:
Adagrad :
https://youtu.be/yKKNr-QKz2Q?list=PLJV_el3uVTsPy9oCRY30oBPNLCo89yu49&t=705
RMSprop :
https://www.youtube.com/watch?v=5Yt-obwvMHI
Adam
https://www.youtube.com/watch?v=JXQT_vxqwIs
以上 print 的部分主要是為了看一下資料和結果的呈現,拿掉也無妨。另外,在自己的 linux 系統,可以將檔案寫死的的部分換成 sys.argv 的使用 (可在 terminal 自行輸入檔案和檔案位置)。
最後,可以藉由調整 learning rate、iter_time (iteration 次數)、取用 features 的多寡(取幾個小時,取哪些特徵欄位),甚至是不同的 model 來超越 baseline。
Report 的問題模板請參照 : https://docs.google.com/document/d/1s84RXs2AEgZr54WCK9IgZrfTF-6B1td-AlKR9oqYa4g/edit