笔记:ML-LHY-HW1_regression

因为测试集没有第10小时,又不打算在Kaggle上验证,所以修改成根据8小时预测第9个小时的PM2.5m,部分注释还是通过9预测第10 小时

本次目標:由前 9 個小時的 18 個 features (包含 PM2.5)預測的 10 個小時的 PM2.5。

Load 'train.csv’

train.csv 的資料為 12 個月中,每個月取 20 天,每天 24 小時的資料(每小時資料有 18 個 features)。

cd /data/jupyter/root/MachineLearning/Lee/wk1/
/data/jupyter/root/MachineLearning/Lee/wk1
import sys
import pandas as pd
import numpy as np
# from google.colab import drive 
# !gdown --id '1wNKAxQ29G15kgpBy_asjTcZRRgmsCZRm' --output data.zip
# !unzip data.zip
data = pd.read_csv('./data/train.csv', header = 0, encoding = 'big5')
# data = pd.read_csv('./train.csv', encoding = 'big5')

print(len(data))
data[:22]
4320
日期測站測項0123456...14151617181920212223
02014/1/1豐原AMB_TEMP14141413121212...22222119171615151515
12014/1/1豐原CH41.81.81.81.81.81.81.8...1.81.81.81.81.81.81.81.81.81.8
22014/1/1豐原CO0.510.410.390.370.350.30.37...0.370.370.470.690.560.450.380.350.360.32
32014/1/1豐原NMHC0.20.150.130.120.110.060.1...0.10.130.140.230.180.120.10.090.10.08
42014/1/1豐原NO0.90.60.51.71.81.51.9...2.52.22.52.32.11.91.51.61.81.5
52014/1/1豐原NO2169.28.26.96.83.86.9...1111222819128.176.96
62014/1/1豐原NOx179.88.78.68.55.38.8...1413253021139.78.68.77.5
72014/1/1豐原O316302723242824...65645134333437383836
82014/1/1豐原PM105650483525124...52516685856346364242
92014/1/1豐原PM2.526393635312825...36454249454441302413
102014/1/1豐原RAINFALLNRNRNRNRNRNRNR...NRNRNRNRNRNRNRNRNRNR
112014/1/1豐原RH77686774727374...47495667726970707069
122014/1/1豐原SO21.821.71.61.91.41.5...3.94.49.95.13.42.321.91.91.9
132014/1/1豐原THC2221.91.91.81.9...1.91.91.92.121.91.91.91.91.9
142014/1/1豐原WD_HR37805776110106101...307304307124118121113112106110
152014/1/1豐原WIND_DIREC35792.45594116106...313305291124119118114108102111
162014/1/1豐原WIND_SPEED1.41.810.61.72.52.5...2.52.21.42.22.832.62.72.12.1
172014/1/1豐原WS_HR0.50.90.60.30.61.92...2.12.11.912.52.52.82.62.42.3
182014/1/2豐原AMB_TEMP16151514141516...24242321201918181818
192014/1/2豐原CH41.81.81.81.81.81.81.8...1.81.81.81.81.81.81.81.81.81.8
202014/1/2豐原CO0.260.250.280.270.240.260.34...0.340.350.380.610.440.40.40.550.410.33
212014/1/2豐原NMHC0.060.050.060.050.050.070.09...0.120.160.230.320.180.150.230.290.170.12

22 rows × 27 columns

Preprocessing

取需要的數值部分,將 ‘RAINFALL’ 欄位全部補 0。
另外,如果要在 colab 重覆這段程式碼的執行,請從頭開始執行(把上面的都重新跑一次),以避免跑出不是自己要的結果(若自己寫程式不會遇到,但 colab 重複跑這段會一直往下取資料。意即第一次取原本資料的第三欄之後的資料,第二次取第一次取的資料掉三欄之後的資料,…)。

data = data.iloc[:, 3:]
data[data == 'NR'] = 0
raw_data = data.to_numpy()
raw_data[0 : 18, :]# 第一天的18个features
array([['14', '14', '14', '13', '12', '12', '12', '12', '15', '17', '20',
        '22', '22', '22', '22', '22', '21', '19', '17', '16', '15', '15',
        '15', '15'],
       ['1.8', '1.8', '1.8', '1.8', '1.8', '1.8', '1.8', '1.8', '1.8',
        '1.8', '1.8', '1.8', '1.8', '1.8', '1.8', '1.8', '1.8', '1.8',
        '1.8', '1.8', '1.8', '1.8', '1.8', '1.8'],
       ['0.51', '0.41', '0.39', '0.37', '0.35', '0.3', '0.37', '0.47',
        '0.78', '0.74', '0.59', '0.52', '0.41', '0.4', '0.37', '0.37',
        '0.47', '0.69', '0.56', '0.45', '0.38', '0.35', '0.36', '0.32'],
       ['0.2', '0.15', '0.13', '0.12', '0.11', '0.06', '0.1', '0.13',
        '0.26', '0.23', '0.2', '0.18', '0.12', '0.11', '0.1', '0.13',
        '0.14', '0.23', '0.18', '0.12', '0.1', '0.09', '0.1', '0.08'],
       ['0.9', '0.6', '0.5', '1.7', '1.8', '1.5', '1.9', '2.2', '6.6',
        '7.9', '4.2', '2.9', '3.4', '3', '2.5', '2.2', '2.5', '2.3',
        '2.1', '1.9', '1.5', '1.6', '1.8', '1.5'],
       ['16', '9.2', '8.2', '6.9', '6.8', '3.8', '6.9', '7.8', '15',
        '21', '14', '11', '14', '12', '11', '11', '22', '28', '19', '12',
        '8.1', '7', '6.9', '6'],
       ['17', '9.8', '8.7', '8.6', '8.5', '5.3', '8.8', '9.9', '22',
        '29', '18', '14', '17', '15', '14', '13', '25', '30', '21', '13',
        '9.7', '8.6', '8.7', '7.5'],
       ['16', '30', '27', '23', '24', '28', '24', '22', '21', '29', '44',
        '58', '50', '57', '65', '64', '51', '34', '33', '34', '37', '38',
        '38', '36'],
       ['56', '50', '48', '35', '25', '12', '4', '2', '11', '38', '56',
        '64', '56', '57', '52', '51', '66', '85', '85', '63', '46', '36',
        '42', '42'],
       ['26', '39', '36', '35', '31', '28', '25', '20', '19', '30', '41',
        '44', '33', '37', '36', '45', '42', '49', '45', '44', '41', '30',
        '24', '13'],
       [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0],
       ['77', '68', '67', '74', '72', '73', '74', '73', '66', '56', '45',
        '37', '40', '42', '47', '49', '56', '67', '72', '69', '70', '70',
        '70', '69'],
       ['1.8', '2', '1.7', '1.6', '1.9', '1.4', '1.5', '1.6', '5.1',
        '15', '4.5', '2.7', '3.5', '3.6', '3.9', '4.4', '9.9', '5.1',
        '3.4', '2.3', '2', '1.9', '1.9', '1.9'],
       ['2', '2', '2', '1.9', '1.9', '1.8', '1.9', '1.9', '2.1', '2',
        '2', '2', '1.9', '1.9', '1.9', '1.9', '1.9', '2.1', '2', '1.9',
        '1.9', '1.9', '1.9', '1.9'],
       ['37', '80', '57', '76', '110', '106', '101', '104', '124', '46',
        '241', '280', '297', '305', '307', '304', '307', '124', '118',
        '121', '113', '112', '106', '110'],
       ['35', '79', '2.4', '55', '94', '116', '106', '94', '232', '153',
        '283', '269', '290', '316', '313', '305', '291', '124', '119',
        '118', '114', '108', '102', '111'],
       ['1.4', '1.8', '1', '0.6', '1.7', '2.5', '2.5', '2', '0.6', '0.8',
        '1.6', '1.9', '2.1', '3.3', '2.5', '2.2', '1.4', '2.2', '2.8',
        '3', '2.6', '2.7', '2.1', '2.1'],
       ['0.5', '0.9', '0.6', '0.3', '0.6', '1.9', '2', '2', '0.5', '0.3',
        '0.8', '1.2', '2', '2.6', '2.1', '2.1', '1.9', '1', '2.5', '2.5',
        '2.8', '2.6', '2.4', '2.3']], dtype=object)

Extract Features (1)

在这里插入图片描述

將原始 4320 * 18 的資料依照每個月分重組成 12 個 18 (features) * 480 (hours) 的資料。 480 = 24 * 20(每个月取20天)

raw_data[0 : 18, :][0]# 这种记录会有480条,组成一个月的sample,但是sample[0],是第一个features的汇总
array(['14', '14', '14', '13', '12', '12', '12', '12', '15', '17', '20',
       '22', '22', '22', '22', '22', '21', '19', '17', '16', '15', '15',
       '15', '15'], dtype=object)
month_data = {}
for month in range(12):
    sample = np.empty([18, 480])
    for day in range(20):
        sample[:, day * 24 : (day + 1) * 24] = raw_data[18 * (20 * month + day) : 18 * (20 * month + day + 1), :]
    month_data[month] = sample
month_data[0][0][:24]# 第一月WS_HR汇总(顺序反了WS_HR排到最前面)
array([14., 14., 14., 13., 12., 12., 12., 12., 15., 17., 20., 22., 22.,
       22., 22., 22., 21., 19., 17., 16., 15., 15., 15., 15.])

Extract Features (2)

在这里插入图片描述

每個月會有 480hrs,每 9 小時形成一個 data,每個月會有 471 個 data,故總資料數為 471 * 12 筆,而每筆 data 有 9 * 18 的 features (一小時 18 個 features * 9 小時)。

對應的 target 則有 471 * 12 個(第 10 個小時的 PM2.5)

np.set_printoptions(suppress = True)
x = np.empty([12 * 472, 18 * 8], dtype = float)
y = np.empty([12 * 472, 1], dtype = float)
for month in range(12):
    for day in range(20):
        for hour in range(24):
            if day == 19 and hour > 14:
                continue
            x[month * 472 + day * 24 + hour, :] = month_data[month][:,day * 24 + hour : day * 24 + hour + 8].reshape(1, -1) #vector dim:18*9
            y[month * 472 + day * 24 + hour, 0] = month_data[month][9, day * 24 + hour + 8] #value
print(x)
print(y)

[[14.  14.  14.  ...  1.9  2.   2. ]
 [14.  14.  13.  ...  2.   2.   0.5]
 [14.  13.  12.  ...  2.   0.5  0.3]
 ...
 [18.  19.  18.  ...  1.1  1.4  1.3]
 [19.  18.  17.  ...  1.4  1.3  1.6]
 [ 0.   0.   0.  ...  0.   0.   0. ]]
[[19.]
 [30.]
 [41.]
 ...
 [17.]
 [24.]
 [ 0.]]
x[25]#第一笔的第一个features(8小时)AMB_TEMP  9 * 18
array([ 15.  ,  15.  ,  14.  ,  14.  ,  15.  ,  16.  ,  16.  ,  17.  ,
         1.8 ,   1.8 ,   1.8 ,   1.8 ,   1.8 ,   1.8 ,   1.8 ,   1.8 ,
         0.25,   0.28,   0.27,   0.24,   0.26,   0.34,   0.56,   0.79,
         0.05,   0.06,   0.05,   0.05,   0.07,   0.09,   0.19,   0.31,
         1.1 ,   1.3 ,   1.  ,   1.2 ,   1.1 ,   1.6 ,   8.4 ,  17.  ,
         3.2 ,   3.3 ,   3.1 ,   3.1 ,   4.3 ,   9.4 ,  19.  ,  26.  ,
         4.3 ,   4.7 ,   4.1 ,   4.3 ,   5.5 ,  11.  ,  27.  ,  43.  ,
        38.  ,  39.  ,  39.  ,  34.  ,  31.  ,  30.  ,  18.  ,  17.  ,
        34.  ,  31.  ,  16.  ,  18.  ,   8.  ,  16.  ,  24.  ,  37.  ,
        23.  ,  30.  ,  30.  ,  22.  ,  18.  ,  13.  ,  13.  ,  11.  ,
         0.  ,   0.  ,   0.  ,   0.  ,   0.  ,   0.  ,   0.  ,   0.  ,
        66.  ,  70.  ,  71.  ,  70.  ,  67.  ,  59.  ,  60.  ,  55.  ,
         0.8 ,   0.9 ,   0.8 ,   0.8 ,   0.7 ,   0.8 ,   1.2 ,   3.3 ,
         1.8 ,   1.8 ,   1.8 ,   1.8 ,   1.8 ,   1.8 ,   2.  ,   2.1 ,
       117.  , 113.  , 110.  , 100.  ,  64.  ,  80.  ,  88.  ,  50.  ,
       113.  , 115.  , 102.  ,  87.  ,  79.  , 252.  ,  90.  , 216.  ,
         2.7 ,   2.9 ,   2.3 ,   1.5 ,   1.  ,   0.8 ,   4.  ,   1.  ,
         2.4 ,   2.8 ,   2.7 ,   2.  ,   0.5 ,   0.8 ,   1.5 ,   0.7 ])

Normalize (1)

mean_x = np.mean(x, axis = 0) #18 * 9 
std_x = np.std(x, axis = 0) #18 * 9 
for i in range(len(x)): #12 * 471
    for j in range(len(x[0])): #18 * 9 
        if std_x[j] != 0:
            x[i][j] = (x[i][j] - mean_x[j]) / std_x[j]

len(x)# 5652 = 12 * 471
5664
len(x[0])# 162 = 9 * 18
144
tag = x[0]

#Split Training Data Into “train_set” and "validation_set"
這部分是針對作業中 report 的第二題、第三題做的簡單示範,以生成比較中用來訓練的 train_set 和不會被放入訓練、只是用來驗證的 validation_set。

import math
x_train_set = x[: math.floor(len(x) * 0.8), :]
y_train_set = y[: math.floor(len(y) * 0.8), :]
x_validation = x[math.floor(len(x) * 0.8): , :]
y_validation = y[math.floor(len(y) * 0.8): , :]
print(x_train_set)
print(y_train_set)
print(x_validation)
print(y_validation)
print(len(x_train_set))
print(len(y_train_set))
print(len(x_validation))
print(len(y_validation))
[[-1.33404075 -1.33462207 -1.33500832 ...  0.17629941  0.26946316
   0.26863726]
 [-1.33404075 -1.33462207 -1.49203094 ...  0.26985884  0.26946316
  -1.13432695]
 [-1.33404075 -1.49171996 -1.64905356 ...  0.26985884 -1.1339323
  -1.32138884]
 ...
 [ 0.70895574  0.39345479  0.07819527 ...  1.39257204  0.26946316
  -0.38607937]
 [ 0.39464859  0.07925899  0.07819527 ...  0.26985884 -0.38545472
  -0.38607937]
 [ 0.08034144  0.07925899  0.07819527 ... -0.38505719 -0.38545472
  -0.85373411]]
[[19.]
 [30.]
 [41.]
 ...
 [ 7.]
 [ 5.]
 [14.]]
[[ 0.08034144  0.07925899  0.23521789 ... -0.38505719 -0.85325321
  -0.57314126]
 [ 0.08034144  0.23635689  0.23521789 ... -0.85285436 -0.57257412
   0.5492301 ]
 [ 0.23749501  0.23635689 -0.07882735 ... -0.57217606  0.55014225
  -0.10548653]
 ...
 [-0.70542645 -0.54913259 -0.70691783 ... -0.57217606 -0.29189502
  -0.38607937]
 [-0.54827287 -0.70623048 -0.86394045 ... -0.29149776 -0.38545472
  -0.10548653]
 [-3.53419082 -3.53399261 -3.53332501 ... -1.60132983 -1.60173079
  -1.60198169]]
[[13.]
 [24.]
 [22.]
 ...
 [17.]
 [24.]
 [ 0.]]
4531
4531
1133
1133

Training

在这里插入图片描述
在这里插入图片描述

在这里插入图片描述
(和上圖不同處: 下面的 code 採用 Root Mean Square Error)

因為常數項的存在,所以 dimension (dim) 需要多加一欄;eps 項是避免 adagrad 的分母為 0 而加的極小數值。

每一個 dimension (dim) 會對應到各自的 gradient, weight (w),透過一次次的 iteration (iter_time) 學習。

dim = 18 * 8 + 1
w = np.zeros([dim, 1])
x_ = np.concatenate((np.ones([12 * 472, 1]), x), axis = 1).astype(float)
print(len(x_[0]))
print(len(w))
145
145

这里补了为1的一列,用于常数项使用,即 x n p ∗ w p = w p x_{np} * w_p = w_p xnpwp=wp



learning_rate = 200
iter_time = 20000
adagrad = np.zeros([dim, 1])
eps = 0.0000000001
for t in range(iter_time):
    loss = np.sqrt(np.sum(np.power(np.dot(x_, w) - y, 2))/472/12)#rmse
    if(t%100==0):
        print(str(t) + ":" + str(loss))
    gradient = 2 * np.dot(x_.transpose(), np.dot(x_, w) - y) #dim*1
    adagrad += gradient ** 2
    w = w - learning_rate * gradient / np.sqrt(adagrad + eps)
np.save('weight.npy', w)
w
0:27.04269876504886
100:42.316253745299576
200:31.074186243413685
300:25.537280583796065
400:22.165757929684172
500:19.75254567886226



98300:5.688854049407231
98400:5.688854025502277
98500:5.68885400177681
98600:5.6888539782294805
98700:5.6888539548589465
98800:5.688853931663879
98900:5.688853908642958
99000:5.688853885794873
99100:5.688853863118321
99200:5.688853840612015
99300:5.688853818274671
99400:5.68885379610502
99500:5.688853774101799
99600:5.688853752263754
99700:5.6888537305896465
99800:5.68885370907824
99900:5.688853687728311

array([[21.32627119],
       [-0.59770122],
       [ 1.19618126],
       [-1.71854272],
       [-0.00412458],
       [ 1.06928241],
       [-0.36250559],
       [-1.81932088],
       [ 2.10286256],
       [ 0.12625626],
       [ 0.10239455],
       [ 0.04337644],
       [ 0.05290508],
       [-0.0207053 ],
       [-0.25997183],
       [ 0.15053346],
       [ 0.58737417],
       [ 0.01482249],
       [ 0.06264181],
       [-0.16256801],
       [ 0.14738827],
       [-0.02307035],
       [-0.16147156],
       [ 0.07498768],
       [ 0.40381712],
       [ 0.31507161],
       [-0.24065449],
       [ 0.23966163],
       [ 0.38239296],
       [-0.52786859],
       [ 0.0581854 ],
       [ 0.16077347],
       [ 0.06112612],
       [ 0.11976572],
       [ 0.23739248],
       [-1.04772983],
       [-0.47176797],
       [-0.55928703],
       [ 0.0159055 ],
       [-0.54088112],
       [-0.68640924],
       [-0.06704892],
       [-0.1638642 ],
       [-2.79479872],
       [-1.46477859],
       [-0.90830161],
       [-0.4895747 ],
       [-2.71160536],
       [ 0.13711248],
       [-0.38775076],
       [ 0.12351668],
       [ 3.22246928],
       [ 1.65049504],
       [ 1.25627249],
       [ 0.14705583],
       [ 2.61715673],
       [ 1.9780861 ],
       [ 0.1623215 ],
       [-0.16639199],
       [-0.36364337],
       [ 0.01706278],
       [-0.56897905],
       [-0.42067013],
       [-0.17519396],
       [ 1.8326884 ],
       [ 0.36716326],
       [-0.81083076],
       [ 1.0596038 ],
       [-0.10137155],
       [-1.05794694],
       [ 1.02079922],
       [-0.05240129],
       [ 0.91110234],
       [-0.82112217],
       [ 3.55240057],
       [-3.51869125],
       [-0.69883282],
       [ 8.19423616],
       [-9.11851928],
       [ 0.28782482],
       [15.86037477],
       [ 0.00647815],
       [-0.0904938 ],
       [ 0.00927876],
       [-0.1027838 ],
       [ 0.10683582],
       [ 0.03512709],
       [-0.08788882],
       [-0.15754034],
       [ 0.12550395],
       [ 0.37267731],
       [-0.85912138],
       [-0.58034475],
       [ 1.18502834],
       [-1.32860977],
       [ 0.05769156],
       [ 0.6495768 ],
       [ 0.22441773],
       [-0.08086894],
       [-0.12145987],
       [-0.03584999],
       [ 0.10052267],
       [-0.2054299 ],
       [ 0.28685996],
       [ 0.22959199],
       [-0.32587512],
       [ 0.12923094],
       [-0.40514046],
       [-0.175628  ],
       [ 0.37578853],
       [ 0.15293797],
       [-0.35953503],
       [-0.1440169 ],
       [ 0.17412323],
       [-0.06657272],
       [ 0.18199105],
       [ 0.05810032],
       [ 0.16962856],
       [-0.24043043],
       [ 0.10899224],
       [ 0.04145799],
       [-0.09666332],
       [ 0.09489148],
       [-0.18459977],
       [ 0.03905182],
       [-0.00918566],
       [ 0.00566057],
       [-0.24082881],
       [ 0.02100525],
       [-0.14123642],
       [ 0.19429907],
       [ 0.0552319 ],
       [-0.11646291],
       [-0.08053281],
       [-0.0443063 ],
       [-0.0314302 ],
       [-0.13354943],
       [ 0.0625202 ],
       [-0.10887952],
       [-0.21179497],
       [-0.02862832],
       [ 0.34464294],
       [-0.02245315],
       [-0.29397586],
       [ 0.18813077]])

Testing

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-WTIvSWaZ-1594744658937)(https://drive.google.com/uc?id=1165ETzZyE6HStqKvgR0gKrJwgFLK6-CW)]

載入 test data,並且以相似於訓練資料預先處理和特徵萃取的方式處理,使 test data 形成 240 個維度為 18 * 9 + 1 的資料。

# testdata = pd.read_csv('gdrive/My Drive/hw1-regression/test.csv', header = None, encoding = 'big5')
testdata = pd.read_csv('data/test.csv', header = None, encoding = 'big5')
test_data = testdata.iloc[:, 2:]
testdata[:20]
012345678910
0id_0AMB_TEMP212120201919191817
1id_0CH41.71.71.71.71.71.71.71.71.8
2id_0CO0.390.360.360.40.530.550.340.310.23
3id_0NMHC0.160.240.220.270.270.260.270.290.1
4id_0NO1.31.31.31.31.41.61.21.10.9
5id_0NO21714131418218.99.45
6id_0NOx18161415202310105.8
7id_0O3323131261612272026
8id_0PM10625044393832483625
9id_0PM2.53339392518181794
10id_0RAINFALLNRNRNRNRNRNRNRNRNR
11id_0RH838587878685788180
12id_0SO221.81.81.82.12.622.32.4
13id_0THC1.81.91.9222221.9
14id_0WD_HR5853675959737982104
15id_0WIND_DIREC574473445611545107103
16id_0WIND_SPEED1.41.31.51.41.61.61.21.82.3
17id_0WS_HR10.90.90.91.20.710.61.8
18id_1AMB_TEMP141313131313131213
19id_1CH41.81.81.81.81.81.81.71.71.8
test_data[:20]
2345678910
0212120201919191817
11.71.71.71.71.71.71.71.71.8
20.390.360.360.40.530.550.340.310.23
30.160.240.220.270.270.260.270.290.1
41.31.31.31.31.41.61.21.10.9
51714131418218.99.45
618161415202310105.8
7323131261612272026
8625044393832483625
93339392518181794
10NRNRNRNRNRNRNRNRNR
11838587878685788180
1221.81.81.82.12.622.32.4
131.81.91.9222221.9
145853675959737982104
15574473445611545107103
161.41.31.51.41.61.61.21.82.3
1710.90.90.91.20.710.61.8
18141313131313131213
191.81.81.81.81.81.81.71.71.8

test_data[test_data == 'NR'] = 0
test_data = test_data.to_numpy()
test_x = np.empty([240, 18*8], dtype = float)
test_y = np.empty([240, 1], dtype = float)
for i in range(240):
    test_x[i, :] = test_data[18 * i: 18* (i + 1), :8].reshape(1, -1)
    test_y[i, :] = test_data[9 * (i+1), 8]
test_x[0]
test_y[0]
/root/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
/root/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py:3414: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._where(-key, value, inplace=True)





array([4.])

#重新计算mean和std
mean_x = np.mean(test_x, axis = 0) #18 * 9 
std_x = np.std(test_x, axis = 0) #18 * 9 

for i in range(len(test_x)):
    for j in range(len(test_x[0])):
        if std_x[j] != 0:
            test_x[i][j] = (test_x[i][j] - mean_x[j]) / std_x[j]

test_x[0]
array([-0.36400913, -0.37580621, -0.58294053, -0.6039988 , -0.79754419,
       -0.77993183, -0.77134213, -0.93924642, -0.23381365, -0.20614316,
       -0.178965  , -0.19738551, -0.20112982, -0.17849306, -0.21800613,
       -0.1462465 , -0.02243175, -0.21006416, -0.19137731,  0.00237043,
        0.64996798,  0.8347128 , -0.24763691, -0.26774851,  0.12113192,
        0.72194031,  0.62642991,  1.1193606 ,  1.07661245,  0.99170142,
        1.14457532,  1.59800227, -0.41421696, -0.37903818, -0.44066817,
       -0.3619202 , -0.40277453, -0.27806221, -0.42701048, -0.42302961,
        0.93257409,  0.38666369,  0.29567325,  0.40431516,  0.932154  ,
        1.38609225, -0.35672738, -0.21797962,  0.66821673,  0.27394276,
        0.11390661,  0.16298197,  0.74432972,  1.04369186, -0.44653418,
       -0.37384103, -0.01811478, -0.05063733, -0.10172555, -0.35003493,
       -0.83314626, -1.02814938, -0.37646249, -0.70842781,  0.35455966,
        0.00370494, -0.20993074, -0.35090948, -0.3798995 , -0.57390832,
       -0.14049582, -0.48849244,  0.27232352,  0.56559302,  0.54673056,
       -0.10839156, -0.44431028, -0.46020247, -0.50975521, -0.91310287,
       -0.11549831, -0.15822222, -0.14537882, -0.19986573, -0.12898544,
       -0.12412819, -0.1111852 , -0.13155091,  0.7839987 ,  0.94178976,
        1.11502694,  1.11585946,  1.03145997,  0.94421659,  0.44932091,
        0.67745778, -0.66448801, -0.69458922, -0.8241389 , -0.76852537,
       -0.60883941, -0.3483424 , -0.81324231, -0.52086313, -0.39850542,
        0.13751952,  0.20242608,  0.79950013,  0.80387309,  0.83182892,
        0.89047392,  0.97892759, -1.25158954, -1.32292147, -1.08841118,
       -1.13878394, -1.23022993, -1.13992527, -1.05895382, -0.98101608,
       -1.25641454, -1.38157623, -1.0202544 , -1.36540596, -1.26620751,
       -0.68274526, -1.44856806, -0.69165696, -0.71738483, -0.74234011,
       -0.60566186, -0.6923032 , -0.59214802, -0.56088831, -0.96087669,
       -0.36666637, -0.6863133 , -0.77984735, -0.76161203, -0.75062533,
       -0.49862637, -1.04516832, -0.70121844, -1.02610681])
tag
len(test_x[0])
test_x = np.concatenate((np.ones([240, 1]), test_x), axis = 1).astype(float)

Prediction

說明圖同上

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-UzSTSYxu-1594744658939)(https://drive.google.com/uc?id=1165ETzZyE6HStqKvgR0gKrJwgFLK6-CW)]

有了 weight 和測試資料即可預測 target。

w = np.load('weight.npy')
w
array([[21.32627119],
       [-0.59770122],
       [ 1.19618126],
       [-1.71854272],
       [-0.00412458],
       [ 1.06928241],
       [-0.36250559],
       [-1.81932088],
       [ 2.10286256],
       [ 0.12625626],
       [ 0.10239455],
       [ 0.04337644],
       [ 0.05290508],
       [-0.0207053 ],
       [-0.25997183],
       [ 0.15053346],
       [ 0.58737417],
       [ 0.01482249],
       [ 0.06264181],
       [-0.16256801],
       [ 0.14738827],
       [-0.02307035],
       [-0.16147156],
       [ 0.07498768],
       [ 0.40381712],
       [ 0.31507161],
       [-0.24065449],
       [ 0.23966163],
       [ 0.38239296],
       [-0.52786859],
       [ 0.0581854 ],
       [ 0.16077347],
       [ 0.06112612],
       [ 0.11976572],
       [ 0.23739248],
       [-1.04772983],
       [-0.47176797],
       [-0.55928703],
       [ 0.0159055 ],
       [-0.54088112],
       [-0.68640924],
       [-0.06704892],
       [-0.1638642 ],
       [-2.79479872],
       [-1.46477859],
       [-0.90830161],
       [-0.4895747 ],
       [-2.71160536],
       [ 0.13711248],
       [-0.38775076],
       [ 0.12351668],
       [ 3.22246928],
       [ 1.65049504],
       [ 1.25627249],
       [ 0.14705583],
       [ 2.61715673],
       [ 1.9780861 ],
       [ 0.1623215 ],
       [-0.16639199],
       [-0.36364337],
       [ 0.01706278],
       [-0.56897905],
       [-0.42067013],
       [-0.17519396],
       [ 1.8326884 ],
       [ 0.36716326],
       [-0.81083076],
       [ 1.0596038 ],
       [-0.10137155],
       [-1.05794694],
       [ 1.02079922],
       [-0.05240129],
       [ 0.91110234],
       [-0.82112217],
       [ 3.55240057],
       [-3.51869125],
       [-0.69883282],
       [ 8.19423616],
       [-9.11851928],
       [ 0.28782482],
       [15.86037477],
       [ 0.00647815],
       [-0.0904938 ],
       [ 0.00927876],
       [-0.1027838 ],
       [ 0.10683582],
       [ 0.03512709],
       [-0.08788882],
       [-0.15754034],
       [ 0.12550395],
       [ 0.37267731],
       [-0.85912138],
       [-0.58034475],
       [ 1.18502834],
       [-1.32860977],
       [ 0.05769156],
       [ 0.6495768 ],
       [ 0.22441773],
       [-0.08086894],
       [-0.12145987],
       [-0.03584999],
       [ 0.10052267],
       [-0.2054299 ],
       [ 0.28685996],
       [ 0.22959199],
       [-0.32587512],
       [ 0.12923094],
       [-0.40514046],
       [-0.175628  ],
       [ 0.37578853],
       [ 0.15293797],
       [-0.35953503],
       [-0.1440169 ],
       [ 0.17412323],
       [-0.06657272],
       [ 0.18199105],
       [ 0.05810032],
       [ 0.16962856],
       [-0.24043043],
       [ 0.10899224],
       [ 0.04145799],
       [-0.09666332],
       [ 0.09489148],
       [-0.18459977],
       [ 0.03905182],
       [-0.00918566],
       [ 0.00566057],
       [-0.24082881],
       [ 0.02100525],
       [-0.14123642],
       [ 0.19429907],
       [ 0.0552319 ],
       [-0.11646291],
       [-0.08053281],
       [-0.0443063 ],
       [-0.0314302 ],
       [-0.13354943],
       [ 0.0625202 ],
       [-0.10887952],
       [-0.21179497],
       [-0.02862832],
       [ 0.34464294],
       [-0.02245315],
       [-0.29397586],
       [ 0.18813077]])

ans_y = np.dot(test_x, w)
print(ans_y[:10])
print(test_y[:10])
[[ 3.49897912]
 [10.71431377]
 [24.25967974]
 [ 4.60526241]
 [25.49323832]
 [15.20700794]
 [18.56378292]
 [28.2902962 ]
 [22.1783548 ]
 [46.30543741]]
[[ 4.]
 [13.]
 [18.]
 [28.]
 [26.]
 [14.]
 [ 4.]
 [31.]
 [30.]
 [20.]]
sum (abs(ans_y - test_y)) / len(ans_y)
array([16.0961944])

Save Prediction to CSV File

import csv
with open('submit.csv', mode='w', newline='') as submit_file:
    csv_writer = csv.writer(submit_file)
    header = ['id', 'value']
    print(header)
    csv_writer.writerow(header)
    for i in range(240):
        row = ['id_' + str(i), ans_y[i][0]]
        csv_writer.writerow(row)
        print(row)

相關 reference 可以參考:

Adagrad :
https://youtu.be/yKKNr-QKz2Q?list=PLJV_el3uVTsPy9oCRY30oBPNLCo89yu49&t=705

RMSprop :
https://www.youtube.com/watch?v=5Yt-obwvMHI

Adam
https://www.youtube.com/watch?v=JXQT_vxqwIs

以上 print 的部分主要是為了看一下資料和結果的呈現,拿掉也無妨。另外,在自己的 linux 系統,可以將檔案寫死的的部分換成 sys.argv 的使用 (可在 terminal 自行輸入檔案和檔案位置)。

最後,可以藉由調整 learning rate、iter_time (iteration 次數)、取用 features 的多寡(取幾個小時,取哪些特徵欄位),甚至是不同的 model 來超越 baseline。

Report 的問題模板請參照 : https://docs.google.com/document/d/1s84RXs2AEgZr54WCK9IgZrfTF-6B1td-AlKR9oqYa4g/edit


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值