李宏毅2020机器学习-作业1 Regression
0 参考
- 2020机器学习课程主页:https://speech.ee.ntu.edu.tw/~hylee/ml/2020-spring.html
包含视频、课件、作业(要求和参考代码)等 - 视频:https://www.bilibili.com/video/BV1JE411g7XF
- 博客:
(1)https://mrsuncodes.github.io/2020/03/15/%E6%9D%8E%E5%AE%8F%E6%AF%85%E6%9C%BA%E5%99%A8%E5%AD%A6%E4%B9%A0-%E7%AC%AC%E4%B8%80%E8%AF%BE%E4%BD%9C%E4%B8%9A/
(2)https://www.cnblogs.com/HL-space/p/10676637.html
1 导入需要的包
(包说明忘记是哪个博客参考了,如果有谁知道了告诉我一声)
- sys:该模块提供对解释器使用或维护的一些变量的访问,以及与解释器强烈交互的函数
- pandas:一个强大的分析结构化数据的工具集
- numpy: Python的一个扩展程序库,支持大量的维度数组与矩阵运算
- math:数学运算的库
import sys
import pandas as pd
import numpy as np
2 数据预处理
2.1 train.csv
使用pandas的read_csv函数读取数据,并保存在变量data中。会自动忽略数据表格第一行的表头。
data = pd.read_csv('D:/MyProgram/Python/forJUPYTER/data/train.csv', encoding='big5')
每一行数据的前三行分别是:日期、测站和测项。
因此每日24小时的数据是从第四行开始的,通过iloc函数进行数据提炼和读取。
data = data.iloc[:, 3:] # iloc函数:通过行号来取行数据
print(data)
0 1 2 3 4 5 6 7 8 9 ... 14 \
0 14 14 14 13 12 12 12 12 15 17 ... 22
1 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 ... 1.8
2 0.51 0.41 0.39 0.37 0.35 0.3 0.37 0.47 0.78 0.74 ... 0.37
3 0.2 0.15 0.13 0.12 0.11 0.06 0.1 0.13 0.26 0.23 ... 0.1
4 0.9 0.6 0.5 1.7 1.8 1.5 1.9 2.2 6.6 7.9 ... 2.5
... ... ... ... ... ... ... ... ... ... ... ... ...
4315 1.8 1.8 1.8 1.8 1.8 1.7 1.7 1.8 1.8 1.8 ... 1.8
4316 46 13 61 44 55 68 66 70 66 85 ... 59
4317 36 55 72 327 74 52 59 83 106 105 ... 18
4318 1.9 2.4 1.9 2.8 2.3 1.9 2.1 3.7 2.8 3.8 ... 2.3
4319 0.7 0.8 1.8 1 1.9 1.7 2.1 2 2 1.7 ... 1.3
15 16 17 18 19 20 21 22 23
0 22 21 19 17 16 15 15 15 15
1 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8
2 0.37 0.47 0.69 0.56 0.45 0.38 0.35 0.36 0.32
3 0.13 0.14 0.23 0.18 0.12 0.1 0.09 0.1 0.08
4 2.2 2.5 2.3 2.1 1.9 1.5 1.6 1.8 1.5
... ... ... ... ... ... ... ... ... ...
4315 1.8 2 2.1 2 1.9 1.9 1.9 2 2
4316 308 327 21 100 109 108 114 108 109
4317 311 52 54 121 97 107 118 100 105
4318 2.6 1.3 1 1.5 1 1.7 1.5 2 2
4319 1.7 0.7 0.4 1.1 1.4 1.3 1.6 1.8 2
[4320 rows x 24 columns]
RAINFALL参数的值NR表示没有下雨,因此可以将NR改为0,方便后续处理。
data[data == 'NR'] = 0
print(data)
0 1 2 3 4 5 6 7 8 9 ... 14 \
0 14 14 14 13 12 12 12 12 15 17 ... 22
1 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 ... 1.8
2 0.51 0.41 0.39 0.37 0.35 0.3 0.37 0.47 0.78 0.74 ... 0.37
3 0.2 0.15 0.13 0.12 0.11 0.06 0.1 0.13 0.26 0.23 ... 0.1
4 0.9 0.6 0.5 1.7 1.8 1.5 1.9 2.2 6.6 7.9 ... 2.5
... ... ... ... ... ... ... ... ... ... ... ... ...
4315 1.8 1.8 1.8 1.8 1.8 1.7 1.7 1.8 1.8 1.8 ... 1.8
4316 46 13 61 44 55 68 66 70 66 85 ... 59
4317 36 55 72 327 74 52 59 83 106 105 ... 18
4318 1.9 2.4 1.9 2.8 2.3 1.9 2.1 3.7 2.8 3.8 ... 2.3
4319 0.7 0.8 1.8 1 1.9 1.7 2.1 2 2 1.7 ... 1.3
15 16 17 18 19 20 21 22 23
0 22 21 19 17 16 15 15 15 15
1 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8 1.8
2 0.37 0.47 0.69 0.56 0.45 0.38 0.35 0.36 0.32
3 0.13 0.14 0.23 0.18 0.12 0.1 0.09 0.1 0.08
4 2.2 2.5 2.3 2.1 1.9 1.5 1.6 1.8 1.5
... ... ... ... ... ... ... ... ... ...
4315 1.8 2 2.1 2 1.9 1.9 1.9 2 2
4316 308 327 21 100 109 108 114 108 109
4317 311 52 54 121 97 107 118 100 105
4318 2.6 1.3 1 1.5 1 1.7 1.5 2 2
4319 1.7 0.7 0.4 1.1 1.4 1.3 1.6 1.8 2
[4320 rows x 24 columns]
然后将数据转换成二维矩阵,并保存在raw_data中。
raw_data = data.to_numpy()
print('raw_data:',raw_data)
print('raw_data的维数:',raw_data.shape)
raw_data: [['14' '14' '14' ... '15' '15' '15']
['1.8' '1.8' '1.8' ... '1.8' '1.8' '1.8']
['0.51' '0.41' '0.39' ... '0.35' '0.36' '0.32']
...
['36' '55' '72' ... '118' '100' '105']
['1.9' '2.4' '1.9' ... '1.5' '2' '2']
['0.7' '0.8' '1.8' ... '1.6' '1.8' '2']]
raw_data的维数: (4320, 24)
以月为单位,每个月形成一个
18
×
(
24
×
20
)
18\times(24\times20)
18×(24×20)的矩阵存放数据,共12组。
month_data = {}
for month in range(12):
sample = np.empty([18, 24*20])
for day in range(20):
sample[:, day * 24 : (day+1) * 24] = raw_data[18*(20*month+day):18*(20*month+day+1),:]
month_data[month] = sample
print('month_data:', month_data)
month_data: {0: array([[14. , 14. , 14. , ..., 14. , 13. , 13. ],
[ 1.8 , 1.8 , 1.8 , ..., 1.8 , 1.8 , 1.8 ],
[ 0.51, 0.41, 0.39, ..., 0.34, 0.41, 0.43],
...,
[35. , 79. , 2.4 , ..., 48. , 63. , 53. ],
[ 1.4 , 1.8 , 1. , ..., 1.1 , 1.9 , 1.9 ],
[ 0.5 , 0.9 , 0.6 , ..., 1.2 , 1.2 , 1.3 ]]), 1: array([[ 15. , 14. , 14. , ..., 8.4 , 8. , 7.6 ],
[ 1.8 , 1.8 , 1.7 , ..., 1.7 , 1.7 , 1.7 ],
[ 0.27, 0.26, 0.25, ..., 0.36, 0.35, 0.32],
...,
[113. , 109. , 104. , ..., 72. , 65. , 69. ],
[ 2.3 , 2.2 , 2.6 , ..., 1.9 , 2.9 , 1.5 ],
[ 2.5 , 2.2 , 2.2 , ..., 0.9 , 1.6 , 1.1 ]]), 2: array([[ 18. , 18. , 18. , ..., 14. , 13. , 13. ],
[ 1.8 , 1.8 , 1.8 , ..., 1.8 , 1.8 , 1.8 ],
[ 0.39, 0.36, 0.4 , ..., 0.42, 0.47, 0.49],
...,
[103. , 128. , 115. , ..., 60. , 94. , 53. ],
[ 1.7 , 1.4 , 1.8 , ..., 4.2 , 3.5 , 4.3 ],
[ 1.9 , 0.8 , 1.5 , ..., 3.1 , 2.4 , 2.4 ]]), 3: array([[ 19. , 18. , 17. , ..., 24. , 24. , 23. ],
[ 1.7 , 1.7 , 1.7 , ..., 1.8 , 1.8 , 1.9 ],
[ 0.42, 0.42, 0.42, ..., 0.41, 0.46, 0.42],
...,
[308. , 308. , 320. , ..., 331. , 261. , 273. ],
[ 1.7 , 2.2 , 2. , ..., 1. , 1. , 0.8 ],
[ 1.5 , 1.5 , 1.2 , ..., 0.6 , 1.1 , 0.9 ]]), 4: array([[1.90e+01, 1.90e+01, 2.00e+01, ..., 2.60e+01, 2.60e+01, 2.50e+01],
[1.80e+00, 1.80e+00, 1.80e+00, ..., 1.60e+00, 1.60e+00, 1.60e+00],
[4.80e-01, 4.70e-01, 4.50e-01, ..., 1.50e-01, 1.50e-01, 1.30e-01],
...,
[2.90e+02, 6.90e+01, 2.50e+02, ..., 1.74e+02, 1.95e+02, 1.69e+02],
[1.50e+00, 1.90e+00, 1.70e+00, ..., 3.10e+00, 3.10e+00, 2.90e+00],
[4.00e-01, 5.00e-01, 1.00e+00, ..., 2.90e+00, 2.40e+00, 3.10e+00]]), 5: array([[2.60e+01, 2.50e+01, 2.50e+01, ..., 2.70e+01, 2.70e+01, 2.80e+01],
[1.70e+00, 1.70e+00, 1.70e+00, ..., 1.60e+00, 1.60e+00, 1.60e+00],
[3.50e-01, 3.40e-01, 3.40e-01, ..., 2.60e-01, 1.90e-01, 1.60e-01],
...,
[1.18e+02, 1.22e+02, 1.19e+02, ..., 1.16e+02, 1.59e+02, 1.62e+02],
[1.60e+00, 1.40e+00, 1.30e+00, ..., 1.70e+00, 1.00e+00, 2.40e+00],
[1.50e+00, 1.50e+00, 1.30e+00, ..., 1.30e+00, 1.30e+00, 1.70e+00]]), 6: array([[2.60e+01, 2.50e+01, 2.60e+01, ..., 2.80e+01, 2.80e+01, 2.80e+01],
[1.60e+00, 1.60e+00, 1.60e+00, ..., 1.60e+00, 1.60e+00, 1.70e+00],
[1.40e-01, 1.30e-01, 1.30e-01, ..., 3.10e-01, 3.00e-01, 2.70e-01],
...,
[1.06e+02, 1.24e+02, 1.17e+02, ..., 1.27e+02, 1.33e+02, 1.72e+02],
[1.60e+00, 1.80e+00, 1.20e+00, ..., 1.60e+00, 1.40e+00, 1.70e+00],
[2.00e+00, 2.20e+00, 1.70e+00, ..., 1.70e+00, 1.30e+00, 1.60e+00]]), 7: array([[2.80e+01, 2.80e+01, 2.80e+01, ..., 2.60e+01, 2.60e+01, 2.60e+01],
[1.60e+00, 1.60e+00, 1.60e+00, ..., 1.70e+00, 1.70e+00, 1.70e+00],
[2.60e-01, 2.00e-01, 1.60e-01, ..., 1.60e-01, 1.40e-01, 1.30e-01],
...,
[2.04e+02, 1.77e+02, 1.72e+02, ..., 1.68e+02, 1.80e+02, 1.62e+02],
[2.90e+00, 2.80e+00, 2.70e+00, ..., 2.90e+00, 2.80e+00, 2.50e+00],
[3.00e+00, 2.80e+00, 2.70e+00, ..., 3.10e+00, 2.90e+00, 2.50e+00]]), 8: array([[ 25. , 25. , 25. , ..., 26. , 26. , 26. ],
[ 1.7 , 1.7 , 1.7 , ..., 1.6 , 1.6 , 1.7 ],
[ 0.28, 0.27, 0.26, ..., 0.28, 0.24, 0.23],
...,
[ 98. , 109. , 108. , ..., 163. , 71. , 55. ],
[ 1.8 , 1.9 , 1.1 , ..., 1.2 , 1.1 , 0.7 ],
[ 1.4 , 1.9 , 1.7 , ..., 3.4 , 1. , 0.7 ]]), 9: array([[ 25. , 25. , 25. , ..., 23. , 22. , 22. ],
[ 1.7 , 1.7 , 1.7 , ..., 1.8 , 1.7 , 1.7 ],
[ 0.24, 0.26, 0.27, ..., 0.42, 0.35, 0.26],
...,
[ 72. , 100. , 68. , ..., 109. , 110. , 107. ],
[ 1.1 , 1.4 , 1.1 , ..., 2.2 , 2.4 , 2.5 ],
[ 1.8 , 1.2 , 0.9 , ..., 2.1 , 2.2 , 2.3 ]]), 10: array([[ 22. , 21. , 21. , ..., 19. , 18. , 18. ],
[ 1.9 , 1.9 , 1.9 , ..., 1.7 , 1.7 , 1.7 ],
[ 0.79, 0.71, 0.61, ..., 0.36, 0.36, 0.37],
...,
[100. , 117. , 110. , ..., 117. , 117. , 114. ],
[ 1.1 , 1.9 , 1.7 , ..., 2.1 , 2.2 , 1.9 ],
[ 0.7 , 1.1 , 1.2 , ..., 1.8 , 2.1 , 1.9 ]]), 11: array([[ 23. , 23. , 23. , ..., 13. , 13. , 13. ],
[ 1.6 , 1.7 , 1.7 , ..., 1.8 , 1.8 , 1.8 ],
[ 0.22, 0.2 , 0.18, ..., 0.51, 0.57, 0.56],
...,
[ 93. , 50. , 99. , ..., 118. , 100. , 105. ],
[ 1.8 , 2.1 , 3.2 , ..., 1.5 , 2. , 2. ],
[ 1.3 , 0.9 , 1. , ..., 1.6 , 1.8 , 2. ]])}
题目要求根据前9个小时的数据预测第10个小时的PM2.5值,因此对raw_data进行进一步的处理。即每9个小时的数据为一个data,参考值为第10个小时的数据。
由于每个月中的20天是连续的,因此每个月有
20
×
24
−
9
20\times24-9
20×24−9个data,每个data的维数为
18
×
9
18×9
18×9。
这里注意一下,数据处理后的
x
\pmb{\text{x}}
xxx是二维矩阵,而不是三维矩阵,
x
\pmb{\text{x}}
xxx的维数为
(
12
×
(
20
×
24
−
9
)
,
18
×
9
)
\left(12\times(20\times24-9),18\times9\right)
(12×(20×24−9),18×9),其中
18
×
9
18\times9
18×9使用reshape(-1,1)
拉直。
x = np.empty([12*(24*20-9),18*9], dtype=float)
y = np.empty([12*(24*20-9), 1], dtype=float)
for month in range(12):
for day in range(20):
for hour in range(24):
if day == 19 and hour>14:
continue
x[month*(24*20-9) + day*24 + hour, :] = month_data[month][:,day*24+hour:day*24+hour+9].reshape(1,-1)
y[month*(24*20-9) + day*24 + hour, 0] = month_data[month][9, day*24+hour+9]
print('x:',x)
print('x的维数:', x.shape)
print('y:',y)
print('y的维数:', y.shape)
x: [[14. 14. 14. ... 2. 2. 0.5]
[14. 14. 13. ... 2. 0.5 0.3]
[14. 13. 12. ... 0.5 0.3 0.8]
...
[17. 18. 19. ... 1.1 1.4 1.3]
[18. 19. 18. ... 1.4 1.3 1.6]
[19. 18. 17. ... 1.3 1.6 1.8]]
x的维数: (5652, 162)
y: [[30.]
[41.]
[44.]
...
[17.]
[24.]
[29.]]
y的维数: (5652, 1)
- 数据标准化
最常见的标准方法就是 Z Z Z标准化(标准差标准化),经过处理的数据符合标准正态分布,即均值为0,标准差为1。
注意:一般来说 z − s c o r e z-score z−score不是归一化,而是标准化。
归一化只是标准化的一种。 关于归一化和标准化
数据的标准化是将数据按比例缩放,使之落入一个小的特定区间。在某些比较和评价的指标处理中经常会用到。去除数据的的单位限制,将其转化为无量纲的纯数值,便于不同单位或量级的指标能够进行比较和加权。
数据的归一化是将数据变成 ( 0 , 1 ) (0,1) (0,1)之间的小数,把有量纲表达式变为无量纲表达式。归一化后可以提升模型的收敛速度,提升模型的精度,防止模型梯度爆炸。
z − s c o r e z-score z−score标准化的转化函数为:
x ∗ = x − μ σ x^*=\frac{x-\mu}{\sigma} x∗=σx−μ
其中, μ \mu μ是所有样本数据的均值, σ \sigma σ是所有样本数据的标准差。
mean_x = np.mean(x, axis = 0)
std_x = np.std(x, axis=0)
for i in range(len(x)):
for j in range(len(x[0])):
if std_x[j] != 0:
x[i][j] = (x[i][j] - mean_x[j]) / std_x[j]
将训练集分为 train_set 和 validation_set 。在 train_set 上训练,在 validation_set 上验证模型效果。
train_set : validation_set = 8 : 2
import math
x_train_set = x[: math.floor(len(x) * 0.8), :]
y_train_set = y[: math.floor(len(y) * 0.8), :]
x_validation = x[math.floor(len(x) * 0.8):, :]
y_validation = y[math.floor(len(y) * 0.8):, :]
print('x_train_set:',x_train_set)
print('len(x_train_set):',len(x_train_set))
print('y_train_set:',y_train_set)
print('len(y_train_set):',len(y_train_set))
print('x_validation',x_validation)
print('len(x_validation):',len(x_validation))
print('y_valiadtion:',y_validation)
print('len(y_validation):',len(y_validation))
x_train_set: [[-1.35825331 -1.35883937 -1.359222 ... 0.26650729 0.2656797
-1.14082131]
[-1.35825331 -1.35883937 -1.51819928 ... 0.26650729 -1.13963133
-1.32832904]
[-1.35825331 -1.51789368 -1.67717656 ... -1.13923451 -1.32700613
-0.85955971]
...
[ 0.86929969 0.70886668 0.38952809 ... 1.39110073 0.2656797
-0.39079039]
[ 0.71018876 0.39075806 0.07157353 ... 0.26650729 -0.39013211
-0.39079039]
[ 0.3919669 0.07264944 0.07157353 ... -0.38950555 -0.39013211
-0.85955971]]
len(x_train_set): 4521
y_train_set: [[30.]
[41.]
[44.]
...
[ 7.]
[ 5.]
[14.]]
len(y_train_set): 4521
x_validation [[ 0.07374504 0.07264944 0.07157353 ... -0.38950555 -0.85856912
-0.57829812]
[ 0.07374504 0.07264944 0.23055081 ... -0.85808615 -0.57750692
0.54674825]
[ 0.07374504 0.23170375 0.23055081 ... -0.57693779 0.54674191
-0.1095288 ]
...
[-0.88092053 -0.72262212 -0.56433559 ... -0.57693779 -0.29644471
-0.39079039]
[-0.7218096 -0.56356781 -0.72331287 ... -0.29578943 -0.39013211
-0.1095288 ]
[-0.56269867 -0.72262212 -0.88229015 ... -0.38950555 -0.10906991
0.07797893]]
len(x_validation): 1131
y_valiadtion: [[13.]
[24.]
[22.]
...
[17.]
[24.]
[29.]]
len(y_validation): 1131
2.2 test.csv
header = None
表示数据中没有表头。
testdata = pd.read_csv('D:/MyProgram/Python/forJUPYTER/data/test.csv', header = None, encoding='big5')
和训练集一样,需要去除每行的说明列,以及将 NR 处理成 0,并转换成numpy矩阵。
testdata = testdata.iloc[:,2:]
testdata[testdata == 'NR'] = 0
test_data = testdata.to_numpy()
print('test_data::', test_data)
print('test_data的维数:', test_data.shape)
test_data:: [['21' '21' '20' ... '19' '18' '17']
['1.7' '1.7' '1.7' ... '1.7' '1.7' '1.8']
['0.39' '0.36' '0.36' ... '0.34' '0.31' '0.23']
...
['76' '99' '93' ... '98' '97' '65']
['2.2' '3.2' '2.5' ... '5.7' '4.9' '3.6']
['1.7' '2.8' '2.6' ... '4.9' '5.2' '3.6']]
test_data的维数: (4320, 9)
测试集中一共有240天的数据。
test_x = np.empty([240,18*9], dtype=float)
for i in range(240):
test_x[i,:] = test_data[18*i:18*(i+1),:].reshape(1,-1)
数据标准化。
for i in range(len(test_x)):
for j in range(len(test_x[0])):
if std_x[j] != 0:
test_x[i][j] = (test_x[i][j] - mean_x[j]) / std_x[j]
test_x = np.concatenate((np.ones([240,1]),test_x), axis=1).astype(float)
print(test_x)
[[ 1. -0.24447681 -0.24545919 ... -0.67065391 -1.04594393
0.07797893]
[ 1. -1.35825331 -1.51789368 ... 0.17279117 -0.10906991
-0.48454426]
[ 1. 1.5057434 1.34508393 ... -1.32666675 -1.04594393
-0.57829812]
...
[ 1. 0.3919669 0.54981237 ... 0.26650729 -0.20275731
1.20302531]
[ 1. -1.8355861 -1.8360023 ... -1.04551839 -1.13963133
-1.14082131]
[ 1. -1.35825331 -1.35883937 ... 2.98427476 3.26367657
1.76554849]]
3 模型训练
加一个bias列
dim = 18*9+1
w = np.zeros([dim,1])
x_train_set = np.concatenate((np.ones([len(x_train_set),1]), x_train_set), axis=1).astype(float)
定义超参数和变量
learning_rate = 10 # 学习率
iter_time = 50000 # 迭代次数
adagrad = np.zeros([dim, 1]) # adagrad算法
eps=0.0000000001 # 因为新的学习率是learning_rate/sqrt(sum_of_pre_grads**2),而adagrad=sum_of_grads**2,所以处在分母上而迭代时adagrad可能为0,所以加上一个极小数,使其不除0
- 损失函数:Root Mean Square Error
L ( w ) = 1 n ∑ i = 0 n − 1 ( y i − y ^ i ) 2 y i = ∑ j = 0 m w j x i j + b = θ ⋅ x i + b \begin{aligned} L(w)&=\sqrt{\frac{1}{n}\sum_{i=0}^{n-1}\left(y_i-\hat{y}_i\right)^2}\\ y_i&=\sum_{j=0}^{m}w^jx_i^j+b=\theta\cdot x_i+b \end{aligned} L(w)yi=n1i=0∑n−1(yi−y^i)2=j=0∑mwjxij+b=θ⋅xi+b
其中, n = 12 × ( 20 × 24 − 9 ) , m = 18 × 9 n=12\times(20\times24-9),m=18\times9 n=12×(20×24−9),m=18×9, y i y_i yi代表第 i i i个PM2.5预测值, x i j x_i^j xij代表第 i i i个参数矩阵中第 j j j个参数值。 - adagrad算法:
w t + 1 = w t − η ∑ i = 0 t ( g i ) 2 g t g t = ∂ L ( w t ) ∂ w t \begin{aligned} w_{t+1}&=w_{t}-\frac{\eta}{\sqrt{\sum_{i=0}^{t}\left(g_i\right)^2}}g_{t}\\ g_t&=\frac{\partial L(w_t)}{\partial w_t} \end{aligned} wt+1gt=wt−∑i=0t(gi)2ηgt=∂wt∂L(wt)
这里,
g t = 1 n ∑ i = 0 n − 1 ( w t x i + b − y ^ i ) 2 ∂ w t = ∑ i = 0 n − 1 x i ( w t x i + b − y ^ i ) n L ( w t ) g_t=\frac{\sqrt{\frac{1}{n}\sum_{i=0}^{n-1}\left(w_tx_i+b-\hat{y}_i\right)^2}}{\partial w_t}=\frac{\sum_{i=0}^{n-1}x_i(w_tx_i+b-\hat{y}_i)}{nL(w_t)} gt=∂wtn1∑i=0n−1(wtxi+b−y^i)2=nL(wt)∑i=0n−1xi(wtxi+b−y^i)
for t in range(iter_time):
loss = np.sqrt(np.sum(np.power(np.dot(x_train_set,w)-y_train_set,2))/len(x_train_set))
if(t%100 == 0):
print('迭代次数:%i, 损失值:%f'%(t,loss))
gradient = (np.dot(x_train_set.transpose(), np.dot(x_train_set, w)-y_train_set))/(loss*len(x_train_set))
adagrad += gradient ** 2
w = w - learning_rate * gradient / np.sqrt(adagrad + eps)
np.save('weight.npy', w)
迭代次数:0, 损失值:27.239592
迭代次数:100, 损失值:598.991742
迭代次数:200, 损失值:96.973083
迭代次数:300, 损失值:240.807182
迭代次数:400, 损失值:71.607934
迭代次数:500, 损失值:212.116933
迭代次数:600, 损失值:117.461546
……
迭代次数:49400, 损失值:15.226941
迭代次数:49500, 损失值:15.212356
迭代次数:49600, 损失值:15.197824
迭代次数:49700, 损失值:15.183339
迭代次数:49800, 损失值:15.168902
迭代次数:49900, 损失值:15.154507
4 验证集上验证
w = np.load('weight.npy')
x_validation = np.concatenate((np.ones([len(x_validation), 1]),x_validation), axis=1).astype(float)
ans_y = np.dot(x_validation, w)
loss = np.sqrt(np.sum(np.power(ans_y-y_validation, 2))/len(y_validation))
print('验证集上的loss:', loss)
验证集上的loss: 13.145704209176895
5 测试集上预测
w = np.load('weight.npy')
ans_y = np.dot(test_x,w)
print('测试集上PM2.5的预测结果',ans_y)
测试集上PM2.5的预测结果 [[ 4.18940895]
[21.15516933]
[ 3.36186293]
[ 5.46244159]
[28.63633728]
……
[37.49079383]
[23.43782156]
[10.07970318]
[29.31665506]]
保存预测结果
import csv
with open('submit.csv', mode='w', newline='') as submit_file:
csv_writer = csv.writer(submit_file)
header = ['id','value']
print(header)
csv_writer.writerow(header)
for i in range(240):
row = ['id_' + str(i), ans_y[i][0]]
csv_writer.writerow(row)
print(row)
['id', 'value']
['id_0', 4.18940895110925]
['id_1', 21.155169332936023]
['id_2', 3.361862929935379]
['id_3', 5.46244158523443]
……
['id_237', 23.437821562479797]
['id_238', 10.079703183805389]
['id_239', 29.316655061609495]
麻了,感觉结果不算特别好,所以不太清楚人家是怎么收敛到五点多的…
jupyter notebook👉提取码:bul3