飞桨李宏毅机器学习特训营 Linear Regression（线性回归）项目实践

最新推荐文章于 2023-08-27 20:18:30 发布

摸鱼高手学ML

最新推荐文章于 2023-08-27 20:18:30 发布

阅读量517

点赞数 2

文章标签：机器学习 python 人工智能

本文链接：https://blog.csdn.net/qq_41690864/article/details/115912962

版权

本文是飞桨李宏毅机器学习训练营的作业1-PM2.5的详细实现的全过程，主要用于涉及线性回归。项目的数据集及项目实现可以直接访问AI Studio中的项目：https://aistudio.baidu.com/aistudio/projectdetail/1834942?channelType=0&channel=0
飞桨李宏毅机器学习特训营的课程链接为：https://aistudio.baidu.com/aistudio/course/introduce/1978

作业1-PM2.5预测

项目描述

本次作业的资料是从行政院环境环保署空气品质监测网所下载的观测资料。
希望大家能在本作业实现 linear regression 预测出 PM2.5 的数值。

数据集介绍

本次作业使用丰原站的观测记录，分成 train set 跟 test set，train set 是丰原站每个月的前 20 天所有资料。test set 则是从丰原站剩下的资料中取样出来。
train.csv: 每个月前 20 天的完整资料。
test.csv : 从剩下的资料当中取样出连续的 10 小时为一笔，前九小时的所有观测数据当作 feature，第十小时的 PM2.5 当作 answer。一共取出 240 笔不重複的 test data，请根据 feature 预测这 240 笔的 PM2.5。
Data 含有 18 项观测数据 AMB_TEMP, CH4, CO, NHMC, NO, NO2, NOx, O3, PM10, PM2.5, RAINFALL, RH, SO2, THC, WD_HR, WIND_DIREC, WIND_SPEED, WS_HR。

项目要求

请手动实现 linear regression，方法限使用 gradient descent。
禁止使用 numpy.linalg.lstsq

数据准备

无

环境配置/安装

!pip install --upgrade pandas

Looking in indexes: https://mirror.baidu.com/pypi/simple/
Requirement already up-to-date: pandas in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (1.2.3)
Requirement already satisfied, skipping upgrade: numpy>=1.16.5 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pandas) (1.20.1)
Requirement already satisfied, skipping upgrade: pytz>=2017.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pandas) (2019.3)
Requirement already satisfied, skipping upgrade: python-dateutil>=2.7.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pandas) (2.8.0)
Requirement already satisfied, skipping upgrade: six>=1.5 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)

1 数据准备

1.1 数据导入

import numpy as np
import pandas as pd
data = pd.read_csv('work/hw1_data/train.csv', encoding = 'big5')  # 使用'big5'进行编码才能让文件中的中文不乱码 big5为繁体字编码

print(data.head(15))        # 查看数据
print(data.shape)  # 查看数据大小

          日期  測站        測項     0     1     2     3     4     5     6  ...  \
0   2014/1/1  豐原  AMB_TEMP    14    14    14    13    12    12    12  ...   
1   2014/1/1  豐原       CH4   1.8   1.8   1.8   1.8   1.8   1.8   1.8  ...   
2   2014/1/1  豐原        CO  0.51  0.41  0.39  0.37  0.35   0.3  0.37  ...   
3   2014/1/1  豐原      NMHC   0.2  0.15  0.13  0.12  0.11  0.06   0.1  ...   
4   2014/1/1  豐原        NO   0.9   0.6   0.5   1.7   1.8   1.5   1.9  ...   
5   2014/1/1  豐原       NO2    16   9.2   8.2   6.9   6.8   3.8   6.9  ...   
6   2014/1/1  豐原       NOx    17   9.8   8.7   8.6   8.5   5.3   8.8  ...   
7   2014/1/1  豐原        O3    16    30    27    23    24    28    24  ...   
8   2014/1/1  豐原      PM10    56    50    48    35    25    12     4  ...   
9   2014/1/1  豐原     PM2.5    26    39    36    35    31    28    25  ...   
10  2014/1/1  豐原  RAINFALL    NR    NR    NR    NR    NR    NR    NR  ...   
11  2014/1/1  豐原        RH    77    68    67    74    72    73    74  ...   
12  2014/1/1  豐原       SO2   1.8     2   1.7   1.6   1.9   1.4   1.5  ...   
13  2014/1/1  豐原       THC     2     2     2   1.9   1.9   1.8   1.9  ...   
14  2014/1/1  豐原     WD_HR    37    80    57    76   110   106   101  ...   

      14    15    16    17    18    19    20    21    22    23  
0     22    22    21    19    17    16    15    15    15    15  
1    1.8   1.8   1.8   1.8   1.8   1.8   1.8   1.8   1.8   1.8  
2   0.37  0.37  0.47  0.69  0.56  0.45  0.38  0.35  0.36  0.32  
3    0.1  0.13  0.14  0.23  0.18  0.12   0.1  0.09   0.1  0.08  
4    2.5   2.2   2.5   2.3   2.1   1.9   1.5   1.6   1.8   1.5  
5     11    11    22    28    19    12   8.1     7   6.9     6  
6     14    13    25    30    21    13   9.7   8.6   8.7   7.5  
7     65    64    51    34    33    34    37    38    38    36  
8     52    51    66    85    85    63    46    36    42    42  
9     36    45    42    49    45    44    41    30    24    13  
10    NR    NR    NR    NR    NR    NR    NR    NR    NR    NR  
11    47    49    56    67    72    69    70    70    70    69  
12   3.9   4.4   9.9   5.1   3.4   2.3     2   1.9   1.9   1.9  
13   1.9   1.9   1.9   2.1     2   1.9   1.9   1.9   1.9   1.9  
14   307   304   307   124   118   121   113   112   106   110  

[15 rows x 27 columns]
(4320, 27)

1.2 数据预处理

data = data.iloc[:, 3:]      #去除数据中前3列的说明项
data[data=='NR']=0           #将数据中无效项“NR”替换为0
print(data.head(15))
numpy_data=data.to_numpy()   #将数据转换为numpy数组
print(numpy_data.shape)

       0     1     2     3     4     5     6     7     8     9  ...    14  \
0     14    14    14    13    12    12    12    12    15    17  ...    22   
1    1.8   1.8   1.8   1.8   1.8   1.8   1.8   1.8   1.8   1.8  ...   1.8   
2   0.51  0.41  0.39  0.37  0.35   0.3  0.37  0.47  0.78  0.74  ...  0.37   
3    0.2  0.15  0.13  0.12  0.11  0.06   0.1  0.13  0.26  0.23  ...   0.1   
4    0.9   0.6   0.5   1.7   1.8   1.5   1.9   2.2   6.6   7.9  ...   2.5   
5     16   9.2   8.2   6.9   6.8   3.8   6.9   7.8    15    21  ...    11   
6     17   9.8   8.7   8.6   8.5   5.3   8.8   9.9    22    29  ...    14   
7     16    30    27    23    24    28    24    22    21    29  ...    65   
8     56    50    48    35    25    12     4     2    11    38  ...    52   
9     26    39    36    35    31    28    25    20    19    30  ...    36   
10     0     0     0     0     0     0     0     0     0     0  ...     0   
11    77    68    67    74    72    73    74    73    66    56  ...    47   
12   1.8     2   1.7   1.6   1.9   1.4   1.5   1.6   5.1    15  ...   3.9   
13     2     2     2   1.9   1.9   1.8   1.9   1.9   2.1     2  ...   1.9   
14    37    80    57    76   110   106   101   104   124    46  ...   307   

      15    16    17    18    19    20    21    22    23  
0     22    21    19    17    16    15    15    15    15  
1    1.8   1.8   1.8   1.8   1.8   1.8   1.8   1.8   1.8  
2   0.37  0.47  0.69  0.56  0.45  0.38  0.35  0.36  0.32  
3   0.13  0.14  0.23  0.18  0.12   0.1  0.09   0.1  0.08  
4    2.2   2.5   2.3   2.1   1.9   1.5   1.6   1.8   1.5  
5     11    22    28    19    12   8.1     7   6.9     6  
6     13    25    30    21    13   9.7   8.6   8.7   7.5  
7     64    51    34    33    34    37    38    38    36  
8     51    66    85    85    63    46    36    42    42  
9     45    42    49    45    44    41    30    24    13  
10     0     0     0     0     0     0     0     0     0  
11    49    56    67    72    69    70    70    70    69  
12   4.4   9.9   5.1   3.4   2.3     2   1.9   1.9   1.9  
13   1.9   1.9   2.1     2   1.9   1.9   1.9   1.9   1.9  
14   304   307   124   118   121   113   112   106   110  

[15 rows x 24 columns]
(4320, 24)

# 4320行 = 12个月*每个月20天*每天监测18种  24列 = 24小时
month_data = {}
for month in range(12):
    sample = np.empty([18, 480])  # [18, 480]  480 = 20*24
    for day in range(20):
        sample[:, day * 24 : (day + 1) * 24] = numpy_data[18 * (20 * month + day) : 18 * (20 * month + day + 1), :]    # 18, 24
    month_data[month] = sample
print(month_data[0])       #查看第一月的数据
print(month_data[0].shape)

[[14.   14.   14.   ... 14.   13.   13.  ]
 [ 1.8   1.8   1.8  ...  1.8   1.8   1.8 ]
 [ 0.51  0.41  0.39 ...  0.34  0.41  0.43]
 ...
 [35.   79.    2.4  ... 48.   63.   53.  ]
 [ 1.4   1.8   1.   ...  1.1   1.9   1.9 ]
 [ 0.5   0.9   0.6  ...  1.2   1.2   1.3 ]]
(18, 480)

1.3 特征提取

每个月480小时，9小时一个data，一个月有480 - 9 = 471个data。一年有471 * 12个，每个data有9 * 18个特征

# 数据
x=np.empty([12*471,18*9],dtype=float)
# pm2.5
y=np.empty([12*471,1],dtype=float)
for month in range(12):
    for day in range(20):
        for hour in range(24):
            # 如果是最后一天，最后一个包结束，则返回
            if day==19 and hour>14:
                continue
            # 每个小时的18项数据
            x[month*471+day*24+hour,:]=month_data[month][:,day*24+hour:day*24+hour+9].reshape(1,-1)
            # pm值
            y[month*471+day*24+hour,0]=month_data[month][9,day*24+hour+9]
print(x)
print(y)

[[14.  14.  14.  ...  2.   2.   0.5]
 [14.  14.  13.  ...  2.   0.5  0.3]
 [14.  13.  12.  ...  0.5  0.3  0.8]
 ...
 [17.  18.  19.  ...  1.1  1.4  1.3]
 [18.  19.  18.  ...  1.4  1.3  1.6]
 [19.  18.  17.  ...  1.3  1.6  1.8]]
[[30.]
 [41.]
 [44.]
 ...
 [17.]
 [24.]
 [29.]]

1.4 归一化

mean_x = np.mean(x, axis = 0) #18 * 9 平均值
std_x = np.std(x, axis = 0) #18 * 9 方差
for i in range(len(x)): #12 * 471
    for j in range(len(x[0])): #18 * 9 
        if std_x[j] != 0:
            x[i][j] = (x[i][j] - mean_x[j]) / std_x[j]
x.shape    # (12*471,18*9)

(5652, 162)

1.5 训练集和验证集的划分

import math
x_train_set = x[: math.floor(len(x) * 0.8), :]
y_train_set = y[: math.floor(len(y) * 0.8), :]
x_validation = x[math.floor(len(x) * 0.8): , :]
y_validation = y[math.floor(len(y) * 0.8): , :]
print('x_train:', x_train_set)
print('y_train:', y_train_set)
print('x_validation:', x_validation)
print('y_validation:', y_validation)
print(len(x_train_set))
print(len(y_train_set))
print(len(x_validation))
print(len(y_validation))

x_train: [[-1.35825331 -1.35883937 -1.359222   ...  0.26650729  0.2656797
  -1.14082131]
 [-1.35825331 -1.35883937 -1.51819928 ...  0.26650729 -1.13963133
  -1.32832904]
 [-1.35825331 -1.51789368 -1.67717656 ... -1.13923451 -1.32700613
  -0.85955971]
 ...
 [ 0.86929969  0.70886668  0.38952809 ...  1.39110073  0.2656797
  -0.39079039]
 [ 0.71018876  0.39075806  0.07157353 ...  0.26650729 -0.39013211
  -0.39079039]
 [ 0.3919669   0.07264944  0.07157353 ... -0.38950555 -0.39013211
  -0.85955971]]
y_train: [[30.]
 [41.]
 [44.]
 ...
 [ 7.]
 [ 5.]
 [14.]]
x_validation: [[ 0.07374504  0.07264944  0.07157353 ... -0.38950555 -0.85856912
  -0.57829812]
 [ 0.07374504  0.07264944  0.23055081 ... -0.85808615 -0.57750692
   0.54674825]
 [ 0.07374504  0.23170375  0.23055081 ... -0.57693779  0.54674191
  -0.1095288 ]
 ...
 [-0.88092053 -0.72262212 -0.56433559 ... -0.57693779 -0.29644471
  -0.39079039]
 [-0.7218096  -0.56356781 -0.72331287 ... -0.29578943 -0.39013211
  -0.1095288 ]
 [-0.56269867 -0.72262212 -0.88229015 ... -0.38950555 -0.10906991
   0.07797893]]
y_validation: [[13.]
 [24.]
 [22.]
 ...
 [17.]
 [24.]
 [29.]]
4521
4521
1131
1131

2.训练

# 线性拟合 y = w1*x1 + w2*x2 + ... + w152*x152 + w153   w153为常数项
dim = 18 * 9 + 1           #153个权重系数
w = np.zeros([dim, 1])
x = np.concatenate((np.ones([12 * 471, 1]), x), axis = 1).astype(float)  #数组拼接 (5652, 163)
learning_rate = 100
iter_time = 2000
adagrad = np.zeros([dim, 1])
eps = 0.0000000001        # 防止被除数为0
Loss = []
for t in range(iter_time):
    # 平方差
    loss = np.sqrt(np.sum(np.power(np.dot(x, w) - y, 2))/471/12)#rmse
    Loss.append(loss)
    # 每100次输出一次
    if(t%100==0):
        print(str(t) + ":" + str(loss))
    # 梯度
    gradient = 2 * np.dot(x.transpose(), np.dot(x, w) - y) #dim*1 transpose()调换数组的行列索引 类似于求矩阵转置
    adagrad += gradient ** 2
    w = w - learning_rate * gradient / np.sqrt(adagrad + eps)
np.save('weight.npy', w)

0:27.071214829194115
100:33.78905859777454
200:19.913751298197095
300:13.531068193689693
400:10.645466158446172
500:9.277353455475065
600:8.518042045956502
700:8.014061987588425
800:7.636756824775692
900:7.336563740371125
1000:7.090968643947219
1100:6.8873114803241
1200:6.717116295730694
1300:6.574102121171868
1400:6.453381172520268
1500:6.351062466046933
1600:6.26401228776637
1700:6.189688383453206
1800:6.126016764546804
1900:6.071297135545988

import matplotlib.pyplot as plt
%matplotlib inline   
plt.plot(Loss)
plt.xlabel('Iter_time')
plt.ylabel('Loss')

Text(0,0.5,'Loss')

在这里插入图片描述

3 测试

test_data = pd.read_csv('work/hw1_data/test.csv',header=None, encoding="big5")

test_data.head(20)

	0	1	2	3	4	5	6	7	8	9	10
0	id_0	AMB_TEMP	21	21	20	20	19	19	19	18	17
1	id_0	CH4	1.7	1.7	1.7	1.7	1.7	1.7	1.7	1.7	1.8
2	id_0	CO	0.39	0.36	0.36	0.4	0.53	0.55	0.34	0.31	0.23
3	id_0	NMHC	0.16	0.24	0.22	0.27	0.27	0.26	0.27	0.29	0.1
4	id_0	NO	1.3	1.3	1.3	1.3	1.4	1.6	1.2	1.1	0.9
5	id_0	NO2	17	14	13	14	18	21	8.9	9.4	5
6	id_0	NOx	18	16	14	15	20	23	10	10	5.8
7	id_0	O3	32	31	31	26	16	12	27	20	26
8	id_0	PM10	62	50	44	39	38	32	48	36	25
9	id_0	PM2.5	33	39	39	25	18	18	17	9	4
10	id_0	RAINFALL	NR	NR	NR	NR	NR	NR	NR	NR	NR
11	id_0	RH	83	85	87	87	86	85	78	81	80
12	id_0	SO2	2	1.8	1.8	1.8	2.1	2.6	2	2.3	2.4
13	id_0	THC	1.8	1.9	1.9	2	2	2	2	2	1.9
14	id_0	WD_HR	58	53	67	59	59	73	79	82	104
15	id_0	WIND_DIREC	57	44	73	44	56	115	45	107	103
16	id_0	WIND_SPEED	1.4	1.3	1.5	1.4	1.6	1.6	1.2	1.8	2.3
17	id_0	WS_HR	1	0.9	0.9	0.9	1.2	0.7	1	0.6	1.8
18	id_1	AMB_TEMP	14	13	13	13	13	13	13	12	13
19	id_1	CH4	1.8	1.8	1.8	1.8	1.8	1.8	1.7	1.7	1.8

test_data=test_data.iloc[:, 2:]
test_data[test_data=='NR'] = 0
print(test_data.head(20))
print(test_data.shape)

      2     3     4     5     6     7     8     9     10
0     21    21    20    20    19    19    19    18    17
1    1.7   1.7   1.7   1.7   1.7   1.7   1.7   1.7   1.8
2   0.39  0.36  0.36   0.4  0.53  0.55  0.34  0.31  0.23
3   0.16  0.24  0.22  0.27  0.27  0.26  0.27  0.29   0.1
4    1.3   1.3   1.3   1.3   1.4   1.6   1.2   1.1   0.9
5     17    14    13    14    18    21   8.9   9.4     5
6     18    16    14    15    20    23    10    10   5.8
7     32    31    31    26    16    12    27    20    26
8     62    50    44    39    38    32    48    36    25
9     33    39    39    25    18    18    17     9     4
10     0     0     0     0     0     0     0     0     0
11    83    85    87    87    86    85    78    81    80
12     2   1.8   1.8   1.8   2.1   2.6     2   2.3   2.4
13   1.8   1.9   1.9     2     2     2     2     2   1.9
14    58    53    67    59    59    73    79    82   104
15    57    44    73    44    56   115    45   107   103
16   1.4   1.3   1.5   1.4   1.6   1.6   1.2   1.8   2.3
17     1   0.9   0.9   0.9   1.2   0.7     1   0.6   1.8
18    14    13    13    13    13    13    13    12    13
19   1.8   1.8   1.8   1.8   1.8   1.8   1.7   1.7   1.8
(4320, 9)

test_data=test_data.to_numpy()
# 240个记录,18*9
test_x=np.empty([240,18*9], dtype=float)
for i in range(240):
    test_x[i, :] = test_data[18 * i: 18* (i + 1), :].reshape(1, -1)   # reshape(1,-1)将数据转换为一行 列数自动求解
# 归一化
for i in range(len(test_x)):
    for j in range(len(test_x[0])):
        if std_x[j] != 0:
            test_x[i][j] = (test_x[i][j] - mean_x[j]) / std_x[j]
test_x = np.concatenate((np.ones([240, 1]), test_x), axis = 1).astype(float)
test_x

array([[ 1.        , -0.24447681, -0.24545919, ..., -0.67065391,
        -1.04594393,  0.07797893],
       [ 1.        , -1.35825331, -1.51789368, ...,  0.17279117,
        -0.10906991, -0.48454426],
       [ 1.        ,  1.5057434 ,  1.34508393, ..., -1.32666675,
        -1.04594393, -0.57829812],
       ...,
       [ 1.        ,  0.3919669 ,  0.54981237, ...,  0.26650729,
        -0.20275731,  1.20302531],
       [ 1.        , -1.8355861 , -1.8360023 , ..., -1.04551839,
        -1.13963133, -1.14082131],
       [ 1.        , -1.35825331, -1.35883937, ...,  2.98427476,
         3.26367657,  1.76554849]])

4.预测

w=np.load("weight.npy")
ans_y=np.dot(test_x,w)
ans_y

5.保存

import csv
with open("submit.csv", mode="w",newline='') as submit_file:
    csv_writer=csv.writer(submit_file)
    header=['id','value']
    csv_writer.writerow(header)
    for i in range(240):
        row=["id_" +str (i), ans_y[i][0]]
        csv_writer.writerow(row)
        print(row)