飞桨 李宏毅机器学习特训营 Linear Regression(线性回归)项目实践

本文是飞桨李宏毅机器学习训练营的作业1-PM2.5的详细实现的全过程,主要用于涉及线性回归。项目的数据集及项目实现可以直接访问AI Studio中的项目:https://aistudio.baidu.com/aistudio/projectdetail/1834942?channelType=0&channel=0
飞桨 李宏毅机器学习特训营的课程链接为:https://aistudio.baidu.com/aistudio/course/introduce/1978

作业1-PM2.5预测

项目描述

  • 本次作业的资料是从行政院环境环保署空气品质监测网所下载的观测资料。
  • 希望大家能在本作业实现 linear regression 预测出 PM2.5 的数值。

数据集介绍

  • 本次作业使用丰原站的观测记录,分成 train set 跟 test set,train set 是丰原站每个月的前 20 天所有资料。test set 则是从丰原站剩下的资料中取样出来。
  • train.csv: 每个月前 20 天的完整资料。
  • test.csv : 从剩下的资料当中取样出连续的 10 小时为一笔,前九小时的所有观测数据当作 feature,第十小时的 PM2.5 当作 answer。一共取出 240 笔不重複的 test data,请根据 feature 预测这 240 笔的 PM2.5。
  • Data 含有 18 项观测数据 AMB_TEMP, CH4, CO, NHMC, NO, NO2, NOx, O3, PM10, PM2.5, RAINFALL, RH, SO2, THC, WD_HR, WIND_DIREC, WIND_SPEED, WS_HR。

项目要求

  • 请手动实现 linear regression,方法限使用 gradient descent。
  • 禁止使用 numpy.linalg.lstsq

数据准备

环境配置/安装

!pip install --upgrade pandas
Looking in indexes: https://mirror.baidu.com/pypi/simple/
Requirement already up-to-date: pandas in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (1.2.3)
Requirement already satisfied, skipping upgrade: numpy>=1.16.5 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pandas) (1.20.1)
Requirement already satisfied, skipping upgrade: pytz>=2017.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pandas) (2019.3)
Requirement already satisfied, skipping upgrade: python-dateutil>=2.7.3 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from pandas) (2.8.0)
Requirement already satisfied, skipping upgrade: six>=1.5 in /opt/conda/envs/python35-paddle120-env/lib/python3.7/site-packages (from python-dateutil>=2.7.3->pandas) (1.15.0)

1 数据准备

1.1 数据导入

import numpy as np
import pandas as pd
data = pd.read_csv('work/hw1_data/train.csv', encoding = 'big5')  # 使用'big5'进行编码才能让文件中的中文不乱码 big5为繁体字编码
print(data.head(15))        # 查看数据
print(data.shape)  # 查看数据大小
          日期  測站        測項     0     1     2     3     4     5     6  ...  \
0   2014/1/1  豐原  AMB_TEMP    14    14    14    13    12    12    12  ...   
1   2014/1/1  豐原       CH4   1.8   1.8   1.8   1.8   1.8   1.8   1.8  ...   
2   2014/1/1  豐原        CO  0.51  0.41  0.39  0.37  0.35   0.3  0.37  ...   
3   2014/1/1  豐原      NMHC   0.2  0.15  0.13  0.12  0.11  0.06   0.1  ...   
4   2014/1/1  豐原        NO   0.9   0.6   0.5   1.7   1.8   1.5   1.9  ...   
5   2014/1/1  豐原       NO2    16   9.2   8.2   6.9   6.8   3.8   6.9  ...   
6   2014/1/1  豐原       NOx    17   9.8   8.7   8.6   8.5   5.3   8.8  ...   
7   2014/1/1  豐原        O3    16    30    27    23    24    28    24  ...   
8   2014/1/1  豐原      PM10    56    50    48    35    25    12     4  ...   
9   2014/1/1  豐原     PM2.5    26    39    36    35    31    28    25  ...   
10  2014/1/1  豐原  RAINFALL    NR    NR    NR    NR    NR    NR    NR  ...   
11  2014/1/1  豐原        RH    77    68    67    74    72    73    74  ...   
12  2014/1/1  豐原       SO2   1.8     2   1.7   1.6   1.9   1.4   1.5  ...   
13  2014/1/1  豐原       THC     2     2     2   1.9   1.9   1.8   1.9  ...   
14  2014/1/1  豐原     WD_HR    37    80    57    76   110   106   101  ...   

      14    15    16    17    18    19    20    21    22    23  
0     22    22    21    19    17    16    15    15    15    15  
1    1.8   1.8   1.8   1.8   1.8   1.8   1.8   1.8   1.8   1.8  
2   0.37  0.37  0.47  0.69  0.56  0.45  0.38  0.35  0.36  0.32  
3    0.1  0.13  0.14  0.23  0.18  0.12   0.1  0.09   0.1  0.08  
4    2.5   2.2   2.5   2.3   2.1   1.9   1.5   1.6   1.8   1.5  
5     11    11    22    28    19    12   8.1     7   6.9     6  
6     14    13    25    30    21    13   9.7   8.6   8.7   7.5  
7     65    64    51    34    33    34    37    38    38    36  
8     52    51    66    85    85    63    46    36    42    42  
9     36    45    42    49    45    44    41    30    24    13  
10    NR    NR    NR    NR    NR    NR    NR    NR    NR    NR  
11    47    49    56    67    72    69    70    70    70    69  
12   3.9   4.4   9.9   5.1   3.4   2.3     2   1.9   1.9   1.9  
13   1.9   1.9   1.9   2.1     2   1.9   1.9   1.9   1.9   1.9  
14   307   304   307   124   118   121   113   112   106   110  

[15 rows x 27 columns]
(4320, 27)

1.2 数据预处理

data = data.iloc[:, 3:]      #去除数据中前3列的说明项
data[data=='NR']=0           #将数据中无效项“NR”替换为0
print(data.head(15))
numpy_data=data.to_numpy()   #将数据转换为numpy数组
print(numpy_data.shape)
       0     1     2     3     4     5     6     7     8     9  ...    14  \
0     14    14    14    13    12    12    12    12    15    17  ...    22   
1    1.8   1.8   1.8   1.8   1.8   1.8   1.8   1.8   1.8   1.8  ...   1.8   
2   0.51  0.41  0.39  0.37  0.35   0.3  0.37  0.47  0.78  0.74  ...  0.37   
3    0.2  0.15  0.13  0.12  0.11  0.06   0.1  0.13  0.26  0.23  ...   0.1   
4    0.9   0.6   0.5   1.7   1.8   1.5   1.9   2.2   6.6   7.9  ...   2.5   
5     16   9.2   8.2   6.9   6.8   3.8   6.9   7.8    15    21  ...    11   
6     17   9.8   8.7   8.6   8.5   5.3   8.8   9.9    22    29  ...    14   
7     16    30    27    23    24    28    24    22    21    29  ...    65   
8     56    50    48    35    25    12     4     2    11    38  ...    52   
9     26    39    36    35    31    28    25    20    19    30  ...    36   
10     0     0     0     0     0     0     0     0     0     0  ...     0   
11    77    68    67    74    72    73    74    73    66    56  ...    47   
12   1.8     2   1.7   1.6   1.9   1.4   1.5   1.6   5.1    15  ...   3.9   
13     2     2     2   1.9   1.9   1.8   1.9   1.9   2.1     2  ...   1.9   
14    37    80    57    76   110   106   101   104   124    46  ...   307   

      15    16    17    18    19    20    21    22    23  
0     22    21    19    17    16    15    15    15    15  
1    1.8   1.8   1.8   1.8   1.8   1.8   1.8   1.8   1.8  
2   0.37  0.47  0.69  0.56  0.45  0.38  0.35  0.36  0.32  
3   0.13  0.14  0.23  0.18  0.12   0.1  0.09   0.1  0.08  
4    2.2   2.5   2.3   2.1   1.9   1.5   1.6   1.8   1.5  
5     11    22    28    19    12   8.1     7   6.9     6  
6     13    25    30    21    13   9.7   8.6   8.7   7.5  
7     64    51    34    33    34    37    38    38    36  
8     51    66    85    85    63    46    36    42    42  
9     45    42    49    45    44    41    30    24    13  
10     0     0     0     0     0     0     0     0     0  
11    49    56    67    72    69    70    70    70    69  
12   4.4   9.9   5.1   3.4   2.3     2   1.9   1.9   1.9  
13   1.9   1.9   2.1     2   1.9   1.9   1.9   1.9   1.9  
14   304   307   124   118   121   113   112   106   110  

[15 rows x 24 columns]
(4320, 24)
# 4320行 = 12个月*每个月20天*每天监测18种  24列 = 24小时
month_data = {}
for month in range(12):
    sample = np.empty([18, 480])  # [18, 480]  480 = 20*24
    for day in range(20):
        sample[:, day * 24 : (day + 1) * 24] = numpy_data[18 * (20 * month + day) : 18 * (20 * month + day + 1), :]    # 18, 24
    month_data[month] = sample
print(month_data[0])       #查看第一月的数据
print(month_data[0].shape)
[[14.   14.   14.   ... 14.   13.   13.  ]
 [ 1.8   1.8   1.8  ...  1.8   1.8   1.8 ]
 [ 0.51  0.41  0.39 ...  0.34  0.41  0.43]
 ...
 [35.   79.    2.4  ... 48.   63.   53.  ]
 [ 1.4   1.8   1.   ...  1.1   1.9   1.9 ]
 [ 0.5   0.9   0.6  ...  1.2   1.2   1.3 ]]
(18, 480)

1.3 特征提取

每个月480小时,9小时一个data,一个月有480 - 9 = 471个data。一年有471 * 12个,每个data有9 * 18个特征

# 数据
x=np.empty([12*471,18*9],dtype=float)
# pm2.5
y=np.empty([12*471,1],dtype=float)
for month in range(12):
    for day in range(20):
        for hour in range(24):
            # 如果是最后一天,最后一个包结束,则返回
            if day==19 and hour>14:
                continue
            # 每个小时的18项数据
            x[month*471+day*24+hour,:]=month_data[month][:,day*24+hour:day*24+hour+9].reshape(1,-1)
            # pm值
            y[month*471+day*24+hour,0]=month_data[month][9,day*24+hour+9]
print(x)
print(y)
[[14.  14.  14.  ...  2.   2.   0.5]
 [14.  14.  13.  ...  2.   0.5  0.3]
 [14.  13.  12.  ...  0.5  0.3  0.8]
 ...
 [17.  18.  19.  ...  1.1  1.4  1.3]
 [18.  19.  18.  ...  1.4  1.3  1.6]
 [19.  18.  17.  ...  1.3  1.6  1.8]]
[[30.]
 [41.]
 [44.]
 ...
 [17.]
 [24.]
 [29.]]

1.4 归一化

mean_x = np.mean(x, axis = 0) #18 * 9 平均值
std_x = np.std(x, axis = 0) #18 * 9 方差
for i in range(len(x)): #12 * 471
    for j in range(len(x[0])): #18 * 9 
        if std_x[j] != 0:
            x[i][j] = (x[i][j] - mean_x[j]) / std_x[j]
x.shape    # (12*471,18*9)
(5652, 162)

1.5 训练集和验证集的划分

import math
x_train_set = x[: math.floor(len(x) * 0.8), :]
y_train_set = y[: math.floor(len(y) * 0.8), :]
x_validation = x[math.floor(len(x) * 0.8): , :]
y_validation = y[math.floor(len(y) * 0.8): , :]
print('x_train:', x_train_set)
print('y_train:', y_train_set)
print('x_validation:', x_validation)
print('y_validation:', y_validation)
print(len(x_train_set))
print(len(y_train_set))
print(len(x_validation))
print(len(y_validation))
x_train: [[-1.35825331 -1.35883937 -1.359222   ...  0.26650729  0.2656797
  -1.14082131]
 [-1.35825331 -1.35883937 -1.51819928 ...  0.26650729 -1.13963133
  -1.32832904]
 [-1.35825331 -1.51789368 -1.67717656 ... -1.13923451 -1.32700613
  -0.85955971]
 ...
 [ 0.86929969  0.70886668  0.38952809 ...  1.39110073  0.2656797
  -0.39079039]
 [ 0.71018876  0.39075806  0.07157353 ...  0.26650729 -0.39013211
  -0.39079039]
 [ 0.3919669   0.07264944  0.07157353 ... -0.38950555 -0.39013211
  -0.85955971]]
y_train: [[30.]
 [41.]
 [44.]
 ...
 [ 7.]
 [ 5.]
 [14.]]
x_validation: [[ 0.07374504  0.07264944  0.07157353 ... -0.38950555 -0.85856912
  -0.57829812]
 [ 0.07374504  0.07264944  0.23055081 ... -0.85808615 -0.57750692
   0.54674825]
 [ 0.07374504  0.23170375  0.23055081 ... -0.57693779  0.54674191
  -0.1095288 ]
 ...
 [-0.88092053 -0.72262212 -0.56433559 ... -0.57693779 -0.29644471
  -0.39079039]
 [-0.7218096  -0.56356781 -0.72331287 ... -0.29578943 -0.39013211
  -0.1095288 ]
 [-0.56269867 -0.72262212 -0.88229015 ... -0.38950555 -0.10906991
   0.07797893]]
y_validation: [[13.]
 [24.]
 [22.]
 ...
 [17.]
 [24.]
 [29.]]
4521
4521
1131
1131

2.训练

# 线性拟合 y = w1*x1 + w2*x2 + ... + w152*x152 + w153   w153为常数项
dim = 18 * 9 + 1           #153个权重系数
w = np.zeros([dim, 1])
x = np.concatenate((np.ones([12 * 471, 1]), x), axis = 1).astype(float)  #数组拼接 (5652, 163)
learning_rate = 100
iter_time = 2000
adagrad = np.zeros([dim, 1])
eps = 0.0000000001        # 防止被除数为0
Loss = []
for t in range(iter_time):
    # 平方差
    loss = np.sqrt(np.sum(np.power(np.dot(x, w) - y, 2))/471/12)#rmse
    Loss.append(loss)
    # 每100次输出一次
    if(t%100==0):
        print(str(t) + ":" + str(loss))
    # 梯度
    gradient = 2 * np.dot(x.transpose(), np.dot(x, w) - y) #dim*1 transpose()调换数组的行列索引 类似于求矩阵转置
    adagrad += gradient ** 2
    w = w - learning_rate * gradient / np.sqrt(adagrad + eps)
np.save('weight.npy', w)
0:27.071214829194115
100:33.78905859777454
200:19.913751298197095
300:13.531068193689693
400:10.645466158446172
500:9.277353455475065
600:8.518042045956502
700:8.014061987588425
800:7.636756824775692
900:7.336563740371125
1000:7.090968643947219
1100:6.8873114803241
1200:6.717116295730694
1300:6.574102121171868
1400:6.453381172520268
1500:6.351062466046933
1600:6.26401228776637
1700:6.189688383453206
1800:6.126016764546804
1900:6.071297135545988
import matplotlib.pyplot as plt
%matplotlib inline   
plt.plot(Loss)
plt.xlabel('Iter_time')
plt.ylabel('Loss')
Text(0,0.5,'Loss')

在这里插入图片描述

3 测试

test_data = pd.read_csv('work/hw1_data/test.csv',header=None, encoding="big5")

test_data.head(20)
012345678910
0id_0AMB_TEMP212120201919191817
1id_0CH41.71.71.71.71.71.71.71.71.8
2id_0CO0.390.360.360.40.530.550.340.310.23
3id_0NMHC0.160.240.220.270.270.260.270.290.1
4id_0NO1.31.31.31.31.41.61.21.10.9
5id_0NO21714131418218.99.45
6id_0NOx18161415202310105.8
7id_0O3323131261612272026
8id_0PM10625044393832483625
9id_0PM2.53339392518181794
10id_0RAINFALLNRNRNRNRNRNRNRNRNR
11id_0RH838587878685788180
12id_0SO221.81.81.82.12.622.32.4
13id_0THC1.81.91.9222221.9
14id_0WD_HR5853675959737982104
15id_0WIND_DIREC574473445611545107103
16id_0WIND_SPEED1.41.31.51.41.61.61.21.82.3
17id_0WS_HR10.90.90.91.20.710.61.8
18id_1AMB_TEMP141313131313131213
19id_1CH41.81.81.81.81.81.81.71.71.8
test_data=test_data.iloc[:, 2:]
test_data[test_data=='NR'] = 0
print(test_data.head(20))
print(test_data.shape)
      2     3     4     5     6     7     8     9     10
0     21    21    20    20    19    19    19    18    17
1    1.7   1.7   1.7   1.7   1.7   1.7   1.7   1.7   1.8
2   0.39  0.36  0.36   0.4  0.53  0.55  0.34  0.31  0.23
3   0.16  0.24  0.22  0.27  0.27  0.26  0.27  0.29   0.1
4    1.3   1.3   1.3   1.3   1.4   1.6   1.2   1.1   0.9
5     17    14    13    14    18    21   8.9   9.4     5
6     18    16    14    15    20    23    10    10   5.8
7     32    31    31    26    16    12    27    20    26
8     62    50    44    39    38    32    48    36    25
9     33    39    39    25    18    18    17     9     4
10     0     0     0     0     0     0     0     0     0
11    83    85    87    87    86    85    78    81    80
12     2   1.8   1.8   1.8   2.1   2.6     2   2.3   2.4
13   1.8   1.9   1.9     2     2     2     2     2   1.9
14    58    53    67    59    59    73    79    82   104
15    57    44    73    44    56   115    45   107   103
16   1.4   1.3   1.5   1.4   1.6   1.6   1.2   1.8   2.3
17     1   0.9   0.9   0.9   1.2   0.7     1   0.6   1.8
18    14    13    13    13    13    13    13    12    13
19   1.8   1.8   1.8   1.8   1.8   1.8   1.7   1.7   1.8
(4320, 9)
test_data=test_data.to_numpy()
# 240个记录,18*9
test_x=np.empty([240,18*9], dtype=float)
for i in range(240):
    test_x[i, :] = test_data[18 * i: 18* (i + 1), :].reshape(1, -1)   # reshape(1,-1)将数据转换为一行 列数自动求解
# 归一化
for i in range(len(test_x)):
    for j in range(len(test_x[0])):
        if std_x[j] != 0:
            test_x[i][j] = (test_x[i][j] - mean_x[j]) / std_x[j]
test_x = np.concatenate((np.ones([240, 1]), test_x), axis = 1).astype(float)
test_x

array([[ 1.        , -0.24447681, -0.24545919, ..., -0.67065391,
        -1.04594393,  0.07797893],
       [ 1.        , -1.35825331, -1.51789368, ...,  0.17279117,
        -0.10906991, -0.48454426],
       [ 1.        ,  1.5057434 ,  1.34508393, ..., -1.32666675,
        -1.04594393, -0.57829812],
       ...,
       [ 1.        ,  0.3919669 ,  0.54981237, ...,  0.26650729,
        -0.20275731,  1.20302531],
       [ 1.        , -1.8355861 , -1.8360023 , ..., -1.04551839,
        -1.13963133, -1.14082131],
       [ 1.        , -1.35825331, -1.35883937, ...,  2.98427476,
         3.26367657,  1.76554849]])

4.预测

w=np.load("weight.npy")
ans_y=np.dot(test_x,w)
ans_y

5.保存

import csv
with open("submit.csv", mode="w",newline='') as submit_file:
    csv_writer=csv.writer(submit_file)
    header=['id','value']
    csv_writer.writerow(header)
    for i in range(240):
        row=["id_" +str (i), ans_y[i][0]]
        csv_writer.writerow(row)
        print(row)
  • 2
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 1
    评论
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值