李宏毅老师ML_HW1——PM2.5预测_降雨量为nr是什么意思-CSDN博客

本文链接：https://blog.csdn.net/Moelimoe/article/details/104750201

处理数据之前的pre-process
观察下载的数据文件train.csv
train.csv只有每个月前20天的数据，另外每个月剩下10天数据用于作业结果评估，对学生不可见
观察数据发现rainfall栏有很多NR表示没有降雨，但是我们只需要数字，因此可以使用excel的替换将NR替换为0，之后再进行处理
在这里插入图片描述
作业要求：前9小时作为训练集，第10小时的PM2.5作为实际结果，实际一共有18个特征——CH4、CO、CO2、NO等，但是这里我们只取相关性最高的PM2.5自身作为feature，实际如果对PM2.5的影响因素很了解，可以在另外的17个特征进行选取。

我们第一次取0 ~ 8时PM2.5作为训练集feature，9时PM2.5作为其label，第二次取1 ~ 9时作为feature，10时的PM2.5作为label…直至feature取到22时，label取到23时，一共features为(3600, 9)的矩阵，labels为(3600, 1)的矩阵
使用梯度下降算法预测PM2.5

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

np.set_printoptions(precision=3)  # 设置array显示精度，只能设置显示精度不能实际更改其数值精度
np.set_printoptions(suppress=True)  # 抑制小数的科学计数法显示

df = pd.read_csv("HW1_data_1.csv")


# 数据处理，数据集中包含18个特征，本次训练只使用PM2.5计算和预测
def dataprocess():
    # feature:
    feature_data = np.array(df.loc[:, "0":"22"]).reshape([240, 18, 23])  # 取所有行，首index"0"-"14", 尾index"14"-"22"
    feature = np.zeros((3600, 18, 9))  # 定义feature 容器
    for i in range(0, 15):
        feature[240 * i:240 * (i + 1)] = feature_data[:, :, i:9 + i]  # axis=0和1的所有值的每9小时为一组数据，取左不取右
    feature = feature[:, 9, :]  # 取所有维度的第9行，就是PM2.5的数值(240×15×9个PM2.5)

    # label:
    # 9时到23时所有特征数据（包含PM2.5在内所有特征量），因为loc无法操作细致切片，先转成array再切
    # 只取PM2.5的数据用作预测，所以取[:, 9]240×18每18个中的第10个特征
    label_data = np.array(df.loc[:, "9":"23"]).reshape([240, 18, 15])[:, 9]
    label = label_data.reshape(3600, 1)
    return feature, label



class Regression:


    def gradientdescent(self, x, y, epoch=1000, l=10, reg_rate=0.001):
        '''初始化误差bias:3600个; 初始化weight:9个; 手动调整学习次数epoch; 手动调整学习率l,reg_rate正则化的系数'''
        n = len(x)  # 实例个数n=3600
        weights = np.random.randn(x.shape[1], 1)  # y.shape[0]行,1列的数，这些数从均值为0，方差为1的正态分布中生成用来初始化w
        biases = np.random.randn(y.shape[0], 1)  # 同样初始化b
        # biases = np.zeros((y.shape[0], 1))  # 所有的bias都是相同的

        # 将x*weights+biases变形成X_new*theta，一起更新，同时便于cost求导
        X_new = np.ones((x.shape[0], x.shape[1] + 1))  # 定义X_new，比x多一列，第一列全为1，后面的列为x
        X_new[:, 1:] = x  # (3600, 10)，除了第一列是1，后面就是就是和x值一样
        theta = np.full((weights.shape[0] + 1, weights.shape[1]), biases[0])  # 创建比weight多一行的矩阵
        theta[1:, ] = weights  # 除了第一行是biases的值，后面9行都是weight
        grad_sum = np.zeros(theta.shape)
        # print("weights:", weights.shape, "biases:", biases.shape)  # weights: (9, 1) biases: (3600, 1)
        # print("X_new:", X_new.shape, "theta:", theta.shape)  # X_new:(3600, 10)  theta: (10, 1)
        # print(f"theta{theta},和weights{weights}{theta[1:]==weights}")

        for i in range(epoch):
            # 第一步：y = w1x1+w2x2+...+w9x9 + b1，一共3600次向量相乘算出初步的预测值a，然后慢慢优化
            # y_hat = np.dot(x, weights) + biases   # 先x后weights避免shape对应不上
            y_hat1 = np.dot(X_new, theta)   # 等同于xw+b
            loss = y_hat1 - y

            # Adagrad更新θ（权重和偏向）
            # 这里grad是每一次的梯度，而grad_sum是所有之前的梯度之和
            grad = 2 * (X_new.transpose().dot(loss))
            grad_sum += grad ** 2
            theta -= l * grad / np.sqrt(grad_sum)

            # cost函数
            cost = (1 / (2 * n)) * np.linalg.norm(loss)  # np.linalg.norm是numpy自带的欧氏距离计算方法
            # cost = (1/(2*n))*np.sqrt(np.sum(np.square(y - y_hat)))  # 这个是直接数学运算求向量的欧氏距离，系数是1/(2*n)
            if (i + 1) % 100 == 0:
                print(f"经过第{i+1}次训练，cost变为：{cost}")
                print(f"经过第{i + 1}次训练，y_hat1均值:{round(np.sum(y_hat1)/3600, 4)}与y均值:{round(np.sum(y)/3600, 4)}"
                      f"之差loss平均差变为：{np.sum(loss) / 3600}")
                print(f"经过第{i + 1}次训练，本次梯度grad变为：{np.sum(grad)/3600}")


DP = dataprocess()
R = Regression()
R.gradientdescent(DP[0], DP[1])

输出结果：

经过第100次训练，cost变为：0.17832930592812254
经过第200次训练，cost变为：0.1660618393695226
经过第300次训练，cost变为：0.15927505672924216
经过第400次训练，cost变为：0.15489108318605818
经过第500次训练，cost变为：0.15184467687762343
经过第600次训练，cost变为：0.14964991588379611
经过第700次训练，cost变为：0.14804105402447518
经过第800次训练，cost变为：0.14685216677724566
经过第900次训练，cost变为：0.14597040063815991
经过第1000次训练，cost变为：0.14531528068944696

经过第100次训练，y_hat1均值:17.8139与y均值:24.0569之差平均差loss变为：-6.24304257009874
经过第200次训练，y_hat1均值:18.653与y均值:24.0569之差平均差loss变为：-5.403915186992251
经过第300次训练，y_hat1均值:19.4186与y均值:24.0569之差平均差loss变为：-4.638329558163197
经过第400次训练，y_hat1均值:20.0769与y均值:24.0569之差平均差loss变为：-3.9800706681641165
经过第500次训练，y_hat1均值:20.6393与y均值:24.0569之差平均差loss变为：-3.417681572977715
经过第600次训练，y_hat1均值:21.12与y均值:24.0569之差平均差loss变为：-2.9369228516813766
经过第700次训练，y_hat1均值:21.5316与y均值:24.0569之差平均差loss变为：-2.5253134626240024
经过第800次训练，y_hat1均值:21.8845与y均值:24.0569之差平均差loss变为：-2.172397192822715
经过第900次训练，y_hat1均值:22.1875与y均值:24.0569之差平均差loss变为：-1.8694547147842282
经过第1000次训练，y_hat1均值:22.4478与y均值:24.0569之差平均差loss变为：-1.609182858159369

经过第100次训练，本次梯度grad变为：0.15936844318025478
经过第200次训练，本次梯度grad变为：0.05629405865738792
经过第300次训练，本次梯度grad变为：0.03547007720060113
经过第400次训练，本次梯度grad变为：0.025188272217026478
经过第500次训练，本次梯度grad变为：0.01861384588607305
经过第600次训练，本次梯度grad变为：0.014144000392948834
经过第700次训练，本次梯度grad变为：0.011006343985549796
经过第800次训练，本次梯度grad变为：0.008740552807814514
经过第900次训练，本次梯度grad变为：0.0070594976666507134
经过第1000次训练，本次梯度grad变为：0.005780481608253467