机器学习

1、作业1,预测PM2.5

参考网址:[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-s3QZnoIM-1595679433910)(file:///C:\Users\Administrator\AppData\Roaming\Tencent\QQTempSys%W@GJ$ACOF(TYDYECOKVDYB.png)]https://blog.csdn.net/weixin_42447868/article/details/105261672?utm_medium=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-2.channel_param&depth_1-utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-2.channel_param

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-T4w9kZ4H-1595679433926)(file:///C:\Users\Administrator\AppData\Roaming\Tencent\QQTempSys%W@GJ$ACOF(TYDYECOKVDYB.png)]https://www.cnblogs.com/HL-space/p/10676637.html

题目:作业要求:由每天前9个小时的18个空气的影响因素(如:NO,CO,SO2,PM2.5等等)来预测第10个小时的PM2.5,train.csv是一年的数据,每个月取了20天,每天24小时

这份源码也是从上述博客中复制的,就是有部分源码不太理解,想请大佬给解释解释。麻烦了。

import sys
import pandas as pd
import numpy as np
import math

# 导入数据,(前面的''为数据存放路径,后面big5对字符串进行编码转换)
data = pd.read_csv('F:/data/hw1/train.csv', encoding='big5')

# 分割出前三列,取第四列开始的将数据存到data中
# iloc基于位置索引,而loc基于name索引
data = data.iloc[:, 3:]  # 行全取,列从第四列开始
data[data == 'NR'] = 0 #此项为是否降雨,降雨值为1,没有则为NR,对于空数据一般都是删除或者补全,因此,可以将NR全部补为0
raw_data = data.to_numpy()
print(raw_data)

# 对data进行调整,将4320*24重组为12*18*480
month_data = {}
for month in range(12):
    sample = np.empty([18, 480])
    for day in range(20):
        sample[:, day * 24: (day + 1) * 24] = raw_data[18 * (20 * month + day): 18 * (20 * month + day + 1), :]
    month_data[month] = sample

x = np.empty([12*471, 18*9], dtype=float)
y = np.empty([12*471, 1], dtype=float)
for month in range(12):
    for day in range(20):
        for hour in range(24):
            if day == 19 and hour > 14:
                continue
            x[month * 471 + day * 24 + hour, :] = month_data[month][:, day * 24 + hour: day * 24 + hour + 9].reshape(1, -1)
            y[month * 471 + day * 24 + hour, 0] = month_data[month][9, day * 24 + hour + 9]
print(x)
print(y)

**# 归一化**
mean_x = np.mean(x, axis=0)
std_x = np.std(x,axis=0)
for i in range(len(x)):
    for j in range(len(x[0])):
        if std_x[j] != 0:
            x[i][j] = (x[i][j] - mean_x[j]) / std_x[j]

# 将训练集分成训练-验证集,用来最后检验我们的模型
x_train_set = x[: math.floor(len(x) * 0.8), :]
y_train_set = y[: math.floor(len(y) * 0.8), :]
x_validation = x[math.floor(len(x) * 0.8):, :]
y_validation = y[math.floor(len(y) * 0.8):, :]
print(x_train_set)
print(y_train_set)
print(x_validation)
print(y_validation)
print(len(x_train_set))
print(len(y_train_set))
print(len(x_validation))
print(len(y_validation))

/*
# 因为存在偏差bias 所以dim+1
dim = 18 * 9 + 1
# w维度为163*1
w = np.zeros([dim, 1])
# x_train_set维度为4521*163
x_train_set = np.concatenate((np.ones([len(x_train_set), 1]), x_train_set), axis=1).astype(float)
# 设置学习率
learning_rate = 10
# 设置迭代数
iter_time = 30000
# RMSprop参数初始化
adagrad = np.zeros([dim, 1])
eps = 0.0000000001
*/这里的代码不理解
# beta = 0.9
# 迭代
for t in range(iter_time):
    loss = np.sqrt(np.sum(np.power(np.dot(x_train_set, w)-y_train_set, 2))/len(x_train_set))
    if(t%100 == 0):
        print("迭代次数为: %i,损失值: %f" % (t, loss))
        # gradient = 2*np.dot(x.transpose(),np.dot(x,w)-y)
        # 计算梯度值
        gradient = (np.dot(x_train_set.transpose(), np.dot(x_train_set, w)-y_train_set))/(loss*len(x_train_set))
        adagrad += (gradient ** 2)
        # 更新参数w
        w = w - learning_rate * gradient / np.sqrt(adagrad + eps)

# 保存参数
np.save('weight.npy', w)

testdata = pd.read_csv('F:/data/hw1/test.csv', header=None, encoding='big5')
test_data = testdata.iloc[:, 2:]
test_data[test_data == 'NR'] = 0
test_data = test_data.to_numpy()
test_x = np.empty([240, 18 * 9], dtype=float)
for i in range(240):
    test_x[i, :] = test_data[18 * i:18 * (i + 1), :].reshape(1, -1)
for i in range(len(test_x)):
    for j in range(len(test_x[0])):
        if std_x[j] != 0:
            test_x[i][j] = (test_x[i][j] - mean_x[j]) / std_x[j]
test_x = np.concatenate((np.ones([240, 1]), test_x), axis=1).astype(float)
print(test_x)

# 在验证集上进行验证
w = np.load('weight.npy')
x_validation = np.concatenate((np.ones([len(x_validation), 1]), x_validation), axis=1).astype(float)
for m in range(len(x_validation)):
    Loss = np.sqrt(np.sum(np.power(np.dot(x_validation, w)-y_validation, 2))/len(x_validation))


print("the Loss on val data is %f" % ( Loss ))
# 预测
ans_y = np.dot(test_x, w)
print('预测PM2.5值')
print(ans_y)
n))


print("the Loss on val data is %f" % ( Loss ))
# 预测
ans_y = np.dot(test_x, w)
print('预测PM2.5值')
print(ans_y)
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值