作业1:预测PM2.5的值
在这个作业中,我们将用梯度下降方法预测PM2.5的值
hw1要求:
1、要求python3.5+
2、只能用(1)numpy(2)scipy(3)pandas
3、请用梯度下降手写线性回归
4、最好的公共简单基线
5、对于想加载模型而并不想运行整个训练过程的人:
请上传训练代码并命名成 train.py
只要用梯度下降的代码就行了
hw_best要求:
1、要求python3.5+
2、任何库都可以用
3、在kaggle上获得你选择的更高的分
数据介绍:
本次作業使用豐原站的觀測記錄,分成train set跟test set,train set是豐原站每個月的前20天所有資料test set則是從豐原站剩下的資料中取樣出來。
train.csv:每個月前20天每個小時的氣象資料(每小時有18種測資)。共12個月。
test.csv:從剩下的資料當中取樣出連續的10小時為一筆,前九小時的所有觀測數據當作feature,第十小時的PM2.5當作answer。一共取出240筆不重複的test data,請根據feauure預測這240筆的PM2.5。
请完成之后参考以下资料:
Sample_code:https://ntumlta.github.io/2017fall-ml-hw1/code.html
Supplementary_Slide:https://docs.google.com/presentation/d/1WwIQAVI0RRA6tpcieynPVoYDuMmuVKGvVNF_DSKIiDI/edit#slide=id.g1ef6d808f1_2_0
答案参考answer.csv
#coding=utf-8
#Version:python3.6.0
#Tools:Pycharm 2017.3.2
__date__ = '2019/5/22 13:51'
__author__ = 'ranchunfu'
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
train = pd.read_csv("./Dataset/train.csv")
test = pd.read_csv("./Dataset/test(1).csv")
train = train[train['observation'] == 'PM2.5']
test = test[test['AMB_TEMP'] == 'PM2.5']
train = train.iloc[:,3:]
test = test.iloc[:,2:]
train = np.array(train, dtype = 'float32')
test = np.array(test, dtype = 'float32')
train = train.reshape(1,train.shape[0]*train.shape[1])
PM = train
#数据归一化 参考追风者
PM_mean = int(PM.mean())
PM_theta = int(PM.var()**0.5)
PM = (PM - PM_mean) / PM_theta
np.random.seed(3)
W = np.random.randn(1,10) * 0.01
# b = np.zeros((1,1))
#正向传播以及梯度下降
costs = []
lean_rate = 0.1
m = PM.shape[1] - 9
for i in range(150):
cost = 0
grad = 0
for j in range(m):
x = np.array(PM[:,j:j+9])
x = np.insert(x,0,1).reshape(10,1)
error = PM[:,j+9] - np.dot(W,x)
cost += float(error**2)
grad += (error) * x.T
cost = cost / (2*m)
costs.append(cost)
dW = grad/m
if i % 10 == 0 :
print(cost)
W = W + lean_rate*dW
plt.plot(costs)
plt.xlabel("num of iter")
plt.ylabel("cost")
plt.title("learn = 0.1")
plt.show()
#处理测试数据
test = pd.read_csv("./Dataset/test(1).csv")
test = test[test['AMB_TEMP'] == 'PM2.5']
test = test.iloc[:,2:]
x = test.insert(0,0,1)
test = test.T
test = np.array(test, dtype = 'float32')
test_pred = np.dot(W,test) #正向传播
np.set_printoptions(precision=3)
np.set_printoptions(suppress=True)
# print(test_pred)
answer = pd.read_csv("answer.csv")
answer = answer["value"].values
answer = answer.reshape(1,240)
print(np.sum((y_pred - answer)**2)/240)