李宏毅_Machine Learning_2019 Task 3

李宏毅_Machine Learning_2019 Task 3

学习打卡内容

大作业

  • 按照 Homework1_Introduction.txt 的要求完成本次作业
  • 作业1:预测PM2.5的值
  • 在这个作业中,我们将用梯度下降法 (Gradient Descent) 预测 PM2.5 的值 (Regression 回归问题)
  • Homework1要求:
    • 要求 python3.5+
    • 只能用
      • numpy
      • scipy
      • pandas
    • 请用梯度下降手写线性回归
    • 最好使用 Public Simple Baseline
    • 对于想加载模型而并不想运行整个训练过程的人:
      • 请上传训练代码并命名成 train.py
      • 只要用梯度下降的代码就行了
  • Homework_best 要求:
    • 要求 python3.5+
    • 任何库都可以用
    • Kaggle 上获得你选择的更高的分
  • 数据介绍:
    本次作业使用豐原站的觀測記錄,分成 train settest set,train set 是豐原站每個月的前20天所有資料,test set則是從豐原站剩下的資料中取樣出來。
    train.csv:每個月前20天每個小時的氣象資料(每小時有18種測資)。共12個月。
    test.csv:從剩下的資料當中取樣出連續的10小時為一筆,前九小時的所有觀測數據當作feature,第十小時的PM2.5當作answer。一共取出240筆不重複的 test data,請根據feauure預測這240筆的PM2.5。
  • 请完成之后参考以下资料:

Task 3 Implementation

方案1


'''
利用 Linear Regression 线性回归预测 PM2.5 
该方法参考黑桃大哥的优秀作业-|vv|-
'''

# 导入必要的包 numpy、pandas以及scikit-learn归一化预处理
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler

# 指定相对路径
path = "./Dataset/"

# 利用 pands 进行读取文件操作
train = pd.read_csv(path + 'train.csv', engine='python', encoding='utf-8')
test = pd.read_csv(path + 'test.csv', engine='python', encoding='gbk')
train = train[train['observation'] == 'PM2.5']
# print(train)
test = test[test['AMB_TEMP'] == 'PM2.5']
# 删除无关特征
train = train.drop(['Date', 'stations', 'observation'], axis=1)
test_x = test.iloc[:, 2:]

train_x = []
train_y = []

for i in range(15):
    x = train.iloc[:, i:i + 9]
    # notice if we don't set columns name, it will have different columns name in each iteration
    x.columns = np.array(range(9))
    y = train.iloc[:, i + 9]
    y.columns = np.array(range(1))
    train_x.append(x)
    train_y.append(y)

# review "Python for Data Analysis" concat操作
# train_x and train_y are the type of Dataframe
# 取出 PM2.5 的数据,训练集中一共有 240 天,每天取出 15 组 含有 9 个特征 和 1 个标签的数据,共有 240*15*9个数据
train_x = pd.concat(train_x) # (3600, 9) Dataframe类型
train_y = pd.concat(train_y)

# 将str数据类型转换为 numpy的 ndarray 类型
train_y = np.array(train_y, float)
test_x = np.array(test_x, float)
# print(train_x.shape, train_y.shape)

# 进行标准缩放,即数据归一化
ss = StandardScaler()

# 进行数据拟合
ss.fit(train_x)
# 进行数据转换
train_x = ss.transform(train_x)

ss.fit(test_x)
test_x = ss.transform(test_x)

# 定义评估函数
# 计算均方误差(Mean Squared Error,MSE)
# r^2 用于度量因变量的变异中 可以由自变量解释部分所占的比例 取值一般为 0~1
def r2_score(y_true, y_predict):
    # 计算y_true和y_predict之间的MSE
    MSE = np.sum((y_true - y_predict) ** 2) / len(y_true)
    # 计算y_true和y_predict之间的R Square
    return 1 - MSE / np.var(y_true)

# 线性回归
class LinearRegression:

    def __init__(self):
        # 初始化 Linear Regression 模型
        self.coef_ = None
        self.intercept_ = None
        self._theta = None

    def fit_normal(self, X_train, y_train):
        # 根据训练数据集X_train, y_train训练Linear Regression模型
        assert X_train.shape[0] == y_train.shape[0], \
            "the size of X_train must be equal to the size of y_train"

        # 对训练数据集添加 bias
        X_b = np.hstack([np.ones((len(X_train), 1)), X_train])
        self._theta = np.linalg.inv(X_b.T.dot(X_b)).dot(X_b.T).dot(y_train)

        self.intercept_ = self._theta[0]
        self.coef_ = self._theta[1:]

        return self

    def fit_gd(self, X_train, y_train, eta=0.01, n_iters=1e4):
        '''
        :param X_train: 训练集
        :param y_train: label
        :param eta: 学习率
        :param n_iters: 迭代次数
        :return: theta 模型参数
        '''
        # 根据训练数据集X_train, y_train, 使用梯度下降法训练Linear Regression模型
        assert X_train.shape[0] == y_train.shape[0], \
            "the size of X_train must be equal to the size of y_train"

        # 定义损失函数
        def J(theta, X_b, y):
            try:
                return np.sum((y - X_b.dot(theta)) ** 2) / len(y)
            except:
                return float('inf')
        # 对损失函数求导
        def dJ(theta, X_b, y):
            return X_b.T.dot(X_b.dot(theta) - y) * 2. / len(y)

        def gradient_descent(X_b, y, initial_theta, eta, n_iters=1e4, epsilon=1e-8):
            '''
            :param X_b: 输入特征向量
            :param y: lebel
            :param initial_theta: 初始参数
            :param eta: 步长
            :param n_iters: 迭代次数
            :param epsilon: 容忍度
            :return:theta:模型参数
            '''
            theta = initial_theta
            cur_iter = 0

            while cur_iter < n_iters:
                gradient = dJ(theta, X_b, y)
                last_theta = theta
                theta = theta - eta * gradient
                if (abs(J(theta, X_b, y) - J(last_theta, X_b, y)) < epsilon):
                    break
                cur_iter += 1
            return theta

        X_b = np.hstack([np.ones((len(X_train), 1)), X_train])
        initial_theta = np.zeros(X_b.shape[1]) # 初始化theta
        self._theta = gradient_descent(X_b, y_train, initial_theta, eta, n_iters)
        self.intercept_ = self._theta[0]
        self.coef_ = self._theta[1:]
        return self

    def predict(self, X_predict):
        # 给定待预测数据集X_predict,返回表示X_predict的结果向量
        assert self.intercept_ is not None and self.coef_ is not None, \
            "must fit before predict!"
        assert X_predict.shape[1] == len(self.coef_), \
            "the feature number of X_predict must be equal to X_train"
        X_b = np.hstack([np.ones((len(X_predict), 1)), X_predict])
        return X_b.dot(self._theta)

    def score(self, X_test, y_test):
        # 根据测试数据集 X_test 和 y_test 确定当前模型的准确度
        y_predict = self.predict(X_test)
        return r2_score(y_test, y_predict)

    def __repr__(self):
        return

# 模型训练
LR = LinearRegression().fit_gd(train_x, train_y)
# 评分
LR.score(train_x, train_y)
# 预测
result = LR.predict(test_x)

# 结果保存
sampleSubmission = pd.read_csv(path + 'sampleSubmission.csv', engine='python', encoding='gbk')
sampleSubmission['value'] = result
sampleSubmission.to_csv(path + 'result.csv')

方案2

#-*- coding:utf-8 -*-
# @File    : Predict_PM2dot5.py
# @Date    : 2019-05-19
# @Author  : 追风者
# @Software: PyCharm
# @Python Version: python 3.6

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# 数据读取与预处理
train_data = pd.read_csv("./Dataset/train.csv")
train_data.drop(['Date', 'stations'], axis=1, inplace=True)
column = train_data['observation'].unique()
# print(column)
new_train_data = pd.DataFrame(np.zeros([24*240, 18]), columns=column)

for i in column:
    train_data1 = train_data[train_data['observation'] == i]
    # Be careful with the inplace, as it destroys any data that is dropped!
    train_data1.drop(['observation'], axis=1, inplace=True)
    train_data1 = np.array(train_data1)
    train_data1[train_data1 == 'NR'] = '0'
    train_data1 = train_data1.astype('float')
    train_data1 = train_data1.reshape(1, 5760)
    train_data1 = train_data1.T
    new_train_data[i] = train_data1

label = np.array(new_train_data['PM2.5'][9:], dtype='float32')

# 探索性数据分析 EDA
# 最简单粗暴的方式就是根据 HeatMap 热力图分析各个指标之间的关联性
f, ax = plt.subplots(figsize=(9, 6))
sns.heatmap(new_train_data.corr(), fmt="d", linewidths=0.5, ax=ax)
plt.show()

# 模型选择
# a.数据归一化
# 使用前九个小时的 PM2.5 来预测第十个小时的 PM2.5,使用线性回归模型
PM = new_train_data['PM2.5']
PM_mean = int(PM.mean())
PM_theta = int(PM.var()**0.5)
PM = (PM - PM_mean) / PM_theta
w = np.random.rand(1, 10)
theta = 0.1
m = len(label)
for i in range(100):
    loss = 0
    i += 1
    gradient = 0
    for j in range(m):
        x = np.array(PM[j : j + 9])
        x = np.insert(x, 0, 1)
        error = label[j] - np.matmul(w, x)
        loss += error**2
        gradient += error * x

    loss = loss/(2*m)
    print(loss)
    w = w+theta*gradient/m

HeatMap
热力图分析 由热力图可直接看出,与PM2.5相关性较高的指标有PM10NO2SO2NOXO3THC.

# 打印损失函数
[301.76621466]
[228.34845921]
[189.62343004]
[159.73518205]
[135.63649698]
[116.10717244]
[100.25294032]
[87.36170872]
[76.8617656]
[68.2934691]
[61.28698881]
[55.54453984]
[50.82617376]
[46.93840439]
[43.72509587]
......
[22.17827769]
[22.1427191]
[22.10786632]
[22.07369828]
[22.04019474]
[22.00733629]
[21.97510423]
[21.94348063]
[21.91244821]
[21.88199035]
[21.85209108]
[21.822735]
[21.79390728]

方案3

思维导图 Task-Map

PM2.5 Prediction

PM2.5 Prediction

模型设计
  • 数据预处理

    处理训练样本为 (18*9) 的矩阵

#-*- coding:utf-8 -*-
# @File    : PM2.5Prediction.py
# @Date    : 2019-05-19
# @Author  : 追风者
# @Software: PyCharm
# @Python Version: python 3.6

'''
利用线性回归Linear Regression模型预测 PM2.5

特征工程中的特征选择与数据可视化的直观分析
通过选择的特征进一步建立回归模型
'''
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

'''数据读取与预处理'''
# DataFrame类型
train_data = pd.read_csv("./Dataset/train.csv")
train_data.drop(['Date', 'stations', 'observation'], axis=1, inplace=True)

ItemNum=18
#训练样本features集合
X_Train=[]
#训练样本目标PM2.5集合
Y_Train=[]

for i in range(int(len(train_data)/ItemNum)):
    observation_data = train_data[i*ItemNum:(i+1)*ItemNum] #一天的观测数据
    for j in range(15):
        x = observation_data.iloc[:, j:j + 9]
        y = int(observation_data.iloc[9,j+9])
        # 将样本分别存入X_Train、Y_Train中
        X_Train.append(x)
        Y_Train.append(y)
# print(X_Train)
# print(Y_Train)
  • 数据可视化

    绘制散点图,预测各特征与 PM2.5 的关系

'''绘制散点图'''
x_AMB=[]
x_CH4=[]
x_CO=[]
x_NMHC=[]

# x_NO=[]
# x_NO2=[]
# x_NOX=[]
# x_O3=[]

# x_PM10=[]
# x_PM2Dot5=[]
# x_RAINFALL=[]
# x_RH=[]

# x_SO2=[]
# x_THC=[]
# x_WD_HR=[]
# x_WIND_DIREC=[]

# x_WIND_SPEED=[]
# x_WS_HR=[]
#
y=[]
#
# for i in range(len(Y_Train)):
#     y.append(Y_Train[i])
#     x=X_Train[i]
#     # print(type(x.iloc[0,0]))
#     # 求各测项的平均值
#     x_WIND_SPEED_sum = 0
#     x_WS_HR_sum = 0
#     for j in range(9):
#         x_WIND_SPEED_sum = x_WIND_SPEED_sum + float(x.iloc[0, j])
#         x_WS_HR_sum = x_WS_HR_sum + float(x.iloc[1, j])
#     x_WIND_SPEED.append(x_WIND_SPEED_sum / 9)
#     x_WS_HR.append(x_WS_HR_sum / 9)
# plt.figure(figsize=(10, 6))
# plt.subplot(1, 2, 1)
# plt.title('WIND_SPEED')
# plt.scatter(x_WIND_SPEED, y)
# plt.subplot(1, 2, 2)
# plt.title('WS_HR')
# plt.scatter(x_WS_HR, y)
# plt.show()
#     x_SO2_sum = 0
#     x_THC_sum = 0
#     x_WD_HR_sum = 0
#     x_WIND_DIREC_sum = 0
#     for j in range(9):
#         x_SO2_sum = x_SO2_sum + float(x.iloc[0, j])
#         x_THC_sum = x_THC_sum + float(x.iloc[1, j])
#         x_WD_HR_sum = x_WD_HR_sum + float(x.iloc[2, j])
#         x_WIND_DIREC_sum = x_WIND_DIREC_sum + float(x.iloc[3, j])
#     x_SO2.append(x_SO2_sum / 9)
#     x_THC.append(x_THC_sum / 9)
#     x_WD_HR.append(x_WD_HR_sum / 9)
#     x_WIND_DIREC.append(x_WIND_DIREC_sum / 9)
# plt.figure(figsize=(10, 6))
# plt.subplot(2, 2, 1)
# plt.title('SO2')
# plt.scatter(x_SO2, y)
# plt.subplot(2, 2, 2)
# plt.title('THC')
# plt.scatter(x_THC, y)
# plt.subplot(2, 2, 3)
# plt.title('WD_HR')
# plt.scatter(x_WD_HR, y)
# plt.subplot(2, 2, 4)
# plt.title('WIND_DIREC')
# plt.scatter(x_WIND_DIREC, y)
# plt.show()
#     x_PM10_sum = 0
#     x_PM2Dot5_sum = 0
#     x_RAINFALL_sum = 0
#     x_RH_sum = 0
#     for j in range(9):
#         x_PM10_sum = x_PM10_sum + float(x.iloc[0, j])
#         x_PM2Dot5_sum = x_PM2Dot5_sum + float(x.iloc[1, j])
#         x_RAINFALL_sum = x_RAINFALL_sum + float(x.iloc[2, j])
#         x_RH_sum = x_RH_sum + float(x.iloc[3, j])
#     x_PM10.append(x_PM10_sum / 9)
#     x_PM2Dot5.append(x_PM2Dot5_sum / 9)
#     x_RAINFALL.append(x_RAINFALL_sum / 9)
#     x_RH.append(x_RH_sum / 9)
# plt.figure(figsize=(10, 6))
# plt.subplot(2, 2, 1)
# plt.title('PM10')
# plt.scatter(x_PM10, y)
# plt.subplot(2, 2, 2)
# plt.title('PM2.5')
# plt.scatter(x_PM2Dot5, y)
# plt.subplot(2, 2, 3)
# plt.title('RAINFALL')
# plt.scatter(x_RAINFALL, y)
# plt.subplot(2, 2, 4)
# plt.title('RH')
# plt.scatter(x_RH, y)
# plt.show()
    x_AMB_sum=0
    x_CH4_sum=0
    x_CO_sum=0
    x_NMHC_sum=0
    for j in range(9):
        x_AMB_sum = x_AMB_sum + float(x.iloc[0,j])
        x_CH4_sum = x_CH4_sum + float(x.iloc[1, j])
        x_CO_sum = x_CO_sum + float(x.iloc[2, j])
        x_NMHC_sum = x_NMHC_sum + float(x.iloc[3, j])
    x_AMB.append(x_AMB_sum / 9)
    x_CH4.append(x_CH4_sum / 9)
    x_CO.append(x_CO_sum / 9)
    x_NMHC.append(x_NMHC_sum / 9)
plt.figure(figsize=(10, 6))
plt.subplot(2, 2, 1)
plt.title('AMB')
plt.scatter(x_AMB, y)
plt.subplot(2, 2, 2)
plt.title('CH4')
plt.scatter(x_CH4, y)
plt.subplot(2, 2, 3)
plt.title('CO')
plt.scatter(x_CO, y)
plt.subplot(2, 2, 4)
plt.title('NMHC')
plt.scatter(x_NMHC, y)
plt.show()

Group1

Group2

Group3

Group4

Group5

  • 特征选择

    选择最具代表性的特征:PM10PM2.5SO2

  • 模型建立

    建立线性回归模型
    y = b + ∑ i = 1 27 W i × X i 等 价 于 y = b + w 1 × x 1 + w 2 × x 2 + ⋯ + w 27 × x 27 y=b+\sum_{i=1}^{27} \mathcal{W}_{i} \times \mathcal{X}_{i} \\ 等价于 \\ \mathrm{y}=b+w_{1} \times x_{1}+w_{2} \times x_{2}+\cdots+w_{27} \times x_{27} y=b+i=127Wi×Xiy=b+w1×x1+w2×x2++w27×x27
    其中x1到x9是前九个时间点的PM10值,x10到x18是前9个时间点的PM2.5值,x19到x27是前9个时间点的SO2值,w为对应参数,b为偏移量

  • 定义损失函数 (Gradient Descent)
    L o s s = 1 2 ∑ i = 1 m ( y i − y ireal ) 2 L \mathrm{oss}=\frac{1}{2} \sum_{i=1}^{m}\left(y_{i}-y_{\text {ireal}}\right)^{2} Loss=21i=1m(yiyireal)2
    其中m为每次更新参数时使用的样本数,yi为预测值,yireal为真实值

    采用小批量梯度下降算法,并且设定批量样本大小为50,即每次随机在训练样本中选取50个用来更新参数

    设定学习率learning rate分别为0.000000001、0.0000001、0.000001时,比较不同的learning rate对损失函数收敛速度的影响

  • 模型评估
    M o d e l − E v a l u a t i o n = 1 n ∑ i = 1 n ( y i − y ireal ) 2 \mathrm{Model_-Evaluation}=\frac{1}{n} \sum_{i=1}^{n}\left(y_{i}-y_{\text {ireal}}\right)^{2} ModelEvaluation=n1i=1n(yiyireal)2

总结

  • 小批量梯度下降算法中,初始参数的选择很重要,不同的初始参数,其对应损失函数收敛速度也不一样
  • learning rate 采用递减的方式选取的,根据经验的选择也很重要,说起来挺轻松的一件事,但实际操作起来,却四处碰壁,希望大家在实验中,多积累、多总结,机器学习不就是坑起坑落,挖坑、填坑、再挖坑、再填坑的反复训练过程么v_v
  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值