李宏毅的机器学习作业笔记1+2

最新推荐文章于 2024-09-15 11:11:14 发布

新火之光

最新推荐文章于 2024-09-15 11:11:14 发布

阅读量612

点赞数

文章标签：机器学习

本文链接：https://blog.csdn.net/qq_45878378/article/details/108079271

版权

这篇博客详细记录了李宏毅机器学习课程的作业内容，涵盖作业一和作业二。作业一涉及空气质量预测，重点在于数据处理，包括读取、归一化和特征提取。作业二为二分类问题，提出了梯度下降和生成模型两种解决方案，讨论了激活函数和概率分布的计算。

摘要由CSDN通过智能技术生成

李宏毅的机器学习作业笔记1+2

作业一
- 作业要求
- 作业要点
作业二

作业所需数据 akti
作业参考答案

作业一

作业要求

本此作业给出的数据是一份空气质量的检测资料，train.csv给出了整个2014年，每个月前20天的数据，这些数据包含了共计18个特征，在这些天每一个小时的变化。我们所需要的根据这些数据训练出一个模型，要求根据前九个小时的数据，预测出第10个小时的PM2.5。

作业要点

本此作业考察的是regression，模型本身很简单，并不是本此作业的重点，我认为本此作业的重点在于对数据的处理，也就是将给予的两个数据文件处理成模型需要的样子。

读取数据

import pandas as pd
data = pd.read_csv('./train.csv', encoding = 'big5')

数据提取与变换

data = data.iloc[:, 3:]   #取出第三列之后的数据
data[data == 'NR'] = 0    #把非数字转为数字
raw_data = data.to_numpy()

将资料按照每月划分

month_data = {}
for month in range(12):
    sample = np.empty([18, 480])
    for day in range(20):
        sample[:, day * 24 : (day + 1) * 24] = raw_data[18 * (20 * month + day) : 18 * (20 * month + day + 1), :]
    month_data[month] = sample

这样month_data所存储的就是这个月的相关数据。

根据题目的要求，我们需要按照前九个小时的数据来得到第十个小时的数据，也就是前9个小时的18个features作为特征，第十个小时的PM2.5作为结果来构建regression回归，实际上所需要得到的就是一个含有18*9=162个系数的一维特征数组。而我们需要根据到手的数据制作训练集与测试集。

特征提取

x = np.empty([12 * 471, 18 * 9], dtype = float)
y = np.empty([12 * 471, 1], dtype = float)
for month in range(12):
    for day in range(20):
        for hour in range(24):
            if day == 19 and hour > 14:
                continue
            x[month * 471 + day * 24 + hour, :] = month_data[month][:,day * 24 + hour : day * 24 + hour + 9].reshape(1, -1) 
            y[month * 471 + day * 24 + hour, 0] = month_data[month][9, day * 24 + hour + 9] #value

Normalize归一化

mean_x = np.mean(x, axis = 0) #18 * 9 
std_x = np.std(x, axis = 0) #18 * 9 
for i in range(len(x)): #12 * 471
    for j in range(len(x[0])): #18 * 9 
        if std_x[j] != 0:
            x[i][j] = (x[i][j] - mean_x[j]) / std_x[j]

到此数据处理基本完毕，接下来就是训练过程，比较简单，不多赘述。

作业二

作业要求

本此作业是一个简单的二分类问题，参考答案给出了两种方案,第一种是逻辑回归+激活函数，第二种使用generative model。

数据处理

读取数据

X_train_fpath = './data/X_train'
Y_train_fpath = './data/Y_train'
X_test_fpath = './data/X_test'
output_fpath = './output_{}.csv'

# Parse csv files to numpy array
with open(X_train_fpath) as f:
    next(f)
    X_train = np.array([line.strip('\n').split(',')[1:] for line in f], dtype = float)
with open(Y_train_fpath) as f:
    next(f)
    Y_train = np.array([line.strip('\n').split(',')[1] for line in f], dtype = float)
with open(X_test_fpath) as f:
    next(f)
    X_test = np.array([line.strip('\n').split(',')[1:] for line in f], dtype = float)

归一化函数

def _normalize(X, train = True, specified_column = None, X_mean = None, X_std = None):
    if specified_column == None:
        specified_column = np.arange(X.shape[1])
    if train:
        X_mean = np.mean(X[:, specified_column] ,0).reshape(1, -1)
        X_std  = np.std(X[:, specified_column], 0).reshape(1, -1)

    X[:,specified_column] = (X[:, specified_column] - X_mean) / (X_std + 1e-8)
     
    return X, X_mean, X_std

方法一 —— 梯度下降

使用小批量梯度下降法，通过计算其梯度和损失对w和b计算，与作业一相类似，区别在于最后要连接一个激活函数，以及损失函数的计算采用交叉熵。

激活函数

def _sigmoid(z):
    # Sigmoid function can be used to calculate probability.
    # To avoid overflow, minimum/maximum output value is set.
    return np.clip(1 / (1.0 + np.exp(-z)), 1e-8, 1 - (1e-8))

def _f(X, w, b):
    # This is the logistic regression function, parameterized by w and b
    #
    # Arguements:
    #     X: input data, shape = [batch_size, data_dimension]
    #     w: weight vector, shape = [data_dimension, ]
    #     b: bias, scalar
    # Output:
    #     predicted probability of ea

损失函数

def _cross_entropy_loss(y_pred, Y_label):
    # This function computes the cross entropy.
    #
    # Arguements:
    #     y_pred: probabilistic predictions, float vector
    #     Y_label: ground truth labels, bool vector
    # Output:
    #     cross entropy, scalar
    cross_entropy = -np.dot(Y_label, np.log(y_pred)) - np.dot((1 - Y_label), np.log(1 - y_pred))
    return cross_entropy

方法二——生成模型

generative model 又叫生成概率模型，它先假设数据的概率分布，然后用概率公式去计算x所属于的类型的概率。通俗点理解，我们假设数据的分布符合某一种分布（例如高斯分布），我们需要的是把这种分布找出来，计算他的概率公式，使概率最大的公式就是我们需要的模型。
假设x为正向情况 $C_{1}$ 的概率为 $P(C_{1}|x)$ 。而x是正向情况的数据集。那么 $P(C_{1}|x)$ 就要尽可能接近1。
假设数据服从高斯分布时，对应的公式如下：

$P(C_{1}|x)=\sigma(z)$

$z=(\mu^1-\mu^2)^T\Sigma^{-1}x-\frac{1}{2}(\mu^1)^T(\Sigma^{1})^{-1}\mu^1+\frac{1}{2}(\mu^2)^T(\Sigma^{2})^{-1}\mu^2+ln\frac{N_{1}}{N_{2}}$

$\mu^1和\mu^2$ 表示两类的均值。

$(\mu^1-\mu^2)^T\Sigma^{-1}$ 作为 w, $-\frac{1}{2}(\mu^1)^T(\Sigma^{1})^{-1}\mu^1+\frac{1}{2}(\mu^2)^T(\Sigma^{2})^{-1}\mu^2+ln\frac{N_{1}}{N_{2}}$ 作为***b*** ,就转化成了线性的逻辑回归。那么带入公式计算即可。

对应的代码：

u, s, v = np.linalg.svd(cov, full_matrices=False)
inv = np.matmul(v.T * 1 / s, u.T)

# Directly compute weights and bias
w = np.dot(inv, mean_0 - mean_1)
b =  (-0.5) * np.dot(mean_0, np.dot(inv, mean_0)) + 0.5 * np.dot(mean_1, np.dot(inv, mean_1))\
    + np.log(float(X_train_0.shape[0]) / X_train_1.shape[0])