pmf-automl源码分析

数学工具构造器

于 2020-06-08 11:37:47 发布

阅读量634

点赞数

分类专栏： automl

本文链接：https://blog.csdn.net/TQCAI666/article/details/106598379

版权

本文深入探讨了PMF-automl项目的源码，涉及数据切分、隐变量初始化、模型训练、高斯过程定义及后验分布协方差矩阵的求解。通过分析代码，详细解释了RBF核函数和White噪声项在求解协方差矩阵中的应用，以及GP前向函数的返回值含义。

摘要由CSDN通过智能技术生成

arxiv论文（有附录，但是字小）
Probabilistic Matrix Factorization for Automated Machine Learning
NIPS2018论文（字大但是没有附录）
Probabilistic Matrix Factorization for Automated Machine Learning
代码
https://github.com/rsheth80/pmf-automl

初窥项目文件

用jupyter lab打开all_normalized_accuracy_with_pipelineID.csv
在这里插入图片描述

all_normalized_accuracy_with_pipelineID.zip contains the performance observations from running 42K pipelines on 553 OpenML datasets. The task was classification and the performance metric was balanced accuracy. Unzip prior to running code.

行表示pipeline id，列表示dataset id，元素表示balanced accuracy 。

在这里插入图片描述
简单查阅了一下pipelines.json，基本只有pca和polynomial两种preprocessor。

PMF模型训练

数据切分

Ytrain, Ytest, Ftrain, Ftest = get_data()

>>> Ytrain.shape
Out[2]: (42000, 464)
>>> Ytest.shape
Out[3]: (42000, 89)
>>> Ftrain.shape
Out[4]: (464, 46)
>>> Ftest.shape
Out[5]: (89, 46)

训练测试集切分，89个数据集作为测试集，464个训练集

初始隐变量

    imp = sklearn.impute.SimpleImputer(missing_values=np.nan, strategy='mean')
    X = sklearn.decomposition.PCA(Q).fit_transform(
                                            imp.fit(Ytrain).transform(Ytrain))

>>> X.shape
Out[7]: (42000, 20)

根据目前的理解，整个训练过程就是根据GP来训练X的隐变量。这个隐变量是用PCA初始化的。

处理训练集的缺失值，并降维为20维（42K个pipelines，数据集从553降为20个隐变量）

论文：the elements of $Y$ are given by as nonlinear function of the latent variables, $y_{n,d}=f_d(x_n)+\epsilon$ , where $\epsilon$ is independent Gaussian noise.

这里的 $Y$ 指的是整个 $42000\times464$ 矩阵，那么 $X$ 就是pipeline空间的隐变量，这里隐变量维度 $Q = 20$ ， $X$ 的shape为 $42000\times20$

模型的定义与训练

模型的顶层定义：

    kernel = kernels.Add(kernels.RBF(Q, lengthscale=None), kernels.White(Q))
    m = gplvm.GPLVM(Q, X, Ytrain, kernel, N_max=N_max, D_max=batch_size)
    optimizer = torch.optim.SGD(m.parameters(), lr=lr)
    m = train(m, optimizer, f_callback=f_callback, f_stop=f_stop)

f_callback和f_stop都是两个local函数

    def f_callback(m, v, it, t):
        varn_list.append(transform_forward(m.variance).item())
        logpr_list.append(m().item()/m.D)
        if it == 1:
            t_list.append(t)
        else:
            t_list.append(t_list[-1] + t)

        if save_checkpoint and not (it % checkpoint_period):
            torch.save(m.state_dict(), fn_checkpoint + '_it%d.pt' % it)

        print('it=%d, f=%g, varn=%g, t: %g'
              % (it, logpr_list[-1], transform_forward(m.variance), t_list[-1]))

    def f_stop(m, v, it, t):

        if it >= maxiter-1:
            print('maxiter (%d) reached' % maxiter)
            return True

        return False

看到训练函数train

def train(m, optimizer, f_callback=None, f_stop=None):

    it = 0
    while True:

        try:
            t = time.time()

            optimizer.zero_grad()
            nll = m()
            nll.backward()
            optimizer.step()

            it += 1
            t = time.time() - t

            if f_callback is not None:
                f_callback(m, nll, it, t)

            # f_stop should not be a substantial portion of total iteration time
            if f_stop is not None and f_stop(m, nll, it, t):
                break

        except KeyboardInterrupt:
            break

    return m