【阅读笔记】AutoEncoder by Forest+Deep Forest+Ladder Networks+PU learning

最新推荐文章于 2025-02-13 14:00:25 发布

SrdLaplaceGua

最新推荐文章于 2025-02-13 14:00:25 发布

阅读量1.5k

点赞数 1

分类专栏：机器学习

本文链接：https://blog.csdn.net/SrdLaplace/article/details/81837554

版权

本文探讨了深度森林（Deep Forest）的gcForest方法，它借鉴了深度学习的特性并用随机森林构建多层模型。此外，文章还介绍了基于森林的自编码器（AutoEncoder by Forest），展示其在重建误差和训练速度上的优势，以及Ladder Networks如何结合有监督和无监督学习。最后，文章简述了Positive and Unlabeled (PU)学习算法，用于处理仅有正例和无标注数据的分类问题。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

这周看的文章，感觉都还挺有意思，但是实用价值一般，就简要的存个档。

前两篇时周志华的深度森林3弹的前两弹，第三弹以前介绍过，我也是跟风读了读，毕竟周志华知名度还是比较高的，感觉还是挺有想法的，但是距离实用还是有一些距离。总体而言，第一篇借鉴神经网络多层的想法，然随机森林也搞成多层，每层也做表征学习。第二篇是自编码，用最大兼容规则（MCR）重构出原图。第三篇是借鉴反向传播，构造逆函数，把误差传回去更精确的学习。

Ladder Network和PU learning（Positive and Unlabeled Learning）都属于半监督学习，都是标注数据占很少的一部分的情况。

Deep Forest

作者：
Zhi-Hua Zhou, Ji Feng
National Key Laboratory for Novel Software Technology,
Nanjing University, Nanjing 210023, China
{zhouzh, fengj}@lamda.nju.edu.cn

Abstract

探索对于不可微模型构造深度模型。
深度学习成功背后的原因是有这三个特性: layer-by-layer processing, in-model feature transformation and sufficient model complexity.
我们提出的gcForest在生成深度模型的同时保持了深度学习的三个特性。
而且树模型与深度学习相比，有更少的超参数，这可以使得模型调参工作大大减轻。

1 Introduction

2 Inspiration

2.1 Inspiration from DNNs

representation learning(layer-by-layer processing)

2.2 Inspiration from Ensemble Learning

It is well known that an ensemble can usually achieve better generalization performance than single learners.
To construct a good ensemble, the individual learners should be accurate and
diverse.

3 The gcForest Approach

3.1 Cascade Forest Structure

we include different types of forests to encourage the diversity, because diversity is crucial for ensemble construction.
随机森林+极端随机数->每一个类别的概率+原始输入作为下一层的特征。最后一层再平均

3.2 Multi-Grained Scanning

多个窗口扫描作为特征输入，产生的结果作为输出concatenate到后面作为那一层的输出。

4 Experiments

4.1 Configuration

In all experiments gcForest is using the same cascade structure: Each level consists of 4 completely-random tree forests and 4 random forests, each containing 500 trees.
Three-fold cross validation is used for class vector generation.
The number of cascade levels is automatically determined.
把数据分为80%的growing set和20%的estimating set，当在estimating set准确率不在提升时，层数停止生长。
For $d$ raw features, we use feature windows with sizes of $[d/16], [d/8], [d/4]$

4.2 Results

Image Categorization

Comparison of test accuracy on MNIST

gcForest	99.26%
LeNet-5	99.05%
Deep Belief Net	98.75%
SVM (rbf kernel)	98.60%
Random Forest	96.80%

Comparison of test accuracy on CIFAR-10

ResNet	93.57%
AlexNet	83.00%
gcForest(gbdt)	69.00%
gcForest(5grains)	63.37%
Deep Belief Net	62.20%
gcForest(default)	61.78%
Random Forest	50.17%
MLP	42.20%
Logistic Regression	37.32%
SVM (linear kernel)	16.32%

AutoEncoder by Forest

Abstract

Experiments show that, compared with DNN autoencoders, eForest is able to obtain lower reconstruction error with fast training speed, while the model itself is reusable and damage-tolerable.

1. Introduction

In this paper, we present the EncoderForest, (abbrv. eForest), by enabling a tree ensemble to perform forward encoding and backward decoding operations and can be trained in both supervised or unsupervised fashion. Experiments showed the eForest approach has the following advantages:

Accurate: Its experimental reconstruction error is lower than a MLP or CNN based auto-encoders.
Efficient: eForest on a single KNL (many-core CPU) runs even faster than a CNN auto-encoder runs on a Titan-X GPU for training.
Damage-tolerable: The trained model works well even when it is partially damaged.
Reusable: A model trained from one datase