这周看的文章,感觉都还挺有意思,但是实用价值一般,就简要的存个档。
前两篇时周志华的深度森林3弹的前两弹,第三弹以前介绍过,我也是跟风读了读,毕竟周志华知名度还是比较高的,感觉还是挺有想法的,但是距离实用还是有一些距离。总体而言,第一篇借鉴神经网络多层的想法,然随机森林也搞成多层,每层也做表征学习。第二篇是自编码,用最大兼容规则(MCR)重构出原图。第三篇是借鉴反向传播,构造逆函数,把误差传回去更精确的学习。
Ladder Network和PU learning(Positive and Unlabeled Learning)都属于半监督学习,都是标注数据占很少的一部分的情况。
Deep Forest
作者:
Zhi-Hua Zhou, Ji Feng
National Key Laboratory for Novel Software Technology,
Nanjing University, Nanjing 210023, China
{zhouzh, fengj}@lamda.nju.edu.cn
Abstract
探索对于不可微模型构造深度模型。
深度学习成功背后的原因是有这三个特性: layer-by-layer processing, in-model feature transformation and sufficient model complexity.
我们提出的gcForest在生成深度模型的同时保持了深度学习的三个特性。
而且树模型与深度学习相比,有更少的超参数,这可以使得模型调参工作大大减轻。
1 Introduction
2 Inspiration
2.1 Inspiration from DNNs
representation learning(layer-by-layer processing)
2.2 Inspiration from Ensemble Learning
It is well known that an ensemble can usually achieve better generalization performance than single learners.
To construct a good ensemble, the individual learners should be accurate and
diverse.
3 The gcForest Approach
3.1 Cascade Forest Structure
we include different types of forests to encourage the diversity, because diversity is crucial for ensemble construction.
随机森林+极端随机数->每一个类别的概率+原始输入作为下一层的特征。最后一层再平均
3.2 Multi-Grained Scanning
多个窗口扫描作为特征输入,产生的结果作为输出concatenate到后面作为那一层的输出。
4 Experiments
4.1 Configuration
In all experiments gcForest is using the same cascade structure: Each level consists of 4 completely-random tree forests and 4 random forests, each containing 500 trees.
Three-fold cross validation is used for class vector generation.
The number of cascade levels is automatically determined.
把数据分为80%的growing set和20%的estimating set,当在estimating set准确率不在提升时,层数停止生长。
For d d raw features, we use feature windows with sizes of
4.2 Results
Image Categorization
Comparison of test accuracy on MNIST
gcForest | 99.26% |
---|---|
LeNet-5 | 99.05% |
Deep Belief Net | 98.75% |
SVM (rbf kernel) | 98.60% |
Random Forest | 96.80% |
Comparison of test accuracy on CIFAR-10
ResNet | 93.57% |
---|---|
AlexNet | 83.00% |
gcForest(gbdt) | 69.00% |
gcForest(5grains) | 63.37% |
Deep Belief Net | 62.20% |
gcForest(default) | 61.78% |
Random Forest | 50.17% |
MLP | 42.20% |
Logistic Regression | 37.32% |
SVM (linear kernel) | 16.32% |
AutoEncoder by Forest
Abstract
Experiments show that, compared with DNN autoencoders, eForest is able to obtain lower reconstruction error with fast training speed, while the model itself is reusable and damage-tolerable.
1. Introduction
In this paper, we present the EncoderForest, (abbrv. eForest), by enabling a tree ensemble to perform forward encoding and backward decoding operations and can be trained in both supervised or unsupervised fashion. Experiments showed the eForest approach has the following advantages:
- Accurate: Its experimental reconstruction error is lower than a MLP or CNN based auto-encoders.
- Efficient: eForest on a single KNL (many-core CPU) runs even faster than a CNN auto-encoder runs on a Titan-X GPU for training.
- Damage-tolerable: The trained model works well even when it is partially damaged.
- Reusable: A model trained from one datase
2. Related Work
3. The Proposed Method
AutoEncoder有两个基本功能:编码和解码。对于随机森林来说,编码是没有困难的,因为叶节点的信息可以被看作是一种编码;更不用说,节点的子集甚至是路径的分支都可以提供更多的编码信息。
eForest编码过程:给定训练好的森基森林,把输入数据输入到森林的每颗树中,将每个数结果叶子结点的索引的集合作为编码特征。该编码过程与如何分割树节点的特定学习规