autosklearn 源码理解

self.steps

0 ['categorical_encoding', <autosklearn.pipeline.components.data_preprocessing.one_hot_encoding.OHEChoice object at 0x7f7a74dfa8d0>]
1 ['imputation', Imputation(random_state=None, strategy='median')]
2 ['variance_threshold', VarianceThreshold(random_state=None)]
3 ['rescaling', <autosklearn.pipeline.components.data_preprocessing.rescaling.RescalingChoice object at 0x7f7a74dfa780>]
4 ['balancing', Balancing(random_state=None, strategy='none')]
5 ['preprocessor', <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice object at 0x7f7a74dc6390>]
6 ['classifier', <autosklearn.pipeline.components.classification.ClassifierChoice object at 0x7f7a74dc6048>]
pipenode class is choice
OHEChoice choice
Imputation not choice
VarianceThreshold not choice
RescalingChoice choice
Balancing not choice
FeaturePreprocessorChoice choice
ClassifierChoice choice
from collections import OrderedDict
import importlib
import inspect
import pkgutil
import sys

def find_components(package, directory, base_class):
    components = OrderedDict()

    for module_loader, module_name, ispkg in pkgutil.iter_modules([directory]):
        full_module_name = "%s.%s" % (package, module_name)
        if full_module_name not in sys.modules and not ispkg:
            module = importlib.import_module(full_module_name)

            for member_name, obj in inspect.getmembers(module):
                if inspect.isclass(obj) and issubclass(obj, base_class) and \
                        obj != base_class:
                    # TODO test if the obj implements the interface
                    # Keep in mind that this only instantiates the ensemble_wrapper,
                    # but not the real target classifier
                    classifier = obj
                    components[module_name] = classifier

    return components

_classifiers = find_components(__package__,
                               classifier_directory,
                               AutoSklearnClassificationAlgorithm)

autosklearn/ensemble_builder.py:389

            self.logger.warning("No models better than random - "
                                "using Dummy Score!")

automl.py文件中_fit函数self._proc_ensemble = self._get_ensemble_process(time_left_for_ensembles)获取的是EnsembleBuilder对象,代码文件是~/ensemble_builder.py
_proc_ensemble继承了多进程类,会单独开个进程运行run方法。
run中调用了EnsembleBuilder.main
重点研究:autosklearn.ensembles.ensemble_selection.EnsembleSelection#_fast
_ensemble这个变量可能是通过load_model 加载的。

We experimented with different approaches to optimize these weights: stacking [26], gradient-free numerical optimization, and the method ensemble selection [24].

we found both numerical optimization and stacking to overfit to the validation set and to be
computationally costly

In a nutshell, ensemble selection (introduced by Caruana et al. [24]) is a greedy procedure that starts from an empty ensemble and then iteratively adds the model that maximizes ensemble validation performance (with uniform weight, but allowing for repetitions)

We note that SMAC [9] can handle this conditionality
natively

The 14 possible feature preprocessing methods can be categorized into
feature selection (2), kernel approximation (2), matrix decomposition (3), embeddings (1), feature
clustering (1), polynomial feature expansion (1) and methods that use a classifier for feature selection
(2).

auto-sklearn 基本运行流程梳理

基本运行入口:autosklearn.automl.AutoMLClassifier#fit
在子类配置了一些必要的参数之后,调用父类的fit方法,即autosklearn.automl.AutoML#fit
在调用loaded_data_manager = XYDataManager(...将X y进行管理之后,调用return self._fit(...

创建搜索空间

self.configuration_space, configspace_path = self._create_search_space(
进入对应区域:autosklearn.automl.AutoML#_create_search_space
看到configuration_space = pipeline.get_configuration_space(
进入对应区域:autosklearn.util.pipeline.get_configuration_space
在这个函数中,配置了info字典之后,最后一段代码:

    if info['task'] in REGRESSION_TASKS:
        return _get_regression_configuration_space(info, include, exclude)
    else:
        return _get_classification_configuration_space(info, include, exclude)
  • 进入对应区域:autosklearn.util.pipeline._get_classification_configuration_space

最后一段代码:

    return SimpleClassificationPipeline(
        dataset_properties=dataset_properties,
        include=include, exclude=exclude).\
        get_hyperparameter_search_space()
  • 进入对应区域:autosklearn.pipeline.base.BasePipeline#get_hyperparameter_search_space

最后一段代码:

        if not hasattr(self, 'config_space') or self.config_space is None:
            self.config_space = self._get_hyperparameter_search_space(
                include=self.include_, exclude=self.exclude_,
                dataset_properties=self.dataset_properties_)
        return self.config_space
  • 进入对应区域:autosklearn.pipeline.classification.SimpleClassificationPipeline#_get_hyperparameter_search_space

至此,进过多次跳转与入栈,我们终于进入了”干货“最为丰富的区域了。
看到如下代码:

        cs = self._get_base_search_space(
            cs=cs, dataset_properties=dataset_properties,
            exclude=exclude, include=include, pipeline=self.steps)

注意,这里的self.steps表示autosklearn想要优化出的Pipeline的所有节点。

  • 进入对应区域:autosklearn.pipeline.base.BasePipeline#_get_base_search_space

看到要获取matches,我们想知道matches是怎么来的:

  • 进入对应区域:autosklearn.pipeline.create_searchspace_util.get_match_array

for node_name, node in pipeline:这个循环中,构造了一个很重要的变量:node_i_choices,他是一个2维列表。在原生形式中,维度1为7,表示7个Pipeline的结点。其中每个子列表表示可以选择的所有option
我取前4个作为样例

node_i_choices[0]
Out[16]: 
[autosklearn.pipeline.components.data_preprocessing.one_hot_encoding.no_encoding.NoEncoding,
 autosklearn.pipeline.components.data_preprocessing.one_hot_encoding.one_hot_encoding.OneHotEncoder]
node_i_choices[1]
Out[17]: [Imputation(random_state=None, strategy='median')]
node_i_choices[2]
Out[18]: [VarianceThreshold(random_state=None)]
node_i_choices[3]
Out[19]: 
[autosklearn.pipeline.components.data_preprocessing.rescaling.minmax.MinMaxScalerComponent,
 autosklearn.pipeline.components.data_preprocessing.rescaling.none.NoRescalingComponent,
 autosklearn.pipeline.components.data_preprocessing.rescaling.normalize.NormalizerComponent,
 autosklearn.pipeline.components.data_preprocessing.rescaling.quantile_transformer.QuantileTransformerComponent,
 autosklearn.pipeline.components.data_preprocessing.rescaling.robust_scaler.RobustScalerComponent,
 autosklearn.pipeline.components.data_preprocessing.rescaling.standardize.StandardScalerComponent]

之后,matches_dimensions表示每个子列表的长度,用来构造一个高维张量matches

matches_dimensions
Out[20]: [2, 1, 1, 6, 1, 15, 15]
matches = np.ones(matches_dimensions, dtype=int)

看到:

    pipeline_idxs = [range(dim) for dim in matches_dimensions]
    for pipeline_instantiation_idxs in itertools.product(*pipeline_idxs):

可以理解为遍历这条Pipeline中所有的可能。
pipeline_instantiation_idxs表示某个Pipeline在matches中的坐标

pipeline_instantiation_idxs
Out[25]: (0, 0, 0, 0, 0, 0, 0)
            node_input = node.get_properties()['input']
            node_output = node.get_properties()['output']
node_input
Out[26]: (5, 6, 10)
node_output
Out[27]: (8,)

这个操作乍一看不理解,跳转get_properties函数我们看到:

                'input': (DENSE, SPARSE, UNSIGNED_DATA),
                'output': (PREDICTIONS,)}

应该是适应哪些类型。
首先判断sparse与dense是否check:

            # First check if these two instantiations of this node can work
            # together. Do this in multiple if statements to maintain
            # readability
            if (data_is_sparse and SPARSE not in node_input) or \
                    not data_is_sparse and DENSE not in node_input:
                matches[pipeline_instantiation_idxs] = 0
                break
            # No need to check if the node can handle SIGNED_DATA; this is
            # always assumed to be true
            elif not dataset_is_signed and UNSIGNED_DATA not in node_input:
                matches[pipeline_instantiation_idxs] = 0
                break

后面的操作也差不多,反正就是检查这个Pipeline是否合理。源码很sophisticated,我暂时跳过。
最后返回matches

  • 返回对应区域:autosklearn.pipeline.base.BasePipeline#_get_base_search_space:293
            if not is_choice:
                cs.add_configuration_space(node_name,
                                           node.get_hyperparameter_search_space(dataset_properties))
            # If the node isn't a choice, we have to figure out which of it's
            #  choices are actually legal choices
            else:
                choices_list = \
                    autosklearn.pipeline.create_searchspace_util.find_active_choices(
                        matches, node, node_idx,
                        dataset_properties,
                        include.get(node_name),
                        exclude.get(node_name)
                    )
                sub_config_space = node.get_hyperparameter_search_space(
                    dataset_properties, include=choices_list)
                cs.add_configuration_space(node_name, sub_config_space)

如果是选择性的结点,则进入else的部分,choices_list是所有的候选项

choices_list
Out[29]: ['no_encoding', 'one_hot_encoding']

我们再打印一下

sub_config_space
Out[30]: 
Configuration space object:
  Hyperparameters:
    __choice__, Type: Categorical, Choices: {
   no_encoding, one_hot_encoding}, Default: one_hot_encoding
    one_hot_encoding:minimum_fraction, Type: UniformFloat, Range: [0.0001, 0.5], Default: 0.01, on log-scale
    one_hot_encoding:use_minimum_fraction, Type: Categorical, Choices: {
   True, False}, Default: True
  Conditions:
    one_hot_encoding:minimum_fraction | one_hot_encoding:use_minimum_fraction == 'True'
    one_hot_encoding:use_minimum_fraction | __choice__ == 'one_hot_encoding'

我们打印一下特征处理部分:

Configuration space object:
  Hyperparameters:
    __choice__, Type: Categorical, Choices: {
   extra_trees_preproc_for_classification, fast_ica, feature_agglomeration, kernel_pca, kitchen_sinks, liblinear_svc_preprocessor, no_preprocessing, nystroem_sampler, pca, polynomial, random_trees_embedding, select_percentile_classification, select_rates}, Default: no_preprocessing
    extra_trees_preproc_for_classification:bootstrap, Type: Categorical, Choices: {
   True, False}, Default: False
    extra_trees_preproc_for_classification:criterion, Type: Categorical, Choices: {
   gini, entropy}, Default: gini
    extra_trees_preproc_for_classification:max_depth, Type: Constant, Value: None
    extra_trees_preproc_for_classification:max_features, Type: UniformFloat, Range: [0.0, 1.0], Default: 0.5
    extra_trees_preproc_for_classification:max_leaf_nodes, Type: Constant, Value: None
    extra_trees_preproc_for_classification:min_impurity_decrease, Type: Constant, Value: 0.0
    extra_trees_preproc_for_classification:min_samples_leaf, Type: UniformInteger, Range: [1, 20], Default: 1
    extra_trees_preproc_for_classification:min_samples_split, Type: UniformInteger, Range: [2, 20], Default: 2
    extra_trees_preproc_for_classification:min_weight_fraction_leaf, Type: Constant, Value: 0.0
    extra_trees_preproc_for_classification:n_estimators, Type: Constant, Value: 100
    fast_ica:algorithm, Type: Categorical, Choices: {
   parallel, deflation}, Default: parallel
    fast_ica:fun, Type: Categorical, Choices: {
   logcosh, exp, cube}, Default: logcosh
    fast_ica:n_components, Type: UniformInteger, Range: [10, 2000], Default: 100
    fast_ica:whiten, Type: Categorical, Choices: {
   False, True}, Default: False
    feature_agglomeration:affinity, Type: Categorical, Choices: {
   euclidean, manhattan, cosine}, Default: euclidean
    feature_agglomeration:linkage, Type: Categorical, Choices: {
   ward, complete, average}, Default: ward
    feature_agglomeration:n_clusters, Type: UniformInteger, Range: [2, 400], Default: 25
    feature_agglomeration:pooling_func, Type: Categorical, Choices: 
  • 0
    点赞
  • 4
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值