autosklearn 源码理解

最新推荐文章于 2024-03-28 12:04:08 发布

数学工具构造器

最新推荐文章于 2024-03-28 12:04:08 发布

阅读量816

点赞数

分类专栏：机器学习 automl

本文链接：https://blog.csdn.net/TQCAI666/article/details/104138996

版权

本文深入探讨autosklearn的源码，解析其基本运行流程，包括创建搜索空间、ensemble_selection过程。重点介绍了元学习（metalearning）的计算，包括元特征的计算与使用，以及如何通过元数据集进行模型推荐。此外，文章还讨论了如何在SMAC中整合metalearn。

摘要由CSDN通过智能技术生成

self.steps

0 ['categorical_encoding', <autosklearn.pipeline.components.data_preprocessing.one_hot_encoding.OHEChoice object at 0x7f7a74dfa8d0>]
1 ['imputation', Imputation(random_state=None, strategy='median')]
2 ['variance_threshold', VarianceThreshold(random_state=None)]
3 ['rescaling', <autosklearn.pipeline.components.data_preprocessing.rescaling.RescalingChoice object at 0x7f7a74dfa780>]
4 ['balancing', Balancing(random_state=None, strategy='none')]
5 ['preprocessor', <autosklearn.pipeline.components.feature_preprocessing.FeaturePreprocessorChoice object at 0x7f7a74dc6390>]
6 ['classifier', <autosklearn.pipeline.components.classification.ClassifierChoice object at 0x7f7a74dc6048>]

pipenode class	is choice
OHEChoice	choice
Imputation	not choice
VarianceThreshold	not choice
RescalingChoice	choice
Balancing	not choice
FeaturePreprocessorChoice	choice
ClassifierChoice	choice

from collections import OrderedDict
import importlib
import inspect
import pkgutil
import sys

def find_components(package, directory, base_class):
    components = OrderedDict()

    for module_loader, module_name, ispkg in pkgutil.iter_modules([directory]):
        full_module_name = "%s.%s" % (package, module_name)
        if full_module_name not in sys.modules and not ispkg:
            module = importlib.import_module(full_module_name)

            for member_name, obj in inspect.getmembers(module):
                if inspect.isclass(obj) and issubclass(obj, base_class) and \
                        obj != base_class:
                    # TODO test if the obj implements the interface
                    # Keep in mind that this only instantiates the ensemble_wrapper,
                    # but not the real target classifier
                    classifier = obj
                    components[module_name] = classifier

    return components

_classifiers = find_components(__package__,
                               classifier_directory,
                               AutoSklearnClassificationAlgorithm)

autosklearn/ensemble_builder.py:389

            self.logger.warning("No models better than random - "
                                "using Dummy Score!")

automl.py文件中_fit函数self._proc_ensemble = self._get_ensemble_process(time_left_for_ensembles)获取的是EnsembleBuilder对象，代码文件是~/ensemble_builder.py。
_proc_ensemble继承了多进程类，会单独开个进程运行run方法。
在run中调用了EnsembleBuilder.main
重点研究：autosklearn.ensembles.ensemble_selection.EnsembleSelection#_fast
_ensemble这个变量可能是通过load_model 加载的。

We experimented with different approaches to optimize these weights: stacking [26], gradient-free numerical optimization, and the method ensemble selection [24].

we found both numerical optimization and stacking to overfit to the validation set and to be
computationally costly

In a nutshell, ensemble selection (introduced by Caruana et al. [24]) is a greedy procedure that starts from an empty ensemble and then iteratively adds the model that maximizes ensemble validation performance (with uniform weight, but allowing for repetitions)

We note that SMAC [9] can handle this conditionality
natively

The 14 possible feature preprocessing methods can be categorized into
feature selection (2), kernel approximation (2), matrix decomposition (3), embeddings (1), feature
clustering (1), polynomial feature expansion (1) and methods that use a classifier for feature selection
(2).

auto-sklearn 基本运行流程梳理

基本运行入口：autosklearn.automl.AutoMLClassifier#fit
在子类配置了一些必要的参数之后，调用父类的fit方法，即autosklearn.automl.AutoML#fit
在调用loaded_data_manager = XYDataManager(...将X y进行管理之后，调用return self._fit(...

创建搜索空间

self.configuration_space, configspace_path = self._create_search_space(
进入对应区域：autosklearn.automl.AutoML#_create_search_space
看到configuration_space = pipeline.get_configuration_space(
进入对应区域：autosklearn.util.pipeline.get_configuration_space
在这个函数中，配置了info字典之后，最后一段代码：

    if info['task'] in REGRESSION_TASKS:
        return _get_regression_configuration_space(info, include, exclude)
    else:
        return _get_classification_configuration_space(info, include, exclude)

进入对应区域：autosklearn.util.pipeline._get_classification_configuration_space

最后一段代码：

    return SimpleClassificationPipeline(
        dataset_properties=dataset_properties,
        include=include, exclude=exclude).\
        get_hyperparameter_search_space()

进入对应区域：autosklearn.pipeline.base.BasePipeline#get_hyperparameter_search_space

最后一段代码：

        if not hasattr(self, 'config_space') or self.config_space is None:
            self.config_space = self._get_hyperparameter_search_space(
                include=self.include_, exclude=self.exclude_,
                dataset_properties=self.dataset_properties_)
        return self.config_space

进入对应区域：autosklearn.pipeline.classification.SimpleClassificationPipeline#_get_hyperparameter_search_space

至此，进过多次跳转与入栈，我们终于进入了”干货“最为丰富的区域了。
看到如下代码：

        cs = self._get_base_search_space(
            cs=cs, dataset_properties=dataset_properties,
            exclude=exclude, include=include, pipeline=self.steps)

注意，这里的self.steps表示autosklearn想要优化出的Pipeline的所有节点。

进入对应区域：autosklearn.pipeline.base.BasePipeline#_get_base_search_space

看到要获取matches，我们想知道matches是怎么来的：

进入对应区域：autosklearn.pipeline.create_searchspace_util.get_match_array

在for node_name, node in pipeline:这个循环中，构造了一个很重要的变量：node_i_choices，他是一个2维列表。在原生形式中，维度1为7，表示7个Pipeline的结点。其中每个子列表表示可以选择的所有option
我取前4个作为样例

node_i_choices[0]
Out[16]: 
[autosklearn.pipeline.components.data_preprocessing.one_hot_encoding.no_encoding.NoEncoding,
 autosklearn.pipeline.components.data_preprocessing.one_hot_encoding.one_hot_encoding.OneHotEncoder]
node_i_choices[1]
Out[17]: [Imputation(random_state=None, strategy='median')]
node_i_choices[2]
Out[18]: [VarianceThreshold(random_state=None)]
node_i_choices[3]
Out[19]: 
[autosklearn.pipeline.components.data_preprocessing.rescaling.minmax.MinMaxScalerComponent,
 autosklearn.pipeline.components.data_preprocessing.rescaling.none.NoRescalingComponent,
 autosklearn.pipeline.components.data_preprocessing.rescaling.normalize.NormalizerComponent,
 autosklearn.pipeline.components.data_preprocessing.rescaling.quantile_transformer.QuantileTransformerComponent,
 autosklearn.pipeline.components.data_preprocessing.rescaling.robust_scaler.RobustScalerComponent,
 autosklearn.pipeline.components.data_preprocessing.rescaling.standardize.StandardScalerComponent]

之后，matches_dimensions表示每个子列表的长度，用来构造一个高维张量matches

matches_dimensions
Out[20]: [2, 1, 1, 6, 1, 15, 15]

matches = np.ones(matches_dimensions, dtype=int)

看到：

    pipeline_idxs = [range(dim) for dim in matches_dimensions]
    for pipeline_instantiation_idxs in itertools.product(*pipeline_idxs):

可以理解为遍历这条Pipeline中所有的可能。
pipeline_instantiation_idxs表示某个Pipeline在matches中的坐标

pipeline_instantiation_idxs
Out[25]: (0, 0, 0, 0, 0, 0, 0)

            node_input = node.get_properties()['input']
            node_output = node.get_properties()['output']

node_input
Out[26]: (5, 6, 10)
node_output
Out[27]: (8,)

这个操作乍一看不理解，跳转get_properties函数我们看到：

                'input': (DENSE, SPARSE, UNSIGNED_DATA),
                'output': (PREDICTIONS,)}

应该是适应哪些类型。
首先判断sparse与dense是否check：

            # First check if these two instantiations of this node can work
            # together. Do this in multiple if statements to maintain
            # readability
            if (data_is_sparse and SPARSE not in node_input) or \
                    not data_is_sparse and DENSE not in node_input:
                matches[pipeline_instantiation_idxs] = 0
                break

            # No need to check if the node can handle SIGNED_DATA; this is
            # always assumed to be true
            elif not dataset_is_signed and UNSIGNED_DATA not in node_input:
                matches[pipeline_instantiation_idxs] = 0
                break

后面的操作也差不多，反正就是检查这个Pipeline是否合理。源码很sophisticated，我暂时跳过。
最后返回matches

返回对应区域：autosklearn.pipeline.base.BasePipeline#_get_base_search_space:293

            if not is_choice:
                cs.add_configuration_space(node_name,
                                           node.get_hyperparameter_search_space(dataset_properties))
            # If the node isn't a choice, we have to figure out which of it's
            #  choices are actually legal choices
            else:
                choices_list = \
                    autosklearn.pipeline.create_searchspace_util.find_active_choices(
                        matches, node, node_idx,
                        dataset_properties,
                        include.get(node_name),
                        exclude.get(node_name)
                    )
                sub_config_space = node.get_hyperparameter_search_space(
                    dataset_properties, include=choices_list)
                cs.add_configuration_space(node_name, sub_config_space)

如果是选择性的结点，则进入else的部分，choices_list是所有的候选项

choices_list
Out[29]: ['no_encoding', 'one_hot_encoding']

我们再打印一下

sub_config_space
Out[30]: 
Configuration space object:
  Hyperparameters:
    __choice__, Type: Categorical, Choices: {
   no_encoding, one_hot_encoding}, Default: one_hot_encoding
    one_hot_encoding:minimum_fraction, Type: UniformFloat, Range: [0.0001, 0.5], Default: 0.01, on log-scale
    one_hot_encoding:use_minimum_fraction, Type: Categorical, Choices: {
   True, False}, Default: True
  Conditions:
    one_hot_encoding:minimum_fraction | one_hot_encoding:use_minimum_fraction == 'True'
    one_hot_encoding:use_minimum_fraction | __choice__ == 'one_hot_encoding'

我们打印一下特征处理部分：

Configuration space object:
  Hyperparameters:
    __choice__, Type: Categorical, Choices: {
   extra_trees_preproc_for_classification, fast_ica, feature_agglomeration, kernel_pca, kitchen_sinks, liblinear_svc_preprocessor, no_preprocessing, nystroem_sampler, pca, polynomial, random_trees_embedding, select_percentile_classification, select_rates}, Default: no_preprocessing
    extra_trees_preproc_for_classification:bootstrap, Type: Categorical, Choices: {
   True, False}, Default: False
    extra_trees_preproc_for_classification:criterion, Type: Categorical, Choices: {
   gini, entropy}, Default: gini
    extra_trees_preproc_for_classification:max_depth, Type: Constant, Value: None
    extra_trees_preproc_for_classification:max_features, Type: UniformFloat, Range: [0.0, 1.0], Default: 0.5
    extra_trees_preproc_for_classification:max_leaf_nodes, Type: Constant, Value: None
    extra_trees_preproc_for_classification:min_impurity_decrease, Type: Constant, Value: 0.0
    extra_trees_preproc_for_classification:min_samples_leaf, Type: UniformInteger, Range: [1, 20], Default: 1
    extra_trees_preproc_for_classification:min_samples_split, Type: UniformInteger, Range: [2, 20], Default: 2
    extra_trees_preproc_for_classification:min_weight_fraction_leaf, Type: Constant, Value: 0.0
    extra_trees_preproc_for_classification:n_estimators, Type: Constant, Value: 100
    fast_ica:algorithm, Type: Categorical, Choices: {
   parallel, deflation}, Default: parallel
    fast_ica:fun, Type: Categorical, Choices: {
   logcosh, exp, cube}, Default: logcosh
    fast_ica:n_components, Type: UniformInteger, Range: [10, 2000], Default: 100
    fast_ica:whiten, Type: Categorical, Choices: {
   False, True}, Default: False
    feature_agglomeration:affinity, Type: Categorical, Choices: {
   euclidean, manhattan, cosine}, Default: euclidean
    feature_agglomeration:linkage, Type: Categorical, Choices: {
   ward, complete, average}, Default: ward
    feature_agglomeration:n_clusters, Type: UniformInteger, Range: [2, 400], Default: 25
    feature_agglomeration:pooling_func, Type: Categorical, Choices: