sklearn.compose.make_column_transformer 解读

最新推荐文章于 2024-08-20 14:31:28 发布

梅津太郎

最新推荐文章于 2024-08-20 14:31:28 发布

阅读量1.9k

点赞数

分类专栏： sklearn

本文链接：https://blog.csdn.net/gaocui883/article/details/111474422

版权

sklearn 专栏收录该内容

3 篇文章 0 订阅

订阅专栏

遇到的问题：

构建一个make_column_transformer 后，对数据进行使用，前后的维度不一样问题。

sklearn.compose.make_column_transformer(*transformers, **kwargs)

Parameters
*transformerstuples
Tuples of the form (transformer, columns) specifying the transformer objects to be applied to subsets of the data.

transformer{‘drop’, ‘passthrough’} or estimator
Estimator must support fit and transform. Special-cased strings ‘drop’ and ‘passthrough’ are accepted as well, to indicate to drop the columns or to pass them through untransformed, respectively.

columnsstr, array-like of str, int, array-like of int, slice, array-like of bool or callable
Indexes the data on its second axis. Integers are interpreted as positional columns, while strings can reference DataFrame columns by name. A scalar string or int should be used where transformer expects X to be a 1d array-like (vector), otherwise a 2d array will be passed to the transformer. A callable is passed the input data X and can return any of the above. To select multiple columns by name or dtype, you can use make_column_selector.

remainder{‘drop’, ‘passthrough’} or estimator, default=’drop’
By default, only the specified columns in transformers are transformed and combined in the output, and the non-specified columns are dropped. (default of 'drop'). By specifying remainder='passthrough', all remaining columns that were not specified in transformers will be automatically passed through. This subset of columns is concatenated with the output of the transformers. By setting remainder to be an estimator, the remaining non-specified columns will use the remainder estimator. The estimator must support fit and transform.

sparse_thresholdfloat, default=0.3
If the transformed output consists of a mix of sparse and dense data, it will be stacked as a sparse matrix if the density is lower than this value. Use sparse_threshold=0 to always return dense. When the transformed output consists of all sparse or all dense data, the stacked result will be sparse or dense, respectively, and this keyword will be ignored.

n_jobsint, default=None
Number of jobs to run in parallel. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

verbosebool, default=False
If True, the time elapsed while fitting each transformer will be printed as it is completed.

Returns
ctColumnTransformer

官方文档说的比较清楚，在于remainder这个参数，默认是dropout的，意思就是没有进行转化的列，自动丢弃，所以就导致了input转换前后维度不一样的结果。

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer
make_column_transformer(
    (StandardScaler(), ['numerical_column']),
    (OneHotEncoder(), ['categorical_column']))

数字列正则化，分类的列onehot编码，其他列就默认丢弃了，当然，我们可以指定Transformer ：比如StandardScaler(),OneHotEncoder(),和自己关心的那些列。当然，如果不想丢弃剩下的列，也可以将remainder = ‘passthhrough’,不丢弃。或者remainder = MinMaxScaler() 之类的transformer。