sklearn.compose.make_column_transformer 解读

遇到的问题:

构建一个make_column_transformer 后,对数据进行使用,前后的维度不一样问题。

sklearn.compose.make_column_transformer(*transformers, **kwargs)
Parameters
*transformerstuples
Tuples of the form (transformer, columns) specifying the transformer objects to be applied to subsets of the data.

transformer{‘drop’, ‘passthrough’} or estimator
Estimator must support fit and transform. Special-cased strings ‘drop’ and ‘passthrough’ are accepted as well, to indicate to drop the columns or to pass them through untransformed, respectively.

columnsstr, array-like of str, int, array-like of int, slice, array-like of bool or callable
Indexes the data on its second axis. Integers are interpreted as positional columns, while strings can reference DataFrame columns by name. A scalar string or int should be used where transformer expects X to be a 1d array-like (vector), otherwise a 2d array will be passed to the transformer. A callable is passed the input data X and can return any of the above. To select multiple columns by name or dtype, you can use make_column_selector.

remainder{‘drop’, ‘passthrough’} or estimator, default=’drop’
By default, only the specified columns in transformers are transformed and combined in the output, and the non-specified columns are dropped. (default of 'drop'). By specifying remainder='passthrough', all remaining columns that were not specified in transformers will be automatically passed through. This subset of columns is concatenated with the output of the transformers. By setting remainder to be an estimator, the remaining non-specified columns will use the remainder estimator. The estimator must support fit and transform.

sparse_thresholdfloat, default=0.3
If the transformed output consists of a mix of sparse and dense data, it will be stacked as a sparse matrix if the density is lower than this value. Use sparse_threshold=0 to always return dense. When the transformed output consists of all sparse or all dense data, the stacked result will be sparse or dense, respectively, and this keyword will be ignored.

n_jobsint, default=None
Number of jobs to run in parallel. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

verbosebool, default=False
If True, the time elapsed while fitting each transformer will be printed as it is completed.

Returns
ctColumnTransformer

官方文档说的比较清楚,在于remainder这个参数,默认是dropout的,意思就是没有进行转化的列,自动丢弃,所以就导致了input转换 前后维度不一样的结果。

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer
make_column_transformer(
    (StandardScaler(), ['numerical_column']),
    (OneHotEncoder(), ['categorical_column']))

数字列正则化,分类的列onehot编码,其他列就默认丢弃了,当然,我们可以指定Transformer :比如StandardScaler(),OneHotEncoder(),和自己关心的那些列。当然,如果不想丢弃剩下的列,也可以将remainder = ‘passthhrough’,不丢弃。或者remainder = MinMaxScaler() 之类的transformer。

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值