数据预处理——QuantileTransformer
class sklearn.preprocessing.QuantileTransformer(*, n_quantiles=1000, output_distribution='uniform', ignore_implicit_zeros=False, subsample=100000, random_state=None, copy=True)
Transform features using quantiles information.
This method transforms the features to follow a uniform or a normal distribution. Therefore, for a given feature, this transformation tends to spread out the most frequent values. It also reduces the impact of (marginal) outliers: this is therefore a robust preprocessing scheme.
The transformation is applied on each feature independently. First an estimate of the cumulative distribution function of a feature is used to map the original values to a uniform distribution. The obtained values are then mapped to the desired output distribution using the associated quantile function. Features values of new/unseen data that fall below or above the fitted range will be mapped to the bounds of the output distribution. Note that this transform is non-linear. It may distort linear correlations between variables measured at the same scale but renders variables measured at different scales more directly comparable.

代码示例:
import numpy as np
from sklearn.preprocessing import QuantileTransformer
rng = np.random.RandomState(0)
X = np.sort(rng.normal(loc=0.5, scale=0.25, size=(25, 1)), axis=0)
qt = QuantileTransformer(n_quantiles=10, random_state=0)
qt.fit_transform(X)
对MOA比赛进行的代码如下,这一步对于结果影响很大。
from sklearn.preprocessing import QuantileTransformer
use_test_for_preprocessing = False
for col in (GENES + CELLS):
if IS_TRAIN:
transformer = QuantileTransformer(n_quantiles=100, random_state=0, output_distribution="normal")
if use_test_for_preprocessing:
raw_vec = pd.concat([train_features, test_features])[col].values.reshape(vec_len+vec_len_test, 1)
transformer.fit(raw_vec)
else:
raw_vec = train_features[col].values.reshape(vec_len, 1)
transformer.fit(raw_vec)
pd.to_pickle(transformer, f'{MODEL_DIR}/{NB}_{col}_quantile_transformer.pkl')
else:
transformer = pd.read_pickle(f'{MODEL_DIR}/{NB}_{col}_quantile_transformer.pkl')
vec_len = len(train_features[col].values)
vec_len_test = len(test_features[col].values)
train_features[col] = transformer.transform(train_features[col].values.reshape(vec_len, 1)).reshape(1, vec_len)[0]
test_features[col] = transformer.transform(test_features[col].values.reshape(vec_len_test, 1)).reshape(1, vec_len_test)[0]

本文介绍sklearn.preprocessing.QuantileTransformer类,用于将特征转换为均匀或正态分布,适用于特征独立转换,减少边际异常值影响。通过示例代码展示了其在实际数据集预处理中的应用。

被折叠的 条评论
为什么被折叠?



