数据挖掘（数据预处理,特征工程）

最新推荐文章于 2021-11-30 23:00:12 发布

濯君

最新推荐文章于 2021-11-30 23:00:12 发布

阅读量316

点赞数 1

分类专栏：数据挖掘机器学习

本文链接：https://blog.csdn.net/zzldm/article/details/100186903

版权

机器学习同时被 2 个专栏收录

28 篇文章 3 订阅

订阅专栏

数据挖掘

5 篇文章 0 订阅

订阅专栏

数据挖掘中，预处理和特征工程是关键步骤。StandardScaler和RobustScaler用于特征缩放，前者保证均值0、方差1，后者则忽略异常值。MinMaxScaler将特征范围标准化到0-1。Normalizer处理数据点到单位圆。PCA用于特征选择，NMF提取正向特征。Binning和Discretization改善线性模型。多项式特征增强线性模型，Log和Exp转换可能有益。RFE、Pearson Correlation和Label Encoding助于特征选择。最后，优化DataFrame存储占用。

摘要由CSDN通过智能技术生成

1.StandardScaler
确保处理后的特征均值为0，方差为1，但是不确保特征任何特定的最大，最小值

2.RobustScaler
与StandardScaler类似，确保所有特征scale在同一规模，但是使用属性median，quartiles，
忽视outliers的影响

3.MinMaxScaler
将特征值范围scale到0,1

4.Normalizer
将数据点映射到半径为1的圆；一般在数据的方向影响预测时使用

5.PCA
进行特征选取;
使用pca之前需要对数据scale

6.Non-negative matrix factorization（NMF）
提取有用特征；
使用必须确保数据是正值的

7.Binning, Discretization（将连续值转为离散值）
对linear model有提升作用，对DT没用；
当数据集很大，维度很高，有一些特征与输出有非线性关系时，有效；

bins = np.linspace(-3, 3, 11) #划分10个bins
which_bin = np.digitize(X, bins=bins)#将X中的值分到所属的bin

from sklearn.preprocessing import OneHotEncoder
# transform using the OneHotEncoder
encoder = OneHotEncoder(sparse=False)
# encoder.fit finds the unique values that appear in which_bin
encoder.fit(which_bin)
# transform creates the one-hot encoding
X_binned = encoder.transform(which_bin)
print(X_binned[:5])

reg = LinearRegression().fit(X_binned, y)

在这里插入图片描述
8.PolynomialFeatures(degree=2)
多项式特征，一般只对线性模型，naive bayes有效, tree-based的模型能自行找到这种特征间的交互关系，不需要特意转换数据

9.考虑对数据进行log，exp转换

10.按百分比选取特征

from sklearn.feature_selection import SelectPercentile

select = SelectPercentile(percentile=75)
select.fit(train_x, train_y)
# transform training set
train_x_selected = select.transform(train_x)

11.利用模型选择特征

from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import RandomForestClassifier
select = SelectFromModel(
RandomForestClassifier(n_estimators=100, random_state=42),
threshold="median") #median选择一半的特征

select.fit(X_train, y_train)
X_train_l1 = select.transform(X_train)
print("X_train.shape: {}".format(X_train.shape))
print("X_train_l1.shape: {}".format(X_train_l1.shape))

在这里插入图片描述
12.recursive feature elimination (RFE)

from sklearn.feature_selection import RFE
select = RFE(RandomForestClassifier(n_estimators=100, random_state=42),
n_features_to_select=40) #选择40个特征
select.fit(X_train, y_train)

13.使用Pearson Correlation选取特征

def cor_selector(X, y):
    cor_list = []
    # calculate the correlation with y for each feature
    for i in X.columns.tolist():
        cor = np.corrcoef(X[i], y)[0, 1]
        cor_list.append(cor)
    # replace NaN with 0
    cor_list = [0 if np.isnan(i) else i for i in cor_list]
    # feature name
    cor_feature = X.iloc[:,np.argsort(np.abs(cor_list))[-100:]].columns.tolist() #选择相关性最大的100个特征
    # feature selection? 0 for not select, 1 for select
    cor_support = [True if i in cor_feature else False for i in feature_name]
    return cor_support, cor_feature

_, cor_feature = cor_selector(train_x, train_y['isFraud'])
train_x = train_x[cor_feature]

14.使用Label Encoding将category类特征值转为数值型

from sklearn import preprocessing

# Label Encoding
for f in df_train.drop('isFraud', axis=1).columns:
    if df_train[f].dtype=='object' or df_test[f].dtype=='object': 
        lbl = preprocessing.LabelEncoder()
        lbl.fit(list(df_train[f].values) + list(df_test[f].values))
        df_train[f] = lbl.transform(list(df_train[f].values))
        df_test[f] = lbl.transform(list(df_test[f].values))

15.减少dataframe占用的存储

## Function to reduce the DF size
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

濯君

关注

1
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
数据挖掘（数据预处理,特征工程）

1.StandardScaler确保处理后的特征均值为0，方差为1，但是不确保特征任何特定的最大，最小值2.RobustScaler与StandardScaler类似，确保所有特征scale在同一规模，但是使用属性median，quartiles，忽视outliers的影响3.MinMaxScaler将特征值范围scale到0,14.Normalizer将数据点映射到半径为1的圆；...
复制链接

扫一扫

专栏目录