机器学习特征选择

雷古小狮子

已于 2024-05-11 19:51:03 修改

阅读量174

点赞数 3

文章标签：机器学习人工智能 python

于 2024-04-14 21:33:37 首次发布

本文链接：https://blog.csdn.net/qq_45876576/article/details/137753438

版权

本文介绍了单变量特征选择的重要性和方法，包括考虑的因素如模型简单性、计算效率、适用性和特征评估指标。列举了线性回归、决策树、SVM、Lasso回归和随机森林等常用模型，并通过实例演示了如何使用这些模型进行特征选择以及在回归问题中使用MSE进行评估。

摘要由CSDN通过智能技术生成

过滤

单变量特征选择的目的是评估每个特征对模型预测能力的贡献。在选择用于单变量特征选择的模型时，应考虑以下因素：

模型的简单性：单变量特征选择通常用于初步筛选特征，因此选择的模型应该简单易懂，便于解释。线性回归模型是最常见的选择，因为它的参数可以直接解释为特征对目标变量的影响。
计算效率：由于需要对每个特征单独训练模型，因此选择的模型应该具有较高的计算效率，以便在有限的时间和资源内处理大量特征。
模型的适用性：选择的模型应该适用于数据的分布和问题类型。例如，如果数据是非线性的，可以考虑使用决策树或支持向量机等模型。
特征的评估指标：选择的模型应该能够提供明确的评估指标，如均方误差（MSE）或R-squared，以便比较不同特征的性能。
特征间的独立性：如果特征之间存在多重共线性，可能需要使用一些能够处理特征间关系的模型，如带有正则化的线性回归模型（Lasso或Ridge回归）。

基于上述考虑，以下是几种常用的模型，适用于单变量特征选择：

线性回归：适用于线性关系明显的数据，简单且易于解释。可以通过计算每个特征的系数来评估其重要性。
决策树：可以处理非线性关系，通过观察特征在树中的位置和重要性分数来评估特征的重要性。
支持向量机（SVM）：适用于高维空间的数据，通过评估特征在支持向量中的权重来确定特征的重要性。
Lasso回归：通过L1正则化，可以将不重要的特征的系数压缩至零，从而实现特征选择。
随机森林：通过构建多个决策树并评估特征在所有树中的平均重要性，可以确定特征的重要性。

在选择模型时，还应考虑数据的特点和问题的具体需求。例如，如果数据集非常大，可能需要选择能够高效处理大数据集的模型。如果目标是解释模型的预测，可能更倾向于选择线性模型。最终，单变量特征选择的目的是为了找到最有影响力的特征，提高模型的预测性能和解释性。

特征选择ROC AUC分类（单变量特征选择）

导入需要的包

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score,roc_auc_score
from sklearn.feature_selection import VarianceThreshold

加载数据

data = pd.read_csv('',nrows=2000)
data.head()
X = data.drop(['TARGEY'],axis=1)
y = data['TARGEY']
X.shape,y.shape
x_train,x_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=0,stratify=y)

删除常量，准常量，重复的特征

#删除常量，准常量，重复的特征
constant_filter = VarianceThreshold(threshold=0.01)
constant_filter.fit(x_train)
x_train_filter = constant_filter.transform(x_train)
x_test_filter = constant_filter.transform(x_test)
print(x_train_filter.shape,x_test_filter.shape)

#删除重复的特征
x_train_T = x_train_filter.T
x_test_T = x_test_filter.T
x_train_T = pd.DataFrame(x_train_T)
x_test_T = pd.DataFrame(x_test_T)
print(x_train_T.duplicated().sum())
duplicated_features = x_train_T.duplicated()
print(duplicated_features)
features_to_keep = [not index for index in duplicated_features]
x_train_unique = x_train_T[features_to_keep].T
x_test_unique = x_test_T[features_to_keep].T
print(x_train_unique.shape,x_train.shape)

计算ROC_AUC分数

#计算ROC_AUC分数
roc_auc = []
for feature in x_train_unique.columns:
    clf = RandomForestClassifier(n_estimators=100,random_state=0)
    clf.fit(x_train_unique[feature].to_frame,y_train)
    y_pred = clf.predict(x_test_unique[feature].to_frame)
    roc_auc.append(roc_auc_score(y_test,y_pred))

print(roc_auc)
roc_value = pd.Series(roc_auc)
roc_value.index = x_train_unique.columns
roc_value.sort_values(ascending= False,inplace=True)
print(roc_value)
#二份类问题中，如果roc_value小于0.5，则不能提供信息预测输出
roc_value.plot.bar()
sel = roc_value[roc_value>0.5]
print(sel)
x_train_roc = x_train_unique[sel.index]
x_test_roc = x_test_unique[sel.index]

建立模型和对比

#建立模型和对比
def run_randomForest(x_train,x_test,y_train,y_test):
    clf = RandomForestClassifier(n_estimators=100, random_state=0)
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    print('Accuracy on test set:',accuracy_score(y_test,y_pred))

%%time
run_randomForest(x_train_roc,x_test_roc,y_train,y_test)
print(x_train_roc.shape)

%%time
run_randomForest(x_train,x_test,y_train,y_test)

特征选择均方根误差MSE回归（单变量特征选择）

在回归问题中，使用均方误差（Mean Squared Error, MSE）作为特征选择的依据是一种常见的做法，尤其适用于单变量特征选择。单变量特征选择是指每次只考虑一个特征，以评估该特征对模型预测能力的贡献。以下是使用MSE进行单变量特征选择的一般步骤：

数据准备：首先，你需要准备好数据集，包括所有的候选特征和目标变量。
模型选择：选择一个合适的回归模型。对于单变量特征选择，简单的线性回归模型就足够了，但也可以是其他任何适合的回归模型。
训练模型：对于每一个候选特征，使用该特征和目标变量训练一个回归模型。例如，如果你有( n )个特征，你将训练( n )个模型。
计算MSE：对于每个训练好的模型，计算其在验证集上的MSE。MSE是模型预测值与实际值之间差异的平方的平均值，计算公式为：

其中，( y_i ) 是实际值，( hat{y}_i ) 是模型预测的值，n 是样本数量。
特征评估：根据每个特征的MSE进行评估。一般来说，MSE较小的特征对模型的贡献更大，因为它们使得模型的预测更加准确。
选择最佳特征：选择MSE最小的特征作为最佳单变量特征。这个特征被认为是对目标变量预测最有影响力的。
迭代过程：如果你希望找到多个特征，可以重复上述过程，每次添加一个新的特征，并重新计算MSE，直到找到所有你希望包含在模型中的特征。

需要注意的是，单变量特征选择可能不会考虑特征之间的相互作用，因此在某些情况下，组合多个特征可能会得到更好的预测结果。此外，单变量特征选择可能不会找到最优的特征组合，但它提供了一个简单快速的方法来评估单个特征的重要性。

以下是一个简单的Python示例，使用单变量特征选择和MSE：

import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error,mean_absolute_error,r2_score
from sklearn.model_selection import train_test_split

boston = load_boston()
print(boston.DESCR)

X = pd.DataFrame(boston.data,columns=boston.feature_names)
X.head()
y = boston.target
x_train,x_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=0)

mse = []
for feature in x_train.columns:
    clf = LinearRegression()
    clf.fit(x_train[feature].to_frame(),y_train)
    y_pred = clf.predict(x_test[feature].to_frame())
    mse.append(mean_squared_error(y_test,y_pred))

print(mse)#得到每个特征的均方误差
mse = pd.Series(mse,index=x_train.columns)
mse.sort_values(ascending=False,inplace=True)
print(mse)#mse越高意味着错误越大，则较低的mse表明这些特征在预测中非常重要
mse.plot.bar()
x_train_2 = x_train[['RM','LSTAT']]
x_test_2 = x_test[['RM','LSTAT']]

# %%time
model = LinearRegression()
model.fit(x_train_2,y_train)
y_pred = model.predict(x_test_2)
print('r2_score:',r2_score(y_test,y_pred))
print('rmse:',np.sqrt(mean_squared_error(y_test,y_pred)))
print('sd of house price: ',np.std(y))#若预测值的均方根误差小于房价的标准差


# %%time
model = LinearRegression()
model.fit(x_train,y_train)
y_pred = model.predict(x_test)
print('r2_score:',r2_score(y_test,y_pred))
print('rmse:',np.sqrt(mean_squared_error(y_test,y_pred)))
print('sd of house price: ',np.std(y))#若预测值的均方根误差小于房价的标准差

在这个示例中，我们首先生成了一些随机数据作为特征和目标变量。然后，我们对每个特征进行了单变量回归，并计算了测试集上的MSE。最后，我们找到了具有最小MSE的特征作为最佳特征。这个过程可以帮助我们理解单个特征对模型预测能力的影响，并选择最有助于预测目标变量的特征。

如果数据的特征有多重共线性，则可能需要通过其他方法进一步处理，如正则化（例如Lasso回归或岭回归）或主成分分析（PCA）等，以提高模型的稳定性和解释性。用lasso回归的代码示例如下。

import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error,mean_absolute_error,r2_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso

data = pd.read_csv('PATH')
features = data.drop(['TARGET',],axis = 1)


target = data['TARGET']
x_train,x_test,y_train,y_test = train_test_split(features,target,test_size=0.2,random_state=0)


#Lasso回归通过将系数收缩到零来实现特征选择，
#因此系数为零的特征可视为不重要的特征。
#您可以根据需要调整Lasso回归的正则化参数alpha的值。

# 初始化Lasso回归模型
lasso = Lasso(alpha=0.1)  # 可根据需要调整正则化参数alpha的值

# 训练Lasso回归模型
lasso.fit(x_train, y_train)

# 获取特征的系数，系数为0表示该特征不重要
coefficients = lasso.coef_

# 创建一个字典来存储特征及其对应的系数
coef_dict = {feature: coef for feature, coef in zip(x_train.columns, coefficients)}

# 将字典转换为DataFrame，并按照系数的绝对值进行排序
coef_df = pd.DataFrame.from_dict(coef_dict, orient='index', columns=['Coefficient'])
coef_df['Absolute Coefficient'] = np.abs(coef_df['Coefficient'])
coef_df = coef_df.sort_values(by='Absolute Coefficient', ascending=False)

# 打印特征及其对应的系数
print(coef_df)
print(coef_df.to_string())

特征选择单变量方差分析测试

理论上，p值低于0.05对应的特征非常重要，但是如果数据集维度太高，p值极低并不表示这些是非常重要的特征，因为有非常大的数据和很多的特征，所以即使将p值设为0.01，仍有很多特征

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_selection import f_classif,f_regression
from sklearn.feature_selection import SelectKBest,SelectPercentile

data = pd.read_csv('',nrows=20000)
data.head()

data = pd.read_csv('',nrows=2000)
data.head()
X = data.drop(['TARGEY'],axis=1)
y = data['TARGEY']
X.shape,y.shape
x_train,x_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=0,stratify=y)

#删除常量，准常量，重复的特征
constant_filter = VarianceThreshold(threshold=0.01)
constant_filter.fit(x_train)
x_train_filter = constant_filter.transform(x_train)
x_test_filter = constant_filter.transform(x_test)
print(x_train_filter.shape,x_test_filter.shape)
x_train_filter_features = features.columns[constant_filter.get_support()]

#删除重复的特征
x_train_T = x_train_filter.T
x_test_T = x_test_filter.T
x_train_T = pd.DataFrame(x_train_T)
# 使用 rename() 方法给行索引换名字
x_train_T.rename(index=dict(zip(x_train_T.index, x_train_filter_features)), inplace=True)

# 打印修改后的 DataFrame
# print(x_train_T)
x_test_T = pd.DataFrame(x_test_T)
x_test_T.rename(index=dict(zip(x_test_T.index, x_train_filter_features)), inplace=True)

print(x_train_T.duplicated().sum())

duplicated_features = x_train_T.duplicated()
print(duplicated_features)

features_to_keep = [not index for index in duplicated_features]
x_train_unique = x_train_T[features_to_keep].T
x_test_unique = x_test_T[features_to_keep].T
print(x_train_unique.shape,x_train.shape)
#检验方差分析
#do F-Test
sel = f_regression(x_train_unique,y_train)
sel

p_value = pd.Series(sel[1],index=x_train_unique.columns)
p_value.index = x_train_unique.columns
p_value.sort_values(ascending=False,inplace=True)
p_value

p_value.plot.bar(figsize = (16,5))

p_value = p_value[p_value<0.01]#理论上，p值低于0.05对应的特征非常重要，但是如果数据集太高
#p值极低并不表示这些是非常重要的特征，因为有非常大的数据和很多的特征，所以即使将p值设为0.01，仍有很多特征
p_value.index


x_train_p = x_train_unique[p_value.index]
x_test_p = x_test_unique[p_value.index]

#建立分类器并进行对比
def run_randomForest(x_train,x_test,y_train,y_test):
    clf = RandomForestClassifier(n_estimators=100, random_state=0,n_jobs=-1)
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    print('Accuracy on test set:',accuracy_score(y_test,y_pred))

%%time
run_randomForest(x_train_p,x_test_p,y_train,y_test)
print(x_train_p.shape)

%%time
run_randomForest(x_train,x_test,y_train,y_test)

Recursive Feature Elimination(REF) by using Tree Based and Gradient Based Estimators

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()

print(data.keys())
print(data.DESCR)
X = pd.DataFrame(data=data,columns=data.feature_names)
X.head()
y = data.target
x_train,x_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=0)
print(x_train.shape,x_test.shape)

#feature selection by feature importance of random forest classifier
sel = SelectFromModel(RandomForestClassifier(n_estimators=100,random_state=0,n_jobs=-1))
sel.fit(x_train,y_train)
print(sel.get_support())

features = x_train.columns[sel.get_support()]
print(features)
print(sel.estimator_.feature_importances_)
x_train_rfc = sel.transform(x_train)
x_test_rfc = sel.transform(x_test)

def run_randomForest(x_train,x_test,y_train,y_test):
    clf = RandomForestClassifier(n_estimators=100, random_state=0,n_jobs=-1)
    clf.fit(x_train, y_train)
    y_pred = clf.predict(x_test)
    print('Accuracy on test set:',accuracy_score(y_test,y_pred))

%%time
run_randomForest(x_train_rfc,x_test_rfc,y_train,y_test)
print(x_train_rfc.shape)

%%time
run_randomForest(x_train,x_test,y_train,y_test)

#Recursive Feature Eliminatioon(RFE)
from sklearn.feature_selection import RFE
sel = RFE(RandomForestClassifier(n_estimators=100, random_state=0,n_jobs=-1),n_features_to_select=15)
sel.fit(x_train,y_train)

features = x_train.columns[sel.get_support()]
print(features)
print(sel.estimator_.feature_importances_)
x_train_rfe = sel.transform(x_train)
x_test_rfe = sel.transform(x_test)



%%time
run_randomForest(x_train_rfe,x_test_rfe,y_train,y_test)
print(x_train_rfe.shape)

%%time
run_randomForest(x_train,x_test,y_train,y_test)


#feature selection by gradientboost tree importance
from sklearn.ensemble import GradientBoostingClassifier
sel = RFE(GradientBoostingClassifier(n_estimators=100, random_state=0),n_features_to_select=12)
sel.fit(x_train,y_train)

features = x_train.columns[sel.get_support()]
print(features)
print(sel.estimator_.feature_importances_)
x_train_rfe = sel.transform(x_train)
x_test_rfe = sel.transform(x_test)



%%time
run_randomForest(x_train_rfe,x_test_rfe,y_train,y_test)
print(x_train_rfe.shape)

%%time
run_randomForest(x_train,x_test,y_train,y_test)

for index in range(1,31):
    sel = RFE(GradientBoostingClassifier(n_estimators=100, random_state=0), n_features_to_select=index)
    sel.fit(x_train, y_train)
    x_train_rfe = sel.transform(x_train)
    x_test_rfe = sel.transform(x_test)
    print('Select Feature: ',index)
    run_randomForest(x_train_rfe, x_test_rfe, y_train, y_test)
    print()


sel = RFE(GradientBoostingClassifier(n_estimators=100, random_state=0), n_features_to_select=6)
sel.fit(x_train, y_train)
x_train_rfe = sel.transform(x_train)
x_test_rfe = sel.transform(x_test)
print('Select Feature: ',index)
run_randomForest(x_train_rfe, x_test_rfe, y_train, y_test)
print()
features = x_train.columns[sel.get_support()]
print(features)