过滤
单变量特征选择的目的是评估每个特征对模型预测能力的贡献。在选择用于单变量特征选择的模型时,应考虑以下因素:
-
模型的简单性:单变量特征选择通常用于初步筛选特征,因此选择的模型应该简单易懂,便于解释。线性回归模型是最常见的选择,因为它的参数可以直接解释为特征对目标变量的影响。
-
计算效率:由于需要对每个特征单独训练模型,因此选择的模型应该具有较高的计算效率,以便在有限的时间和资源内处理大量特征。
-
模型的适用性:选择的模型应该适用于数据的分布和问题类型。例如,如果数据是非线性的,可以考虑使用决策树或支持向量机等模型。
-
特征的评估指标:选择的模型应该能够提供明确的评估指标,如均方误差(MSE)或R-squared,以便比较不同特征的性能。
-
特征间的独立性:如果特征之间存在多重共线性,可能需要使用一些能够处理特征间关系的模型,如带有正则化的线性回归模型(Lasso或Ridge回归)。
基于上述考虑,以下是几种常用的模型,适用于单变量特征选择:
-
线性回归:适用于线性关系明显的数据,简单且易于解释。可以通过计算每个特征的系数来评估其重要性。
-
决策树:可以处理非线性关系,通过观察特征在树中的位置和重要性分数来评估特征的重要性。
-
支持向量机(SVM):适用于高维空间的数据,通过评估特征在支持向量中的权重来确定特征的重要性。
-
Lasso回归:通过L1正则化,可以将不重要的特征的系数压缩至零,从而实现特征选择。
-
随机森林:通过构建多个决策树并评估特征在所有树中的平均重要性,可以确定特征的重要性。
在选择模型时,还应考虑数据的特点和问题的具体需求。例如,如果数据集非常大,可能需要选择能够高效处理大数据集的模型。如果目标是解释模型的预测,可能更倾向于选择线性模型。最终,单变量特征选择的目的是为了找到最有影响力的特征,提高模型的预测性能和解释性。
特征选择ROC AUC分类(单变量特征选择)
导入需要的包
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score,roc_auc_score
from sklearn.feature_selection import VarianceThreshold
加载数据
data = pd.read_csv('',nrows=2000)
data.head()
X = data.drop(['TARGEY'],axis=1)
y = data['TARGEY']
X.shape,y.shape
x_train,x_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=0,stratify=y)
删除常量,准常量,重复的特征
#删除常量,准常量,重复的特征
constant_filter = VarianceThreshold(threshold=0.01)
constant_filter.fit(x_train)
x_train_filter = constant_filter.transform(x_train)
x_test_filter = constant_filter.transform(x_test)
print(x_train_filter.shape,x_test_filter.shape)
#删除重复的特征
x_train_T = x_train_filter.T
x_test_T = x_test_filter.T
x_train_T = pd.DataFrame(x_train_T)
x_test_T = pd.DataFrame(x_test_T)
print(x_train_T.duplicated().sum())
duplicated_features = x_train_T.duplicated()
print(duplicated_features)
features_to_keep = [not index for index in duplicated_features]
x_train_unique = x_train_T[features_to_keep].T
x_test_unique = x_test_T[features_to_keep].T
print(x_train_unique.shape,x_train.shape)
计算ROC_AUC分数
#计算ROC_AUC分数
roc_auc = []
for feature in x_train_unique.columns:
clf = RandomForestClassifier(n_estimators=100,random_state=0)
clf.fit(x_train_unique[feature].to_frame,y_train)
y_pred = clf.predict(x_test_unique[feature].to_frame)
roc_auc.append(roc_auc_score(y_test,y_pred))
print(roc_auc)
roc_value = pd.Series(roc_auc)
roc_value.index = x_train_unique.columns
roc_value.sort_values(ascending= False,inplace=True)
print(roc_value)
#二份类问题中,如果roc_value小于0.5,则不能提供信息预测输出
roc_value.plot.bar()
sel = roc_value[roc_value>0.5]
print(sel)
x_train_roc = x_train_unique[sel.index]
x_test_roc = x_test_unique[sel.index]
建立模型和对比
#建立模型和对比
def run_randomForest(x_train,x_test,y_train,y_test):
clf = RandomForestClassifier(n_estimators=100, random_state=0)
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
print('Accuracy on test set:',accuracy_score(y_test,y_pred))
%%time
run_randomForest(x_train_roc,x_test_roc,y_train,y_test)
print(x_train_roc.shape)
%%time
run_randomForest(x_train,x_test,y_train,y_test)
特征选择均方根误差MSE回归(单变量特征选择)
在回归问题中,使用均方误差(Mean Squared Error, MSE)作为特征选择的依据是一种常见的做法,尤其适用于单变量特征选择。单变量特征选择是指每次只考虑一个特征,以评估该特征对模型预测能力的贡献。以下是使用MSE进行单变量特征选择的一般步骤:
-
数据准备:首先,你需要准备好数据集,包括所有的候选特征和目标变量。
-
模型选择:选择一个合适的回归模型。对于单变量特征选择,简单的线性回归模型就足够了,但也可以是其他任何适合的回归模型。
-
训练模型:对于每一个候选特征,使用该特征和目标变量训练一个回归模型。例如,如果你有( n )个特征,你将训练( n )个模型。
-
计算MSE:对于每个训练好的模型,计算其在验证集上的MSE。MSE是模型预测值与实际值之间差异的平方的平均值,计算公式为:
其中,( y_i ) 是实际值,( hat{y}_i ) 是模型预测的值,n 是样本数量。
-
特征评估:根据每个特征的MSE进行评估。一般来说,MSE较小的特征对模型的贡献更大,因为它们使得模型的预测更加准确。
-
选择最佳特征:选择MSE最小的特征作为最佳单变量特征。这个特征被认为是对目标变量预测最有影响力的。
-
迭代过程:如果你希望找到多个特征,可以重复上述过程,每次添加一个新的特征,并重新计算MSE,直到找到所有你希望包含在模型中的特征。
需要注意的是,单变量特征选择可能不会考虑特征之间的相互作用,因此在某些情况下,组合多个特征可能会得到更好的预测结果。此外,单变量特征选择可能不会找到最优的特征组合,但它提供了一个简单快速的方法来评估单个特征的重要性。
以下是一个简单的Python示例,使用单变量特征选择和MSE:
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error,mean_absolute_error,r2_score
from sklearn.model_selection import train_test_split
boston = load_boston()
print(boston.DESCR)
X = pd.DataFrame(boston.data,columns=boston.feature_names)
X.head()
y = boston.target
x_train,x_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=0)
mse = []
for feature in x_train.columns:
clf = LinearRegression()
clf.fit(x_train[feature].to_frame(),y_train)
y_pred = clf.predict(x_test[feature].to_frame())
mse.append(mean_squared_error(y_test,y_pred))
print(mse)#得到每个特征的均方误差
mse = pd.Series(mse,index=x_train.columns)
mse.sort_values(ascending=False,inplace=True)
print(mse)#mse越高意味着错误越大,则较低的mse表明这些特征在预测中非常重要
mse.plot.bar()
x_train_2 = x_train[['RM','LSTAT']]
x_test_2 = x_test[['RM','LSTAT']]
# %%time
model = LinearRegression()
model.fit(x_train_2,y_train)
y_pred = model.predict(x_test_2)
print('r2_score:',r2_score(y_test,y_pred))
print('rmse:',np.sqrt(mean_squared_error(y_test,y_pred)))
print('sd of house price: ',np.std(y))#若预测值的均方根误差小于房价的标准差
# %%time
model = LinearRegression()
model.fit(x_train,y_train)
y_pred = model.predict(x_test)
print('r2_score:',r2_score(y_test,y_pred))
print('rmse:',np.sqrt(mean_squared_error(y_test,y_pred)))
print('sd of house price: ',np.std(y))#若预测值的均方根误差小于房价的标准差
在这个示例中,我们首先生成了一些随机数据作为特征和目标变量。然后,我们对每个特征进行了单变量回归,并计算了测试集上的MSE。最后,我们找到了具有最小MSE的特征作为最佳特征。这个过程可以帮助我们理解单个特征对模型预测能力的影响,并选择最有助于预测目标变量的特征。
如果数据的特征有多重共线性,则可能需要通过其他方法进一步处理,如正则化(例如Lasso回归或岭回归)或主成分分析(PCA)等,以提高模型的稳定性和解释性。用lasso回归的代码示例如下。
import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error,mean_absolute_error,r2_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Lasso
data = pd.read_csv('PATH')
features = data.drop(['TARGET',],axis = 1)
target = data['TARGET']
x_train,x_test,y_train,y_test = train_test_split(features,target,test_size=0.2,random_state=0)
#Lasso回归通过将系数收缩到零来实现特征选择,
#因此系数为零的特征可视为不重要的特征。
#您可以根据需要调整Lasso回归的正则化参数alpha的值。
# 初始化Lasso回归模型
lasso = Lasso(alpha=0.1) # 可根据需要调整正则化参数alpha的值
# 训练Lasso回归模型
lasso.fit(x_train, y_train)
# 获取特征的系数,系数为0表示该特征不重要
coefficients = lasso.coef_
# 创建一个字典来存储特征及其对应的系数
coef_dict = {feature: coef for feature, coef in zip(x_train.columns, coefficients)}
# 将字典转换为DataFrame,并按照系数的绝对值进行排序
coef_df = pd.DataFrame.from_dict(coef_dict, orient='index', columns=['Coefficient'])
coef_df['Absolute Coefficient'] = np.abs(coef_df['Coefficient'])
coef_df = coef_df.sort_values(by='Absolute Coefficient', ascending=False)
# 打印特征及其对应的系数
print(coef_df)
print(coef_df.to_string())
特征选择单变量方差分析测试
理论上,p值低于0.05对应的特征非常重要,但是如果数据集维度太高,p值极低并不表示这些是非常重要的特征,因为有非常大的数据和很多的特征,所以即使将p值设为0.01,仍有很多特征
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import VarianceThreshold
from sklearn.feature_selection import f_classif,f_regression
from sklearn.feature_selection import SelectKBest,SelectPercentile
data = pd.read_csv('',nrows=20000)
data.head()
data = pd.read_csv('',nrows=2000)
data.head()
X = data.drop(['TARGEY'],axis=1)
y = data['TARGEY']
X.shape,y.shape
x_train,x_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=0,stratify=y)
#删除常量,准常量,重复的特征
constant_filter = VarianceThreshold(threshold=0.01)
constant_filter.fit(x_train)
x_train_filter = constant_filter.transform(x_train)
x_test_filter = constant_filter.transform(x_test)
print(x_train_filter.shape,x_test_filter.shape)
x_train_filter_features = features.columns[constant_filter.get_support()]
#删除重复的特征
x_train_T = x_train_filter.T
x_test_T = x_test_filter.T
x_train_T = pd.DataFrame(x_train_T)
# 使用 rename() 方法给行索引换名字
x_train_T.rename(index=dict(zip(x_train_T.index, x_train_filter_features)), inplace=True)
# 打印修改后的 DataFrame
# print(x_train_T)
x_test_T = pd.DataFrame(x_test_T)
x_test_T.rename(index=dict(zip(x_test_T.index, x_train_filter_features)), inplace=True)
print(x_train_T.duplicated().sum())
duplicated_features = x_train_T.duplicated()
print(duplicated_features)
features_to_keep = [not index for index in duplicated_features]
x_train_unique = x_train_T[features_to_keep].T
x_test_unique = x_test_T[features_to_keep].T
print(x_train_unique.shape,x_train.shape)
#检验方差分析
#do F-Test
sel = f_regression(x_train_unique,y_train)
sel
p_value = pd.Series(sel[1],index=x_train_unique.columns)
p_value.index = x_train_unique.columns
p_value.sort_values(ascending=False,inplace=True)
p_value
p_value.plot.bar(figsize = (16,5))
p_value = p_value[p_value<0.01]#理论上,p值低于0.05对应的特征非常重要,但是如果数据集太高
#p值极低并不表示这些是非常重要的特征,因为有非常大的数据和很多的特征,所以即使将p值设为0.01,仍有很多特征
p_value.index
x_train_p = x_train_unique[p_value.index]
x_test_p = x_test_unique[p_value.index]
#建立分类器并进行对比
def run_randomForest(x_train,x_test,y_train,y_test):
clf = RandomForestClassifier(n_estimators=100, random_state=0,n_jobs=-1)
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
print('Accuracy on test set:',accuracy_score(y_test,y_pred))
%%time
run_randomForest(x_train_p,x_test_p,y_train,y_test)
print(x_train_p.shape)
%%time
run_randomForest(x_train,x_test,y_train,y_test)
Recursive Feature Elimination(REF) by using Tree Based and Gradient Based Estimators
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
print(data.keys())
print(data.DESCR)
X = pd.DataFrame(data=data,columns=data.feature_names)
X.head()
y = data.target
x_train,x_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=0)
print(x_train.shape,x_test.shape)
#feature selection by feature importance of random forest classifier
sel = SelectFromModel(RandomForestClassifier(n_estimators=100,random_state=0,n_jobs=-1))
sel.fit(x_train,y_train)
print(sel.get_support())
features = x_train.columns[sel.get_support()]
print(features)
print(sel.estimator_.feature_importances_)
x_train_rfc = sel.transform(x_train)
x_test_rfc = sel.transform(x_test)
def run_randomForest(x_train,x_test,y_train,y_test):
clf = RandomForestClassifier(n_estimators=100, random_state=0,n_jobs=-1)
clf.fit(x_train, y_train)
y_pred = clf.predict(x_test)
print('Accuracy on test set:',accuracy_score(y_test,y_pred))
%%time
run_randomForest(x_train_rfc,x_test_rfc,y_train,y_test)
print(x_train_rfc.shape)
%%time
run_randomForest(x_train,x_test,y_train,y_test)
#Recursive Feature Eliminatioon(RFE)
from sklearn.feature_selection import RFE
sel = RFE(RandomForestClassifier(n_estimators=100, random_state=0,n_jobs=-1),n_features_to_select=15)
sel.fit(x_train,y_train)
features = x_train.columns[sel.get_support()]
print(features)
print(sel.estimator_.feature_importances_)
x_train_rfe = sel.transform(x_train)
x_test_rfe = sel.transform(x_test)
%%time
run_randomForest(x_train_rfe,x_test_rfe,y_train,y_test)
print(x_train_rfe.shape)
%%time
run_randomForest(x_train,x_test,y_train,y_test)
#feature selection by gradientboost tree importance
from sklearn.ensemble import GradientBoostingClassifier
sel = RFE(GradientBoostingClassifier(n_estimators=100, random_state=0),n_features_to_select=12)
sel.fit(x_train,y_train)
features = x_train.columns[sel.get_support()]
print(features)
print(sel.estimator_.feature_importances_)
x_train_rfe = sel.transform(x_train)
x_test_rfe = sel.transform(x_test)
%%time
run_randomForest(x_train_rfe,x_test_rfe,y_train,y_test)
print(x_train_rfe.shape)
%%time
run_randomForest(x_train,x_test,y_train,y_test)
for index in range(1,31):
sel = RFE(GradientBoostingClassifier(n_estimators=100, random_state=0), n_features_to_select=index)
sel.fit(x_train, y_train)
x_train_rfe = sel.transform(x_train)
x_test_rfe = sel.transform(x_test)
print('Select Feature: ',index)
run_randomForest(x_train_rfe, x_test_rfe, y_train, y_test)
print()
sel = RFE(GradientBoostingClassifier(n_estimators=100, random_state=0), n_features_to_select=6)
sel.fit(x_train, y_train)
x_train_rfe = sel.transform(x_train)
x_test_rfe = sel.transform(x_test)
print('Select Feature: ',index)
run_randomForest(x_train_rfe, x_test_rfe, y_train, y_test)
print()
features = x_train.columns[sel.get_support()]
print(features)