遗传算法之特征选择的python实现

目录

1 遗传算法特征选取基本原理

2. 适应度函数选择和环境要求

(1)适应度函数选择

(2)依赖的第三方工具包

3. python实现


1 遗传算法特征选取基本原理

遗传算法特征选择的基本原理是用遗传算法寻找一个最优的二进制编码, 码中的每一位对应一个特征, 若第i位为“1”, 则表明对应特征被选取, 该特征将出现在估计器中, 为“0”, 则表明对应特征未被选取,该特征将不出现在分类器中。其基本步骤为:

(1) 编码。采用二进制编码方法, 二进制码的每一位的值, “0”表示特征未被选中;“1”表示特征被选中。

(2) 初始群体的生成。随机产生N个初始串构成初始种群, 通常种群数确定为50~100。

(3) 适应度函数。适应度函数表明个体或解的优劣性。针对特征选取问题, 适应度函数的构造非常重要, 它主要依据类别的可分性判据以及特征的分类能力。

(4) 将适应度最大的个体, 即种群中最好的个体无条件地复制到下一代新种群中, 然后对父代种群进行选择、交叉和变异等遗传算子运算, 从而繁殖出下一代新种群其它n-1个基因串。通常采用转轮法作为选取方法, 适应度大的基因串选择的机会大, 从而被遗传到下一代的机会大, 相反, 适应度小的基因串选择的机会小, 从而被淘汰的机率大。交叉和变异是产生新个体的遗传算子, 交叉率太大, 将使高适应度的基因串结构很快被破坏掉, 太小则使搜索停止不前, 一般取为0.5~0.9。变异率太大, 将使遗传算法变为随机搜索, 太小则不会产生新个体, 一般取为0.01~0.1。

(5) 如果达到设定的繁衍代数, 返回最好的基因串, 并将其作为特征选取的依据, 算法结束。否则, 回到 (4) 继续下一代的繁衍。

2. 适应度函数选择和环境要求

(1)适应度函数选择

可以参考scikitlearn库的model evaluation的分类指标和回归指标,本手稿选取mean_squared_error指标

以下有很多指标可以选择

SCORERS = dict(explained_variance=explained_variance_scorer,
               r2=r2_scorer,
               max_error=max_error_scorer,
               neg_median_absolute_error=neg_median_absolute_error_scorer,
               neg_mean_absolute_error=neg_mean_absolute_error_scorer,
               neg_mean_squared_error=neg_mean_squared_error_scorer,
               neg_mean_squared_log_error=neg_mean_squared_log_error_scorer,
               accuracy=accuracy_scorer, roc_auc=roc_auc_scorer,
               balanced_accuracy=balanced_accuracy_scorer,
               average_precision=average_precision_scorer,
               neg_log_loss=neg_log_loss_scorer,
               brier_score_loss=brier_score_loss_scorer,
               # Cluster metrics that use supervised evaluation
               adjusted_rand_score=adjusted_rand_scorer,
               homogeneity_score=homogeneity_scorer,
               completeness_score=completeness_scorer,
               v_measure_score=v_measure_scorer,
               mutual_info_score=mutual_info_scorer,
               adjusted_mutual_info_score=adjusted_mutual_info_scorer,
               normalized_mutual_info_score=normalized_mutual_info_scorer,
               fowlkes_mallows_score=fowlkes_mallows_scorer)

(2)依赖的第三方工具包

遗传算法工具包:sklearn-genetic

python第三方工具包:scikitlearn、numpy、scipy、matplotlib

3. python实现

from __future__ import print_function
from genetic_selection import GeneticSelectionCV
import numpy as np
from sklearn.neural_network import MLPRegressor
import scipy.io as sio
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_squared_error
import matplotlib.pyplot as plt


def main():
    # 1.数据获取
    mat = sio.loadmat('NDFNDF_smote.mat')
    data = mat['NDFNDF_smote']
    x, y = data[:, :1050], data[:, 1050]
    print(x.shape, y.shape)

    # 2.样本集划分和预处理
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)

    x_scale, y_scale = StandardScaler(), StandardScaler()
    x_train_scaled = x_scale.fit_transform(x_train)
    x_test_scaled = x_scale.transform(x_test)
    y_train_scaled = y_scale.fit_transform(y_train.reshape(-1, 1))
    y_test_scaled = y_scale.transform(y_test.reshape(-1, 1))
    print(x_train_scaled.shape, y_train_scaled.shape)
    print(x_test_scaled.shape, y_test_scaled.shape)

    # 3. 优化超参数
    base, size = 30, 21
    wavelengths_save, wavelengths_size, r2_test_save, mse_test_save = [], [], [], []
    for hidden_size in range(base, base+size):
        print('隐含层神经元数量: ', hidden_size)
        estimator = MLPRegressor(hidden_layer_sizes=hidden_size,
                                 activation='relu',
                                 solver='adam',
                                 alpha=0.0001,
                                 batch_size='auto',
                                 learning_rate='constant',
                                 learning_rate_init=0.001,
                                 power_t=0.5,
                                 max_iter=1000,
                                 shuffle=True,
                                 random_state=1,
                                 tol=0.0001,
                                 verbose=False,
                                 warm_start=False,
                                 momentum=0.9,
                                 nesterovs_momentum=True,
                                 early_stopping=False,
                                 validation_fraction=0.1,
                                 beta_1=0.9, beta_2=0.999,
                                 epsilon=1e-08)

        selector = GeneticSelectionCV(estimator,
                                      cv=5,
                                      verbose=1,
                                      scoring="neg_mean_squared_error",
                                      max_features=5,
                                      n_population=200,
                                      crossover_proba=0.5,
                                      mutation_proba=0.2,
                                      n_generations=200,
                                      crossover_independent_proba=0.5,
                                      mutation_independent_proba=0.05,
                                      tournament_size=3,
                                      n_gen_no_change=10,
                                      caching=True,
                                      n_jobs=-1)
        selector = selector.fit(x_train_scaled, y_train_scaled.ravel())
        print('有效变量的数量:', selector.n_features_)
        print(np.array(selector.population_).shape)
        print(selector.generation_scores_)

        x_train_s, x_test_s = x_train_scaled[:, selector.support_], x_test_scaled[:, selector.support_]
        estimator.fit(x_train_s, y_train_scaled.ravel())

        # y_train_pred = estimator.predict(x_train_s)
        y_test_pred = estimator.predict(x_test_s)
        # y_train_pred = y_scale.inverse_transform(y_train_pred)
        y_test_pred = y_scale.inverse_transform(y_test_pred)
        r2_test = r2_score(y_test, y_test_pred)
        mse_test = mean_squared_error(y_test, y_test_pred)

        wavelengths_save.append(list(selector.support_))  
        wavelengths_size.append(selector.n_features_)  
        r2_test_save.append(r2_test)
        mse_test_save.append(mse_test)
        print('决定系数:', r2_test, '均方误差:', mse_test)

    print('有效变量数量', wavelengths_size)

    # 4.保存过程数据
    dict_name = {'wavelengths_size': wavelengths_size, 'r2_test_save': r2_test_save,
                 'mse_test_save': mse_test_save, 'wavelengths_save': wavelengths_save}
    f = open('bpnn_ga.txt', 'w')
    f.write(str(dict_name))
    f.close()

    # 5.绘制曲线
    plt.figure(figsize=(6, 4), dpi=300)
    fonts = 8
    xx = np.arange(base, base+size)
    plt.plot(xx, r2_test_save, color='r', linewidth=2, label='r2')
    plt.plot(xx, mse_test_save, color='k', linewidth=2, label='mse')
    plt.xlabel('generation', fontsize=fonts)
    plt.ylabel('accuracy', fontsize=fonts)
    plt.grid(True)
    plt.legend(fontsize=fonts)
    plt.tight_layout(pad=0.3)
    plt.show()


if __name__ == "__main__":
    main()

 

 

  • 10
    点赞
  • 147
    收藏
    觉得还不错? 一键收藏
  • 27
    评论
特征选择是机器学习中非常重要的一步,可以有效地提高模型的准确性和泛化能力。遗传算法是一种基于自然选择和遗传机制的优化算法,可以用于特征选择问题。 下面是一个使用遗传算法实现特征选择Python 示例代码: ``` python import numpy as np from sklearn.datasets import load_breast_cancer from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score # 加载数据集 data = load_breast_cancer() X = data.data y = data.target # 划分训练集和测试集 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) # 定义适应度函数 def fitness_function(individual, X_train, X_test, y_train, y_test): # 将个体转换为特征掩码 feature_mask = individual.astype(bool) # 选择相关的特征 X_train_selected = X_train[:, feature_mask] X_test_selected = X_test[:, feature_mask] # 训练随机森林模型 model = RandomForestClassifier(n_estimators=100, random_state=42) model.fit(X_train_selected, y_train) # 在测试集上计算准确率 y_pred = model.predict(X_test_selected) accuracy = accuracy_score(y_test, y_pred) return accuracy, # 定义遗传算法参数 POPULATION_SIZE = 100 P_CROSSOVER = 0.9 P_MUTATION = 0.1 MAX_GENERATIONS = 50 HALL_OF_FAME_SIZE = 10 # 定义遗传算法工具箱 from deap import base from deap import creator from deap import tools creator.create("FitnessMax", base.Fitness, weights=(1.0,)) creator.create("Individual", np.ndarray, fitness=creator.FitnessMax) toolbox = base.Toolbox() toolbox.register("attr_bool", np.random.randint, 0, 2) toolbox.register("individual", tools.initRepeat, creator.Individual, toolbox.attr_bool, n=X.shape[1]) toolbox.register("population", tools.initRepeat, list, toolbox.individual) toolbox.register("evaluate", fitness_function, X_train=X_train, X_test=X_test, y_train=y_train, y_test=y_test) toolbox.register("mate", tools.cxTwoPoint) toolbox.register("mutate", tools.mutFlipBit, indpb=1.0/X.shape[1]) toolbox.register("select", tools.selTournament, tournsize=3) # 运行遗传算法 population = toolbox.population(n=POPULATION_SIZE) hof = tools.HallOfFame(HALL_OF_FAME_SIZE) stats = tools.Statistics(lambda ind: ind.fitness.values) stats.register("avg", np.mean) stats.register("min", np.min) stats.register("max", np.max) best = None for gen in range(MAX_GENERATIONS): offspring = algorithms.varAnd(population, toolbox, P_CROSSOVER, P_MUTATION) fits = toolbox.map(toolbox.evaluate, offspring) for fit, ind in zip(fits, offspring): ind.fitness.values = fit population = toolbox.select(offspring, k=len(population)) hof.update(population) record = stats.compile(population) print("Generation {}: {}".format(gen, record)) if best is None or best.fitness < hof[0].fitness: best = hof[0] if hof[0].fitness.values[0] >= 0.99: break # 输出结果 feature_mask = best.astype(bool) selected_features = X_train[:, feature_mask] print("Selected features:", selected_features.shape[1]) ``` 上述代码使用了 `deap` 库来实现遗传算法。首先,我们定义了一个适应度函数 `fitness_function`,它将一个个体(即特征掩码)转换为相关的特征,然后训练随机森林模型并在测试集上计算准确率。 然后,我们定义了遗传算法的参数,并注册了遗传算法工具箱中的各种操作函数。接着,我们初始化种群,使用遗传算法运行多代,并记录每一代的结果。 最后,我们输出了最终选中的特征个数。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论 27
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值