导读
维基百科中定义的遗传算法的灵感来自查尔斯·达尔文提出的自然选择过程。更一般地说,我们可以使用以下描述来理解自然过程以及它与遗传算法的关系。
更多内容 和 完整代码 关注公众号【小Z的科研日常】获取。
我们从具有某些特征的初始种群开始,如图 1 所示。将在特定环境中测试该初始种群,以根据预定义的适应度标准观察该种群中的个体(父母)的表现如何。机器学习中的适应度可以是任何性能指标——准确度、精确度、召回率、F1 分数、auc 等。根据适应度值,我们选择表现最好的父母(“适者生存”)作为幸存群体(图 2)。
图1 初始化种群
图2 适者生存
现在,幸存种群中的父代将结合使用两个步骤来交配产生后代:交叉/重组和突变。在交叉的情况下,来自交配父母的基因(参数)将被重新组合,以产生后代,每个孩子都会从每个父母那里继承一些基因(参数)(图3)。
图3 交叉操作
最后,在突变的情况下,基因的一些值(参数)将被改变以保持遗传多样性(图 4)。这使得自然/遗传算法通常能够得出更好的解决方案。
图4 突变
图 5 显示了第二代人口,其中包括幸存的父母和子女。我们保留幸存的父母,以便保留最佳的适应度参数,以防后代的适应度值比父母差。
图5 第二代人口
XGBoost的遗传算法模块
我们将创建一个为 XGBoost 定制的遗传算法模块。以下是XGboost的描述:
XGBoost是一个优化的分布式梯度提升库,旨在搞笑、灵活和可移植。他在Gradient Boosting框架下实现机器学习算法。
该模块将具有遵循四个步骤的函数:
(i)初始化
(ii)选择
(iii)交叉
(iv)变异
初始化:
第一步是初始化,其中参数被随机初始化以创建群体。它类似于图 1 中所示的第一代群体。下面的代码显示了初始化过程,其中我们生成一个包含参数的向量。对于 XGBoost,我们选择了 7 个参数进行优化:learning_rate、n_estimators、max_depth、min_child_weight、subsample、colsample_bytree 和 gamma。。
参数的限制要么基于 XGBoost 文档中描述的限制,要么基于合理的猜测(如果上限设置为无穷大)。我们首先为每个参数创建一个空数组,然后用随机值填充它。
父代选择(适者生存):
在第二步中,我们使用初始群体训练模型并计算适应度值。在这种情况下,我们将计算 F1 分数。我们将定义要选择多少个父代,并根据所选父母的适应度值创建一个数组。
交叉:
遗传算法中定义交叉的方法有多种,如单点交叉、两点交叉、k点交叉、均匀交叉、有序表交叉等。我们将使用统一交叉,其中子代的每个参数将根据一定的分布从父代中独立选择。在我们的例子中,我们将使用numpy 随机函数的“离散均匀”分布。
突变:
最后一步是通过随机选择一个参数并随机改变其值来向子代引入多样性。我们还将引入一些限制,以便将更改的值限制在一定范围内。跳过这些限制可能会导致错误。
代码如下:
初始化:
def initilialize_poplulation(numberOfParents):
learningRate = np.empty([numberOfParents, 1])
nEstimators = np.empty([numberOfParents, 1], dtype = np.uint8)
maxDepth = np.empty([numberOfParents, 1], dtype = np.uint8)
minChildWeight = np.empty([numberOfParents, 1])
gammaValue = np.empty([numberOfParents, 1])
subSample = np.empty([numberOfParents, 1])
colSampleByTree = np.empty([numberOfParents, 1])
for i in range(numberOfParents):
print(i)
learningRate[i] = round(random.uniform(0.01, 1), 2)
nEstimators[i] = random.randrange(10, 1500, step = 25)
maxDepth[i] = int(random.randrange(1, 10, step= 1))
minChildWeight[i] = round(random.uniform(0.01, 10.0), 2)
gammaValue[i] = round(random.uniform(0.01, 10.0), 2)
subSample[i] = round(random.uniform(0.01, 1.0), 2)
colSampleByTree[i] = round(random.uniform(0.01, 1.0), 2)
population = np.concatenate((learningRate, nEstimators, maxDepth, minChildWeight, gammaValue, subSample, colSampleByTree), axis= 1)
return population
父代选择(适者生存):
def fitness_f1score(y_true, y_pred):
fitness = round((f1_score(y_true, y_pred, average='weighted')), 4)
return fitness
#训练数据并找出适合度得分
def train_population(population, dMatrixTrain, dMatrixtest, y_test):
fScore = []
for i in range(population.shape[0]):
param = { 'objective':'binary:logistic',
'learning_rate': population[i][0],
'n_estimators': population[i][1],
'max_depth': int(population[i][2]),
'min_child_weight': population[i][3],
'gamma': population[i][4],
'subsample': population[i][5],
'colsample_bytree': population[i][6],
'seed': 24}
num_round = 100
xgbT = xgb.train(param, dMatrixTrain, num_round)
preds = xgbT.predict(dMatrixtest)
preds = preds>0.5
fScore.append(fitness_f1score(y_test, preds))
return fScore
#select parents for mating
def new_parents_selection(population, fitness, numParents):
selectedParents = np.empty((numParents, population.shape[1])) #create an array to store fittest parents
#找到表现最好的父代
for parentId in range(numParents):
bestFitnessId = np.where(fitness == np.max(fitness))
bestFitnessId = bestFitnessId[0][0]
selectedParents[parentId, :] = population[bestFitnessId, :]
fitness[bestFitnessId] = -1 #set this value to negative, in case of F1-score, so this parent is not selected again
return selectedParents
交叉:
def crossover_uniform(parents, childrenSize):
crossoverPointIndex = np.arange(0, np.uint8(childrenSize[1]), 1, dtype= np.uint8) #get all the index
crossoverPointIndex1 = np.random.randint(0, np.uint8(childrenSize[1]), np.uint8(childrenSize[1]/2)) # select half of the indexes randomly
crossoverPointIndex2 = np.array(list(set(crossoverPointIndex) - set(crossoverPointIndex1))) #select leftover indexes
children = np.empty(childrenSize)
for i in range(childrenSize[0]):
#find parent 1 index
parent1_index = i%parents.shape[0]
#find parent 2 index
parent2_index = (i+1)%parents.shape[0]
#insert parameters based on random selected indexes in parent 1
children[i, crossoverPointIndex1] = parents[parent1_index, crossoverPointIndex1]
#insert parameters based on random selected indexes in parent 1
children[i, crossoverPointIndex2] = parents[parent2_index, crossoverPointIndex2]
return children
突变:
def mutation(crossover, numberOfParameters):
#Define minimum and maximum values allowed for each parameter
minMaxValue = np.zeros((numberOfParameters, 2))
minMaxValue[0:] = [0.01, 1.0] #min/max learning rate
minMaxValue[1, :] = [10, 2000] #min/max n_estimator
minMaxValue[2, :] = [1, 15] #min/max depth
minMaxValue[3, :] = [0, 10.0] #min/max child_weight
minMaxValue[4, :] = [0.01, 10.0] #min/max gamma
minMaxValue[5, :] = [0.01, 1.0] #min/maxsubsample
minMaxValue[6, :] = [0.01, 1.0] #min/maxcolsample_bytree
# Mutation changes a single gene in each offspring randomly.
mutationValue = 0
parameterSelect = np.random.randint(0, 7, 1)
print(parameterSelect)
if parameterSelect == 0: #learning_rate
mutationValue = round(np.random.uniform(-0.5, 0.5), 2)
if parameterSelect == 1: #n_estimators
mutationValue = np.random.randint(-200, 200, 1)
if parameterSelect == 2: #max_depth
mutationValue = np.random.randint(-5, 5, 1)
if parameterSelect == 3: #min_child_weight
mutationValue = round(np.random.uniform(5, 5), 2)
if parameterSelect == 4: #gamma
mutationValue = round(np.random.uniform(-2, 2), 2)
if parameterSelect == 5: #subsample
mutationValue = round(np.random.uniform(-0.5, 0.5), 2)
if parameterSelect == 6: #colsample
mutationValue = round(np.random.uniform(-0.5, 0.5), 2)
#indtroduce mutation by changing one parameter, and set to max or min if it goes out of range
for idx in range(crossover.shape[0]):
crossover[idx, parameterSelect] = crossover[idx, parameterSelect] + mutationValue
if(crossover[idx, parameterSelect] > minMaxValue[parameterSelect, 1]):
crossover[idx, parameterSelect] = minMaxValue[parameterSelect, 1]
if(crossover[idx, parameterSelect] < minMaxValue[parameterSelect, 0]):
crossover[idx, parameterSelect] = minMaxValue[parameterSelect, 0]
return crossover
遗传算法优化XGBoost
我们将实现上面讨论的模块来训练数据集。该数据集来自UCI机器学习存储库。它包含一组 102 个分子,其中 39 个被人类鉴定为具有可用于香料的气味,而 69 个则不具有所需的气味。该数据集包含这些分子的 6,590 个低能构象,包含 166 个特征。作为本教程的目标,我们正在做最少的先入为主来理解遗传算法。
import numpy as np
import pandas as pd
import geneticXGboost
import xgboost as xgb
np.random.seed(723)
dataset = pd.read_csv('clean2.data', header=None)
X = dataset.iloc[:, 2:168].values #discard first two coloums as these are molecule's name and conformation's name
y = dataset.iloc[:, 168].values #extrtact last coloum as class (1 => desired odor, 0 => undesired odor)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 97)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
xgDMatrix = xgb.DMatrix(X_train, y_train) #create Dmatrix
xgbDMatrixTest = xgb.DMatrix(X_test, y_test)
我们一开始有 8 个父母,我们选择 4 个最适合的父代进行交配。我们将创建 4 代并监控适应度(F1 分数)。下一代的父代中有一半将是从上一代中选择的最适合的父代。这样我们就能在子代体能得分较低的情况下,至少保持与上一代相同的最佳体能得分。
numberOfParents = 8 #number of parents to start
numberOfParentsMating = 4 #number of parents that will mate
numberOfParameters = 7 #number of parameters that will be optimized
numberOfGenerations = 4 #number of genration that will be created
populationSize = (numberOfParents, numberOfParameters)
population = geneticXGboost.initilialize_poplulation(numberOfParents)
fitnessHistory = np.empty([numberOfGenerations+1, numberOfParents])
populationHistory = np.empty([(numberOfGenerations+1)*numberOfParents, numberOfParameters])
populationHistory[0:numberOfParents, :] = population
for generation in range(numberOfGenerations):
print("This is number %s generation" % (generation))
fitnessValue = geneticXGboost.train_population(population=population, dMatrixTrain=xgDMatrix, dMatrixtest=xgbDMatrixTest, y_test=y_test)
fitnessHistory[generation, :] = fitnessValue
print('Best F1 score in the this iteration = {}'.format(np.max(fitnessHistory[generation, :])))
parents = geneticXGboost.new_parents_selection(population=population, fitness=fitnessValue, numParents=numberOfParentsMating)
children = geneticXGboost.crossover_uniform(parents=parents, childrenSize=(populationSize[0] - parents.shape[0], numberOfParameters))
children_mutated = geneticXGboost.mutation(children, numberOfParameters)
population[0:parents.shape[0], :] = parents #fittest parents
population[parents.shape[0]:, :] = children_mutated #children
populationHistory[(generation+1)*numberOfParents : (generation+1)*numberOfParents+ numberOfParents , :] = population
最后,我们得到最好的分数和相关参数:
fitness = geneticXGboost.train_population(population=population, dMatrixTrain=xgDMatrix, dMatrixtest=xgbDMatrixTest, y_test=y_test)
fitnessHistory[generation+1, :] = fitness
#index of the best solution
bestFitnessIndex = np.where(fitness == np.max(fitness))[0][0]
#Best fitness
print("Best fitness is =", fitness[bestFitnessIndex])
#Best parameters
print("Best parameters are:")
print('learning_rate', population[bestFitnessIndex][0])
print('n_estimators', population[bestFitnessIndex][1])
print('max_depth', int(population[bestFitnessIndex][2]))
print('min_child_weight', population[bestFitnessIndex][3])
print('gamma', population[bestFitnessIndex][4])
print('subsample', population[bestFitnessIndex][5])
print('colsample_bytree', population[bestFitnessIndex][6])
现在让我们可视化每一代人群的适应度变化(下图)。虽然我们已经从高 F1 分数(~0.98)开始,但在随机生成的初始群体中,在两个父母中,我们能够在最后一代中进一步提高它。初始种群中父母一方的最低 F1 分数为 0.9143,最后一代父母一方的最佳分数为 0.9947。这表明我们可以通过简单地实现遗传算法来提高 XGBoost 的性能指标。