本文将利用机器学习研究kaggle pima indians diabetes(https://www.kaggle.com/uciml/pima-indians-diabetes-database)数据集。
首先,导入Python包:
import pandas as pdimport numpy as npimport keras
读取机器学习数据集
df = pd.read_csv(‘/kaggle/input/pima-indians-diabetes-database/diabetes.csv’)
查看dataframe:
df.shape
大小为(768,9),表示有768个样本,列数为9。
df.columns
列为: [‘Pregnancies’, ‘Glucose’, ‘BloodPressure’, ‘SkinThickness’, ‘Insulin’, ‘BMI’, ‘DiabetesPedigreeFunction’, ‘Age’, ‘Outcome’]。
df.describe()
所有列的计数均为768,这表明没有缺失值。“Outcome”的平均值为0.35,这表明在给定的数据集中,“Outcome” = 0的情况多于“Outcome” = 1的情况。
将dataframe 'df'转换为numpy数组'dataset'
dataset = df.values
“dataset”拆分为X和y
X = dataset[:,0:8]y = dataset[:,8].astype(‘int’)
标准化
可以看到,列的平均值有很大的不同。因此,将机器学习数据集标准化,这样任何特征都不会被赋予不适当的权重。
a = StandardScaler()a.fit(X)X_standardized = a.transform(X)
现在让我们看一下“ X_standardized”的均值和标准差。
pd.DataFrame(X_standardized).describe()
所有列的均值在0左右,所有列的标准差在1左右。数据已经标准化。
超参数的调整:Batch Size 和Epochs
from sklearn.model_selection import GridSearchCV, KFoldfrom keras.models import Sequentialfrom keras.layers import Densefrom keras.wrappers.scikit_learn import KerasClassifierfrom keras.optimizers import Adam
定义神经体系结构和优化算法。神经网络由1个输入层,2个具有Relu激活函数的隐藏层和1个具有sigmoid激活函数的输出层组成。选择Adam作为神经网络模型的优化算法。
我们运行网格搜索两个超参数:' batch_size '和' epochs '。使用的交叉验证技术是k - fold,默认值k = 3。将计算准确性得分。
# Create the modelmodel = KerasClassifier(build_fn = create_model,verbose = 0)# Define the grid search parametersbatch_size = [10,20,40]epochs = [10,50,100]# Make a dictionary of the grid search parametersparam_grid = dict(batch_size = batch_size,epochs = epochs)# Build and fit the GridSearchCVgrid = GridSearchCV(estimator = model,param_grid = param_grid,cv = KFold(),verbose = 10)grid_result = grid.fit(X_standardized,y)
打印出最佳精度分数和超参数的最佳值。
# Summarize the resultsprint(‘Best : {}, using {}’.format(grid_result.best_score_,grid_result.best_params_))means = grid_result.cv_results_[‘mean_test_score’]stds = grid_result.cv_results_[‘std_test_score’]params = grid_result.cv_results_[‘params’]for mean, stdev, param in zip(means, stds, params): print(‘{},{} with: {}’.format(mean, stdev, param))
对于'batch_size'= 40和'epochs'= 10,最佳准确度得分是0.7604。因此,在调整超参数时,我们选择'batch_size'= 40和'epochs'= 10。
超参数的调整:Learning rate 和 Drop out rate
学习率在优化算法中起着重要作用。如果学习率太大,该算法可能找不到局部最优值。如果学习率太小,则该算法可能需要进行很多次迭代才能收敛,从而导致较高的计算量和计算时间。因此,我们需要一个学习率的最优值,该值要小到足以使算法收敛,而又要足够大以加快收敛过程。学习率有助于“Early Stopping”,这是一种正则化方法,只要测试集的准确性不断提高,就可以对训练集进行训练。
Drop out是一种正则化方法,可以降低模型的复杂性,从而防止训练数据过度拟合。Drop out rate可以取0到1之间的值。0表示没有激活单元被淘汰,1表示所有激活单元都被淘汰了。
from keras.layers import Dropout# Defining the modeldef create_model(learning_rate,dropout_rate): model = Sequential() model.add(Dense(8,input_dim = 8,kernel_initializer = 'normal',activation = 'relu')) model.add(Dropout(dropout_rate)) model.add(Dense(4,input_dim = 8,kernel_initializer = 'normal',activation = 'relu')) model.add(Dropout(dropout_rate)) model.add(Dense(1,activation = 'sigmoid')) adam = Adam(lr = learning_rate) model.compile(loss = 'binary_crossentropy',optimizer = adam,metrics = ['accuracy']) return model# Create the modelmodel = KerasClassifier(build_fn = create_model,verbose = 0,batch_size = 40,epochs = 10)# Define the grid search parameterslearning_rate = [0.001,0.01,0.1]dropout_rate = [0.0,0.1,0.2]# Make a dictionary of the grid search parametersparam_grids = dict(learning_rate = learning_rate,dropout_rate = dropout_rate)# Build and fit the GridSearchCVgrid = GridSearchCV(estimator = model,param_grid = param_grids,cv = KFold(),verbose = 10)grid_result = grid.fit(X_standardized,y)# Summarize the resultsprint('Best : {}, using {}'.format(grid_result.best_score_,grid_result.best_params_))means = grid_result.cv_results_['mean_test_score']stds = grid_result.cv_results_['std_test_score']params = grid_result.cv_results_['params']for mean, stdev, param in zip(means, stds, params): print('{},{} with: {}'.format(mean, stdev, param))
对于'dropout_rate'= 0.1和'learning_rate'= 0.001,最佳准确性得分是0.7695。因此,在调整其他超参数时,我们选择'dropout_rate'= 0.1和'learning_rate'= 0.001。
超参数调整:-激活函数和核初始化器
激活函数将非线性特性引入神经网络,从而建立输入与输出之间的非线性复杂函数映射。如果我们不应用激活函数,那么输出将是输入的一个简单线性函数。
神经网络需要从一些权重开始,然后迭代地将其更新为更好的值。内核初始化器决定用于初始化权重的统计分布或函数。
# Defining the modeldef create_model(activation_function,init): model = Sequential() model.add(Dense(8,input_dim = 8,kernel_initializer = init,activation = activation_function)) model.add(Dropout(0.1)) model.add(Dense(4,input_dim = 8,kernel_initializer = init,activation = activation_function)) model.add(Dropout(0.1)) model.add(Dense(1,activation = 'sigmoid')) adam = Adam(lr = 0.001) model.compile(loss = 'binary_crossentropy',optimizer = adam,metrics = ['accuracy']) return model# Create the modelmodel = KerasClassifier(build_fn = create_model,verbose = 0,batch_size = 40,epochs = 10)# Define the grid search parametersactivation_function = ['softmax','relu','tanh','linear']init = ['uniform','normal','zero']# Make a dictionary of the grid search parametersparam_grids = dict(activation_function = activation_function,init = init)# Build and fit the GridSearchCVgrid = GridSearchCV(estimator = model,param_grid = param_grids,cv = KFold(),verbose = 10)grid_result = grid.fit(X_standardized,y)# Summarize the resultsprint('Best : {}, using {}'.format(grid_result.best_score_,grid_result.best_params_))means = grid_result.cv_results_['mean_test_score']stds = grid_result.cv_results_['std_test_score']params = grid_result.cv_results_['params']for mean, stdev, param in zip(means, stds, params): print('{},{} with: {}'.format(mean, stdev, param))
对于“ activation_function” = tanh和“ kernel_initializer” =uniform,最佳准确性得分是0.7591。因此,在调整其他超参数时,我们选择“ activation_function” = tanh和“ kernel_initializer” =“ uniform”。
超参数的调整:-激活层中神经元的数量
数据的复杂性必须与模型的复杂性相匹配。激活层中神经元的数量决定了模型的复杂性。激活层神经元数目越多,输入与输出之间的非线性复杂函数映射程度越高。
# Defining the modeldef create_model(neuron1,neuron2): model = Sequential() model.add(Dense(neuron1,input_dim = 8,kernel_initializer = 'uniform',activation = 'tanh')) model.add(Dropout(0.1)) model.add(Dense(neuron2,input_dim = neuron1,kernel_initializer = 'uniform',activation = 'tanh')) model.add(Dropout(0.1)) model.add(Dense(1,activation = 'sigmoid')) adam = Adam(lr = 0.001) model.compile(loss = 'binary_crossentropy',optimizer = adam,metrics = ['accuracy']) return model# Create the modelmodel = KerasClassifier(build_fn = create_model,verbose = 0,batch_size = 40,epochs = 10)# Define the grid search parametersneuron1 = [4,8,16]neuron2 = [2,4,8]# Make a dictionary of the grid search parametersparam_grids = dict(neuron1 = neuron1,neuron2 = neuron2)# Build and fit the GridSearchCVgrid = GridSearchCV(estimator = model,param_grid = param_grids,cv = KFold(),verbose = 10)grid_result = grid.fit(X_standardized,y)# Summarize the resultsprint('Best : {}, using {}'.format(grid_result.best_score_,grid_result.best_params_))means = grid_result.cv_results_['mean_test_score']stds = grid_result.cv_results_['std_test_score']params = grid_result.cv_results_['params']for mean, stdev, param in zip(means, stds, params): print('{},{} with: {}'.format(mean, stdev, param))
对于第一层中的神经元数量= 16,第二层中的神经元数量= 4,最佳准确性得分是0.7591。
超参数的最佳值如下:
Batch size = 40
Epochs = 10
Dropout rate = 0.1
Learning rate = 0.001
Activation function = tanh
Kernel Initializer = uniform
No. of neurons in layer 1 = 16
No. of neurons in layer 2 = 4
具有超参数最佳值的训练模型
使用上一节中找到的超参数的最佳值来训练深度学习模型。
from sklearn.metrics import classification_report, accuracy_score# Defining the modeldef create_model(): model = Sequential() model.add(Dense(16,input_dim = 8,kernel_initializer = 'uniform',activation = 'tanh')) model.add(Dropout(0.1)) model.add(Dense(4,input_dim = 16,kernel_initializer = 'uniform',activation = 'tanh')) model.add(Dropout(0.1)) model.add(Dense(1,activation = 'sigmoid')) adam = Adam(lr = 0.001) model.compile(loss = 'binary_crossentropy',optimizer = adam,metrics = ['accuracy']) return model# Create the modelmodel = KerasClassifier(build_fn = create_model,verbose = 0,batch_size = 40,epochs = 10)# Fitting the modelmodel.fit(X_standardized,y)# Predicting using trained modely_predict = model.predict(X_standardized)# Printing the metricsprint(accuracy_score(y,y_predict))print(classification_report(y,y_predict))
准确度为77.6%,F1分数为0.84和0.65。
通过下面的Python代码片段一次性找到超参数的最优值,可以进一步提高性能。注意:-此过程的计算量很大。
def create_model(learning_rate,dropout_rate,activation_function,init,neuron1,neuron2): model = Sequential() model.add(Dense(neuron1,input_dim = 8,kernel_initializer = init,activation = activation_function)) model.add(Dropout(dropout_rate)) model.add(Dense(neuron2,input_dim = neuron1,kernel_initializer = init,activation = activation_function)) model.add(Dropout(dropout_rate)) model.add(Dense(1,activation = 'sigmoid')) adam = Adam(lr = learning_rate) model.compile(loss = 'binary_crossentropy',optimizer = adam,metrics = ['accuracy']) return model# Create the modelmodel = KerasClassifier(build_fn = create_model,verbose = 0)# Define the grid search parametersbatch_size = [10,20,40]epochs = [10,50,100]learning_rate = [0.001,0.01,0.1]dropout_rate = [0.0,0.1,0.2]activation_function = ['softmax','relu','tanh','linear']init = ['uniform','normal','zero']neuron1 = [4,8,16]neuron2 = [2,4,8]# Make a dictionary of the grid search parametersparam_grids = dict(batch_size = batch_size,epochs = epochs,learning_rate = learning_rate,dropout_rate = dropout_rate, activation_function = activation_function,init = init,neuron1 = neuron1,neuron2 = neuron2)# Build and fit the GridSearchCVgrid = GridSearchCV(estimator = model,param_grid = param_grids,cv = KFold(),verbose = 10)grid_result = grid.fit(X_standardized,y)# Summarize the resultsprint('Best : {}, using {}'.format(grid_result.best_score_,grid_result.best_params_))means = grid_result.cv_results_['mean_test_score']stds = grid_result.cv_results_['std_test_score']params = grid_result.cv_results_['params']for mean, stdev, param in zip(means, stds, params): print('{},{} with: {}'.format(mean, stdev, param))