ml模型_改善ML模型的性能

最新推荐文章于 2021-09-14 22:22:14 发布

cunzai1985

最新推荐文章于 2021-09-14 22:22:14 发布

阅读量530

点赞数

文章标签：算法决策树 python 机器学习人工智能

原文链接：https://www.tutorialspoint.com/machine_learning_with_python/machine_learning_improving_performance_of_ml_models.htm

版权

本文介绍了如何通过集成学习方法提升机器学习模型的性能，包括顺序集成法、并行合奏方法和合奏学习方法，如装袋、提升和投票。详细讨论了如随机森林、AdaBoost和随机梯度提升等算法，并展示了在Pima Indians糖尿病数据集上的应用示例。

摘要由CSDN通过智能技术生成

ml模型

改善ML模型的性能 (Improving Performance of ML Models)

整体演奏的性能提升 (Performance Improvement with Ensembles)

Ensembles can give us boost in the machine learning result by combining several models. Basically, ensemble models consist of several individually trained supervised learning models and their results are merged in various ways to achieve better predictive performance compared to a single model. Ensemble methods can be divided into following two groups −

集成多个模型可以使我们提高机器学习的效果。基本上，集成模型由几个单独训练的监督学习模型组成，并且与单个模型相比，它们的结果以各种方式合并以实现更好的预测性能。合奏方法可以分为以下两组-

顺序集成法 (Sequential ensemble methods)

As the name implies, in these kind of ensemble methods, the base learners are generated sequentially. The motivation of such methods is to exploit the dependency among base learners.

顾名思义，在这种集成方法中，基础学习器是顺序生成的。这种方法的动机是利用基础学习者之间的依赖性。

并行合奏方法 (Parallel ensemble methods)

As the name implies, in these kind of ensemble methods, the base learners are generated in parallel. The motivation of such methods is to exploit the independence among base learners.

顾名思义，在这种集成方法中，基础学习器是并行生成的。这种方法的动机是利用基础学习者之间的独立性。

合奏学习方法 (Ensemble Learning Methods)

The following are the most popular ensemble learning methods i.e. the methods for combining the predictions from different models −

以下是最受欢迎的集成学习方法，即用于组合来自不同模型的预测的方法-

装袋 (Bagging)

The term bagging is also known as bootstrap aggregation. In bagging methods, ensemble model tries to improve prediction accuracy and decrease model variance by combining predictions of individual models trained over randomly generated training samples. The final prediction of ensemble model will be given by calculating the average of all predictions from the individual estimators. One of the best examples of bagging methods are random forests.

套袋一词也称为引导聚合。在装袋方法中，集成模型试图通过组合在随机生成的训练样本上训练的各个模型的预测来提高预测准确性并减少模型方差。集合模型的最终预测将通过计算来自各个估计量的所有预测的平均值来给出。套袋方法的最好例子之一是随机森林。

提升 (Boosting)

In boosting method, the main principle of building ensemble model is to build it incrementally by training each base model estimator sequentially. As the name suggests, it basically combine several week base learners, trained sequentially over multiple iterations of training data, to build powerful ensemble. During the training of week base learners, higher weights are assigned to those learners which were misclassified earlier. The example of boosting method is AdaBoost.

在Boosting方法中，构建集成模型的主要原理是通过顺序地训练每个基本模型估计器来逐步构建它。顾名思义，它基本上是结合几个星期的学习者，并在多次迭代的训练数据上依次进行训练，以建立强大的整体。在每周基本学习者的培训中，较高的权重分配给了那些较早分类错误的学习者。增强方法的示例是AdaBoost。

表决 (Voting)

In this ensemble learning model, multiple models of different types are built and some simple statistics, like calculating mean or median etc., are used to combine the predictions. This prediction will serve as the additional input for training to make the final prediction.

在该集成学习模型中，建立了不同类型的多个模型，并使用一些简单的统计量(例如计算平均值或中位数等)来组合预测。此预测将用作培训以做出最终预测的附加输入。

套袋合奏算法 (Bagging Ensemble Algorithms)

The following are three bagging ensemble algorithms −

以下是三种套袋集成算法-

袋装决策树 (Bagged Decision Tree)

As we know that bagging ensemble methods work well with the algorithms that have high variance and, in this concern, the best one is decision tree algorithm. In the following Python recipe, we are going to build bagged decision tree ensemble model by using BaggingClassifier function of sklearn with DecisionTreeClasifier (a classification & regression trees algorithm) on Pima Indians diabetes dataset.

众所周知，套袋集成方法可以很好地处理方差较大的算法，因此，最好的方法是决策树算法。在以下Python配方中，我们将通过在Simaarn糖尿病数据集上使用sklearn的BaggingClassifier函数和DecisionTreeClasifier(分类和回归树算法)来构建袋装决策树集成模型。

First, import the required packages as follows −

首先，导入所需的软件包，如下所示：


from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

Now, we need to load the Pima diabetes dataset as we did in the previous examples −

现在，我们需要像之前的示例一样加载Pima糖尿病数据集-


path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=headernames)
array = data.values
X = array[:,0:8]
Y = array[:,8]

Next, give the input for 10-fold cross validation as follows −

接下来，如下所示进行10倍交叉验证的输入-


seed = 7
kfold = KFold(n_splits=10, random_state=seed)
cart = DecisionTreeClassifier()

We need to provide the number of trees we are going to build. Here we are building 150 trees −

我们需要提供要建造的树木数量。在这里，我们正在建造150棵树-


num_trees = 150

Next, build the model with the help of following script −

接下来，在以下脚本的帮助下构建模型-


model = BaggingClassifier(base_estimator=cart, n_estimators=num_trees, random_state=seed)

Calculate and print the result as follows −

计算并打印结果如下-


results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

输出量 (Output)


0.7733766233766234

The output above shows that we got around 77% accuracy of our bagged decision tree classifier model.

上面的输出显示，我们的袋装决策树分类器模型的准确率约为77％。

随机森林 (Random Forest)

It is an extension of bagged decision trees. For individual classifiers, the samples of training dataset are taken with replacement, but the trees are constructed in such a way that reduces the correlation between them. Also, a random subset of features is considered to choose each split point rather than greedily choosing the best split point in construction of each tree.

它是袋装决策树的扩展。对于单独的分类器，训练数据集的样本被替换后获取，但是树的构建方式降低了它们之间的相关性。同样，可以考虑使用特征的随机子集来选择每个分割点，而不是贪婪地选择构造每个树时的最佳分割点。

In the following Python recipe, we are going to build bagged random forest ensemble model by using RandomForestClassifier class of sklearn on Pima Indians diabetes dataset.

在以下Python配方中，我们将通过在Pima Indians糖尿病数据集上使用sklearn的RandomForestClassifier类来构建袋装随机森林集成模型。

First, import the required packages as follows −

首先，导入所需的软件包，如下所示：


from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

Now, we need to load the Pima diabetes dataset as did in previous examples −

现在，我们需要像之前的示例一样加载Pima糖尿病数据集-


path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=headernames)
array = data.values
X = array[:,0:8]
Y = array[:,8]

Next, give the input for 10-fold cross validation as follows −

接下来，如下所示进行10倍交叉验证的输入-


seed = 7
kfold = KFold(n_splits=10, random_state=seed)

We need to provide the number of trees we are going to build. Here we are building 150 trees with split points chosen from 5 features −

我们需要提供要建造的树木数量。在这里，我们正在构建150棵树木，这些树木的分裂点是从5个特征中选择的-


num_trees = 150
max_features = 5

Next, build the model with the help of following script −

接下来，在以下脚本的帮助下构建模型-


model = RandomForestClassifier(n_estimators=num_trees, max_features=max_features)

Calculate and print the result as follows −

计算并打印结果如下-


results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

输出量 (Output)


0.7629357484620642

The output above shows that we got around 76% accuracy of our bagged random forest classifier model.

上面的输出显示，我们的袋装随机森林分类器模型的准确性约为76％。

多余的树 (Extra Trees)

It is another extension of bagged decision tree ensemble method. In this method, the random trees are constructed from the samples of the training dataset.

它是袋装决策树集成方法的另一种扩展。在这种方法中，从训练数据集的样本中构造随机树。

In the following Python recipe, we are going to build extra tree ensemble model by using ExtraTreesClassifier class of sklearn on Pima Indians diabetes dataset.

在以下Python食谱中，我们将通过在Pima Indians糖尿病数据集上使用sklearn的ExtraTreesClassifier类来构建额外的树集成模型。

First, import the required packages as follows −

首先，导入所需的软件包，如下所示：


from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import ExtraTreesClassifier

Now, we need to load the Pima diabetes dataset as did in previous examples −

现在，我们需要像之前的示例一样加载Pima糖尿病数据集-


path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=headernames)
array = data.values
X = array[:,0:8]
Y = array[:,8]

Next, give the input for 10-fold cross validation as follows −

接下来，如下所示进行10倍交叉验证的输入-


seed = 7
kfold = KFold(n_splits=10, random_state=seed)

We need to provide the number of trees we are going to build. Here we are building 150 trees with split points chosen from 5 features −

我们需要提供要建造的树木数量。在这里，我们正在构建150棵树木，这些树木的分裂点是从5个特征中选择的-


num_trees = 150
max_features = 5

Next, build the model with the help of following script −

接下来，在以下脚本的帮助下构建模型-


model = ExtraTreesClassifier(n_estimators=num_trees, max_features=max_features)

Calculate and print the result as follows −

计算并打印结果如下-


results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

输出量 (Output)


0.7551435406698566

The output above shows that we got around 75.5% accuracy of our bagged extra trees classifier model.

上面的输出显示，我们的袋装额外树木分类器模型的准确性约为75.5％。

提升合奏算法 (Boosting Ensemble Algorithms)

The followings are the two most common boosting ensemble algorithms −

以下是两种最常见的增强合奏算法-

AdaBoost (AdaBoost)

It is one the most successful boosting ensemble algorithm. The main key of this algorithm is in the way they give weights to the instances in dataset. Due to this the algorithm needs to pay less attention to the instances while constructing subsequent models.

它是最成功的增强集成算法之一。该算法的主要关键在于它们对数据集中的实例赋予权重的方式。因此，在构建后续模型时，该算法无需过多关注实例。

In the following Python recipe, we are going to build Ada Boost ensemble model for classification by using AdaBoostClassifier class of sklearn on Pima Indians diabetes dataset.

在以下Python配方中，我们将通过在Pima Indians糖尿病数据集上使用sklearn的AdaBoostClassifier类来构建Ada Boost集成模型进行分类。

First, import the required packages as follows −

首先，导入所需的软件包，如下所示：


from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import AdaBoostClassifier

Now, we need to load the Pima diabetes dataset as did in previous examples −

现在，我们需要像之前的示例一样加载Pima糖尿病数据集-


path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=headernames)
array = data.values
X = array[:,0:8]
Y = array[:,8]

Next, give the input for 10-fold cross validation as follows −

接下来，如下所示进行10倍交叉验证的输入-


seed = 5
kfold = KFold(n_splits=10, random_state=seed)

We need to provide the number of trees we are going to build. Here we are building 150 trees with split points chosen from 5 features −

我们需要提供要建造的树木数量。在这里，我们正在构建150棵树木，这些树木的分裂点是从5个特征中选择的-


num_trees = 50

Next, build the model with the help of following script −

接下来，在以下脚本的帮助下构建模型-


model = AdaBoostClassifier(n_estimators=num_trees, random_state=seed)

Calculate and print the result as follows −

计算并打印结果如下-


results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

输出量 (Output)


0.7539473684210527

The output above shows that we got around 75% accuracy of our AdaBoost classifier ensemble model.

上面的输出显示，我们的AdaBoost分类器集成模型的准确度约为75％。

随机梯度提升 (Stochastic Gradient Boosting)

It is also called Gradient Boosting Machines. In the following Python recipe, we are going to build Stochastic Gradient Boostingensemble model for classification by using GradientBoostingClassifier class of sklearn on Pima Indians diabetes dataset.

它也称为梯度提升机。在以下Python配方中，我们将通过在Pima Indians糖尿病数据集上使用sklearn的GradientBoostingClassifier类来建立随机梯度Boostingensemble模型进行分类。

First, import the required packages as follows −

首先，导入所需的软件包，如下所示：


from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import GradientBoostingClassifier

Now, we need to load the Pima diabetes dataset as did in previous examples −

现在，我们需要像之前的示例一样加载Pima糖尿病数据集-


path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=headernames)
array = data.values
X = array[:,0:8]
Y = array[:,8]

Next, give the input for 10-fold cross validation as follows −

接下来，如下所示进行10倍交叉验证的输入-


seed = 5
kfold = KFold(n_splits=10, random_state=seed)

We need to provide the number of trees we are going to build. Here we are building 150 trees with split points chosen from 5 features −

我们需要提供要建造的树木数量。在这里，我们正在构建150棵树木，这些树木的分裂点是从5个特征中选择的-


num_trees = 50

Next, build the model with the help of following script −

接下来，在以下脚本的帮助下构建模型-


model = GradientBoostingClassifier(n_estimators=num_trees, random_state=seed)

Calculate and print the result as follows −

计算并打印结果如下-


results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

输出量 (Output)


0.7746582365003418

The output above shows that we got around 77.5% accuracy of our Gradient Boosting classifier ensemble model.

上面的输出显示，我们的Gradient Boosting分类器集成模型的准确性约为77.5％。

投票合奏算法 (Voting Ensemble Algorithms)

As discussed, voting first creates two or more standalone models from training dataset and then a voting classifier will wrap the model along with taking the average of the predictions of sub-model whenever needed new data.

如上所述，投票首先从训练数据集中创建两个或多个独立模型，然后投票分类器将包装模型，并在需要新数据时获取子模型的预测平均值。

In the following Python recipe, we are going to build Voting ensemble model for classification by using VotingClassifier class of sklearn on Pima Indians diabetes dataset. We are combining the predictions of logistic regression, Decision Tree classifier and SVM together for a classification problem as follows −

在以下Python配方中，我们将通过在Pima Indians糖尿病数据集上使用sklearn的VotingClassifier类来构建用于分类的Voting集成模型。我们将Logistic回归，决策树分类器和SVM的预测结合在一起，用于分类问题，如下所示-

First, import the required packages as follows −

首先，导入所需的软件包，如下所示：


from pandas import read_csv
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import VotingClassifier

Now, we need to load the Pima diabetes dataset as did in previous examples −

现在，我们需要像之前的示例一样加载Pima糖尿病数据集-


path = r"C:\pima-indians-diabetes.csv"
headernames = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
data = read_csv(path, names=headernames)
array = data.values
X = array[:,0:8]
Y = array[:,8]

Next, give the input for 10-fold cross validation as follows −

接下来，如下所示进行10倍交叉验证的输入-


kfold = KFold(n_splits=10, random_state=7)

Next, we need to create sub-models as follows −

接下来，我们需要创建如下子模型-


estimators = []
model1 = LogisticRegression()
estimators.append(('logistic', model1))
model2 = DecisionTreeClassifier()
estimators.append(('cart', model2))
model3 = SVC()
estimators.append(('svm', model3))

Now, create the voting ensemble model by combining the predictions of above created sub models.

现在，通过组合上述创建的子模型的预测来创建投票合奏模型。


ensemble = VotingClassifier(estimators)
results = cross_val_score(ensemble, X, Y, cv=kfold)
print(results.mean())

输出量 (Output)


0.7382262474367738

The output above shows that we got around 74% accuracy of our voting classifier ensemble model.

上面的输出显示，我们的投票分类器集成模型的准确性约为74％。

翻译自: https://www.tutorialspoint.com/machine_learning_with_python/machine_learning_improving_performance_of_ml_models.htm

ml模型

cunzai1985

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫