随机森林分类器教程:如何使用基于树的算法进行机器学习

Tree-based algorithms are popular machine learning methods used to solve supervised learning problems. These algorithms are flexible and can solve any kind of problem at hand (classification or regression).

基于树的算法是用于解决监督学习问题的流行机器学习方法。 这些算法非常灵活,可以解决手头的任何类型的问题(分类或回归)。

Tree-based algorithms tend to use the mean for continuous features or mode for categorical features when making predictions on training samples in the regions they belong to. They also produce predictions with high accuracy, stability, and ease of interpretation.

基于树的算法在对它们所属区域中的训练样本进行预测时,倾向于将均值用于连续特征,将模式用于分类特征。 它们还可以产生具有高准确性稳定性易于 解释的预测。

基于树的算法示例 (Examples of Tree-based Algorithms)

There are different tree-based algorithms that you can use, such as

您可以使用不同的基于树的算法,例如

  • Decision Trees

    决策树
  • Random Forest

    随机森林
  • Gradient Boosting

    梯度提升
  • Bagging (Bootstrap Aggregation)

    套袋(引导聚合)

So every data scientist should learn these algorithms and use them in their machine learning projects.

因此,每个数据科学家都应该学习这些算法,并在他们的机器学习项目中使用它们。

In this article, you will learn more about the Random forest algorithm. After completing this article, you should be proficient at using the random forest algorithm to solve and build predictive models for classification problems with scikit-learn.

在本文中,您将了解有关随机森林算法的更多信息。 完成本文之后,您应该精通使用随机森林算法来解决和构建scikit-learn分类问题的预测模型。

什么是随机森林? (What is Random Forest?)

Random forest is one of the most popular tree-based supervised learning algorithms. It is also the most flexible and easy to use.

随机森林是最流行的基于树的监督学习算法之一。 它也是最灵活和易于使用的。

The algorithm can be used to solve both classification and regression problems. Random forest tends to combine hundreds of decision trees and then trains each decision tree on a different sample of the observations.

该算法可用于解决分类和回归问题。 随机森林往往会合并数百个 决策树 然后在不同的观察样本上训练每个决策树。

The final predictions of the random forest are made by averaging the predictions of each individual tree.

随机森林的最终预测是通过平均每棵树的预测得出的。

The benefits of random forests are numerous. The individual decision trees tend to overfit to the training data but random forest can mitigate that issue by averaging the prediction results from different trees. This gives random forests a higher predictive accuracy than a single decision tree.

随机森林的好处很多。 单个决策树倾向于过度适合训练数据,但是随机森林可以通过平均不同树的预测结果来缓解该问题。 与单个决策树相比,这为随机森林提供了更高的预测准确性。

The random forest algorithm can also help you to find features that are important in your dataset. It lies at the base of the Boruta algorithm, which selects important features in a dataset.

随机森林算法还可以帮助您查找数据集中重要的特征。 它位于Boruta算法的基础上,该算法选择数据集中的重要特征。

Random forest has been used in a variety of applications, for example to provide recommendations of different products to customers in e-commerce.

随机森林已用于各种应用程序中,例如,为电子商务中的客户提供不同产品的推荐。

In medicine, a random forest algorithm can be used to identify the patient’s disease by analyzing the patient’s medical record.

在医学中,可以通过分析患者的病历来使用随机森林算法来识别患者的疾病。

Also in the banking sector, it can be used to easily determine whether the customer is fraudulent or legitimate.

同样在银行业中,它可以用来轻松确定客户是欺诈还是合法。

随机森林算法如何工作? (How does the Random Forest algorithm work?)

The random forest algorithm works by completing the following steps:

随机森林算法通过完成以下步骤来工作:

Step 1: The algorithm select random samples from the dataset provided.

步骤1 :算法从提供的数据集中选择随机样本。

Step 2: The algorithm will create a decision tree for each sample selected. Then it will get a prediction result from each decision tree created.

步骤2:算法将为每个选定的样本创建决策树。 然后,它将从创建的每个决策树中获得预测结果。

Step 3: Voting will then be performed for every predicted result. For a classification problem, it will use mode, and for a regression problem, it will use mean.

步骤3:然后将对每个预测结果进行投票。 对于分类问题,它将使用mode ;对于回归问题,将使用mean

Step 4: And finally, the algorithm will select the most voted prediction result as the final prediction.

步骤4 :最后,算法将选择投票最多的预测结果作为最终预测。

实践中的随机森林 (Random Forest in Practice)

Now that you know the ins and outs of the random forest algorithm, let's build a random forest classifier.

现在您已经了解了随机森林算法的来龙去脉,让我们构建一个随机森林分类器。

We will build a random forest classifier using the Pima Indians Diabetes dataset. The Pima Indians Diabetes Dataset involves predicting the onset of diabetes within 5 years based on provided medical details. This is a binary classification problem.

我们将使用Pima Indians Diabetes数据集构建随机森林分类器。 皮马印第安人糖尿病数据集包括根据提供的医疗详细信息预测5年内的糖尿病发作。 这是一个二进制分类问题。

Our task is to analyze and create a model on the Pima Indian Diabetes dataset to predict if a particular patient is at a risk of developing diabetes, given other independent factors.

我们的任务是在Pima印度糖尿病数据集上分析并创建模型,以预测在考虑其他独立因素的情况下,特定患者是否有患糖尿病的风险。

We will start by importing important packages that we will use to load the dataset and create a random forest classifier. We will use the scikit-learn library to load and use the random forest algorithm.

我们将从导入重要的包开始,这些包将用于加载数据集并创建随机森林分类器。 我们将使用scikit-learn库加载和使用随机森林算法。

# import important packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler
import pandas_profiling

from matplotlib import rcParams
import warnings

warnings.filterwarnings("ignore")

# figure size in inches
rcParams["figure.figsize"] = 10, 6
np.random.seed(42)

数据集 (Dataset)

Then load the dataset from the data directory:

然后从数据目录加载数据集:

# Load dataset
data = pd.read_csv("../data/pima_indians_diabetes.csv")

Now we can observe the sample of the dataset.

现在我们可以观察数据集的样本。

# show sample of the dataset
data.sample(5)

As you can see, in our dataset we have different features with numerical values.

如您所见,在我们的数据集中,我们具有带有数值的不同特征。

Let's understand the list of features we have in this dataset.

让我们了解此数据集中的功能列表。

# show columns
data.columns

In this dataset, there are 8 input features and 1 output / target feature. Missing values are believed to be encoded with zero values. The meaning of the variable names are as follows (from the first to the last feature):

在此数据集中,有8个输入要素和1个输出/目标要素。 缺失值被认为是用零值编码的。 变量名称的含义如下(从第一个功能到最后一个功能):

  • Number of times pregnant.

    怀孕的次数。
  • Plasma glucose concentration a 2 hours in an oral glucose tolerance test.

    口服葡萄糖耐量试验中血浆葡萄糖浓度2小时。
  • Diastolic blood pressure (mm Hg).

    舒张压(毫米汞柱)。
  • Triceps skinfold thickness (mm).

    三头肌皮褶厚度(毫米)。
  • 2-hour serum insulin (mu U/ml).

    2小时血清胰岛素(mu U / ml)。
  • Body mass index (weight in kg/(height in m)^2).

    体重指数(体重(kg)/(身高(m))^ 2)。
  • Diabetes pedigree function.

    糖尿病谱系功能。
  • Age (years).

    年龄(年)。
  • Class variable (0 or 1).

    类变量(0或1)。

Then we split the dataset into independent features and target feature. Our target feature for this dataset is called class.

然后,我们将数据集分为独立特征和目标特征。 我们将此数据集的目标特征称为类。

# split data into input and taget variable(s)

X = data.drop("class", axis=1)
y = data["class"]

预处理数据集 (Preprocessing the Dataset)

Before we create a model we need to standardize our independent features by using the standardScaler method from scikit-learn.

在创建模型之前,我们需要使用scikit-learn的standardScaler方法来标准化我们的独立功能。

# standardize the dataset
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

You can learn more on how and why to standardize your data from this article by clicking here.

您可以通过单击此处,从本文中了解有关如何以及为何标准化数据的更多信息

将数据集分为训练和测试数据 (Splitting the dataset into Training and Test data)

We now split our processed dataset into training and test data. The test data will be 10% of the entire processed dataset.

现在,我们将处理后的数据集分为训练和测试数据。 测试数据将占整个处理数据集的10%。

# split into train and test set
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, stratify=y, test_size=0.10, random_state=42
)

建立随机森林分类器 (Building the Random Forest Classifier)

Now is time to create our random forest classifier and then train it on the train set. We will also pass the number of trees (100) in the forest we want to use through the parameter called n_estimators.

现在是时候创建我们的随机森林分类器,然后在火车上对其进行训练了。 我们还将通过森林传递要使用的森林中的树木数量(100) 参数称为n_estimators。

# create the classifier
classifier = RandomForestClassifier(n_estimators=100)

# Train the model using the training sets
classifier.fit(X_train, y_train)

The above output shows different parameter values of the random forest classifier used during the training process on the train data.

上面的输出显示了在训练数据上训练过程中使用的随机森林分类器的不同参数值。

After training we can perform prediction on the test data.

训练后,我们可以对测试数据进行预测。

# predictin on the test set
y_pred = classifier.predict(X_test)

Then we check the accuracy using actual and predicted values from the test data.

然后,我们使用测试数据中的实际值和预测值来检查准确性。

# Calculate Model Accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))

Accuracy: 0.8051948051948052

精度:0.8051948051948052

Our accuracy is around 80.5% which is good. But we can always make it better.

我们的准确性大约是80.5%,这是很好的。 但是我们总是可以做得更好。

识别重要功能 (Identify Important Features)

As I said before, we can also check the important features by using the feature_importances_ variable from the random forest algorithm in scikit-learn.

如前所述,我们还可以通过使用scikit-learn中随机森林算法中的feature_importances_变量来检查重要功能。

# check Important features
feature_importances_df = pd.DataFrame(
    {"feature": list(X.columns), "importance": classifier.feature_importances_}
).sort_values("importance", ascending=False)

# Display
feature_importances_df

The figure above shows the relative importance of features and their contribution to the model. We can also visualize these features and their  scores using the seaborn and matplotlib libraries.

上图显示了功能的相对重要性及其对模型的贡献。 我们还可以使用seaborn和matplotlib库可视化这些功能及其分数。

# visualize important featuers

# Creating a bar plot
sns.barplot(x=feature_importances_df.feature, y=feature_importances_df.importance)
# Add labels to your

plt.xlabel("Feature Importance Score")
plt.ylabel("Features")
plt.title("Visualizing Important Features")
plt.xticks(
    rotation=45, horizontalalignment="right", fontweight="light", fontsize="x-large"
)
plt.show()

From the figure above, you can see the triceps_skinfold_thickness feature has low importance and does not contribute much to the prediction.

从上图中,您可以看到triceps_skinfold_thickness功能的重要性很低,对预测的贡献不大。

This means that we can remove this feature and train our random forest classifier again and then see if it can improve its performance on the test data.

这意味着我们可以删除此功能,然后再次训练我们的随机森林分类器,然后查看它是否可以提高测试数据的性能。

# load data with selected features
X = data.drop(["class", "triceps_skinfold_thickness"], axis=1)
y = data["class"]

# standardize the dataset
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# split into train and test set
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, stratify=y, test_size=0.10, random_state=42
)

We will train the random forest algorithm with the selected processed features from our dataset, perform predictions, and then find the accuracy of the model.

我们将使用从数据集中选择的处理特征训练随机森林算法,执行预测,然后找到模型的准确性。

# Create a Random Classifier
clf = RandomForestClassifier(n_estimators=100)

# Train the model using the training sets
clf.fit(X_train, y_train)

# prediction on test set
y_pred = clf.predict(X_test)

# Calculate Model Accuracy,
print("Accuracy:", accuracy_score(y_test, y_pred))

Accuracy: 0.8181818181818182

精度:0.8181818181818182

Now the model accuracy has increased from 80.5% to 81.8% after we removed the least important feature called triceps_skinfold_thickness.

现在,在我们删除了最不重要的功能triceps_skinfold_thickness之后,模型的准确性从80.5%提高到81.8%

This suggests that it is very important to check important features and see if you can remove the least important features to increase your model's performance.

这表明检查重要特征并查看是否可以删除最不重要的特征以提高模型性能非常重要。

结语 (Wrapping up)

Tree-based algorithms are really important for every data scientist to learn. In this article, you've learned the basics of tree-based algorithms and how to create a classification model by using the random forest algorithm.

基于树的算法对于每个数据科学家来说都非常重要。 在本文中,您已经学习了基于树的算法的基础知识,以及如何使用随机森林算法创建分类模型。

I also recommend you try other types of tree-based algorithms such as the Extra-trees algorithm.

我还建议您尝试其他类型的基于树的算法,例如Extra-trees算法

You can download the dataset and notebook used in this article here: https://github.com/Davisy/Random-Forest-classification-Tutorial

您可以在此处下载本文中使用的数据集和笔记本: https : //github.com/Davisy/Random-Forest-classification-Tutorial

Congratulations, you have made it to the end of this article!

恭喜,您已完成本文的结尾!

If you learned something new or enjoyed reading this article, please share it so that others can see it. Until then, see you in the next post! I can also be reached on Twitter @Davis_McDavid

如果您学习了新知识或喜欢阅读本文,请与他人分享,以便其他人可以看到。 在那之前,在下一篇文章中见! 也可以通过Twitter @Davis_McDavid与联系

翻译自: https://www.freecodecamp.org/news/how-to-use-the-tree-based-algorithm-for-machine-learning/

  • 0
    点赞
  • 13
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值