使用Sci-kit学习和XGBoost进行多类别分类:使用Brainwave数据的案例研究

本文通过使用Sci-kit学习和XGBoost对EEG Brainwave数据进行多类别分类,比较了不同分类器的性能。随机森林分类器在准确性(97.7%)和速度上表现出色,而XGBoost以99.4%的准确性成为最佳选择,但耗时较长。PCA降维后,逻辑回归分类器在时间和准确性上有所改善。
摘要由CSDN通过智能技术生成

by Avishek Nag (Machine Learning expert)

作者:Avishek Nag(机器学习专家)

使用Sci-kit学习和XGBoost进行多类别分类:使用Brainwave数据的案例研究 (Multi-Class classification with Sci-kit learn & XGBoost: A case study using Brainwave data)

比较不同分类器对高维数据的准确性和性能 (A comparison of different classifiers’ accuracy & performance for high-dimensional data)

In Machine learning, classification problems with high-dimensional data are really challenging. Sometimes, very simple problems become extremely complex due this ‘curse of dimensionality’ problem.

在机器学习中,具有高维数据的分类问题确实具有挑战性。 有时,由于“维数诅咒”问题,非常简单的问题变得极为复杂。

In this article, we will see how accuracy and performance vary across different classifiers. We will also see how, when we don’t have the freedom to choose a classifier independently, we can do feature engineering to make a poor classifier perform well.

在本文中,我们将了解不同分类器之间的准确性和性能如何变化。 我们还将看到,当我们没有自由选择分类器的自由时,如何进行特征工程以使较差的分类器表现良好。

了解“数据源”和问题表述 (Understanding the ‘datasource’ & problem formulation)

For this article, we will use the “EEG Brainwave Dataset” from Kaggle. This dataset contains electronic brainwave signals from an EEG headset and is in temporal format. At the time of writing this article, nobody has created any ‘Kernel’ on this dataset — that is, as of now, no solution has been given in Kaggle.

对于本文,我们将使用Kaggle的“ EEG Brainwave数据集”。 该数据集包含来自EEG头戴式耳机的电子脑波信号,并且采用时间格式。 在撰写本文时,还没有人在此数据集上创建任何“内核”,也就是说,到目前为止,在Kaggle中尚未提供任何解决方案。

So, to start with, let’s first read the data to see what’s there.

因此,首先,让我们首先读取数据以查看其中的内容。

There are 2549 columns in the dataset and ‘label’ is the target column for our classification problem. All other columns like ‘mean_d_1_a’, ‘mean_d2_a’ etc are describing features of brainwave signal readings. Columns starting with the ‘fft’ prefix are most probably ‘Fast Fourier transforms’ of original signals. Our target column ‘label’ describes the degree of emotional sentiment.

数据集中有2549列,“ label”是我们分类问题的目标列。 所有其他列,例如“ mean_d_1_a”,“ mean_d2_a”等,都描述了脑电波信号读数的特征。 以'fft'前缀开头的列很可能是原始信号的“快速傅立叶变换”。 我们的目标列“标签”描述了情感情绪的程度。

As per Kaggle, here is the challenge: “Can we predict emotional sentiment from brainwave readings?”

按照Kaggle的说法,这是挑战:“我们能从脑电波读数预测情绪情感吗?”

Let’s first understand class distributions from column ‘label’:

首先让我们从“标签”列了解类分布:

So, there are three classes, ‘POSITIVE’, ‘NEGATIVE’ & ‘NEUTRAL’, for emotional sentiment. From the bar chart, it is clear that class distribution is not skewed and it is a ‘multi-class classification’ problem with target variable ‘label’. We will try with different classifiers and see the accuracy levels.

因此,情感情感分为“积极”,“消极”和“中立”三个类别。 从条形图中可以看出,类别分布没有偏斜,这是目标变量为“标签”的“多类别分类”问题。 我们将尝试使用不同的分类器,并查看准确性级别。

Before applying any classifier, the column ‘label’ should be separated out from other feature columns (‘mean_d_1_a’, ‘mean_d2_a’ etc are features).

在应用任何分类器之前,应将“标签”列与其他要素列分开(“ mean_d_1_a”,“ mean_d2_a”等要素)。

label_df = brainwave_df['label']brainwave_df.drop('label', axis = 1, inplace=True)brainwave_df.head()

As it is a ‘classification’ problem, we will follow the below conventions for each ‘classifier’ to be tried:

由于这是一个“分类”问题,对于每个要尝试的“分类器”,我们将遵循以下约定:

  1. We will use a ‘cross validation’ (in our case will use 10 fold cross validation) approach over the dataset and take average accuracy. This will give us a holistic view of the classifier’s accuracy.

    我们将对数据集使用“交叉验证”(在我们的示例中将使用10倍交叉验证)方法,并取平均准确性。 这将使我们对分类器的准确性有一个整体的认识。
  2. We will use a ‘Pipeline’ based approach to combine all pre-processing and main classifier computation. A ML ‘Pipeline’ wraps all processing stages in a single unit and act as a ‘classifier’ itself. By this, all stages become re-usable and can be put in forming other ‘pipelines’ also.

    我们将使用基于“管道”的方法来组合所有预处理和主要分类器计算。 ML“管道”将所有处理阶段包装在一个单元中,并充当“分类器”本身。 这样,所有阶段都可以重用,也可以用于形成其他“管道”。
  3. We will track total time in building & testing for each approach. We will call this ‘time taken’.

    我们将跟踪每种方法的构建和测试总时间。 我们将其称为“花费时间”。

For the above, we will primarily use the scikit-learn package from Python. As the number of features here is quite high, will start with a classifier which works well on high-dimensional data.

对于以上内容,我们将主要使用Python的scikit-learn包。 由于此处的功能数量很多,因此将从分类器开始,该分类器在高维数据上效果很好。

随机森林分类器 (RandomForest Classifier)

‘RandomForest’ is a tree & bagging approach-based ensemble classifier. It will automatically reduce the number of features by its probabilistic entropy calculation approach. Let’s see that:

“ RandomForest”是基于树和装袋方法的集成分类器。 通过其概率熵计算方法,它将自动减少特征数量。 让我们看看:

Accuracy is very good at 97.7% and ‘total time taken’ is quite short (3.29 seconds only).

准确性非常高,为97.7%,“花费的总时间”非常短(仅3.29秒)。

For this classifier, no pre-processing stages like scaling or noise removal are required, as it is completely probability-based and not at all affected by scale factors.

对于此分类器,不需要像缩放或降噪之类的预处理阶段,因为它完全基于概率,并且完全不受缩放因子的影响。

逻辑回归分类器 (Logistic Regression Classifier)

‘Logistic Regression’ is a linear classifier and works in same way as linear regression.

“逻辑回归”是线性分类器,其工作方式与线性回归相同。

We can see accuracy (93.19%) is lower than ‘RandomForest’ and ‘time taken’ is higher (2 min 7s).

我们可以看到,准确性(93.19%)低于“ RandomForest”,而“花费的时间”则更高(2分7秒)。

‘Logistic Regression’ is heavily affected by different value ranges across dependent variables, thus forces ‘feature scaling’. That’s why ‘StandardScaler’ from scikit-learn has been added as a preprocessing stage. It automatically scales features according to a Gaussian Distribution with zero mean & unit variance, and thus values for all variables range from -1 to +1.

“逻辑回归”在很大程度上取决于跨因变量的不同值范围,因此强制执行“功能缩放”。 这就是为什么scikit-learn的'StandardScaler'被添加为预处理阶段的原因。 它会根据具有零均值和单位方差的高斯分布自动缩放要素,因此所有变量的值范围都在-1至+1之间。

The reason for high time taken is high-dimensionality and scaling time required. There are 2549 variables in the dataset and the coefficient of each one should be optimised as per the Logistic Regression process. Also, there is a question of multi-co-linearity. This means linearly co-related variables should be grouped together instead of considering them separately.

花费大量时间的原因是所需的高尺寸和缩放时间。 数据集中有2549个变量,每个变量的系数应根据Logistic回归过程进行优化。 另外,存在多重共线性的问题。 这意味着线性相关的变量应该分组在一起,而不是分开考虑。

The presence of multi-col-linearity affects accuracy. So now the question becomes, “Can we reduce the number of variables, reduce multi-co-linearity, & improve ‘time taken?”

多共线性的存在会影响准确性。 所以现在的问题变成了:“我们可以减少变量的数量,减少多重共线性,并改善'花费的时间吗?”

主成分分析(PCA) (Principal Component Analysis (PCA))

PCA can transform original low level variables to a higher dimensional space and thus reduce the number of required variables. All co-linear variables get clubbed together. Let’s do a PCA of the data and see what are the main PC’s:

PCA可以将原始的低级变量转换为更高维度的空间,从而减少所需变量的数量。 所有共线性变量合并在一起。 让我们对数据进行PCA,看看主要的PC是什么:

We mapped 2549 variables to 20 Principal Components. From the above result, it is clear that first 10 PCs are a matter of importance. The total percentage of the explained variance ratio by the first 10 PCs is around 0.737 (0.36 + 0.095 + ..+ 0.012). Or it can be said that the first 10 PCs explain 73.7% variance of the entire dataset.

我们将2549个变量映射到20个主要组件。 从以上结果可以明显看出,前10台PC至关重要。 前10个PC的解释方差比率的总百分比约为0.737(0.36 + 0.095 + .. + 0.012)。 或者可以说前10台PC解释了整个数据集的73.7%方差。

So, with this we are able to reduce 2549 variables to 10 variables. That’s a dramatic change, isn’t it? In theory, Principal Components are virtual variables generated from mathematical mapping. From a business angle, it is not possible to tell which physical aspect of the data is covered by them. That means, physically, that Principal Components don’t exist. But, we can easily use these PCs as quantitative input variables to any ML algorithm and get very good results.

因此,我们可以将2549个变量减少为10个变量。 这是一个巨大的变化,不是吗? 从理论上讲,主成分是从数学映射生成的虚拟变量。 从业务角度讲,不可能确定数据覆盖了哪些物理方面。 从物理上讲,这意味着不存在主要组件。 但是,我们可以轻松地将这些PC用作任何ML算法的定量输入变量,并获得很好的结果。

For visualisation, let’s take the first two PCs and see how can we distinguish different classes of the data using a ‘scatterplot’.

为了可视化,让我们拿起前两台PC,看看如何使用“散点图”区分不同类别的数据。

plt.figure(figsize=(25,8))sns.scatterplot(x=pca_vectors[:, 0], y=pca_vectors[:, 1], hue=label_df)plt.title('Principal Components vs Class distribution', fontsize=16)plt.ylabel('Principal Component 2', fontsize=16)plt.xlabel('Principal Component 1', fontsize=16)plt.xticks(rotation='vertical');

In the above plot, three classes are shown in different colours. So, if we use the same ‘Logistic Regression’ classifier with these two PCs, then from the above plot we can probably say that the first classifier will separate out ‘NEUTRAL’ cases from other two cases and the second classifier will separate out ‘POSITIVE’ & ‘NEGATIVE’ cases (as there will be two internal logistic classifiers for 3-class problem). Let’s try and see the accuracy.

在上图中,以不同的颜色显示了三个类别。 因此,如果我们在这两台PC上使用相同的“逻辑回归”分类器,那么从上图可以看出,第一个分类器将把“中性”情况与其他两个情况分开,第二个分类器将“正性”情况分开'&'NEGATIVE'案例(因为对于3类问题,将有两个内部逻辑分类器)。 让我们尝试看看准确性。

Time taken (3.34 s) was reduced but accuracy (77%) decreased.

花费的时间(3.34 s)减少了,但准确性降低了(77%)。

Now, let’s take all 10 PCs and run:

现在,让我们使用所有10台PC并运行:

We see an improvement in accuracy (86%) compared to 2 PC cases with a marginal increase in ‘time taken’.

与2个PC案例相比,我们发现准确性提高了(86%),“花费的时间”略有增加。

So, in both cases we saw low accuracy compared to normal logistic regression, but a significant improvement in ‘time taken’.

因此,在两种情况下,与普通的逻辑回归相比,我们看到的准确性都较低,但是“花费的时间”有了明显的改善。

Accuracy can be further tested with a different ‘solver’ & ‘max_iter’ parameter. We used ‘saga’ as ‘solver’ with L1 penalty and 200 as ‘max_iter’. These values can be changed to get a variable effect on accuracy.

可以使用不同的“ solver”和“ max_iter”参数进一步测试准确性。 我们将“ saga”用作具有L1罚分的“求解器”,将200作为“ max_iter”。 可以更改这些值以获得对准确性的可变影响。

Though ‘Logistic Regression’ is giving low accuracy, there are situations where it may be needed specially with PCA. In datasets with a very large dimensional space, PCA becomes the obvious choice for ‘linear classifiers’.

尽管“逻辑回归”的准确性较低,但在某些情况下,PCA可能特别需要它。 在维空间很大的数据集中,PCA成为“线性分类器”的明显选择。

In some cases, where a benchmark for ML applications is already defined and only limited choices of some ‘linear classifiers’ are available, this analysis would be helpful. It is very common to see such situations in large organisations where standards are already defined and it is not possible to go beyond them.

在某些情况下,如果已经定义了ML应用程序的基准,并且只能使用某些“线性分类器”的有限选择,那么此分析将很有帮助。 在已经定义了标准并且不可能超越这些标准的大型组织中,经常会看到这种情况。

人工神经网络分类器(ANN) (Artificial Neural Network Classifier (ANN))

An ANN classifier is non-linear with automatic feature engineering and dimensional reduction techniques. ‘MLPClassifier’ in scikit-learn works as an ANN. But here also, basic scaling is required for the data. Let’s see how it works:

ANN分类器是非线性的,具有自动特征工程和降维技术。 scikit-learn中的“ MLPClassifier”用作ANN。 但在这里,数据也需要基本缩放。 让我们看看它是如何工作的:

Accuracy (97.5%) is very good, though running time is high (5 min).

尽管运行时间很高(5分钟),但准确性(97.5%)非常好。

The reason for high ‘time taken’ is the rigorous training time required for neural networks, and that too with a high number of dimensions.

“耗时”高的原因是神经网络需要严格的训练时间,而且训练次数也很多。

It is a general convention to start with a hidden layer size of 50% of the total data size and subsequent layers will be 50% of the previous one. In our case these are (1275 = 2549 / 2, 637 = 1275 / 2). The number of hidden layers can be taken as hyper-parameter and can be tuned for better accuracy. In our case it is 2.

一般惯例是,隐藏层大小为总数据大小的50%,随后的层数为前一层的50%。 在我们的例子中,这些是(1275 = 2549 / 2,637 = 1275/2)。 隐藏层的数量可以视为超参数,并且可以进行调整以提高准确性。 在我们的例子中是2。

线性支持向量机分类器(SVM) (Linear Support Vector Machines Classifier (SVM))

We will now apply ‘Linear SVM’ on the data and see how accuracy is coming along. Here also scaling is required as a preprocessing stage.

现在,我们将在数据上应用“线性SVM”,并查看准确性如何。 在这里,缩放也需要作为预处理阶段。

Accuracy is coming in at 96.4% which is little less than ‘RandomForest’ or ‘ANN’. ‘time taken’ is 55 s which is in far better than ‘ANN’.

准确率达到96.4%,略低于“ RandomForest”或“ ANN”。 “花费的时间”为55秒,远比“人工神经网络”好。

极端梯度提升分类器(XGBoost) (Extreme Gradient Boosting Classifier (XGBoost))

XGBoost is a boosted tree based ensemble classifier. Like ‘RandomForest’, it will also automatically reduce the feature set. For this we have to use a separate ‘xgboost’ library which does not come with scikit-learn. Let’s see how it works:

XGBoost是基于树的集成分类器。 像“ RandomForest”一样,它也会自动缩小功能集。 为此,我们必须使用一个单独的“ xgboost”库,该库不包含scikit-learn。 让我们看看它是如何工作的:

Accuracy (99.4%) is exceptionally good, but ‘time taken’(15 min) is quite high. Nowadays, for complicated problems, XGBoost is becoming a default choice for Data Scientists for its accurate results. It has high running time due to its internal ensemble model structure. However, XGBoost performs well in GPU machines.

准确性(99.4%)非常好,但是“耗时”(15分钟)相当高。 如今,对于复杂的问题,XGBoost凭借其准确的结果正成为数据科学家的默认选择。 由于其内部整体模型结构,它具有很高的运行时间。 但是,XGBoost在GPU机器中表现良好。

结论 (Conclusion)

From all of the classifiers, it is clear that for accuracy ‘XGBoost’ is the winner. But if we take ‘time taken’ along with ‘accuracy’, then ‘RandomForest’ is a perfect choice. But we also saw how to use a simple linear classifier like ‘logistic regression’ with proper feature engineering to give better accuracy. Other classifiers don’t need that much feature engineering effort.

从所有分类器中都可以清楚地看出,“ XGBoost”是准确性的赢家。 但是,如果我们将“时间”与“准确性”并举,那么“ RandomForest”是一个完美的选择。 但是,我们还看到了如何将简单的线性分类器(如“逻辑回归”)与适当的特征工程结合使用,以提供更好的准确性。 其他分类器不需要太多的功能设计工作。

It depends on the requirements, use case, and data engineering environment available to choose a perfect ‘classifier’.

它取决于要求,用例和可用的数据工程环境来选择理想的“分类器”。

The entire project on Jupyter NoteBook can be found here.

Jupyter NoteBook上的整个项目可以在这里找到。

参考文献: (References:)

[1] XGBoost Documentation — https://xgboost.readthedocs.io/en/latest/

[1] XGBoost文档- https://xgboost.readthedocs.io/en/latest/

[2] RandomForest workings — http://dataaspirant.com/2017/05/22/random-forest-algorithm-machine-learing/

[2]随机森林运作- http://dataaspirant.com/2017/05/22/random-forest-algorithm-machine-learing/

[3] Principal Component Analysis — https://towardsdatascience.com/a-one-stop-shop-for-principal-component-analysis-5582fb7e0a9c

[3]主成分分析-https://towardsdatascience.com/a-one-stop-shop-for-principal-component-analysis-5582fb7e0a9c

[4] Logistic Regression — http://ufldl.stanford.edu/tutorial/supervised/LogisticRegression/

[4] Logistic回归-http: //ufldl.stanford.edu/tutorial/supervised/LogisticRegression/

[5] Support Vector Machines — https://towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47

[5]支持向量机-https: //towardsdatascience.com/support-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47

翻译自: https://www.freecodecamp.org/news/multi-class-classification-with-sci-kit-learn-xgboost-a-case-study-using-brainwave-data-363d7fca5f69/

  • 0
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值