矢量裁剪矢量_支持像矢量一样思考的矢量机

最新推荐文章于 2024-02-20 16:30:34 发布

weixin_26750481

最新推荐文章于 2024-02-20 16:30:34 发布

阅读量309

点赞数

原文链接：https://medium.com/analytics-vidhya/support-vector-machines-thinking-like-vectors-ba75f184c471

版权

矢量裁剪矢量

Support vector machines work well in high dimensional space with clear margin or separation thus thinking like vectors.

支持向量机在高维空间中以清晰的边距或间隔很好地工作，因此像向量一样思考。

Support Vector Machine(SVM) is a supervised non-linear machine learning algorithm which can be used for both classification and regression problems. SVM is used to generate multiple separating hyperplanes such that it divides segments of data space and each segment contains only one kind of data.

支持向量机(SVM)是一种监督型非线性机器学习算法，可用于分类和回归问题。 SVM用于生成多个分离的超平面，从而将数据空间的各个部分分开，每个部分仅包含一种数据。

SVM technique is useful for data whose distribution is unknown i.e which has Non-regularity i.e data in spam classification, handwriting recognition, text categorization, speaker identification etc. I listed applications of support vector machine with it.:)

SVM技术适用于分布不明的数据，即具有非规律性的数据，即垃圾邮件分类，手写识别，文本分类，说话者识别等数据。我列出了支持向量机的应用。

This post is about explaining support vector machines with an example, demonstration of support vector machine on a dataset and explanation of generated outputs of demonstration.

这篇文章是用一个例子来解释支持向量机，在数据集上演示支持向量机，并解释演示产生的输出。

SVM背后的例子是什么？ (What lies behind SVM with example?)

Image for post — Picture exclusively created

In Support Vector Machines, we plot each data as a point in n-dimensional space(where “n” is the number of features) with the value of each feature being a value of a particular coordinate. Then, we perform classification by finding hyperplane that differentiates the classes.

在支持向量机中，我们将每个数据绘制为n维空间(其中“ n”是要素数量)中的一个点，每个要素的值是特定坐标的值。然后，我们通过找到区分类别的超平面进行分类。

ExampleConsider a dataset containing Apples and Oranges. So, to classify them, we use Support Vector machine ad labelled training data on plane.

示例考虑包含Apple和Orange的数据集。因此，为了对它们进行分类，我们使用支持向量机广告标记了飞机上的训练数据。

A support vector machine(SVM) takes these data points and outputs the hyperplane (which is a two-dimension line of equation y = ax + b) that best separates the tags. The line is called the decision boundary i.e anything that falls to one side of it is classified as Apple and anything that falls to the other as Orange.

支持向量机(SVM)提取这些数据点并输出最能分隔标签的超平面(这是等式y = ax + b的二维线)。这条线称为决策边界，即落在它一侧的任何东西都归为Apple，落在另一侧的任何东西都归为Orange。

The hyperplane(Two-dimensional line) is best when it’s the distance to the nearest element of each data point or tag is the largest i.e specified on maximum margins.

当距每个数据点或标签最近的元素的距离最大(即在最大边距上指定)时，超平面(二维线)最好。

All points on the line ax+b=0 will satisfy the equation so, we draw two parallel lines ax+b=-1 for one side and ax+b=1 for the other side such that these lines pass through a datapoint or tag in the segment which is nearest to our line, then the distance between these two lines will be our margin.

线ax + b = 0上的所有点都将满足方程式，因此，我们在一侧绘制两条平行线ax + b = -1，在另一侧绘制ax + b = 1，以便这些线穿过数据点或标记在最接近我们线的线段中，那么这两条线之间的距离就是我们的边距。

数据集演示 (Demonstration with a dataset)

Iris dataset consists of 50 samples from each of 3 species of Iris(Iris setosa, Iris virginica, Iris versicolor) and a multivariate dataset introduced by British statistician and biologist Ronald Fisher in his 1936 paper. The use of multiple measurements in taxonomic problems.

鸢尾花数据集由3种鸢尾花(鸢尾花，鸢尾花，鸢尾花)中的每一个的50个样本组成，以及由英国统计学家和生物学家罗纳德·费舍尔(Ronald Fisher)在其1936年论文中引入的多元数据集。在分类学问题中使用多个度量。

Four features were measured from each sample i.e length and width of the sepals and petals and based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other.

从每个样品中测量出四个特征，即萼片和花瓣的长度和宽度，并基于这四个特征的组合，Fisher建立了线性判别模型以区分物种。

# Loading datadata(iris) # Structure str(iris)

＃加载datadata(iris)＃结构str(iris)

Using Support vector machine algorithm on the dataset which includes 11 persons and 6 variables using the e1071 package. Refer for package description.

在数据集上使用支持向量机算法，该数据集使用e1071软件包包含11个人和6个变量。请参阅包装说明。

# Installing Packagesinstall.packages(“e1071”)install.packages(“caTools”)install.packages(“caret”)

＃安装软件包install.packages(“ e1071”)install.packages(“ caTools”)install.packages(“ caret”)

# Loading packagelibrary(e1071)library(caTools)library(caret)

＃加载packagelibrary(e1071)library(caTools)library(caret)

# Splitting data into train# and test datasplit <- sample.split(iris, SplitRatio = 0.7)train_sv <- subset(iris, split == “TRUE”)test_sv <- subset(iris, split == “FALSE”)

＃将数据拆分为训练＃和测试数据拆分<-sample.split(iris，SplitRatio = 0.7)train_sv <-子集(iris，split ==“ TRUE”)test_sv <-子集(iris，split ==“ FALSE”)

# Feature Scalingtrain_scale <- scale(train_sv[, 1:4])test_scale <- scale(test_sv[, 1:4])

＃Feature Scalingtrain_scale <-scale(train_sv [，1：4])test_scale <-scale(test_sv [，1：4])

# Fitting KNN Model # to training datasetset.seed(120) # Setting seedclassifier_svm <- svm(Species ~. , data = train, method = “class”)classifier_svm

＃将KNN模型＃拟合到训练数据集set.seed(120)＃设置seedclassifier_svm <-svm(Species〜。，data = train，method =“ class”)classifier_svm

# Summary of modelsummary(classifier_svm)

＃模型摘要(classifier_svm)

# Predictiontest_sv$Species_Predic <- predict(classifier_svm, newdata = test_sv, type = “class”)

＃Predictiontest_sv $ Species_Predic <-predict(classifier_svm，newdata = test_sv，type =“ class”)

# Confusion Matrixcm <- table(test_sv$Species, test_sv$Species_Predic)cm

＃混淆矩阵cm <-table(test_sv $ Species，test_sv $ Species_Predic)cm

# Model EvaluationconfusionMatrix(cm)

＃模型评估混淆矩阵(cm)

输出-定义一切 (Outputs — That defines everything)

Model classifier_svm:
模型classifier_svm：

The model trained is a classification model with 40 support vectors and radial kernel.

训练的模型是具有40个支持向量和径向核的分类模型。

2. Summary of Model:

2. 型号摘要 ：

The model trained is a classification model with 40 support vectors or data points with 3 classes and 3 levels i.e Setosa, Versicolor and Virginica.

训练的模型是具有40个支持向量或数据点的分类模型，具有3个类别和3个级别，即Setosa，Versicolor和Virginica。

3. Confusion Matrix:

3. 混淆矩阵：

So, 20 Setosa are correctly classified as Setosa. 20 Versicolor is correctly classified as Versicolor. 20 virginica are correctly classified as virginica.

因此，有20个Setosa被正确地分类为Setosa。 20 Versicolor被正确分类为Versicolor。 20个维吉尼亚州被正确分类为维吉尼亚州。

4. Model Evaluation:

4. 模型评估：

The model achieved 100% accuracy with P-value less than 1. With Sensitivity, Specificity and Balanced accuracy, the model build is good. For increasing accuracy, tuning of hyperparameters is done with minimum errors and that includes kernel, gamma and cost parameters.

该模型达到100％的准确性，P值小于1。凭借灵敏度，特异性和平衡的准确性，该模型的构建良好。为了提高准确性，超参数的调整要以最小的误差来完成，其中包括内核，伽玛和成本参数。