深度学习算法和机器学习算法_7种机器学习算法的7个关键点-CSDN博客

深度学习算法和机器学习算法

Image for post — Photo by Waldemar Brandt on Unsplash

Thanks to the various libraries and frameworks, we can implement machine learning algorithms with just one line of code. Some go further and let you implement and compare multiple algorithms in no time.

借助各种库和框架，我们仅需一行代码即可实现机器学习算法。有些更进一步，使您可以立即实现和比较多种算法。

The ease of use comes with some disadvantages. We may overlook key concepts or ideas behind these algorithms which are essential to gain a comprehensive understanding of them.

易用性带有一些缺点。我们可能会忽略这些算法背后的关键概念或思想，这对于全面理解它们至关重要。

In this post, I will mention about 7 key points on 7 machine learning algorithms. I want to point out that it will not be a whole explanation of the algorithms so it is better if you have a basic understanding of them.

在这篇文章中，我将提到有关7种机器学习算法的7个关键点。我想指出的是，这并不是对算法的完整解释，因此，如果您对它们有基本的了解，那就更好了。

Let’s start.

开始吧。

1-支持向量机(SVM) (1- Support Vector Machine (SVM))

Point: C parameter

点：C参数

SVM creates a decision boundary that makes the distinction between two or more classes.

SVM创建一个决策边界 ，以区分两个或多个类。

A soft margin SVM tries to solve an optimization problem with the following goals:

软裕量支持向量机尝试解决具有以下目标的优化问题：

Increase the distance of decision boundary to classes (or support vectors)
增加决策边界与类(或支持向量)的距离
Maximize the number of points that are correctly classified in the training set
最大化在训练集中正确分类的点数

There is obviously a trade-off between these two goals. Decision boundary might have to be very close to one particular class to correctly label all data points. However, in this case, the accuracy on new observations might be lower because decision boundary is too sensitive to noise and to small changes in the independent variables.

这两个目标之间显然有一个权衡。决策边界可能必须非常接近某一特定类才能正确标记所有数据点。但是，在这种情况下，由于决策边界对噪声和自变量的微小变化过于敏感，因此新观测值的准确性可能会降低。

On the other hand, a decision boundary might be placed as far as possible to each class with the expense of some misclassified exceptions. This trade-off is controlled by the c parameter.

另一方面，可能会为每个类别尽可能远地放置一个决策边界，但要付出一些错误分类的例外的代价。这种权衡由c参数控制。

C parameter adds a penalty for each misclassified data point. If c is small, the penalty for misclassified points is low so a decision boundary with a large margin is chosen at the expense of a greater number of misclassification.

C参数为每个错误分类的数据点增加了惩罚。如果c小，则对错误分类的点的惩罚较低，因此以较大数量的错误分类为代价选择了具有较大余量的决策边界。

If c is large, SVM tries to minimize the number of misclassified examples due to high penalty which results in a decision boundary with a smaller margin. The penalty is not same for all misclassified examples. It is directly proportional to the distance to the decision boundary.

如果c大，由于高罚分，SVM会尝试最大程度地减少误分类示例的数量，从而导致决策边界的边距较小。对于所有错误分类的示例，惩罚都不相同。它与到决策边界的距离成正比。

2-决策树 (2- Decision Tree)

Point: Information gain

点：信息获取

When choosing a feature to split, the decision tree algorithm tries to achieve

选择要分割的特征时，决策树算法会尝试实现

More predictiveness
更具预测性
Less impurity
杂质少
Lower entropy
较低的熵

Entropy is a measure of uncertainty or randomness. The more randomness a variable has, the higher the entropy is. The variables with uniform distribution have the highest entropy. For example, rolling a fair dice has 6 possible outcomes with equal probabilities so it has a uniform distribution and high entropy.

熵是不确定性或随机性的量度。变量具有的随机性越多，熵就越高。具有均匀分布的变量具有最高的熵。例如，掷骰子有6个概率相等的可能结果，因此它具有均匀的分布和较高的熵。

Splits that result in more pure nodes are chosen. All these indicate “information gain” which is basically the difference between entropy before and after the split.

选择导致更纯节点的拆分。所有这些都表明“信息增益”，基本上是分裂前后的熵之差。

3-随机森林 (3- Random Forest)

Point: Bootstrapping and feature randomness

重点：自举和功能随机性

Random forest is an ensemble of many decision trees. The success of a random forest highly depends on using uncorrelated decision trees. If we use the same or very similar trees, the overall result will not be much different than the result of a single decision tree. Random forests achieve to have uncorrelated decision trees by bootstrapping and feature randomness.

随机森林是许多决策树的集合。随机森林的成功很大程度上取决于使用不相关的决策树。如果我们使用相同或非常相似的树，则总体结果将与单个决策树的结果相差无几。随机森林通过自举和特征随机性来实现具有不相关的决策树。

Bootstrapping is randomly selecting samples from training data with replacement. They are called bootstrap samples.

自举是从训练数据中随机选择样本进行替换。它们称为引导程序样本。

Feature randomness is achieved by selecting features randomly for each decision tree in a random forest. The number of features used for each tree in a random forest can be controlled with the max_features parameter.

通过为随机森林中的每个决策树随机选择特征来实现特征随机性。可以使用max_features参数控制随机森林中每棵树使用的特征数量。

4-梯度提升决策树 (4- Gradient Boosted Decision Tree)

Point: Learning rate and n_estimators

点：学习率和n_estimators

GBDT is an ensemble of decision trees combined with boosting method which means the decision trees are connected sequentially.

GBDT是决策树与boosting方法的结合体，这意味着决策树是顺序连接的。

Learning rate and n_estimators are two critical hyperparameters for gradient boosting decision trees.

学习率和n_estimator是用于梯度提升决策树的两个关键超参数。

The learning rate simply means how fast the model learns. The advantage of a slower learning rate is that the model becomes more robust and generalized. However, learning slowly comes at a cost. It takes more time to train the model which brings us to the other significant hyperparameter.

学习率仅表示模型学习的速度。学习速度较慢的优点是模型变得更健壮和更通用。但是，学习缓慢需要付出一定的代价。训练模型需要更多时间，这将我们带到另一个重要的超参数。

The n_estimator parameter is the number of trees used in the model. If the learning rate is low, we need more trees to train the model. However, we need to be very careful at selecting the number of trees. It creates a high risk of overfitting to use too many trees.

n_estimator参数是模型中使用的树数。如果学习率低，我们需要更多的树来训练模型。但是，我们在选择树数时需要非常小心。使用过多树木会产生过度拟合的高风险。

5-朴素贝叶斯分类器 (5- Naive Bayes Classifier)

Point: What is the advantage of being naive?

要点：天真有什么好处？

Naive Bayes is a supervised machine learning algorithm for classification so the task is to find the class of an observation given the values of features. Naive Bayes classifier calculates the probability of a class given a set of feature values (i.e. p(yi | x1, x2 , … , xn)).

朴素贝叶斯(Naive Bayes)是一种用于分类的监督式机器学习算法，因此任务是在给定要素值的情况下找到观测的类别。朴素贝叶斯分类器在给定一组特征值(即p(yi | x1，x2，…，xn))的情况下计算类的概率。

Naive Bayes assumes that features are independent of each other and there is no correlation between features. However, this is not the case in real life. This naive assumption of features being uncorrelated is the reason why this algorithm is called “naive”.

朴素贝叶斯假设要素彼此独立，要素之间没有关联。但是，现实生活中并非如此。特征不相关的这种天真假设是将该算法称为“天真”的原因。

The assumption that all features are independent makes it very fast compared to complicated algorithms. In some cases, speed is preferred over higher accuracy.

与复杂算法相比，所有功能都是独立的这一假设使其变得非常快 。在某些情况下，速度比精度更高。

It works well with high-dimensional data such as text classification, email spam detection.

它适用于高维数据，例如文本分类，电子邮件垃圾邮件检测。

6-K最近邻居 (6- K-Nearest Neighbors)

Point: When to use and not to use

要点：何时使用和不使用

K-nearest neighbors (kNN) is a supervised machine learning algorithm that can be used to solve both classification and regression tasks. The main principle of kNN is that the value of a data point is determined by the data points around it.

K最近邻(kNN)是一种受监督的机器学习算法，可用于解决分类和回归任务。 kNN的主要原理是，数据点的值由其周围的数据点确定。

The kNN algorithm becomes very slow as the number of data points increases because the model needs to store all data points in order to calculate the distance between them. This very reason also makes the algorithm not memory efficient.

随着数据点数量的增加，kNN算法变得非常慢，因为模型需要存储所有数据点以便计算它们之间的距离。这个原因也使该算法的存储效率不高。

Another disadvantage is that kNN is sensitive to outliers because an outlier has an effect on the nearest point (even if it is too far).

另一个缺点是kNN对异常值敏感，因为异常值会影响最近的点(即使距离太远)。

On the positive side:

在积极方面：

Simple and easy to interpret
简单易懂
Does not make any assumption so it can be implemented in non-linear tasks.
不做任何假设，因此可以在非线性任务中实施。
Works well on classification with multiple classes
在多个类别的分类上效果很好
Works on both classification and regression tasks
适用于分类和回归任务

7- K-均值聚类 (7- K-Means Clustering)

Point: When to use and not to use

要点：何时使用和不使用

K-means clustering aims to partition data into k clusters in a way that data points in the same cluster are similar and data points in the different clusters are farther apart.

K-均值聚类旨在将数据划分为k个聚类，以使同一聚类中的数据点相似，而不同聚类中的数据点相距更远。

K-means algorithm is not able to guess how many clusters exist in the data. The number of clusters must be pre-determined which might be a challenging task.

K-means算法无法猜测数据中存在多少个簇。群集的数量必须预先确定，这可能是一项艰巨的任务。

The algorithm slows down as the number of samples increases because, at each step, it accesses all data points and calculates distances.

该算法随着样本数量的增加而减慢速度，因为在每个步骤中，它都会访问所有数据点并计算距离。

K-means can only draw linear boundaries. If there is a non-linear structure separating groups in the data, k-means will not be a good choice.

K均值只能绘制线性边界。如果存在将数据中的组分开的非线性结构，则k均值将不是一个好的选择。

On the positive side:

在积极方面：

Easy to interpret
容易解释
Relatively fast
比较快
Scalable for large data sets
可扩展用于大型数据集
Able to choose the positions of initial centroids in a smart way that speeds up the convergence
能够以智能方式选择初始质心的位置，从而加快收敛速度
Guarantees convergence
保证融合

We have covered some key concepts about each algorithm. The given points and notes are definitely not the whole explanation of the algorithms. However, they are certainly important to know to make a difference when implementing these algorithms.

我们已经介绍了有关每种算法的一些关键概念。给出的要点和注释绝对不是算法的完整说明。但是，了解实现这些算法时必须有所作为当然很重要。

Thank you for reading. Please let me know if you have any feedback.

感谢您的阅读。如果您有任何反馈意见，请告诉我。