深度学习算法和机器学习算法_前5种机器学习算法

最新推荐文章于 2023-12-15 00:54:02 发布

杨_明

最新推荐文章于 2023-12-15 00:54:02 发布

阅读量843

点赞数

文章标签：算法机器学习人工智能 python 深度学习

原文链接：https://towardsdatascience.com/the-top-5-machine-learning-algorithms-53bc471a2e92

版权

本文介绍了深度学习与机器学习的关系，并详细解析了排名前五的机器学习算法，包括其工作原理和应用。通过翻译自Data Science的文章，读者可以深入理解这些算法在人工智能和Python编程中的重要性。

摘要由CSDN通过智能技术生成

深度学习算法和机器学习算法

意见(Opinion)

目录(Table of Contents)

Introduction
介绍
Logistic Regression
逻辑回归
K-Means
K均值
Decision Trees
决策树
Random Forest
随机森林
XGBoost
XGBoost
Summary
概要

介绍(Introduction)

There are several Machine Learning algorithms that can be beneficial to both Data Scientists and of course, Machine Learning Engineers. I have worked at a couple of companies performing a variety of algorithms. In addition to myself, I have seen others in both educational and professional settings use similar algorithms. I am going to list what I think are the top Machine Learning algorithms along with use cases so that you can become aware or reiterate your knowledge of these algorithms. I also want to highlight the business understanding rather than the technology side, as I believe that point is often not as stressed in other articles.

有几种机器学习算法对数据科学家和机器学习工程师都有好处。我曾在多家执行各种算法的公司工作。除了我自己，我还看到其他在教育和专业领域都使用类似算法的人。我将列出我认为是顶级的机器学习算法以及用例，以便您可以了解或重申对这些算法的了解。我还想强调业务理解而不是技术方面，因为我认为这一点通常在其他文章中没有得到强调。

逻辑回归 (Logistic Regression)

Image for post — Photo by Sifan Liu on Unsplash [2].

These Machine Learning algorithms that I will be discussing are somewhat following order of difficulty. Therefore, the most simple, yet still powerful Machine Learning algorithm is logistic regression. Although the name implies regression, it is actually a (supervised) classification algorithm. Most of the time, it is used for predicting binary classes from the logit function. There are also different forms of logistic regression that include multinomial and ordinal target variables. Here are some popular examples that you can expect to encounter in the real-world and not just academic setting. (I have really only used logistic regression for binary classes, so I will not expound upon multiclass or ordinal— for those situations, I use different algorithms like the ones I will describe next).

我将要讨论的这些机器学习算法在某种程度上遵循难度。因此，最简单但仍功能强大的机器学习算法是逻辑回归。尽管名称暗示回归，但实际上它是一种(监督)分类算法。大多数时候，它用于通过logit函数预测二进制类。 Logistic回归也有不同形式，包括多项式和有序目标变量。这是一些您可以期望在现实世界中遇到的流行示例，而不仅仅是学术环境。 (我实际上只对二进制类使用了逻辑回归，因此，我不会详细介绍多类或序数。在这种情况下，我将使用不同的算法，例如下面将要描述的算法)。

Business use cases for logistic regression:

用于逻辑回归的业务用例：

Customer churn or no churn

客户流失或没有流失

This example predicts if a user of a product will or will not churn, meaning, they unsubscribe and drop themselves from the product. Possible features could include if they have low activity on the platform, failed to pay a fee, along with rates of specific lower activity.

此示例预测产品用户是否会流失，这意味着他们将退订产品并退出产品。可能的功能可能包括，如果他们在平台上的活动较少，没有支付费用以及特定的较低活动率。

Email spam or not spam

电子邮件垃圾邮件或非垃圾邮件

You can be creative and imagine a lot of situations as a 0 or 1, but it is ultimately up to your entire dataset, business use case, and the impact that will determine if this algorithm is right for you and your project. You could try to predict something as random as a house or not house — using descriptive features to help classify your target variable, but depending on your business, you will learn to find more useful and applicable situations for logistic regression (i.e., financial — approved or not approved, healthcare — disease or no disease).

您可以发挥创造力并将很多情况想象为0或1，但这最终取决于您的整个数据集，业务用例以及影响力，该影响力将决定该算法是否适合您和您的项目。您可以尝试预测房屋或不房屋之类的随机事物-使用描述性功能帮助您对目标变量进行分类，但是根据您的业务，您将学习找到更有用和适用的逻辑回归情况(例如，财务批准)或未经批准的医疗保健-疾病或没有疾病)。

You can expect to encounter these evaluation metrics with logistic regression:

您可以期望通过逻辑回归遇到这些评估指标：

1. Accuracy

1.准确性

2. Precision

2.精度

3. Recall

3.召回

Also important is the ROC (receiver operating characteristic) curve and AUC (area under the curve) along with sensitivity and specificity.

ROC(接收器工作特性)曲线和AUC(曲线下面积)以及灵敏度和特异性也很重要。

Documentation that is easy to follow and detailed [3]:

易于理解且详细的文档[3]：

K均值 (K-Means)

Whereas logistic regression was a supervised Machine Learning algorithm, k-means is the opposite, which is an unsupervised algorithm. I am specifically referring to the k-means clustering algorithm. In logistic regression, we had a yes or no, or a 0 or 1, spam or not spam, etc. We can also call those target variables labels, as we know what we are trying to predict. Unsupervised clustering is the opposite — we do not know the label, but we do want to still associate different observations of data with one another in forms of groups — or really just new classes that have no name yet. The way this algorithm works is by forming k clusters and their respective k centers. Intuitively, I like to think of it as a grouping problem. Say we have some data, and we want to create some groups or clusters. First, we identify our features, and then we run the algorithm that will essentially form groups based on features that closely resemble one another per group and between groups are as different as they can be.

逻辑回归是一种有监督的机器学习算法，而k均值则相反，这是一种无监督的算法。我专门指的是k均值聚类算法。在逻辑回归中，我们有一个是或否，或者是0或1，垃圾邮件或非垃圾邮件等。我们也可以称这些目标变量为标签，因为我们知道我们要预测什么。相反，无监督的聚类是相反的-我们不知道标签，但我们仍然希望以组的形式将对数据的不同观察彼此关联-或实际上只是还没有名称的新类。该算法的工作方式是形成k个簇及其各自的k个中心。直观地讲，我喜欢将其视为一个分组问题。假设我们有一些数据，我们想创建一些组或集群。首先，我们确定我们的特征，然后运行该算法，该算法本质上将基于每个组彼此非常相似的特征来形成组，并且组之间的差异尽可能大。

Here are some examples of k-means clustering:

以下是k均值聚类的一些示例：

Customer profiling for targeted advertisements

客户分析目标广告

You notice that you now have three groups that are distinctly identifiable. The first group are people who stay up past midnight, are under 25 years old, and are usually residing in big cities. The second group are people who go to bed around 11 pm, above 25 years old but less than 50, and reside in small cities. The last group goes to bed at 9 pm, tends to be aged 50+, and lives in rural areas. As you can see, each group is unique compared to other groups, but each group population is similar to others in the same group. This situation is ideal because the marketing campaign impact can be contributed to their distinctiveness. Now let’s work with a product manager, telling them that the groups do not have names yet, but that is not important as we want to find features associated with those groups. However, after running the model, we can still come up with labels now that we have our groups. Let’s label our groups and see the best way to market to them.

您会注意到，您现在具有三个可明确识别的组。第一类是熬夜的人，年龄在25岁以下，通常居住在大城市。第二类人是在25岁以上但不到50岁的晚上11点左右上床睡觉并居住在小城市中的人。最后一组在晚上9点上床睡觉，年龄通常在50岁以上，居住在农村地区。如您所见，每个组与其他组相比都是唯一的，但是每个组的人口都与同一组中的其他人相似。这种情况是理想的，因为营销活动的影响可以促进其独特性。现在，让我们与产品经理一起工作，告诉他们这些组还没有名称，但这并不重要，因为我们要查找与这些组关联的功能。但是，在运行模型之后，现在我们有了组，我们仍然可以拿出标签。让我们给小组贴上标签，看看对他们进行营销的最佳方式。

Group 1:  young, and entranced in technology

Marketing on their phone with Instagram ads for a cheaper product
通过Instagram广告在手机上营销更便宜的产品

Group 2: older but still utilizing technology frequently, but not as much on social media

Market to them by emailing them with an advertisement about your after-college product (new home or apartment)
通过电子邮件向他们发送有关您大学后产品(新房或新公寓)的广告来进行营销

Group 3: oldest and not on phone or computers nearly as much

Let’s send them physical mail like a magazine for furniture
让我们向他们发送实物邮件，例如家具杂志

I hope this example is more useful than just saying grouping and explaining k-means mathematically.

我希望这个例子比仅仅用数学说出分组和解释k-means更有用。

When you are working for a company, you may be surprised to find out leadership does not stress the importance of the code or functions developed and applied, but rather the impact of your Machine Learning algorithm that is now your saved and applied model.

在公司工作时，您可能会惊讶地发现，领导层并不强调开发和应用的代码或功能的重要性，而是您现在保存和应用的模型的机器学习算法的影响。

Documentation that is easy to follow and detailed [5]:

易于理解且详细的文档[5]：

决策树 (Decision Trees)

A slightly more complex algorithm (depending on the person) is the decision trees algorithm. A unique benefit of using this algorithm is that it can work with not only classification, but regression problems as well. Also important to note is that it is supervised as well as non-parametric (meaning that there are no assumptions with probability distributions of the data). They are easier to interpret than most models, and easier to visually describe in terms of how they work behind the scenes (thinking of a tree and how its branches split). Another reason why I like using decision trees is that they can handle both categorical and numeric data, which is often what is needed in real-world applications of Machine Learning algorithms.

决策树算法是稍微复杂一点的算法(取决于人)。使用此算法的独特优势在于它不仅可以处理分类，而且还可以处理回归问题。还要注意的重要一点是，它是受监督的，也是非参数的(这意味着没有关于数据的概率分布的假设)。它们比大多数模型更易于解释，并且更易于直观地描述它们在幕后的工作方式(思考一棵树及其分支如何分裂)。我喜欢使用决策树的另一个原因是它们可以处理分类数据和数字数据，这在机器学习算法的实际应用中通常是需要的。

The logic of decision trees can be imagined by making a decision, then more decisions, as the name implies.

顾名思义，可以通过做出一个决定，然后再做出更多决定来想象决策树的逻辑。

A disadvantage of decision trees is that they can be prone to overfitting or not generalizing well (we can use random forest to avoid overfitting faster, or tune the same decision trees model to avoid overfitting).

决策树的一个缺点是它们可能易于过度拟合或不能很好地泛化(我们可以使用随机森林来避免快速过度拟合，或者调整相同的决策树模型以避免过度拟合)。

Here is a business use case of decision trees for classification:

这是用于分类的决策树的业务用例：

Classifying housing markets for realtors

对房地产市场进行房地产经纪人分类

In this example, a realtor would want to know the types of homes for their organization and grouping on their realtor website. The classification labels would be

在此示例中，房地产经纪人希望在其房地产经纪人网站上了解其组织和分组的房屋类型。分类标签为

Hip homes
臀部家园
Family homes
家庭住宅
Singles’ pad
单打垫

The decision trees would work to decide on the information gain of various features of the homes and types of people who tend to buy these homes. For example, a list of decisions and result could possibly be:

决策树将决定房屋的各种功能以及倾向于购买这些房屋的人的类型的信息增益。例如，决策和结果列表可能是：

is this house 2 stories? (yes)
这房子有两个故事吗？ (是)
does this house has 2 or more bedrooms? (yes)
这房子有两间或更多的卧室吗？ (是)
does this house has a big backyard? (yes)
这房子有大后院吗？ (是)
then this house is a family home
那房子是一个家庭住宅

Documentation that is easy to follow and detailed [7]:

易于理解且详细的文档[7]：

随机森林 (Random Forest)

Now we are getting into the top two Machine Learning algorithms. These are some of the most popular and useful algorithms that I have encountered professionally as well as from others. Similar key concepts of decision trees can be applied to random forests — supervised, classification, and regression. A random forest is what it implies as well, a forest of randomness, but what is random about it? It is the random sample of data for the decision trees that compose a random forest. The ensemble of decision tree results ultimately make the random forest result or prediction, whichever class gets the most votes — wins. This type of ensembling helps to prevent overfitting that just using a decision trees algorithm would encounter.

现在我们进入前两种机器学习算法。这些是我在专业上以及其他人中遇到的一些最受欢迎和最有用的算法。决策树的类似关键概念可以应用于随机森林-监督，分类和回归。随机森林也意味着随机森林，但是随机森林又是什么呢？构成随机森林的决策树是数据的随机样本。决策树结果的集合最终会做出随机森林结果或预测，无论哪个类别获得最多选票即获胜。这种集成有助于防止仅使用决策树算法会遇到的过拟合现象。

Here are some business use cases for Random Forest:

以下是随机森林的一些业务用例：

Classifying several products for an e-commerce site

为电子商务网站分类几种产品

So now that we have a classification algorithm, using a powerful supervised algorithm that works well with multiple classes, we can accurately classify product categories. You can have, say, 20 classes or 20 types of products, but it would take hours to manually classify them, and you could make some easy mistakes. A random forest can become pretty accurate for this use case (of course, depending on the dataset). It can also be much faster than a manual approach.

因此，现在我们有了分类算法，使用了功能强大的监督算法，该算法可以很好地适用于多个类别，因此我们可以准确地对产品类别进行分类。例如，您可以拥有20个类别或20种类型的产品，但是手动对其进行分类将花费数小时，并且您可能会犯一些简单的错误。对于这种用例，随机森林可以变得非常准确(当然，取决于数据集)。它也可能比手动方法快得多。

Cups, pants, toys, furniture, etc.
杯子，裤子，玩具，家具等

Imagine the decision tree example from above, but apply it to each of the classes in this random forest problem. Hopefully, your training data and products are well defined and separated.

从上面想象一下决策树示例，但将其应用于该随机森林问题中的每个类。希望您的培训数据和产品得到很好的定义和区分。

The model could have some trouble if your products are too similar or too broad, perhaps making them one class instead or more classes is an easy fix.

如果您的产品太相似或太广泛，则该模型可能会遇到麻烦，也许将它们改为一个类或多个类是一个简单的解决方法。

An example of this problem would be boots and hiking boots
此问题的一个例子是靴子和远足靴
Fix: create different categories that are construction boots (if features allow), hiking boots, and snow boots. Now, we have three unique types of boots rather than boots and hiking boots that would surely overlap.
修复：创建不同类别的建筑靴(如果功能允许)，远足靴和雪地靴。现在，我们拥有三种独特的靴子，而不是肯定会重叠的靴子和远足靴。

Also useful is using the predict_proba function to assign scores for each classification suggestion. So we would be, for example, 90% likely to be construction boots from our model.

使用predict_proba函数为每个分类建议分配分数也很有用。因此，例如，根据我们的模型，我们很可能有90％是构造靴。

Documentation that is easy to follow and detailed [9]:

易于理解且详细的文档[9]：

XGBoost (XGBoost)

Some would argue this Machine Learning algorithm is the best or better than Random Forest. I would agree, but also note that it does depend on the problem at hand as well, sometimes XGBoost can be hard to interpret. It can also be difficult to ingest data, transform, and predict new data. However, it is extremely powerful and leads to accurate results. You can think of XGBoost that utilizes parallel processing (to be extreme) while avoiding overfitting with regularization.

有人会认为这种机器学习算法比随机森林算法最好或更好。我同意，但也请注意，它确实也取决于手头的问题，有时XGBoost可能难以解释。提取数据，转换和预测新数据也可能很困难。但是，它功能非常强大，可以产生准确的结果。您可以想到XGBoost在避免正则化过度使用的同时，利用了并行处理(这是极端的)。

A business case for XGBoost:

XGBoost的商业案例：

Classifying different types of email with a several features quickly (for live predictions)

快速通过多种功能对不同类型的电子邮件进行分类(实时预测)

You will notice in real-world applications of Machine Learning algorithms, some results will need to be outputted once a day, or some, pretty much instantly or in live manner. XGBoost is a beneficial algorithm for this type of problem. Because emails are sent and received quickly, you would need to categorize them quickly. This classification could either be spam or not spam, or it could be different types of emails like promotions, subscriptions, etc.

您会注意到在机器学习算法的实际应用中，某些结果将需要每天一次输出，或者一些结果几乎必须立即或以实时方式输出。 XGBoost是解决此类问题的有益算法。由于电子邮件的发送和接收速度很快，因此您需要对其进行快速分类。此分类可以是垃圾邮件，也可以不是垃圾邮件，也可以是不同类型的电子邮件，例如促销，订阅等。

spam or not spam
垃圾邮件或非垃圾邮件
type of email (promotion, social)
电子邮件类型(促销，社交)

Documentation that is easy to follow and detailed [11]:

易于理解且详细的文档[11]：

概要 (Summary)

Depending on your dataset, business problem, and target variable, will decide which of these top five Machine Learning algorithms you would employ. Think of how fast you need your results, or if it is a one-off task. Do you have a continuous target variable or a class label? Do you even have labels? These are the questions you will need to ask yourself as you pick one of these algorithms. You can also perform similar algorithms to ultimately lower your error metric or increase your accuracy metric.

根据您的数据集，业务问题和目标变量，将决定您将采用这五种机器学习算法中的哪一种。考虑一下您需要多快的结果，或者这是一项一次性的任务。您有连续的目标变量或类标签吗？你甚至有标签吗？这些是您在选择这些算法之一时需要问自己的问题。您还可以执行类似的算法，以最终降低错误指标或提高准确性指标。

The top five Machine Learning algorithms we discussed were:

我们讨论的前五种机器学习算法是：

Logistic RegressionK-Means (clustering)Decision TreesRandom ForestXGBoost

Feel free to comment down below to discuss what algorithms you enjoy and employ, or if you have other ones that you use that are more beneficial.

请在下面随意评论，以讨论您喜欢和采用的算法，或者如果您使用其他更有益的算法。

I hope you enjoyed reading my article and learned something new. Thank you for reading!

希望您喜欢阅读我的文章并学到新知识。 感谢您的阅读！

翻译自: https://towardsdatascience.com/the-top-5-machine-learning-algorithms-53bc471a2e92

深度学习算法和机器学习算法

杨_明

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
复制链接

分享到 QQ

分享到新浪微博

扫一扫