机器学习算法应用_如何为您的应用选择正确的机器学习算法

最新推荐文章于 2022-02-08 11:49:06 发布

羊牮

最新推荐文章于 2022-02-08 11:49:06 发布

阅读量277

点赞数

文章标签：算法机器学习 python 人工智能 java

原文链接：https://towardsdatascience.com/how-to-choose-the-right-machine-learning-algorithm-for-your-application-1e36c32400b9

版权

机器学习算法应用

When I first started learning and practicing data science and machine learning, I would look up resources and tutorials to implement and use a specific machine learning algorithm.

刚开始学习和实践数据科学与机器学习时，我会查找资源和教程以实施和使用特定的机器学习算法。

The internet is full of materials that teach you how to use an algorithm, how it works, and how to apply to your data.

互联网上充满了教您如何使用算法，算法如何工作以及如何将其应用于数据的材料。

However, when I started building my projects, I would take a long time trying to decide which algorithm to use.

但是，当我开始构建项目时，我将花费很长时间来尝试确定要使用哪种算法。

See, what most of the articles on how to use a specific algorithm miss are when to use this algorithm and how to choose the best algorithm for your data.

请参阅，有关如何使用特定算法的大多数文章遗漏了何时使用该算法以及如何为数据选择最佳算法。

In this article, I will try to go over the process I follow in choosing the best machine learning algorithm for a specific project.

在本文中，我将尝试介绍为特定项目选择最佳机器学习算法时遵循的过程。

Before we start, let’s first go through the types of machine learning algorithms.

在开始之前，让我们首先研究一下机器学习算法的类型。

机器学习算法的类型 (Types of machine learning algorithms)

Machine learning algorithms can be categorized broadly into three main categories:

机器学习算法可以大致分为三大类：

监督学习 (Supervised learning)

In Supervised learning, the algorithm builds a mathematical model from the training data, which has labels for both the inputs and output. Data classification and regression algorithms are considered supervised learning.

在监督学习中，该算法从训练数据构建数学模型，该模型具有输入和输出的标签。 数据分类和回归算法被认为是监督学习。

无监督学习 (Unsupervised learning)

In Unsupervised learning, the algorithm builds a model on data that only has the input features but no labels for output. The models then are trained to look for some structure within the data. Clustering and segmentation are examples of unsupervised learning algorithms.

在无监督学习中，该算法在仅具有输入功能但没有输出标签的数据上建立模型。然后训练模型以在数据中寻找某种结构。聚类和分段是无监督学习算法的示例。

强化学习 (Reinforcement learning)

In Reinforcement learning, the model learns to perform a task by performing a set of actions and decisions it improvises by itself and then learn from the feedback of those actions and decisions. Monte Carlo is an example of reinforcement learning algorithms.

在强化学习中，该模型学习通过执行一组自己自行完成的动作和决策来执行任务，然后从这些动作和决策的反馈中学习。 蒙特卡洛是强化学习算法的一个示例。

选择正确的算法 (Choosing the right algorithm)

So, you know the different algorithms types, you know how they differ, and you know how to use them. The question now is when to use each of these algorithms?

因此，您知道不同的算法类型，知道它们的不同之处，并且知道如何使用它们。现在的问题是何时使用每种算法？

To answer this question, we need to consider 4 aspects of the problem we are trying to solve:

要回答这个问题，我们需要考虑我们要解决的问题的四个方面：

№1：数据 (№1: The Data)

Knowing your data is the first and foremost step of deciding on an algorithm. Before you start thinking about the different algorithms, you need to familiarize yourself with your data. A simple way to do that is to visualize the data and try to find patterns within it, try to observe it’s behavior, and, most importantly of all, its size.

了解数据是决定算法的第一步，也是最重要的一步。在开始考虑不同的算法之前，您需要熟悉自己的数据。一种简单的方法是可视化数据并尝试查找其中的模式，尝试观察其行为，最重要的是，观察其大小。

Knowing the critical information about your data will help you make an initial decision on an algorithm.

了解有关数据的关键信息将有助于您对算法做出初步决策。

The size of data: Some algorithms perform better with larger data than others. For example, for small training datasets, algorithms with high bais/ low variance classifiers will work better than low bias/ high variance classifiers. So, for small training data, Naïve Bayes will perform better than kNN.
数据大小：某些算法在处理更大数据时表现更好。例如，对于小型训练数据集，具有高bais /低方差分类器的算法将比低偏差/高方差分类器更好地工作。因此，对于小的训练数据，朴素贝叶斯将比kNN表现更好。
The characteristics of data: This means how your data is formed. Is your data linear? Then maybe a linear model will fit it best, such as regressions — linear and logistic — or SVM (support vector machine). However, if your data is more complex, then you need an algorithm like random forest.
数据的特征：这意味着您的数据如何形成。您的数据是线性的吗？然后，也许线性模型会最适合它，例如回归(线性和逻辑)或SVM (支持向量机)。但是，如果您的数据更复杂，则需要像random forest这样的算法。
The behavior of data: Are your features sequential or chained? If it is sequential? Are you trying to forecast the weather or the stock market? Then it would be best if you used an algorithm that matches that, such as Markov models and decision trees.
数据的行为：您的功能是顺序的还是链接的？如果是顺序的？您是否要预测天气或股市？然后，最好使用与之匹配的算法，例如Markov模型和决策树。
The type of data: You can either categorize your input or output data. If your input data is labeled, then use a supervised learning algorithm; if not, it’s probably an unsupervised learning problem. On the other hand, if your output data is numeric, then use regression, but if it’s a set of groups, then it’s a clustering problem.
数据类型：您可以对输入或输出数据进行分类。如果您的输入数据带有标签，请使用监督学习算法；如果没有，那可能是一个无监督的学习问题。另一方面，如果您的输出数据为数值，则使用回归，但是如果它是一组组，那么这是一个聚类问题。

№2：准确性 (№2: The Accuracy)

Now that you have studied your data, analyzed its type, characteristics, and size, you need to ask yourself how much does accuracy matter to the problem you’re trying to solve?

现在，您已经研究了数据，分析了数据的类型，特征和大小，您需要自问，准确性与您要解决的问题有多大关系？

The accuracy of a model refers to its ability to predict an answer from a given observation set close to the correct response for that observation set.

模型的准确性是指其根据给定观察集预测答案的能力，该值接近该观察集的正确响应。

Sometimes getting an accurate answer is not necessary for our target application. If an approximation is good enough, we can cut our training and processing time significantly by choosing an approximate model. Approximate methods avoid or don’t perform overfitting or the data, such as linear regression on not-so-linear data.

有时，对于我们的目标应用程序而言，不需要准确的答案。如果近似值足够好，我们可以通过选择近似模型来显着减少训练和处理时间。近似方法避免或不执行过度拟合或数据，例如对非线性数据进行线性回归。

№3：速度 (№3: The Speed)

Often, accuracy and speed stand on opposite sides; you need to make some trade-offs between the two when deciding on an algorithm. Higher accuracy typically means more extended training and processing times.

通常，准确性和速度是相反的。在确定算法时，您需要在两者之间进行权衡。更高的精度通常意味着更长的培训和处理时间。

Algorithms like Naïve Bayes and Linear and Logistic regression are easy to understand and implement and, hence they have fast execution. More complex algorithms like SVM, Neural networks, and random forests, need a much longer time to process and train data.

朴素贝叶斯(NaïveBayes)和线性与逻辑回归(Linear and Logistic regression)等算法易于理解和实施，因此执行速度很快。 SVM，神经网络和随机森林等更复杂的算法需要更长的时间来处理和训练数据。

So, which is of more value to your project? Accuracy or time? If it’s time, going with a simpler algorithm will be better, while if accuracy is the most important thing, then choosing a more complex algorithm will work better for your project.

那么，哪个对您的项目更有价值？准确性还是时间？如果有时间的话，使用更简单的算法会更好，而如果精度是最重要的事情，那么选择更复杂的算法将对您的项目更好。

№4：功能和参数 (№4: Features and parameters)

The parameter of your problems is numbers that will affect how the algorithm you will choose to behave. Parameters are factors such as error tolerance or the number of iterations, or options between variants of how the algorithm behaves. The time needed to train and process your data is often related to how many parameters you have.

问题的参数是数字，数字将影响您选择算法的行为方式。参数是诸如容错性或迭代次数之类的因素，或者是算法行为的变体之间的选项。训练和处理数据所需的时间通常与您拥有多少个参数有关。

The time required to process and train a model increases exponentially with the number of parameters. However, having many parameters typically indicates that an algorithm is more flexible.

处理和训练模型所需的时间随参数数量呈指数增长。但是，具有许多参数通常表示算法更灵活。

In machine learning — or data science, in general, a feature is a quantifiable variable of the problem you are trying to analyze.

通常，在机器学习或数据科学中，功能是您要分析的问题的可量化变量。

Having a large number of features can slow down some algorithms, making training time quite long. If your problem has many features, then using an algorithm such as SVM, which is well suited to applications with a high number of features, is the best way to go.

具有大量功能会降低某些算法的速度，从而使训练时间相当长。如果您的问题具有许多功能，那么最好使用诸如SVM之类的算法，该算法非常适合具有大量功能的应用程序。

最后的想法 (Final thoughts)

Many factors control the process of choosing an algorithm. We can mainly divide your decision criteria into two sections, data-related aspects, and problem-related aspects.

许多因素控制着选择算法的过程。我们主要可以将您的决策标准分为两部分，即数据相关方面和问题相关方面。

The size, behavior, characteristics, and type of your data can give you the initial idea of what algorithm to use. Once you get this initial decision, different aspects of your problem will help you decide on a final decision.

数据的大小，行为，特征和类型可以使您初步了解要使用哪种算法。一旦获得了最初的决定，问题的不同方面将帮助您做出最终决定。

Finally, always remember two things:

最后，请记住两件事：

Better data leads to better results than complex algorithms; if you can achieve similar results using a much simpler algorithm, opt for simplicity.
比起复杂的算法，更好的数据可以带来更好的结果。如果您可以使用更简单的算法获得相似的结果，请选择简单。
You can improve the accuracy of an algorithm by sacrificing more time on processing and training the data. Make the decision based on the priority for your specific project.
您可以节省更多时间来处理和训练数据，从而提高算法的准确性。根据特定项目的优先级做出决定。

Always listen to the story your data is trying to say, whiling following the goals of your project.

始终听从您的数据试图讲述的故事，并遵循您的项目目标。

翻译自: https://towardsdatascience.com/how-to-choose-the-right-machine-learning-algorithm-for-your-application-1e36c32400b9

机器学习算法应用

羊牮

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
机器学习算法应用_如何为您的应用选择正确的机器学习算法

机器学习算法应用When I first started learning and practicing data science and machine learning, I would look up resources and tutorials to implement and use a specific machine learning algorithm.刚开始学习和实践数据科学与...
复制链接

扫一扫