机器学习:算法视角pdf_何时使用不同的机器学习算法：简单指南

If you’ve been at machine learning long enough, you know that there is a “no free lunch” principle — there’s no one-size-fits-all algorithm that will help you solve every problem and tackle every dataset.

I work for Springboard — we’ve put a lot of research into machine learning training and resources. At Springboard, we offer the first online course with a machine learning job guarantee.

What helps a lot when confronted with a new problem is to have a primer for what algorithm might be the best fit for certain situations. Here, we talk about different problems and data types and discuss what might be the most effective algorithm to try for each one, along with a resource that can help you implement that particular model.

Remember: the proof is in the pudding: the best approach to your data is the model that empirically gives you the best results. This guide is meant to hone your first instincts and help you remember what models might be the most effective for each problem, and which would be impractical to use.

Let’s start by talking about the variables we need to consider.

监督学习与监督学习(Unsupervised learning vs supervised learning)

Unsupervised learning is where you allow the machine learning algorithm to start learning and outputting a result without any explicit human processing of the data beforehand.

Supervised learning involves some labeling and processing of the training data beforehand in order to structure it for processing.

The kind of learning you can perform will matter a lot when you start working with different machine learning algorithms.

S 速度和时间注意事项(Space and time considerations)

There are space and time considerations for each machine learning algorithm. While in practice you’ll likely work with optimized versions of each algorithm packaged in a framework, it is good to consider how the algorithms you choose can affect performance.

输出 (The output)

Third, and perhaps most important, is the output that you want to get. Are you trying to categorize data? Use it to predict future data points? What you’re looking to get as a result and what you want to do to your data will largely determine the algorithmic approaches you should take.

一些例子 (Some examples)

您正在寻找一个结构合理的数据集而又没有太多复杂性的简单预测模型。(You’re looking tobuild a simple predictive model with a well-structured dataset without too many complications.)

Your best bet here is probably linear regression, something that can take a whole host of factors and then give you a predictive result with a simple error rate explanation and a simple explanation for which factors contribute to the prediction. It doesn’t take much computational power to run a linear regression either.

Resource: Linear Regression — Detailed View

您正在寻找一种在有监督的环境中将已经被标记的数据分类为两种或两种以上截然不同的标签类型(例如，尝试根据孩子的体重和身高确定孩子是男性还是女性)。(You’re looking to classifydata that’s already been labeled into two or more sharply distinct types of labels (e.g., trying to determine if children are likely male or female based on their weight and height) in a supervised setting.)

The first instinct you should have when you see a situation like this is to apply the logistic regression model. After running the model, you’ll see that it forces every data point into two different categories, allowing you to easily output which point belongs to which category. The logistic regression model can also be easily generalized to working with multiple target and result classes if that’s what your problem demands.

Resource: Building a Logistic Regression

您正在寻求将未标记的连续数据放入不同的组中(例如，将具有某些已记录特征的客户放置，并尝试发现他们可以属于的类别/组)。(You’re looking to place unlabeled continuous data into different groups (e.g., putting customers with certain recorded traits and trying to discover categories/groups they can belong to).)

The first natural fit for this problem is the K-Means clustering algorithm, which will group and cluster data by measuring the distance between each point. Then there are a variety of clustering algorithms, such as Density-Based Spatial Clustering of Applications with Noise and Mean-Shift algorithms.

您要预测一串字符或一组特征属于一种数据类别还是另一种数据类别(监督文本分类)，例如，评论是肯定的还是否定的。(You’re looking to predict whether a string of characters or a grouping of traits falls into one category of data or another (supervised text classification) — e.g, whether a review is positive or negative.)

Your best bet here is probably Naive Bayes, which is a simple but powerful model that can be used for text classification. With some text pre-processing and cleaning (being especially careful to remove filler stop words such as “and” that might add noise to your dataset), you can get a remarkable set of results with a very simple model.

Another decent bet is logistic regression, which is a simple model to grasp and explain, and less hard to pick apart than Naive Bayes (which will often assign probabilities word by word rather than holistically labeling a text snippet as being part of one group or another).

Moving on to something more powerful, a Linear Support Vector Machine algorithm will likely help improve your performance. If you want to skip right ahead here, you can (though I suggest trying both models and comparing which one works best — Naive Bayes has an absurdly easy implementation on frameworks like scikit-learn and it isn’t very computationally expensive so you can afford to test both).

Lastly, bag-of-words analysis could also work — consider doing an ensemble of different methods and testing all of these methods against one another, depending on the dataset in question.

您正在寻找对大型图像或视频数据集(例如图像分类)进行非结构化学习的方法。(You’re looking to do unstructured learning on large-scale image or video datasets (e.g., image classification).)

The best algorithm to tackle going through different images is a convolutional neural network that is organized similarly to how animal visual cortexes are analyzed.

Measured by performance (reduced error rate) in the ImageNet competition, the SE-Resnet architecture comes out on top, though as the field is still developing, new advances come out almost every day.

You should be aware, however, that convolutional neural networks are dense and require a lot of computational power — so make sure that you have the hardware capability to run these models on large-scale datasets.

您正在寻找对来自定义明确的流程的结果点进行分类的方法(例如：预先建立的面试流程的录用人数，您可以在其中知道或可以计算得出每个事件的概率)。(You’re looking to classify result points that come out of a well-defined process (ex: number of hires from a pre-established interview process, wherein you know or can computationally infer the probabilities of each event).)

The best option for this is probably a decision tree algorithm that will clearly explain what the split points are between classifying something into one group or another.

Resource: Decision Trees in Machine Learning

您正在寻找使用定义明确，受监督的数据进行时间序列分析的方法(例如，根据从过去到现在的时间顺序排列的股票市场的历史模式来预测股票价格)。 (You’re looking to do time series analysis with well-defined, supervised data (e.g., predicting stock prices based on historical patterns in the stock market arranged on a chronological basis from the past to the present).)

A recurrent neural network is set up to do sequence analysis by containing an in-stream internal memory of data it processes, allowing it to take into account the relationship between data and the time horizon and order it’s deployed in.

Resource: Recurrent Neural Networks and LSTM

结语 (Wrapping up)

Take the recommendations and resources above, and apply them as a sort of first instinct for your modeling — it’ll help you jump into any work you do just a little bit faster. If you’re interested in being mentored by a machine learning expert in learning how to train your instincts further, check out Springboard’s AI/Machine Learning Career Track.

• 0
点赞
• 0
评论
• 1
收藏
• 一键三连
• 扫一扫，分享海报

08-26
08-25

04-24 254
05-19 3317
04-26 7768
10-28
12-14 7万+
08-24 4万+
08-06 2532
01-11 1714
09-28 46万+
10-22 261