Machine Learning Algorithms: Which One to Choose for Your Problem

[转] https://blog.statsbot.co/machine-learning-algorithms-183cc73197c


When I was beginning my way in data science, I often faced the problem of choosing the most appropriate algorithm for my specific problem. If you’re like me, when you open some article about machine learning algorithms, you see dozens of detailed descriptions. The paradox is that they don’t ease the choice.

In this article for Statsbot, I will try to explain basic concepts and give some intuition of using different kinds of machine learning algorithms in different tasks. At the end of the article, you’ll find the structured overview of the main features of described algorithms.

First of all, you should distinguish 4 types of Machine Learning tasks:

  • Supervised learning
  • Unsupervised learning
  • Semi-supervised learning
  • Reinforcement learning
Supervised learning

Supervised learning is the task of inferring a function from labeled training data. By fitting to the labeled training set, we want to find the most optimal model parameters to predict unknown labels on other objects (test set). If the label is a real number, we call the task regression. If the label is from the limited number of values, where these values are unordered, then it’s classification.

Illustration source
Unsupervised learning

In unsupervised learning we have less information about objects, in particular, the train set is unlabeled. What is our goal now? It’s possible to observe some similarities between groups of objects and include them in appropriate clusters. Some objects can differ hugely from all clusters, in this way we assume these objects to be anomalies.

Illustration source
Semi-supervised learning

Semi-supervised learning tasks include both problems we described earlier: they use labeled and unlabeled data. That is a great opportunity for those who can’t afford labeling their data. The method allows us to significantly improve accuracy, because we can use unlabeled data in the train set with a small amount of labeled data.

Illustration source
Reinforcement learning

Reinforcement learning is not like any of our previous tasks because we don’t have labeled or unlabeled datasets here. RL is an area of machine learning concerned with how software agents ought to take actions in some environment to maximize some notion of cumulative reward.

Illustration source

Imagine, you’re a robot in some strange place, you can perform the activities and get rewards from the environment for them. After each action your behavior is getting more complex and clever, so you are training to behave the most effective way on each step. In biology, this is called adaptation to natural environment.

Commonly used Machine Learning algorithms

Now that we have some intuition about types of machine learning tasks, let’s explore the most popular algorithms with their applications in real life.

Linear Regression and Linear Classifier

These are probably the simplest algorithms in machine learning. You have features x1,…xn of objects (matrix A) and labels (vector b). Your goal is to find the most optimal weights w1,…wn and bias for these features according to some loss function, for example, MSE or MAE for a regression problem. In the case of MSE there is a mathematical equation from the least squares method:

In practice, it’s easier to optimize it with gradient descent, that is much more computationally efficient. Despite the simplicity of this algorithm, it works pretty well when you have thousands of features, for example, bag of words or n-gramms in text analysis. More complex algorithms suffer from overfitting many features and not huge datasets, while linear regression provides decent quality.

Illustration source

To prevent overfitting we often use regularization techniques like lasso and ridge. The idea is to add the sum of modules of weights and the sum of squares of weights, respectively, to our loss function. Read the great tutorial on these algorithms at the end of the article.

Logistic regression

Don’t confuse these classification algorithms with regression methods for using “regression” in its title. Logistic regression performs binary classification, so the label outputs are binary. Let’s define P(y=1|x) as the conditional probability that the output y is 1 under the condition that there is given the input feature vector x. The coefficients w are the weights that the model wants to learn.

Since this algorithm calculates the probability of belonging to each class, you should take into account how much the probability differs from 0 or 1 and average it over all objects as we did with linear regression. Such loss function is the average of cross-entropies:

Don’t panic, I’ll make it easy for you. Allow y to be the right answers: 0 or 1, y_pred — predicted answers. If y equals 0, then the first addend under sum equals 0 and the second is the less the closer our predicted y_pred to 0 according to the properties of the logarithm. Similarly, in the case when yequals 1.

What is great about a logistic regression? It takes linear combination of features and applies non-linear function (sigmoid) to it, so it’s a very very small instance of neural network!

Decision trees

Another popular and easy to understand algorithm is decision trees. Their graphics help you see what you’re thinking and their engine requires a systematic, documented thought process.

The idea of this algorithm is quite simple. In every node we choose the best split among all features and all possible split points. Each split is selected in such a way as to maximize some functional. In classification trees we use cross entropy and Gini index. In regression trees we minimize the sum of a squared error between the predictive variable of the target values of the points that fall in that region and the one we assign to it.

Illustration source

We make this procedure recursively for each node and finish when we meet a stopping criteria. They can vary from minimum number of leafs in a node to tree height. Single trees are used very rarely, but in composition with many others they build very efficient algorithms such as Random Forest or Gradient Tree Boosting.

K-means

Sometimes you don’t know any labels and your goal is to assign labels according to the features of objects. This is called clusterization task.

Suppose you want to divide all data-objects into k clusters. You need to select random k points from your data and name them centers of clusters. The clusters of other objects are defined by the closest cluster center. Then, centers of the clusters are converted and the process repeats until convergence.

This is the most clear clusterization technique, which still has some disadvantages. First of all, you should know the amount of clusters that we can’t know. Secondly, the result depends on the points randomly chosen at the beginning and the algorithm doesn’t guarantee that we’ll achieve the global minimum of the functional.

There are a range of clustering methods with different advantages and disadvantages, which you could learn in recommended reading.

Principal component analysis (PCA)

Have you ever prepared for a difficult exam on the last night or during the last hours? You have no chance to remember all the information, but you want to maximize information that you can remember in the time available, for example, learning first the theorems that occur in many exam tickets and so on.

Principal component analysis is based on the same idea. This algorithm provides dimensionality reduction. Sometimes you have a wide range of features, probably highly correlated between each other, and models can easily overfit on a huge amount of data. Then, you can apply PCA.

Surprisingly, these vectors are eigenvectors of correlation matrix of features from a dataset.

Illustration source

The algorithm now is clear:

1. We calculate the correlation matrix of feature columns and find eigenvectors of this matrix.

2. We take these multidimensional vectors and calculate the projection of all features on them.

New features are coordinates from a projection and their number depends on the count of eigenvectors, on which you calculate the projection.

Neural networks

I have already mentioned neural networks, when we talked about logistic regression. There are a lot of different architectures that are valuable in very specific tasks. More often, it’s a range of layers or components with linear connections among them and following nonlinearities.

If you’re working with images, convolutional deep neural networks show the great results. Nonlinearities are represented by convolutional and pooling layers, capable of capturing the characteristic features of images.

Illustration source

For working with texts and sequences you’d better choose recurrent neural networks. RNNs contain LSTM or GRU modules and can work with data, for which we know the dimension in advance. Perhaps, one of the most known applications for RNNs is machine translation.

Conclusion

I hope that I could explain to you common perceptions of the most used machine learning algorithms and give intuition on how to choose one for your specific problem. To make things easier for you, I’ve prepared the structured overview of their main features.

Linear regression and Linear classifier. Despite an apparent simplicity, they are very useful on a huge amount of features where better algorithms suffer from overfitting.

Logistic regression is the simplest non-linear classifier with a linear combination of parameters and nonlinear function (sigmoid) for binary classification.

Decision trees is often similar to people’s decision process and is easy to interpret. But they are most often used in compositions such as Random forest or Gradient boosting.

K-means is more primal, but a very easy to understand algorithm, that can be perfect as a baseline in a variety of problems.

PCA is a great choice to reduce dimensionality of your feature space with minimum loss of information.

Neural Networks are a new era of machine learning algorithms and can be applied for many tasks, but their training needs huge computational complexity.

Recommended sources


  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
Machine Learning Algorithms by Giuseppe Bonaccorso English | 24 July 2017 | ISBN: 1785889621 | ASIN: B072QBG11J | 360 Pages | AZW3 | 12.18 MB Build strong foundation for entering the world of Machine Learning and data science with the help of this comprehensive guide About This Book Get started in the field of Machine Learning with the help of this solid, concept-rich, yet highly practical guide. Your one-stop solution for everything that matters in mastering the whats and whys of Machine Learning algorithms and their implementation. Get a solid foundation for your entry into Machine Learning by strengthening your roots (algorithms) with this comprehensive guide. Who This Book Is For This book is for IT professionals who want to enter the field of data science and are very new to Machine Learning. Familiarity with languages such as R and Python will be invaluable here. What You Will Learn Acquaint yourself with important elements of Machine Learning Understand the feature selection and feature engineering process Assess performance and error trade-offs for Linear Regression Build a data model and understand how it works by using different types of algorithm Learn to tune the parameters of Support Vector machines Implement clusters to a dataset Explore the concept of Natural Processing Language and Recommendation Systems Create a ML architecture from scratch. In Detail As the amount of data continues to grow at an almost incomprehensible rate, being able to understand and process data is becoming a key differentiator for competitive organizations. Machine learning applications are everywhere, from self-driving cars, spam detection, document search, and trading strategies, to speech recognition. This makes machine learning well-suited to the present-day era of Big Data and Data Science. The main challenge is how to transform data into actionable knowledge. In this book you will learn all the important Machine Learning algorithms that are commonly used in the
《了解机器学习:从理论到算法》是一本介绍机器学习的重要参考书籍。该书从理论和算法两个层面深入解析机器学习的基本概念和原理。 首先,书中详细介绍了机器学习的理论基础。它阐述了统计学习理论、概率论、信息论等数学基础,并探讨了机器学习的范式、学习算法的误差分析和泛化能力等重要概念。这些理论基础为读者提供了深入了解机器学习算法的基础。 其次,本书提供了丰富的机器学习算法实例。从监督学习、无监督学习到强化学习,书中介绍了常见的机器学习算法,如线性回归、决策树、支持向量机等。同时,它还讨论了这些算法的优缺点、应用场景和改进方法,帮助读者理解每种算法的原理和运作方式。 此外,该书还关注机器学习的实践应用。它介绍了常用的数据处理方法、特征工程技巧以及模型评估方法。这些实用的内容能够帮助读者在实际应用中解决问题并提升模型的性能。 总体而言,通过《了解机器学习:从理论到算法》这本书,读者可以全面了解机器学习的概念、原理和算法。无论是对于新手还是有一定基础的从业者来说,这本书都是学习机器学习的一份宝贵资料。通过理论和实践相结合的方式,它帮助读者从零基础开始掌握机器学习的核心知识,为进一步深入研究和应用奠定坚实的基础。

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值