An annotated path to start with Machine Learning

Machine Learning is becoming more and more widespread and, day after day, new computer scientists and engineers begin their long jump into this wonderful world. Unfortunately, the number of theories, algorithms, applications, papers, books, videos and so forth is so huge to disorient whoever hasn’t a clear picture of what he wants/needs to learn to improve his/her skills. In this short post, I wanted to share my experiences, suggesting a feasible path to learn quickly the essential concepts and being ready to go deeper the most complex topics. Of course, this is only a personal proposal: every student can choose to dedicate more attention to some topics which are more interesting based on his/her experience.

Prerequisites

Machine Learning is based on Mathematics. It’s not an optional, theoretical approach: it’s a fundamental pillar that cannot be discarded. If you are a computer engineer, working daily with UML, ORM, Design Patterns and many other software engineering tools/techniques, close your eyes and, for one second, forget almost everything. This doesn’t mean that all those concepts aren’t important. They are! But Machine Learning needs a different approach. One of the reasons why Python has become more and more popular in this field is its “prototyping speed”. In Machine Learning, a language that allows you to model an algorithm with a few lines of code (without classes, interfaces and all other OO infrastructures) is absolutely a must.

Calculus, Probability Theory and Linear Algebra are necessary mathematical skills for almost any algorithm. If you already have a good mathematical background, you can skip this section, otherwise, it’s a good idea to refresh some important concepts. Considering the number of theories, I discourage starting with big manuals: even if they can also be used when looking for particular concepts, at the beginning, it’s better focusing on a simple subset of topics. There are many good online resources (like CourseraKhan Academy or Udacity, just to name a few) which adopt a pragmatic approach suitable to any background. However, my suggestion is to use a brief compendium, where the most important concepts are explained and to go on by searching and studying new elements whenever they are needed. This isn’t a very systematic approach, but the alternative has a dramatic drawback: the huge amount of concepts can discourage and disorient all the people without a solid academic background.

An acceptable starting “recipe” can be:

  • Probability theory
    • Discrete and continuous random variables
    • Main distributions (Bernoulli, Categorical, Binomial, Normal, Exponential, Poisson, Beta, Gamma)
    • Moments
    • Bayes statistics
    • Correlation and Covariance
  • Linear algebra
    • Vectors and matrices
    • Determinant of a matrix
    • Eigenvectors and eigenvalues
    • Matrix factorization (like SVD)
  • Calculus
    • Real functions
    • Derivatives, Integrals
    • Main numerical methods

There are many free resources on the web, like:

Wikipedia is also a very good resource and many formulas, theories and theorems are explained in a clear and comprehensible way.

A Machine Learning path proposal (for very beginners)

Feature Engineering

The very first step to jump into Machine Learning is understanding how to measure and improve the quality of datasets. Managing categorical and missing features, normalization and dimensionality reduction (PCAICANMF) are fundamental techniques that can dramatically improve the performances of any algorithm. It’s also useful studying how to split the datasets into training and test sets and how to adopt the Cross-Validation instead of classical test methods.

A good tutorial on Principal Component Analysis is:

Numpy: the king of Python mathematics!

When working with Python, Numpy is much more than a library. It’s the foundation of almost any machine learning implementation and it’s absolutely necessary to know how it works, focusing the attention of the concepts of vectorization and broadcasting. Through these techniques, it’s possible to speed up the learning process of the majority of algorithms, exploiting the power of multithreading and SIMD and MIMD architectures.

The official documentation is complete, however, I suggest also these resources:

Data visualization

Even if it’s not a purely Machine Learning topic, it’s important to know how to visualize the data sets. Matplotlib is probably the most diffused solution: it’s easy to use and allows plotting different kinds of charts. Very interesting alternatives are offered by Bokeh and Seaborne. It’s not necessary to have a complete knowledge of all packages, but it’s useful to know the strength/weakness points of each of them, so to able to pick up the right package when needed.

A good resource to learn Matplotlib details is:

Linear Regression

Linear Regression is one of the simplest models and can be studied considering it as an optimization problem that can be solved minimizing the mean squared error. This approach is effective but limits the possibilities that can be exploited. I suggest also to study it as a Bayesian problem, where the parameters are represented using prior probabilities (Gaussian-distributed, for example) and the optimization becomes a MLE (Maximum Likelihood Estimation). Even if it can seem more complex, this approach offers a new vision that is shared by dozens of other more complex models.

A very useful introduction to Bayesian Statistics is available on Coursera:

I also suggest these books:

Path

Linear Classification

Logistic Regression is normally the best starting point. It’s also a good opportunity to study some Information Theory, to understand the power of concepts like entropy, cross-entropy, and mutual information. Categorical cross-entropy is the most stable and diffused cost function in deep learning classification and a simple logistic regression can show how it can speed up the learning process (compared to the mean squared error). Another important topic is regularization (in particular, Ridge, Lasso, and ElasticNet). Too many times, it is considered as an “esoteric” way to improve the accuracy of a model, but its real meaning is much more precise and should be understood with some concrete examples. I also suggest starting considering a Logistic Regression as a simple neural network, visualizing (for 2D examples) how the weight vector moves during the learning process.

I also suggest to include the Hyperparameter Grid Search methods in this section. Instead of trying different values without a complete awareness, Grid Search allows evaluating the performances of different hyperparameters sets. The engineer can, therefore, focus his/her attention only the combinations that yield the highest accuracy.

Support Vector Machines

Support Vector Machines offer a different approach to classification (both linear and nonlinear). The algorithm is very simple and can be learned by every student with a basic knowledge of geometry. However, it’s very useful to understand how kernel-SVMs work because their real power is shown in tasks where linear methods fail.

Some useful free resources:

Decision Trees

Another approach to classification and regression is offered by Decision Trees. In general, they are not the first choice for very complex problems, but they offer a completely different approach, which can be easily understood even by non-technical people and can be visualized during meetings or official presentations.

A good tutorial (easy) on Decision Trees is:

Classification metrics

Evaluating the performance of a classifier can be more difficult than expected. The overall accuracy is a good measure, but it is often necessary to evaluate the behavior with false positives and false negatives. I suggest dedicating some time in studying: Precision, Recall, F-Score and ROC Curve. They can dramatically change the way a model is considered acceptable or not. Pay attention to Recall, which measures the impact of false negatives on the accuracy. Having a good precision, but a bad recall means that your model is generating many false negatives (think about this in a medical environment). F(beta)-Score is a good trade-off between precision and recall.

A quick glimpse into Ensemble Learning

After having understood the dynamics of a Decision Tree, it’s useful to study methods where sets (ensembles) of trees are trained together to improve the overall accuracy. Random Forests, Gradient Tree Boosting and AdaBoost are powerful algorithms whose complexity is reasonable low. It’s interesting comparing the learning process of a simple tree and the ones adopted by boosting and bagging methods. Scikit-Learn provides the most common implementations, but if you want to exploit the full power of these approaches, I suggest dedicating some time to study XGBoost, which is a distributed framework that can work both with CPUs and GPUs, speeding up the training process even with very large datasets.

Two valid tutorials on Ensemble Learning are:

Clustering

When starting with clustering methods, on my opinion, the best thing to do is considering the Gaussian Mixture algorithm (based on EM, Expectation-Maximization). Even if K-Means is quite simpler (and must be studied), Gaussian Mixtures offers a pure Bayesian approach, which is useful for many other similar tasks. Other algorithms that must be studied include Hierarchical ClusteringSpectral Clustering, and DBSCAN. It’s also useful to understand the idea of instance-based learning, studying the k-Nearest Neighbors algorithm (that can be adopted for both supervised and unsupervised tasks).

A good introduction to KMeans is: Introduction to K-Means Clustering

A useful free resource on Spectral Clustering is:

Clustering metrics

Clustering metrics are a little bit more empirical because their semantics can change with the context. However, I suggest studying the Silhouette plots and some ground truth methods (like the adjusted Rand score). They can provide you with a complete insight into the structure of the clustering process, showing all those situations when a hyperparameter tuning is probably necessary.

A very interesting resource on cluster stability is:

Neural Networks for beginners

Neural networks are the basis of Deep Learning and they should be studied in a separate course. However, I think it’s useful to understand the concepts of Perceptron, Multi-Layer Perceptron, and the Backpropagation algorithm. Scikit-Learn offers a very simple implementation of Neural Networks, however, it might be a good idea starting the exploration of Keras, which is a high-level framework based on TensorflowTheano or CNTK, that allows modeling and training neural networks with a minimum initial effort.

Some good resource to start with Neural Networks are:

The best Deep Learning book (advanced) available on the market is probably:

  • Goodfellow I., Bengio Y., Courville A., Deep Learning, The MIT Press

Moreover, I suggest to “play” with the Tensorflow Playground, where it’s possible to simulate a complex neural network, to tune its parameters and observe the resulting output.

Path image copyright by TheRitters.

See also:

Eight reasons to study Machine Learning

  • 1
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
糖尿病是一种常见的慢性代谢性疾病,为了深入了解糖尿病的相关知识,构建医疗知识图谱是非常关键的。而构建医疗知识图谱需要高质量的数据集来进行注释。注释的数据集为糖尿病数据集,其中包含了多个方面的信息。 首先,该数据集涵盖了大量的病例数据,包括糖尿病患者的基本信息、生活方式、遗传背景等。这些信息对于疾病的研究和管理具有重要意义,可以帮助医生和研究人员更好地了解糖尿病患者的整体情况,并为个性化治疗和预防提供依据。 其次,数据集还包含了临床试验和研究的结果,例如药物治疗的效果、血糖控制的指标等。这些数据对于评估不同治疗方案的有效性和安全性非常重要,可以帮助医生选择最适合患者的治疗策略,提高治疗效果。 此外,数据集中还包含了糖尿病相关的基因表达、代谢组学和蛋白质组学等多组学数据。这些数据可以为研究人员提供更深入的分子机制理解,帮助揭示糖尿病的病理过程和疾病发展的关键因素。 数据集的注释分为多个层次,包括基本信息的标注、临床数据的归类、实验结果的解读等。这些注释可以帮助医生和研究人员系统地浏览和分析数据,快速获取所需的知识。 通过对糖尿病数据集的注释,可以构建起一个丰富而有机的医疗知识图谱,为糖尿病的研究、诊断和治疗提供更全面、准确的信息支持,促进医疗领域的发展和进步。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值