如何学习 azure_Azure的监督学习

最新推荐文章于 2023-12-27 18:58:10 发布

weixin_26752765

最新推荐文章于 2023-12-27 18:58:10 发布

阅读量289

点赞数

文章标签： python 机器学习人工智能

原文链接：https://medium.com/ml-course-microsoft-udacity/supervised-learning-with-azure-23204eae32d6

版权

如何学习 azure

Machine learning sounds cool, doesn’t it? I’m a biology student who didn’t have any idea about this branch of computer science. This lockdown gave me the time and strength to explore it. For those who need a layman intro to machine learning, I shall share an example. One day my dad asked me what do I keep studying? I didn’t know how to explain it to him. Words going on in my mind were normalization, overfitting, models, azure, etc. The next minute, he was trying to type a text to a friend by using google speech recognition on his phone. My next sentence was, that’s what I am studying dad! The science behind this process is what is called machine learning. It is a subset of artificial intelligence that focuses on creating programs that are capable of learning without explicit instruction.

机器学习听起来很酷，不是吗？我是生物学专业的学生，对计算机科学的这个分支一无所知。这种锁定使我有时间和精力进行探索。对于那些需要入门的机器学习入门者，我将分享一个例子。有一天我爸爸问我继续学习什么？我不知道如何向他解释。我脑海中常出现的单词是规范化，过度拟合，模型，天蓝色等。第二分钟，他试图通过在手机上使用Google语音识别功能向朋友输入文本。我的下一句话是，这就是我正在学习的爸爸！该过程背后的科学就是所谓的机器学习。它是人工智能的子集，专注于创建无需明确指令即可学习的程序。

The following article includes one of the basic concepts of machine learning i.e. Supervised Learning. Hope you all enjoy it! 1. Supervised Learning: Classification

以下文章包括机器学习的基本概念之一，即监督学习。希望大家喜欢！ 1.监督学习：分类

The first type of supervised learning that we’ll look at is classification. Recall that the main distinguishing characteristic of classification is the type of output it produces:

我们要研究的第一类监督学习是分类。回想一下分类的主要区别特征是它产生的输出类型：

In a classification problem, the outputs are categorical or discrete.Within this broad definition, there are several main approaches, which differ based on how many classes or categories are used, and whether each output can belong to only one class or multiple classes. Let’s have a look.

在分类 问题中，输出是分类的或离散的。 在这个宽泛的定义内，有几种主要方法，这些方法根据所使用的类别或类别的数量以及每个输出是否只能属于一个类别或多个类别而有所不同。我们来看一下。

Some of the most common types of classification problems include:

最常见的分类问题类型包括：

· Classification on tabular data: The data is available in the form of rows and columns, potentially originating from a wide variety of data sources.

· 表格数据的分类 ：数据以行和列的形式提供，可能源自多种数据源。

· Classification on image or sound data: The training data consists of images or sounds whose categories are already known.

· 图像或声音数据的分类 ：训练数据由其类别已知的图像或声音组成。

· Classification on text data: The training data consists of texts whose categories are already known.

· 文本数据的分类 ：训练数据由类别已知的文本组成。

As we know, machine learning requires numerical data. This means that with images, sound, and text, several steps need to be performed during the preparation phase to transform the data into numerical vectors that can be accepted by the classification algorithms.

众所周知，机器学习需要数值数据。这意味着对于图像，声音和文本，在准备阶段需要执行几个步骤，以将数据转换为分类算法可以接受的数值向量。

Image for post — Source: Udacity course for ML in Azure

The following images are just an introduction to the various algorithms with their major characteristics. No need to get overwhelmed! Learning about algorithms is a slow and steady process.

下图只是各种算法的主要特征介绍。无需不知所措！学习算法是一个缓慢而稳定的过程。

*One-vs-all method: A binary model is created for each of the multiple output classes. Each of these binary models for the individual classes is assessed against its complement (all other classes in the model) as though it were a binary classification issue. Prediction is then performed by running these binary classifiers and choosing the prediction with the highest confidence score.

* 一对多方法 ：为多个输出类中的每个类创建一个二进制模型。针对每个类别的这些二进制模型中的每一个都将根据其补语(模型中的所有其他类别)进行评估，就好像它是二进制分类问题一样。然后，通过运行这些二进制分类器并选择具有最高置信度得分的预测来执行预测。

In essence, an ensemble of individual models is created and the results are then merged, to create a single model that predicts all classes. Thus, any binary classifier can be used as the basis for a one-vs-all model.

本质上，创建单个模型的集合，然后将结果合并，以创建预测所有类的单个模型。因此，任何二进制分类器都可以用作“一对多”模型的基础。

*SMOTE (synthetic minority oversampling technique) is one of the most commonly used oversampling methods to solve the imbalance problem. It aims to balance class distribution by randomly increasing minority class examples by replicating them. SMOTE synthesizes new minority instances between existing minority instances.

* SMOTE (合成少数群体过采样技术)是解决 不平衡问题的最常用过采样方法之一。它旨在通过随机复制少数族裔的例子来平衡阶级分布。 SMOTE在现有少数派实例之间合成新的少数派实例。

2. Multi-Class Algorithms a) Multi-class Logistic Regression *Logistic Regression is a classification method used to predict the value of a categorical dependent variable from its relationship to one or more independent variables assumed to have a logistic distribution. If the dependent variable has only two possible values (success/failure), then the logistic regression is binary. If the dependent variable has more than two possible values (blood type given diagnostic test results), then the logistic regression is multinomial.

2.多类算法a)多类Logistic回归* Logistic回归是一种分类方法，用于根据类别因变量与一个或多个假设具有逻辑分布的自变量之间的关系来预测类别因变量的值。如果因变量只有两个可能的值(成功/失败)，则逻辑回归是二进制的。如果因变量具有两个以上的可能值(给定诊断测试结果的血液类型)，则逻辑回归是多项式。

2 Key parameters to configure this algorithm are: -Optimization tolerance: control when to stop the iterations. If the improvement between iterations is less than the specified threshold, the algorithm stops and returns the current model.

2配置此算法的关键参数是：- 优化容差 ：控制何时停止迭代。如果迭代之间的改进小于指定的阈值，则算法将停止并返回当前模型。

-Regularization weight: Regularization is a method to prevent overfitting by penalizing the models with extreme coefficient values. This factor determines how much to penalize the models at each iteration.

-正则化权重：正则化是一种通过对极端系数值进行惩罚的模型来防止过度拟合的方法。这个因素决定了每次迭代要对模型进行多少惩罚。

b) Multi-class Neural Network Include the input layer, a hidden layer, and an output layer. The relationship between input and output is learned from training the neural network on input data. 3 key parameters include: -The number of hidden nodes: Lets you customize the number of hidden nodes in the neural network. -Learning rate: Controls the size of the step taken at each iteration before correction. -The number of Learning Iterations: The maximum number of times the algorithm should process the training cases. c) Multi-class Decision Forest An ensemble of decision trees. Works by building multiple decision trees and then voting on the most popular output class. 5 key parameters include: -Resampling method: This controls the method used to create the individual trees. -The number of decision trees: This specifies the maximum number of decision trees that can be created in the ensemble. -Maximum depth of the decision trees: This is a number to limit the maximum depth of any decision tree. -The number of random splits per node: The number of splits to use when building each node of the tree. -The minimum number of samples per leaf node: This controls the minimum number of cases that are required to create any terminal node in a tree.

b)多类神经网络包括输入层，隐藏层和输出层。输入和输出之间的关系是通过在输入数据上训练神经网络来学习的。 3个关键参数包括：- 隐藏节点的数量 ：让您自定义神经网络中隐藏节点的数量。 - 学习率 ：控制校正前每次迭代所采取步骤的大小。 - 学习迭代次数：算法应处理训练案例的最大次数。 c)多类决策森林决策树的集合。通过构建多个决策树，然后对最受欢迎的输出类进行投票来工作。 5个关键参数包括：-重采样方法：此控件控制用于创建单个树的方法。 - 决策树的数量 ：这指定可以在集合中创建的决策树的最大数量。 - 决策树的最大深度 ：这是一个数字，用于限制任何决策树的最大深度。 - 每个节点的随机分割数 ：构建树的每个节点时要使用的分割数。 - 每个叶节点的最小样本数 ：这控制在树中创建任何终端节点所需的最小案例数。

3. Supervised Learning: Regression In a regression problem, the output is numerical or continuous. 3.1 Introduction to Regression Common types of regression problems include:

3.有监督的学习：回归 在回归 问题中，输出是数字或连续的。 3.1回归简介回归问题的常见类型包括：

· Regression on tabular data: The data is available in the form of rows and columns, potentially originating from a wide variety of data sources.

· 表格数据的回归：数据以行和列的形式提供，可能源自多种数据源。

· Regression on image or sound data: Training data consists of images/sounds whose numerical scores are already known. Several steps need to be performed during the preparation phase to transform images/sounds into numerical vectors accepted by the algorithms.

· 图像或声音数据的回归：训练数据由其数字分数已知的图像/声音组成。在准备阶段需要执行几个步骤，以将图像/声音转换为算法接受的数值向量。

Regression on text data: Training data consists of texts whose numerical scores are already known. Several steps need to be performed during the preparation phase to transform the text into numerical vectors accepted by the algorithms. Examples: Housing prices, Customer churn, Customer Lifetime Value, Forecasting (time series), and Anomaly Detection.

对文本数据进行回归：训练数据由数字分数已知的文本组成。在准备阶段需要执行几个步骤，以将文本转换为算法接受的数值向量。示例：房价，客户流失，客户生命周期价值，预测(时间序列)和异常检测。

3.2 Categories of Algorithms Common machine learning algorithms for regression problems include:

3.2算法类别用于回归问题的常见机器学习算法包括：

· Linear Regression

·线性回归

· Fast training, linear model

·快速训练，线性模型

· Decision Forest Regression

·决策森林回归

· Accurate, fast training times

·准确，快速的培训时间

· Neural Net Regression

·神经网络回归

· Accurate, long training times

·准确，长时间的培训

Numerical Outcome: Dependent variable *Ordinary least squares method: Calculates error as a sum of the squares of distance from the actual value to the predicted line. It fits the model by minimizing the squared error. This method assumes a strong linear relationship between the inputs and the dependent variable. *Gradient Descent: The approach is to minimize the amount of error at each step of the model training process.

数值结果：因变量* 普通最小二乘法 ：将误差计算为从实际值到预测线的距离的平方。它通过最小化平方误差来拟合模型。该方法假定输入和因变量之间具有很强的线性关系。 * 梯度下降 ：该方法是在模型训练过程的每个步骤中最小化误差量。

The algorithm supports some of the same hyper-parameters discussed for multi-class decision forest algorithms such as the number of trees, maximum depth, etc.

该算法支持为多类决策森林算法讨论的某些相同的超参数，例如树的数量，最大深度等。

Since it is a supervised learning method, it requires a tagged dataset that includes a label column which must be a numerical data type. The algorithm also supports the same hyper-parameters as the number of hidden nodes, learning rate, and the number of iterations that were included in a multi-class neural network algorithm. *Regularization is one of the hyperparameters in machine learning which is the process of regularizing the parameters that restrict, regularizes, or reduces the coefficient estimates towards zero. This technique avoids the risk of overfitting by discouraging the learning of a more complex or flexible model.

由于这是一种有监督的学习方法，因此需要带标签的数据集，该数据集包括必须为数字数据类型的标签列。该算法还支持与多类神经网络算法中包含的隐藏节点数，学习率和迭代数相同的超参数。 * 正则化是机器学习中的超参数之一，它是将限制，正则化或将系数估计值减小为零的参数进行正则化的过程。通过阻止学习更复杂或更灵活的模型，该技术避免了过拟合的风险。

4. Automate the training of Regressors Key challenges in successfully training a machine learning model include: -selecting features from the ones available in the datasets -choosing the right algorithm for the task -tuning the hyperparameters of the selected algorithm -selecting the right evaluation metrics to measure the performance of the trained model -the entire process is pretty iterative The idea behind Automated ML is to enable the automated exploration of the combinations needed to successfully produce a trained model. It intelligently tests multiple algorithms and hyper-parameters in parallel and returns the best one. The next steps include the deployment of the model into production and further customization or refinement if needed to improve performance.

4.自动化回归器的训练成功训练机器学习模型的主要挑战包括：-从数据集中可用的特征中选择特征-为任务选择正确的算法-调整所选算法的超参数-选择正确的评估指标衡量训练模型的性能-整个过程是反复进行的。自动化ML的想法是使能够自动探索成功生成训练模型所需的组合。它可以并行智能地测试多种算法和超参数，并返回最佳算法。下一步包括将模型部署到生产中，并在需要提高性能时进一步定制或完善。

Material Reference: Udacity Fundamental Course in Machine Learning for Microsoft Azurehttps://docs.microsoft.com/en-us/azure/?product=featured https://docs.microsoft.com/en-us/

物料参考：适用于Microsoft Azure的机器学习中的Udacity基础课程https://docs.microsoft.com/zh-cn/azure/?product=featured https://docs.microsoft.com/zh-CN/

Happy learning :)

快乐学习：)

翻译自: https://medium.com/ml-course-microsoft-udacity/supervised-learning-with-azure-23204eae32d6

如何学习 azure

weixin_26752765

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
如何学习 azure_Azure的监督学习

如何学习 azureMachine learning sounds cool, doesn’t it? I’m a biology student who didn’t have any idea about this branch of computer science. This lockdown gave me the time and strength to explore it. For...
复制链接

扫一扫