机器学习算法_机器学习

机器学习算法

“Data is a powerful entity and machine learning is the art of extracting useful information from the data set”

“数据是一个强大的实体,而机器学习是从数据集中提取有用信息的艺术”

To craft such an art, Machine learning has various techniques/algorithms. A machine learning problem is defined based on three characteristics — Performance, Task, and Experience. The performance of a task improves based on previous data or experience.

为了制作这样的艺术,机器学习具有各种技术/算法。 机器学习问题是基于三个特征(性能,任务和经验)定义的。 根据以前的数据或经验,任务的性能会提高。

Machine learning is classified into three major parts:

机器学习分为三个主要部分:

1)监督学习 (1) Supervised Learning)

When a given data set has a predefined set of labeled inputs/outputs. Then it is easier to train a model to find relationships between various entities. For example — House price prediction, Email spam classification.

当给定数据集具有一组预定义的标记输入/输出时。 然后,更容易训练模型以查找各种实体之间的关系。 例如-房屋价格预测,电子邮件垃圾邮件分类。

2)无监督学习 (2) Unsupervised Learning)

When a given data set doesn’t have a predefined set of labeled inputs/outputs. Then we can train a model to group the data based on characteristics and similarities. Example — Anomaly/Cancer prediction, fraudulent transaction.

当给定的数据集没有预定义的标记输入/输出集时。 然后,我们可以训练一个模型,根据特征和相似性对数据进行分组。 示例—异常/癌症预测,欺诈性交易。

3)强化学习 (3) Reinforcement Learning)

When the necessary data is not given, a series of experiments are performed and the data is collected. The collected data must represent the entire community so as to get the accurate results. Example — Playing computer games to improve accuracy through award and penalty system.

如果未提供必要的数据,则进行一系列实验并收集数据。 收集的数据必须代表整个社区,以便获得准确的结果。 示例—通过奖励和惩罚系统玩计算机游戏以提高准确性。

数据预处理: (Data Pre-Processing:)

Data preprocessing in Machine Learning refers to the technique of preparing (cleaning and organizing) the raw data to make it suitable for a building and training Machine Learning models.

机器学习中的数据预处理是指准备(清理和组织)原始数据以使其适合于构建和训练机器学习模型的技术。

  • Collecting the huge amount of relevant data

    收集大量相关数据
  • Replacing null with median values and maintaining common data type each column-wise.

    用中位数替换空值,并在每一列中维护通用数据类型。
  • Dropping unnecessary features

    删除不必要的功能
  • Representing characters/string in the form of numbers using LabelEncoder

    使用LabelEncoder以数字形式表示字符/字符串
  • Data normalization: This technique is required when the data representing has different scales of value. For eg. Human recognizable units such as personage = 25, speed of a car = 80 Km/hr, etc.. which a machine would easily get confused. So data normalization helps to solve this problem. The formulae used here is as shown below

    数据归一化:当表示的数据具有不同的价值尺度时,需要使用此技术。 例如。 人体可识别的单位,例如人物= 25,汽车速度= 80 Km / hr,等等,这很容易使机器感到困惑。 因此,数据标准化有助于解决此问题。 此处使用的公式如下所示
Image for post

The scale of the data after pre-processing will be in the range of (-1 to +1)

预处理后的数据规模将在(-1到+1)范围内

  • Separating input and output from raw data

    将输入和输出与原始数据分开

构架: (Frameworks:)

  • Pandas — Manipulating raw data.

    熊猫-处理原始数据。
  • Numpy — Mathematical calculations.

    numpy-数学计算。
  • Matplotlib, pyplot — Plotting graphs.

    Matplotlib,pyplot —绘制图形。

数学: (Mathematics:)

Why matrix: A list is represented as a vector, multiple vectors represent a matrix. In this form, it is easier to do calculations especially when we have huge data.

为什么使用矩阵:列表表示为一个向量,多个向量表示一个矩阵。 以这种形式,特别是当我们拥有大量数据时,进行计算变得更加容易。

Image for post
Matrix Addition
矩阵加法
Image for post
Matrix Multiplication
矩阵乘法
Image for post
Matrix Transpose
矩阵转置

Why differentiation: To find out the small differences while building our model. The derivative is for single variable functions, and the partial derivative is for multivariate functions. In calculating the partial derivative, we will just change the value of one variable, while keeping others constant.

为什么要差异化:在建立模型时找出细微的差异。 导数用于单变量函数,偏导数用于多元函数。 在计算偏导数时,我们将只更改一个变量的值,同时保持其他变量不变。

监督学习可以分为两种类型: (Supervised Learning can be classified into two types:)

  • Regression — Continous range of values(Output)

    回归-连续的值范围(输出)
  • Classification — Output is discrete and predicts to be in either one of the groups.

    分类-输出是离散的,并且预计将属于任一组。

All the supervised learning models have data and it should be divided into two parts: Input(X) and Output(Y). In order to understand/find a suitable model for the data use a scatter plot.

所有有监督的学习模型都有数据,应将其分为两部分:输入(X)和输出(Y)。 为了理解/找到适合数据的模型,请使用散点图。

Regression Model:

回归模型:

Looking at the below data points, we tend to define such data by a line.

看下面的数据点,我们倾向于用一行来定义这些数据。

Image for post
Image for post
Equation of a line
直线方程

In order to predict accurate values, we need a best-fit line representing the whole data set. Rearranging the variables as commonly used in the machine learning context.

为了预测准确的值,我们需要一条代表整个数据集的最佳拟合线。 重新排列机器学习上下文中常用的变量。

Image for post
  • x — input parameter

    x —输入参数
  • Θ0— Bias: Meaning taking sides. In ML, we choose one generalization over another from the set of possible generalizations

    Θ0—偏见:表示立场。 在ML中,我们从可能的概括集中选择一个概括而不是另一个概括
  • Θ1— hyperparameter: Tuned for a given predictive modeling problem

    Θ1—超参数:针对给定的预测建模问题进行调整
  • y — output

    y-输出

Error Function/Cost:

错误功能/成本:

Image for post

Every ML model's objective is to reduce the error to 0.

每个ML模型的目标都是将误差减小到0。

Image for post
Minimize the error
最小化误差

To Calculate the sum of all the errors, we use a squared error function.

为了计算所有误差的总和,我们使用平方误差函数。

Image for post

Model: X*Θ^T

型号:X *Θ^ T

Minimize Error: We can minimize the error by trying out different Θ1 values in our model.

最小化误差:我们可以通过在模型中尝试不同的Θ1值来使误差最小化。

Gradient Descent:

梯度下降:

Image for post
Image for post

The graph shown here is — for multiple theta values and the respective squared error value.

此处显示的图形是-用于多个theta值和相应的平方误差值。

No matter how high our error value is, we need to bring to a minimum. In order to do that, we can utilize the slope of a line(y/x)[Partial differentiation of y with respect to x] which is shown below. A negative gradient ensures we are going down the curve and ‘C’ is the learning rate by which we progress the value of theta.

无论我们的错误值有多高,我们都需要将其降至最低。 为了做到这一点,我们可以利用一个线(∂Y /∂X),其被示出在下面[相对于x和y的部分分化]的斜率。 负梯度可确保我们顺着曲线前进,“ C”是使theta值提高的学习率。

Image for post
Gradient Descent formula
梯度下降公式

Choosing a ‘learning rate’ is important for a model. In order to do that we need to practice by trial and error method. To automate this we have gradient descent formula which is universal to all the machine learning models.

选择“学习率”对于模型很重要。 为此,我们需要通过试错法进行练习。 为了使这一过程自动化,我们拥有所有机器学习模型通用的梯度下降公式。

翻译自: https://medium.com/analytics-vidhya/machine-learning-591adfe4f81a

机器学习算法

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值