机器学习算法_机器学习-CSDN博客

机器学习算法

“Data is a powerful entity and machine learning is the art of extracting useful information from the data set”

“数据是一个强大的实体，而机器学习是从数据集中提取有用信息的艺术”

To craft such an art, Machine learning has various techniques/algorithms. A machine learning problem is defined based on three characteristics — Performance, Task, and Experience. The performance of a task improves based on previous data or experience.

为了制作这样的艺术，机器学习具有各种技术/算法。机器学习问题是基于三个特征(性能，任务和经验)定义的。根据以前的数据或经验，任务的性能会提高。

Machine learning is classified into three major parts:

机器学习分为三个主要部分：

1)监督学习 (1) Supervised Learning)

When a given data set has a predefined set of labeled inputs/outputs. Then it is easier to train a model to find relationships between various entities. For example — House price prediction, Email spam classification.

当给定数据集具有一组预定义的标记输入/输出时。然后，更容易训练模型以查找各种实体之间的关系。例如-房屋价格预测，电子邮件垃圾邮件分类。

2)无监督学习 (2) Unsupervised Learning)

When a given data set doesn’t have a predefined set of labeled inputs/outputs. Then we can train a model to group the data based on characteristics and similarities. Example — Anomaly/Cancer prediction, fraudulent transaction.

当给定的数据集没有预定义的标记输入/输出集时。然后，我们可以训练一个模型，根据特征和相似性对数据进行分组。示例—异常/癌症预测，欺诈性交易。

3)强化学习 (3) Reinforcement Learning)

When the necessary data is not given, a series of experiments are performed and the data is collected. The collected data must represent the entire community so as to get the accurate results. Example — Playing computer games to improve accuracy through award and penalty system.

如果未提供必要的数据，则进行一系列实验并收集数据。收集的数据必须代表整个社区，以便获得准确的结果。示例—通过奖励和惩罚系统玩计算机游戏以提高准确性。

数据预处理： (Data Pre-Processing:)

Data preprocessing in Machine Learning refers to the technique of preparing (cleaning and organizing) the raw data to make it suitable for a building and training Machine Learning models.

机器学习中的数据预处理是指准备(清理和组织)原始数据以使其适合于构建和训练机器学习模型的技术。

Collecting the huge amount of relevant data
收集大量相关数据
Replacing null with median values and maintaining common data type each column-wise.
用中位数替换空值，并在每一列中维护通用数据类型。
Dropping unnecessary features
删除不必要的功能
Representing characters/string in the form of numbers using LabelEncoder
使用LabelEncoder以数字形式表示字符/字符串
Data normalization: This technique is required when the data representing has different scales of value. For eg. Human recognizable units such as personage = 25, speed of a car = 80 Km/hr, etc.. which a machine would easily get confused. So data normalization helps to solve this problem. The formulae used here is as shown below
数据归一化：当表示的数据具有不同的价值尺度时，需要使用此技术。例如。人体可识别的单位，例如人物= 25，汽车速度= 80 Km / hr，等等，这很容易使机器感到困惑。因此，数据标准化有助于解决此问题。此处使用的公式如下所示

The scale of the data after pre-processing will be in the range of (-1 to +1)

预处理后的数据规模将在(-1到+1)范围内

Separating input and output from raw data
将输入和输出与原始数据分开

构架： (Frameworks:)

Pandas — Manipulating raw data.
熊猫-处理原始数据。
Numpy — Mathematical calculations.
numpy-数学计算。
Matplotlib, pyplot — Plotting graphs.
Matplotlib，pyplot —绘制图形。

数学： (Mathematics:)

Why matrix: A list is represented as a vector, multiple vectors represent a matrix. In this form, it is easier to do calculations especially when we have huge data.

为什么使用矩阵：列表表示为一个向量，多个向量表示一个矩阵。以这种形式，特别是当我们拥有大量数据时，进行计算变得更加容易。

Why differentiation: To find out the small differences while building our model. The derivative is for single variable functions, and the partial derivative is for multivariate functions. In calculating the partial derivative, we will just change the value of one variable, while keeping others constant.

为什么要差异化：在建立模型时找出细微的差异。导数用于单变量函数，偏导数用于多元函数。在计算偏导数时，我们将只更改一个变量的值，同时保持其他变量不变。

监督学习可以分为两种类型： (Supervised Learning can be classified into two types:)

Regression — Continous range of values(Output)
回归-连续的值范围(输出)
Classification — Output is discrete and predicts to be in either one of the groups.
分类-输出是离散的，并且预计将属于任一组。

All the supervised learning models have data and it should be divided into two parts: Input(X) and Output(Y). In order to understand/find a suitable model for the data use a scatter plot.

所有有监督的学习模型都有数据，应将其分为两部分：输入(X)和输出(Y)。为了理解/找到适合数据的模型，请使用散点图。

Regression Model:

回归模型：

Looking at the below data points, we tend to define such data by a line.

看下面的数据点，我们倾向于用一行来定义这些数据。

In order to predict accurate values, we need a best-fit line representing the whole data set. Rearranging the variables as commonly used in the machine learning context.

为了预测准确的值，我们需要一条代表整个数据集的最佳拟合线。重新排列机器学习上下文中常用的变量。

x — input parameter
x —输入参数
Θ0— Bias: Meaning taking sides. In ML, we choose one generalization over another from the set of possible generalizations
Θ0—偏见：表示立场。在ML中，我们从可能的概括集中选择一个概括而不是另一个概括
Θ1— hyperparameter: Tuned for a given predictive modeling problem
Θ1—超参数：针对给定的预测建模问题进行调整
y — output
y-输出

Error Function/Cost:

错误功能/成本：

Every ML model's objective is to reduce the error to 0.

每个ML模型的目标都是将误差减小到0。

To Calculate the sum of all the errors, we use a squared error function.

为了计算所有误差的总和，我们使用平方误差函数。

Model: X*Θ^T

型号：X *Θ^ T

Minimize Error: We can minimize the error by trying out different Θ1 values in our model.

最小化误差：我们可以通过在模型中尝试不同的Θ1值来使误差最小化。

Gradient Descent:

梯度下降：

The graph shown here is — for multiple theta values and the respective squared error value.

此处显示的图形是-用于多个theta值和相应的平方误差值。

No matter how high our error value is, we need to bring to a minimum. In order to do that, we can utilize the slope of a line(∂y/∂x)[Partial differentiation of y with respect to x] which is shown below. A negative gradient ensures we are going down the curve and ‘C’ is the learning rate by which we progress the value of theta.

无论我们的错误值有多高，我们都需要将其降至最低。为了做到这一点，我们可以利用一个线(∂Y /∂X)，其被示出在下面[相对于x和y的部分分化]的斜率。负梯度可确保我们顺着曲线前进，“ C”是使theta值提高的学习率。

Choosing a ‘learning rate’ is important for a model. In order to do that we need to practice by trial and error method. To automate this we have gradient descent formula which is universal to all the machine learning models.

选择“学习率”对于模型很重要。为此，我们需要通过试错法进行练习。为了使这一过程自动化，我们拥有所有机器学习模型通用的梯度下降公式。