python学习机器学习_人工智能与Python

python学习机器学习

人工智能与Python –机器学习 (AI with Python – Machine Learning)

Learning means the acquisition of knowledge or skills through study or experience. Based on this, we can define machine learning (ML) as follows −

学习是指通过学习或经验获得知识或技能。基于此，我们可以定义机器学习(ML)如下-

It may be defined as the field of computer science, more specifically an application of artificial intelligence, which provides computer systems the ability to learn with data and improve from experience without being explicitly programmed.

可以将其定义为计算机科学领域，更具体地讲，是人工智能的应用，它为计算机系统提供了学习数据的能力并从经验中进行改进而无需进行明确编程的能力。

Basically, the main focus of machine learning is to allow the computers learn automatically without human intervention. Now the question arises that how such learning can be started and done? It can be started with the observations of data. The data can be some examples, instruction or some direct experiences too. Then on the basis of this input, machine makes better decision by looking for some patterns in data.

基本上，机器学习的主要重点是允许计算机自动学习而无需人工干预。现在出现的问题是，如何开始和完成这种学习？可以从观察数据开始。数据也可以是一些示例，说明或一些直接的经验。然后，根据该输入，机器通过查找数据中的某些模式来做出更好的决策。

机器学习(ML)的类型 (Types of Machine Learning (ML))

Machine Learning Algorithms helps computer system learn without being explicitly programmed. These algorithms are categorized into supervised or unsupervised. Let us now see a few algorithms −

机器学习算法可帮助计算机系统学习而无需明确编程。这些算法分为有监督的或无监督的。现在让我们看看一些算法-

监督机器学习算法 (Supervised machine learning algorithms)

This is the most commonly used machine learning algorithm. It is called supervised because the process of algorithm learning from the training dataset can be thought of as a teacher supervising the learning process. In this kind of ML algorithm, the possible outcomes are already known and training data is also labeled with correct answers. It can be understood as follows −

这是最常用的机器学习算法。之所以称为监督，是因为可以将其从训练数据集中学习算法的过程视为指导学习过程的教师。在这种ML算法中，可能的结果已经为人所知，并且训练数据还标有正确答案。可以理解如下-

Suppose we have input variables x and an output variable y and we applied an algorithm to learn the mapping function from the input to output such as −

假设我们有输入变量x和输出变量y，并且我们应用了一种算法来学习从输入到输出的映射函数，例如-


Y = f(x)

Now, the main goal is to approximate the mapping function so well that when we have new input data (x), we can predict the output variable (Y) for that data.

现在，主要目标是很好地近似映射函数，以便当我们有新的输入数据(x)时，我们可以预测该数据的输出变量(Y)。

Mainly supervised leaning problems can be divided into the following two kinds of problems −

主要是监督学习的问题可以分为以下两种：

Classification − A problem is called classification problem when we have the categorized output such as “black”, “teaching”, “non-teaching”, etc.
分类 -当我们具有分类输出(例如“黑色”，“教学”，“非教学”等)时，一个问题称为分类问题。
Regression − A problem is called regression problem when we have the real value output such as “distance”, “kilogram”, etc.
回归 -当我们获得诸如“距离”，“千克”等的实际值输出时，此问题称为回归问题。

Decision tree, random forest, knn, logistic regression are the examples of supervised machine learning algorithms.

决策树，随机森林，knn，逻辑回归是监督式机器学习算法的示例。

无监督机器学习算法 (Unsupervised machine learning algorithms)

As the name suggests, these kinds of machine learning algorithms do not have any supervisor to provide any sort of guidance. That is why unsupervised machine learning algorithms are closely aligned with what some call true artificial intelligence. It can be understood as follows −

顾名思义，这类机器学习算法没有任何主管可以提供任何指导。这就是为什么无监督机器学习算法与真正的人工智能紧密结合的原因。可以理解如下-

Suppose we have input variable x, then there will be no corresponding output variables as there is in supervised learning algorithms.

假设我们有输入变量x，那么将没有监督学习算法中的相应输出变量。

In simple words, we can say that in unsupervised learning there will be no correct answer and no teacher for the guidance. Algorithms help to discover interesting patterns in data.

简而言之，我们可以说在无监督学习中将没有正确的答案，也没有指导老师。算法有助于发现数据中有趣的模式。

Unsupervised learning problems can be divided into the following two kinds of problem −

无监督学习问题可以分为以下两种问题：

Clustering − In clustering problems, we need to discover the inherent groupings in the data. For example, grouping customers by their purchasing behavior.
聚类 -在聚类问题中，我们需要发现数据中的固有分组。例如，根据客户的购买行为对其进行分组。
Association − A problem is called association problem because such kinds of problem require discovering the rules that describe large portions of our data. For example, finding the customers who buy both x and y.
关联 -一个问题称为关联问题，因为这类问题需要发现描述我们数据大部分的规则。例如，找到同时购买x和y的客户。

K-means for clustering, Apriori algorithm for association are the examples of unsupervised machine learning algorithms.

无监督机器学习算法的示例包括用于聚类的K均值，用于关联的Apriori算法。

强化机器学习算法 (Reinforcement machine learning algorithms)

These kinds of machine learning algorithms are used very less. These algorithms train the systems to make specific decisions. Basically, the machine is exposed to an environment where it trains itself continually using the trial and error method. These algorithms learn from past experience and tries to capture the best possible knowledge to make accurate decisions. Markov Decision Process is an example of reinforcement machine learning algorithms.

很少使用这类机器学习算法。这些算法训练系统做出特定决策。基本上，机器处于使用反复试验法不断训练自身的环境中。这些算法从过去的经验中吸取教训，并尝试捕获最佳的知识以做出准确的决策。马尔可夫决策过程是强化机器学习算法的一个示例。

最常见的机器学习算法 (Most Common Machine Learning Algorithms)

In this section, we will learn about the most common machine learning algorithms. The algorithms are described below −

在本节中，我们将学习最常见的机器学习算法。算法描述如下-

线性回归 (Linear Regression)

It is one of the most well-known algorithms in statistics and machine learning.

它是统计和机器学习中最著名的算法之一。

Basic concept − Mainly linear regression is a linear model that assumes a linear relationship between the input variables say x and the single output variable say y. In other words, we can say that y can be calculated from a linear combination of the input variables x. The relationship between variables can be established by fitting a best line.

基本概念-主要是线性回归是一种线性模型，它假设输入变量说x和单个输出变量说y之间存在线性关系。换句话说，可以说y可以根据输入变量x的线性组合来计算。变量之间的关系可以通过拟合最佳线来建立。

线性回归的类型 (Types of Linear Regression)

Linear regression is of the following two types −

线性回归具有以下两种类型-

Simple linear regression − A linear regression algorithm is called simple linear regression if it is having only one independent variable.
简单线性回归 -如果线性回归算法只有一个自变量，则称为简单线性回归。
Multiple linear regression − A linear regression algorithm is called multiple linear regression if it is having more than one independent variable.
多元线性回归 -如果线性回归算法具有多个独立变量，则称为多元线性回归。

Linear regression is mainly used to estimate the real values based on continuous variable(s). For example, the total sale of a shop in a day, based on real values, can be estimated by linear regression.

线性回归主要用于基于连续变量估计实际值。例如，可以通过线性回归估算一天中商店的实际销售额(基于实际价值)。

逻辑回归 (Logistic Regression)

It is a classification algorithm and also known as logit regression.

它是一种分类算法，也称为对数回归。

Mainly logistic regression is a classification algorithm that is used to estimate the discrete values like 0 or 1, true or false, yes or no based on a given set of independent variable. Basically, it predicts the probability hence its output lies in between 0 and 1.

主要是逻辑回归是一种分类算法，用于根据给定的一组独立变量估算离散值(例如0或1，是或否，是或否)。基本上，它预测概率，因此其输出位于0到1之间。

决策树 (Decision Tree)

Decision tree is a supervised learning algorithm that is mostly used for classification problems.

决策树是一种监督学习算法，主要用于分类问题。

Basically it is a classifier expressed as recursive partition based on the independent variables. Decision tree has nodes which form the rooted tree. Rooted tree is a directed tree with a node called “root”. Root does not have any incoming edges and all the other nodes have one incoming edge. These nodes are called leaves or decision nodes. For example, consider the following decision tree to see whether a person is fit or not.

基本上，它是一个基于自变量表示为递归分区的分类器。决策树具有形成根树的节点。根树是有向树，其节点称为“根”。根没有任何传入边缘，而所有其他节点都有一个传入边缘。这些节点称为叶子或决策节点。例如，考虑以下决策树以查看一个人是否适合。

支持向量机(SVM) (Support Vector Machine (SVM))

It is used for both classification and regression problems. But mainly it is used for classification problems. The main concept of SVM is to plot each data item as a point in n-dimensional space with the value of each feature being the value of a particular coordinate. Here n would be the features we would have. Following is a simple graphical representation to understand the concept of SVM −

它用于分类和回归问题。但主要用于分类问题。 SVM的主要概念是将每个数据项绘制为n维空间中的一个点，而每个要素的值就是特定坐标的值。这里n是我们将拥有的功能。以下是了解SVM概念的简单图形表示-

In the above diagram, we have two features hence we first need to plot these two variables in two dimensional space where each point has two co-ordinates, called support vectors. The line splits the data into two different classified groups. This line would be the classifier.

在上图中，我们有两个特征，因此我们首先需要在二维空间中绘制这两个变量，其中每个点都有两个坐标，称为支持向量。该行将数据分为两个不同的分类组。这行将是分类器。

朴素贝叶斯 (Naïve Bayes)

It is also a classification technique. The logic behind this classification technique is to use Bayes theorem for building classifiers. The assumption is that the predictors are independent. In simple words, it assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. Below is the equation for Bayes theorem −

这也是一种分类技术。这种分类技术背后的逻辑是将贝叶斯定理用于构建分类器。假设是预测变量是独立的。简而言之，它假定类中某个特定功能的存在与任何其他功能的存在无关。以下是贝叶斯定理的方程式-

$$P\left ( \frac{A}{B} \right ) = \frac{P\left ( \frac{B}{A} \right )P\left ( A \right )}{P\left ( B \right )}$$

$$ P \ left(\ frac {A} {B} \ right)= \ frac {P \ left(\ frac {B} {A} \ right)P \ left(A \ right)} {P \ left( B \ right}} $$

The Naïve Bayes model is easy to build and particularly useful for large data sets.

朴素贝叶斯模型易于构建，对于大型数据集特别有用。

K最近邻居(KNN) (K-Nearest Neighbors (KNN))

It is used for both classification and regression of the problems. It is widely used to solve classification problems. The main concept of this algorithm is that it used to store all the available cases and classifies new cases by majority votes of its k neighbors. The case being then assigned to the class which is the most common amongst its K-nearest neighbors, measured by a distance function. The distance function can be Euclidean, Minkowski and Hamming distance. Consider the following to use KNN −

它用于问题的分类和回归。它被广泛用于解决分类问题。该算法的主要概念是，它用于存储所有可用案例，并通过其k个邻居的多数票对新案例进行分类。然后根据距离函数将案例分配给在其K最近邻居中最常见的类别。距离函数可以是欧几里得距离，明可夫斯基距离和汉明距离。考虑以下使用KNN-

Computationally KNN are expensive than other algorithms used for classification problems.
在计算上，KNN比用于分类问题的其他算法昂贵。
The normalization of variables needed otherwise higher range variables can bias it.
所需变量的规格化，否则范围较大的变量可能会对它产生偏差。
In KNN, we need to work on pre-processing stage like noise removal.
在KNN中，我们需要进行诸如噪声消除之类的预处理阶段。

K均值聚类 (K-Means Clustering)

As the name suggests, it is used to solve the clustering problems. It is basically a type of unsupervised learning. The main logic of K-Means clustering algorithm is to classify the data set through a number of clusters. Follow these steps to form clusters by K-means −

顾名思义，它用于解决聚类问题。基本上，这是一种无监督的学习。 K-Means聚类算法的主要逻辑是通过多个聚类对数据集进行分类。请按照以下步骤通过K均值形成聚类-

K-means picks k number of points for each cluster known as centroids.
K均值为每个聚类选择k个点，称为质心。
Now each data point forms a cluster with the closest centroids, i.e., k clusters.
现在，每个数据点都形成一个具有最接近质心的聚类，即k个聚类。
Now, it will find the centroids of each cluster based on the existing cluster members.
现在，它将基于现有群集成员找到每个群集的质心。
We need to repeat these steps until convergence occurs.
我们需要重复这些步骤，直到收敛为止。

随机森林 (Random Forest)

It is a supervised classification algorithm. The advantage of random forest algorithm is that it can be used for both classification and regression kind of problems. Basically it is the collection of decision trees (i.e., forest) or you can say ensemble of the decision trees. The basic concept of random forest is that each tree gives a classification and the forest chooses the best classifications from them. Followings are the advantages of Random Forest algorithm −

它是一种监督分类算法。随机森林算法的优点是可以用于分类和回归类问题。基本上，它是决策树(即森林)的集合，或者您可以说决策树的集合。随机森林的基本概念是，每棵树都给出一个分类，然后森林从中选择最佳分类。以下是随机森林算法的优点-

Random forest classifier can be used for both classification and regression tasks.
随机森林分类器可用于分类和回归任务。
They can handle the missing values.
他们可以处理缺失的值。
It won’t over fit the model even if we have more number of trees in the forest.
即使我们在森林中有更多的树木，也不会过度拟合模型。