六行代码的机器学习程序-CSDN博客

These series will be composed of #10 episodes of machine learning for “dummies”.

这些系列将由＃10集的“傻瓜”机器学习组成。

If you’re an aspirant to machine learning, I’m happy to announce that, this will make you into it.

如果您渴望机器学习，我很高兴地宣布这一点，这将使您对机器学习有所了解。

After I’ve taken my self a course from Udacity, and a bunch of Youtube series on that article, I’ve decided to share what I’ve learned with you, let me first introduce my self.

在完成了Udacity的自学课程以及该文章的一系列Y outube系列后，我决定与您分享我学到的东西，让我首先介绍一下我的自我。

I’m Irzelindo Salvador, Computer and Telecommunications engineer, mentor at Udacity, Full Stack, Data Engineer, and SQL nano degrees, for more info check my LinkedIn and GitHub profiles.

我是Irzelindo Salvador，计算机和电信工程师， Udacity的指导者， Full Stack ， 数据工程师和SQL nano学位，有关更多信息，请查看我的LinkedIn和GitHub个人资料。

让我们开始 (Let’s begin)

In this article, we’ll walk through a Hello World for Machine Learning…

在本文中，我们将介绍机器学习的Hello World …

We’ll work with two popular open-source libraries sci-kit-learn and TensorFlow.

我们将使用两个流行的开源库sci-kit-learn 和TensorFlow 。

First, let us talk about what is machine learning and why is it important.

首先，让我们谈谈什么是机器学习及其重要性。

Machine learning is a sub-field of artificial intelligence (AI), it is the study of computer algorithms that improve automatically through experience. Machine learning algorithms build a mathematical model based on sample data known as “training data”, to make predictions or decisions without being explicitly programmed to do so.

机器学习是人工智能(AI)的一个子领域，它是对计算机算法的研究，可以通过经验自动提高。机器学习算法基于称为“ 训练数据 ”的样本数据构建数学模型，以进行预测或决策，而无需进行明确的编程。

It is the study of algorithms that learn from examples and experience instead of laying on hard-coded rules.

这是对算法的研究，是从示例和经验中学习的，而不是采用硬编码的规则。

Here is a simple example that sounds easy but impossible to solve without machine learning:

这是一个简单的示例，听起来很简单，但是如果没有机器学习就无法解决：

Can you write code to tell the difference between an apple and an orange?

您可以编写代码来区分苹果和橙子吗？

Imagine that you’re given a task to write a code that receives an input (fruit image) and outputs the type (name) of fruit.

想象一下，您被赋予编写一个接收输入(水果图像)并输出水果类型(名称)的代码的任务。

How to solve this?

如何解决呢？

You’d have to start by writing lots of manual rules, for example, you would try to count the number of pixels for each fruit and end up inferring which fruit is, probably it may work for simple images like the ones on the top image, in the real world you’ll find rules for your solution starting to break. How can you handle black/white images? Or images with no orange or apples at all, I.E peaches, turns out that you’d have to write tons of rules that just tell the difference between oranges and apples. imagine the whole fruit set?

您必须首先编写大量的手动规则，例如，您将尝试计算每个水果的像素数并最终推断出哪种水果，这可能适用于简单图像，例如顶部图像。，在现实世界中，您会发现解决方案的规则开始崩溃。如何处理黑白图像？或完全没有橙子或苹果的图像(例如IE桃子)，您必须编写大量规则来分辨橙子和苹果之间的区别。想像一下整个水果集吗？

We need something better… An algorithm that can figure out the rules for us so that we don’t need to write them by hand.

我们需要更好的东西……一种可以为我们找出规则的算法，这样我们就无需手工编写它们。

For that, we’re going to train a classifier.

为此，我们将训练一个分类器。

Before we jump into the classifier, there’s one thing we must talk about, the types of Machine Learning Techniques, and they’re:

在进入分类器之前，我们必须谈论一件事， 机器学习技术的类型，它们是：

Supervised Learning, Unsupervised Learning, Reinforcement Learning

监督学习，无监督学习，强化学习

监督学习 (Supervised Learning)

Supervised learning is the machine learning technique of learning a function that maps an input to an output based on example input-output pairs. It infers a function from labeled training data consisting of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and the desired output value (also called the supervisory signal). A supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples.

监督学习是一种机器学习技术，用于学习基于示例输入-输出对将输入映射到输出的功能。它从标记的 训练数据推断功能由一系列培训实例组成 。在监督学习中，每个示例都是一对，由输入对象(通常是矢量)和期望的输出值(也称为监督信号 )组成。监督学习算法会分析训练数据并产生一个推断函数，该函数可用于映射新示例。

无监督学习 (Unsupervised Learning)

Unsupervised learning is a type of machine learning that looks for previously undetected patterns in a data set with no pre-existing labels and with a minimum of human supervision. In contrast to supervised learning that usually makes use of human-labeled data, unsupervised learning, also known as self-organization allows for modeling of probability densities over inputs. It forms one of the three main categories of machine learning, along with supervised and reinforcement learning. Semi-supervised learning, a related variant, makes use of supervised and unsupervised techniques.

无监督学习是一种机器学习，它可以在没有预先存在的标签且最少需要人工监督的情况下，在数据集中查找先前未检测到的模式。与通常使用人类标记数据的监督学习相反，无监督学习(也称为自组织)允许对输入的概率密度进行建模。它与监督学习和强化学习一起构成了机器学习的三个主要类别之一。 半监督学习 (一种相关的变体)利用了监督和无监督技术。

强化学习 (Reinforcement Learning)

Reinforcement learning (RL) is an area of machine learning concerned with how software agents ought to take actions in an environment to maximize the notion of cumulative reward. Reinforcement learning is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning. Reinforcement learning differs from supervised learning in not needing labeled input/output pairs to be presented, and in not needing sub-optimal actions to be explicitly corrected. Instead, the focus is on finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge).

强化学习 ( RL )是机器学习的一个领域，与软件代理应如何在环境中采取行动以最大化累积奖励的概念有关。强化学习是三种基本的机器学习范例之一 监督学习和非监督学习 。强化学习与监督学习的不同之处在于，不需要呈现带标签的输入/输出对，也不需要显式纠正次优动作。 相反，重点是在探索(未知领域)和开发(当前知识)之间找到平衡。

As this article is not focused on types of machine learning I’ll not dive deeper into that matter, it is just something that an enthusiast should at least know about, for this first article we’ll be working with supervised learning if you would like me to dive deeper into those add a comment in a section below so that I can consider publishing more detailed articles.

由于本文不关注机器学习的类型，因此我不会对此进行更深入的探讨，这至少是发烧友应该了解的事情，对于第一篇文章，如果您愿意的话，我们将进行监督学习。我将深入研究这些内容，并在下面的部分中添加评论，以便我考虑发布更详细的文章。

Going back to where we stopped… train a classifier.

回到我们停下来的地方…… 训练分类器。

一个分类器...那是什么？ (A Classifier… What is that?)

A classifier is an algorithm that sorts data into labeled classes or categories of information.

分类器是一种将数据分类为标记的信息类别或类别的算法。

For now, let us think of a classifier as a function that takes some data as input and assigns a label to it as an output. I.E we can have an email and want to classify it as a spam or not spam.

现在，让我们将分类器视为一个将某些数据作为输入并为其分配标签作为输出的函数。 IE，我们可以收到一封电子邮件，并希望将其归类为垃圾邮件或非垃圾邮件。

The technique of writing the classifier automatically is called supervised learning, it begins with examples of the problem you’re planning to solve. To code this we’ll work with sci-kit learn… Finally, let’s get our hands dirty.

自动编写分类器的技术称为监督学习，它从您计划要解决的问题的示例开始。要对此进行编码，我们将与sci-kit学习一起使用……最后，让我们动手。

码… (Code…)

Before we start, we’ll need to have Python installed. For this exercise, we’re going to use version 3.8.2. go to https://www.python.org/downloads/ and choose the version for your O.S. For windows users python usually installs with the python package manager/installer pip. For Linux users

在开始之前，我们需要安装P ython 。在本练习中，我们将使用版本3.8.2。转到https://www.python.org/downloads/并为您的操作系统选择版本对于Windows用户，python通常使用python软件包管理器/安装程序pip进行安装。对于Linux用户

$ sudo apt update
$ sudo apt install python3-pip

在Mac OS上 (On Mac OS)

curl https://bootstrap.pypa.io/get-pip.py -o get-pip.py
or
$ sudo easy_install pip
$ sudo pip install --upgrade pip

If you still have problems to install it please referrer to this link

如果您仍有安装问题，请参考此链接

To keep required dependencies separated, we’re going to use python virtual environments for this project. It’s one of the most important tools that most of the Python developers use. For more details on python virtual environments go to the python online library. The command we’re going to use in the terminal is:

为了使所需的依赖项分离，我们将在该项目中使用python虚拟环境 。它是大多数Python开发人员使用的最重要的工具之一。有关python虚拟环境的更多详细信息，请访问python在线库。我们将在终端中使用的命令是：

python3 -m venv ML_Introduction_1_env

The ML_Introduction_1_env is going to be the name of the virtual environment, which is, a self-contained directory tree that contains a Python installation for a particular version of Python, plus several additional packages. Note: This folder is going to be created on the current directory you’re already in. Now we’re going to install the sci-kit learn library

ML_Introduction_1_env将成为虚拟环境的名称，这是一个独立的目录树，其中包含针对特定版本的Python的Python安装以及一些其他软件包。 注意：该文件夹将在您已经在的当前目录中创建。现在我们将安装sci-kit学习库

pip3 install -U scikit-learn

To make sure the package was installed we can start a python script and try to import sklearn, use the command pip freeze or pip show scikit-learn in the terminal(with the virtual environment activated) if all was set up correctly no error will show up during the import and we’ll be able to see the package name(pip freeze) or information (pip show scikit-learn)

为了确保已安装软件包，我们可以启动python脚本并尝试导入sklearn，如果所有设置均正确，则在终端中使用命令pip Frozen或pip show scikit-learn (已激活虚拟环境)，如果没有正确设置，则不会显示错误在导入过程中，我们将能够看到软件包名称( pip Frozen )或信息( pip show scikit-learn )

import sklearn
or
pip freeze
or 
pip show scikit-learn

To use supervised learning we’ll follow a recipe with a few standard steps.

要使用监督式学习，我们将遵循一些标准步骤来制定食谱。

1.收集训练数据 (1. Collect training data)

These are examples of the problem we want to solve, in our case we’re going to write a function to classify a fruit. It will take a description of the fruit as the input and predict if it’s an apple or orange based on features like weight and texture. These are called training data, imagine we head out to an orchard and looked at different apples and oranges and write down measurements to describe then in a table (weight, texture, color, taste, …) in machine learning these measurements are called features. To keep things simple here we used just two. fruit weight in grams and its texture. A good feature makes it easy to discriminate between different types of fruit. each row on the next table describes one piece of fruit. the last column is the label it identifies what type of fruit is in each row (Apple or Orange). Think of these as all examples we want our classifier to learn from, the more examples we train from the better will be our classifier on discriminating whether the fruit is an apple or an orange.

这些是我们要解决的问题的示例，在本例中，我们将编写一个对水果进行分类的函数。它将以对水果的描述作为输入，并根据重量和质地等特征预测它是苹果还是橙子。这些被称为训练数据，想象我们去果园，看着不同的苹果和橘子，写下测量值，然后在表格中(重量，质地，颜色，味道等)进行描述，在机器学习中，这些测量值称为特征。为了使事情简单，我们仅使用了两个。水果重量(克)及其质地。良好的功能可以轻松区分不同类型的水果。下表的每一行都描述了一个水果。最后一列是标签，它标识每行(苹果或橙子)中水果的类型。将这些视为我们希望分类器学习的所有示例，我们从中训练得越多的示例越多，我们的分类器就可以区分水果是苹果还是橙子。

Now let’s write down our classifier in code.

现在，让我们用代码写下分类器。

import sklearnfeatures = [[138, "smooth"], [130, "smooth"], [140, "bumpy"], [172, "bumpy"]]
labels = ["apple","apple","orange","orange"]

We used two variables features and labels, features contain the first two columns in a 2D Array or 2D List and labels contain the last column. You can think of a feature as the inputs of our classifier and the labels the desired output. Now I’m going to change the texture of the fruits into integers instead of strings. To have all our inputs/features mapped to integers so I’ll use 0 for bumpy and 1 for smooth.

我们使用了两个变量功能和标签，功能包含2D数组或2D列表中的前两列，标签包含最后一列。您可以将功能视为分类器的输入，并将标签标记为所需的输出。现在，我将把水果的纹理更改为整数而不是字符串。 为了将我们所有的输入/功能映射到整数，因此我将使用0表示颠簸，使用1表示平滑。

The same I’ll apply for our labels 0 for apples and 1 for oranges

我将同样的标签应用于苹果 0和橙 1

import sklearnfeatures = [[138, 1], [130, 1], [140, 0], [172, 0]]
labels = [0, 0, 1, 1]

2.训练分类器 (2. Train a Classifier)

The type of classifier we’ll start with is called a decision tree. We’ll dive deeper into how this classifier works on the future episodes, for now, let us think of a classifier as a box of rules. That’s because there a different type of classifiers (Naive Bayes, SVM, …)

我们将以分类器开始的类型称为决策树 。我们将更深入地研究此分类器在未来情节中的工作方式，现在，让我们将分类器视为一盒规则。那是因为存在不同类型的分类器(朴素贝叶斯，SVM等)

In our script, I’ll first import the tree from sklearn and create a classifier, at this point, it will be just an empty box of rules because it doesn’t know anything about apples and oranges yet.

在我们的脚本中，我将首先从sklearn导入树并创建一个分类器，这时，它只是一个空的规则框，因为它对苹果和橙子还一无所知。

from sklearn import treefeatures = [[138, 1], [130, 1], [140, 0], [172, 0]]
labels = [0, 0, 1, 1]
classifier = tree.DecisionTreeClassifier()

To train it we’ll need a training algorithm, if a classifier is a box of rules then we can think of the learning algorithm as the procedure that creates them by finding patterns in your training data, for example, it may think that oranges weight more so it’ll create a rule saying that the heavier the fruit it is more likely to be an orange, in sci-kit the learning algorithm is included in the classifier object, and it’s called Fit. We can think of a Fit as being a synonym for “find patterns in data” we’ll dive deep on how it’s going on under the hood on a future episode.

要训练它，我们需要一个训练算法 ，如果分类器是一盒规则，那么我们可以将学习算法视为通过在训练数据中找到模式来创建它们的过程，例如，它可能认为橘子的重量更进一步，它将创建一条规则，说水果越重，它就越有可能是橙色，在sci-kit中，学习算法包含在分类器对象中，这称为Fit。 我们可以将Fit视为“数据中的发现模式”的代名词，我们将在以后的情节中深入探讨它的运行情况。

from sklearn import treefeatures = [[138, 1], [130, 1], [140, 0], [172, 0]]
labels = [0, 0, 1, 1]
classifier = tree.DecisionTreeClassifier()
classifier = classifier.fit(features, labels)

At this point we have a trained classifier, so let us take it for a spin and use it to classify a new fruit, the input is going to be the feature for a new example, lets say the fruit we want to classify is 150g and bumpy(0). The output will be 0 if it’s an apple or 1 if it’s orange.

此时，我们有一个训练有素的分类器，因此让我们旋转一下并使用它来对新水果进行分类，输入将成为新示例的功能，假设我们要分类的水果为150g ， 颠簸(0)。 如果是苹果，则输出为0；如果是橙色，则输出为1 。

3.做出预测 (3. Make Predictions)

from sklearn import treefeatures = [[138, 1], [130, 1], [140, 0], [172, 0]]
labels = [0, 0, 1, 1]
classifier = tree.DecisionTreeClassifier()
classifier = classifier.fit(features, labels)
print(classifier.predict([[150, 0]]))

Before we look into the output, looking into our training data based on the features we had previously what would you guess this fruit would be. Where do fruit with 150g and bumpy(0) fits in our table?

在我们研究输出之前，请根据我们以前拥有的功能来研究我们的训练数据，您会猜想这将是什么。 150克和颠簸(0)的水果在哪里适合我们的餐桌？

0: for an apple
0：一个苹果
1: for an orange
1：换一个橘子
If your answer was 1 Orange, well you’re right because the fruit is heavy and bumpy.
如果您的答案是1橙色，那您是对的，因为水果又重又坎bump。

Well, if everything worked for you, congratulations you just wrote your first machine learning program with just six lines. You can try building a new classifier for a new problem.

好吧，如果一切对您都有效，那么恭喜您，您仅用六行代码编写了您的第一个机器学习程序。您可以尝试针对新问题构建新的分类器。

Well, if you’re asking why didn’t we use pictures instead of a table of features as our training data, just for now it’s possible and we’ll cover that in future episodes. The way we did it here is more general.

好吧，如果您问为什么我们不使用图片而不是功能表作为我们的训练数据，那么现在就可以了，我们将在以后的章节中介绍。我们在这里做的方式比较笼统。

Programming with machine learning isn’t hard, but to get it right, we need to understand a few important concepts.

使用机器学习进行编程并不难，但是要正确进行，我们需要了解一些重要的概念。