如何使用TensorFlow构建简单的图像识别系统(第1部分)

by Wolfgang Beyer

沃尔夫冈·拜尔(Wolfgang Beyer)

如何使用TensorFlow构建简单的图像识别系统(第1部分) (How to Build a Simple Image Recognition System with TensorFlow (Part 1))

This isn’t a general introduction to Artificial Intelligence, Machine Learning or Deep Learning. There are already lots of great articles covering these topics (for example here or here).

这不是对人工智能,机器学习或深度学习的一般介绍。 已经有很多很棒的文章涉及这些主题(例如, herehere )。

And this isn’t a discussion about whether AI will enslave humankind or merely steal all our jobs. You can find plenty of speculation and some premature fearmongering elsewhere.

这不是关于AI是奴役人类还是仅仅窃取我们所有工作的讨论。 您可以在其他地方找到大量的猜测和过早的恐惧。

Instead, this post is a detailed description of how to get started in Machine Learning by building a system that is (somewhat) able to recognize what it sees in an image.

相反,本文详细描述了如何通过构建(某种程度上)能够识别图像中看到的内容的系统来开始机器学习。

I’m currently on a journey to learn about Artificial Intelligence and Machine Learning. And the way I learn best is by not only reading stuff, but by actually building things and getting some hands-on experience. And that’s what this post is about. I want to show you how you can build a system that performs a simple computer vision task: recognizing image content.

我目前正在学习有关人工智能和机器学习的旅程。 而我学得最好的方法不仅是阅读东西,还包括实际构建东西并获得一些动手经验。 这就是这篇文章的主题。 我想向您展示如何构建执行简单计算机视觉任务的系统:识别图像内容。

I don’t claim to be an expert myself. I’m still learning, and there is a lot to learn. I’m describing what I’ve been playing around with, and if it’s somewhat interesting or helpful to you, that’s great! If, on the other hand, you find mistakes or have suggestions for improvements, please let me know, so that I can learn from you.

我自己并不声称自己是专家。 我还在学习,还有很多东西要学习。 我正在描述我一直在玩的游戏,如果它对您有所帮助或有所帮助,那就太好了! 另一方面,如果您发现错误或有改进建议,请告诉我,以便我向您学习。

You don’t need any prior experience with machine learning to be able to follow along. The example code is written in Python, so a basic knowledge of Python would be great, but knowledge of any other programming language is probably enough.

您不需要任何机器学习方面的经验就可以继续学习。 该示例代码是用Python编写的,因此具有Python的基础知识将是不错的,但是任何其他编程语言的知识可能就足够了。

为什么要进行图像识别? (Why image recognition?)

Image recognition is a great task for developing and testing machine learning approaches. Vision is debatably our most powerful sense and comes naturally to us humans. But how do we actually do it? How does the brain translate the image on our retina into a mental model of our surroundings? I don’t think anyone knows exactly.

图像识别是开发和测试机器学习方法的一项艰巨任务。 视觉是我们最有力的感觉,对我们人类来说是自然而然的。 但是,我们实际上该如何做呢? 大脑如何将视网膜上的图像转化为周围环境的心理模型? 我认为没有人确切知道。

The point is, it’s seemingly easy for us to do — so easy that we don’t even need to put any conscious effort into it — but difficult for computers to do (Actually, it might not be that easy for us either, maybe we’re just not aware of how much work it is. More than half of our brain seems to be directly or indirectly involved in vision).

关键是,这似乎对我们来说很容易-如此简单以至于我们甚至都不需要付出任何有意识的努力-但计算机却很难做到(实际上,对我们来说也可能不那么容易“只是不知道它有多少工作。我们大脑的一半以上似乎直接或间接地参与了视觉。

How can we get computers to do visual tasks when we don’t even know how we are doing it ourselves? That’s where machine learning comes into play. Instead of trying to come up with detailed step by step instructions of how to interpret images and translating that into a computer program, we’re letting the computer figure it out itself.

当我们什至不知道自己如何做时,如何让计算机执行视觉任务? 那就是机器学习发挥作用的地方。 我们没有试图给出有关如何解释图像并将其转换为计算机程序的详细分步说明,而是让计算机自行解决。

The goal of machine learning is to give computers the ability to do something without being explicitly told how to do it. We just provide some kind of general structure and give the computer the opportunity to learn from experience, similar to how we humans learn from experience too.

机器学习的目标是使计算机能够执行某项操作,而无需明确告知其操作方法。 我们只是提供某种通用结构,并为计算机提供从经验中学习的机会,类似于我们人类从经验中学习的方式。

But before we start thinking about a full blown solution to computer vision, let’s simplify the task somewhat and look at a specific sub-problem which is easier for us to handle.

但是,在我们开始考虑为计算机视觉提供完整的解决方案之前,让我们稍微简化一下任务,并研究一个更容易处理的特定子问题。

图像分类和CIFAR-10数据集 (Image classification and the CIFAR-10 dataset)

We will try to solve a problem which is as simple and small as possible while still being difficult enough to teach us valuable lessons. All we want the computer to do is the following: when presented with an image (with specific image dimensions), our system should analyze it and assign a single label to it. It can choose from a fixed number of labels, each being a category describing the image’s content. Our goal is for our model to pick the correct category as often as possible. This task is called image classification.

我们将尝试解决一个尽可能简单而又小巧的问题,同时仍然很难让我们学到宝贵的经验。 我们希望计算机执行的所有操作如下:当显示图像(具有特定的图像尺寸)时,我们的系统应对其进行分析并为其分配单个标签。 它可以从固定数量的标签中进行选择,每个标签都是描述图像内容的类别。 我们的目标是使我们的模型尽可能频繁地选择正确的类别。 此任务称为图像分类。

We will use a standardized dataset called CIFAR-10. CIFAR-10 consists of 60,000 images. There are 10 different categories and 6,000 images per category. Each image has a size of only 32 by 32 pixels. The small size makes it sometimes difficult for us humans to recognize the correct category, but it simplifies things for our computer model and reduces the computational load required to analyze the images.

我们将使用称为CIFAR-10的标准化数据集。 CIFAR-10包含60,000张图像。 有10个不同的类别,每个类别6,000张图像。 每个图像的大小仅为32 x 32像素。 较小的尺寸有时会使我们的人难以识别正确的类别,但是它简化了计算机模型的内容,并减少了分析图像所需的计算量。

The way we input these images into our model is by feeding the model a whole bunch of numbers. Each pixel is described by three floating point numbers representing the red, green and blue values for this pixel. This results in 32 x 32 x 3 = 3,072 values for each image.

我们将这些图像输入模型的方式是通过向模型输入一堆数字。 每个像素由三个浮点数表示,该浮点数代表该像素的红色,绿色和蓝色值。 这将导致每个图像32 x 32 x 3 = 3,072个值。

Apart from CIFAR-10, there are plenty of other image datasets which are commonly used in the computer vision community. Using standardized datasets serves two purposes. First, it is a lot of work to create such a dataset. You need to find the images, process them to fit your needs and label all of them individually. The second reason is that using the same dataset allows us to objectively compare different approaches with each other.

除了CIFAR-10外,还有很多其他图像数据集通常在计算机视觉社区中使用。 使用标准化数据集有两个目的。 首先,创建这样的数据集需要大量工作。 您需要找到图像,对其进行处理以适合您的需求,并分别标记所有图像。 第二个原因是,使用相同的数据集可以使我们客观地比较不同的方法。

In addition, standardized image datasets have lead to the creation of computer vision high score lists and competitions. The most famous competition is probably the Image-Net Competition, in which there are 1000 different categories to detect. 2012’s winner was an algorithm developed by Alex Krizhevsky, Ilya Sutskever and Geoffrey Hinton from the University of Toronto (technical paper) which dominated the competition and won by a huge margin. This was the first time the winning approach was using a convolutional neural network, which had a great impact on the research community. Convolutional neural networks are artificial neural networks loosely modeled after the visual cortex found in animals. This technique had been around for a while, but at the time most people did not yet see its potential to be useful. This changed after the 2012 Image-Net competition. Suddenly there was a lot of interest in neural networks and deep learning (deep learning is just the term used for solving machine learning problems with multi-layer neural networks). That event plays a big role in starting the deep learning boom of the last couple of years.

此外,标准化的图像数据集还导致创建计算机视觉高分列表和竞赛。 最著名的竞赛可能是Image-Net竞赛 ,其中有1000种不同的类别可进行检测。 2012年的冠军是由多伦多大学的Alex Krizhevsky,Ilya Sutskever和Geoffrey Hinton开发的算法( 技术论文 ),该算法在比赛中独占do头,获得了巨大的胜利。 这是获胜方法第一次使用卷积神经网络,这对研究界产生了巨大影响。 卷积神经网络是在动物的视觉皮层之后进行松散建模的人工神经网络。 这项技术已经存在了一段时间,但是当时大多数人还没有看到它的潜力。 在2012年Image-Net竞赛之后,情况发生了变化。 突然之间,人们对神经网络和深度学习产生了浓厚的兴趣(深度学习只是用于解决多层神经网络的机器学习问题的术语)。 该事件在过去几年开始深度学习热潮中起着重要作用。

监督学习 (Supervised Learning)

How can we use the image dataset to get the computer to learn on its own? Even though the computer does the learning part by itself, we still have to tell it what to learn and how to do it. The way we do this is by specifying a general process of how the computer should evaluate images.

我们如何使用图像数据集让计算机自行学习? 即使计算机本身是学习的一部分,我们仍然必须告诉计算机要学习什么以及如何去做。 我们这样做的方法是指定计算机应如何评估图像的一般过程。

We’re defining a general mathematical model of how to get from input image to output label. The model’s concrete output for a specific image then depends not only on the image itself, but also on the model’s internal parameters. These parameters are not provided by us, instead they are learned by the computer.

我们正在定义一个关于如何从输入图像到输出标签的通用数学模型。 然后,特定图像的模型具体输出不仅取决于图像本身,还取决于模型的内部参数。 这些参数不是我们提供的,而是由计算机学习的。

The whole thing turns out to be an optimization problem. We start by defining a model and supplying starting values for its parameters. Then we feed the image dataset with its known and correct labels to the model. That’s the training stage. During this phase the model repeatedly looks at training data and keeps changing the values of its parameters. The goal is to find parameter values that result in the model’s output being correct as often as possible. This kind of training, in which the correct solution is used together with the input data, is called supervised learning. There is also unsupervised learning, in which the goal is to learn from input data for which no labels are available, but that’s beyond the scope of this post.

整个事实证明是一个优化问题。 我们首先定义一个模型并为其参数提供起始值。 然后,我们将图像数据集及其已知且正确的标签输入模型。 那是训练阶段。 在此阶段,模型会反复查看训练数据,并不断更改其参数值。 目标是找到尽可能使模型输出正确的参数值。 这种训练,其中正确的解决方案与输入数据一起使用,被称为监督学习。 还有一种无监督的学习,其目的是从没有标签可用的输入数据中学习,但这超出了本文的范围。

After the training has finished, the model’s parameter values don’t change anymore and the model can be used for classifying images which were not part of its training dataset.

训练完成后,模型的参数值不再更改,该模型可用于对不属于其训练数据集的图像进行分类。

TensorFlow (TensorFlow)

TensorFlow is a open source software library for machine learning, which was released by Google in 2015 and has quickly become one of the most popular machine learning libraries being used by researchers and practitioners all over the world. We use it to do the numerical heavy lifting for our image classification model.

TensorFlow是Google于2015年发布的用于机器学习的开源软件库,已Swift成为全球研究人员和从业人员使用的最受欢迎的机器学习库之一。 我们用它对我们的图像分类模型进行数值重算。

建立模型,Softmax分类器 (Building the Model, a Softmax Classifier)

The full code for this model is available on Github. In order to use it, you need to have the following installed:

Github提供了该模型的完整代码。 为了使用它,您需要安装以下组件:

Alright, now we’re finally ready to go. Let’s look at the main file of our experiment, softmax.py and analyze it line by line:

好了,现在我们终于可以开始了。 我们来看一下实验的主文件softmax.pysoftmax.py进行分析:

The future-Statements should be present in all TensorFlow Python files to ensure compatibility with both Python 2 and 3 according to the TensorFlow style guide.

根据TensorFlow样式指南 ,所有TensorFlow Python文件中都应包含future-Statement,以确保与Python 2和3兼容。

Then we are importing TensorFlow, numpy for numerical calculations, and the time module. data_helpers.py contains functions that help with loading and preparing the dataset.

然后我们导入TensorFlow,numpy进行数值计算以及时间模块。 data_helpers.py包含有助于加载和准备数据集的函数。

We start a timer to measure the runtime and define some parameters. I’ll talk about them later when we’re actually using them. Then we load the CIFAR-10 dataset. Since reading the data is not part of the core of what we’re doing, I put these functions into the separate data_helpers.py file, which basically just reads the files containing the dataset and puts the data in a data structure which is easy to handle for us.

我们启动一个计时器来测量运行时间并定义一些参数。 稍后我们实际使用它们时,我将对其进行讨论。 然后,我们加载CIFAR-10数据集。 由于读取数据不是我们正在做的工作的核心部分,因此我将这些函数放在单独的data_helpers.py文件中,该文件基本上只是读取包含数据集的文件并将数据放入易于处理的数据结构中为我们处理。

One thing is important to mention though. load_data() is splitting the 60000 images into two parts. The bigger part contains 50000 images. This training set is what we use for training our model. The other 10000 images are called test set. Our model never gets to see those until the training is finished. Only then, when the model’s parameters can’t be changed anymore, we use the test set as input to our model and measure the model’s performance on the test set.

值得一提的是一件事。 load_data()将60000张图像分为两部分。 较大的部分包含50000张图像。 该训练集是我们用于训练模型的工具。 其他10000张图像称为测试集。 在训练完成之前,我们的模型永远不会看到这些。 只有这样,当模型的参数无法再更改时,我们才将测试集用作模型的输入,并在测试集上测量模型的性能。

This separation of training and testing data is very important. We wouldn’t know how well our model is able to make generalizations if it was exposed to the same dataset for training and for testing. In the worst case, imagine a model which exactly memorizes all the training data it sees. If we were to use the same data for testing it, the model would perform perfectly by just looking up the correct solution in its memory. But it would have no idea what to do with inputs which it hasn’t seen before.

培训和测试数据的这种分离非常重要。 如果将模型暴露在相同的数据集中进行训练和测试,我们将无法对模型进行概括。 在最坏的情况下,请设想一个模型,该模型可以精确存储所看到的所有训练数据。 如果我们使用相同的数据对其进行测试,则只需在其内存中查找正确的解决方案,该模型便可以完美地发挥作用。 但是它不知道如何处理以前从未见过的输入。

This concept of a model learning the specific features of the training data and possibly neglecting the general features, which we would have preferred for it to learn is called overfitting. Overfitting and how to avoid it is a big issue in machine learning. More information about overfitting and why it is generally advisable to split the data into not only 2 but 3 different datasets can be found in this video (youtube mirror) (the video is part of Andrew Ng’s great free machine learning course on Coursera).

学习训练数据的特定特征并可能忽略一般特征的模型概念(我们更希望其学习)称为过拟合。 过度拟合以及如何避免它是机器学习中的一个大问题。 有关过度拟合以及为什么建议将数据分为2个和3个不同的数据集的更多信息,可以在此视频 ( youtube镜子 )中找到(该视频是Andrew Ng 在Coursera上很棒的免费机器学习课程的一部分 )。

To get back to our code, load_data() returns a dictionary containing

回到我们的代码, load_data()返回一个包含以下内容的字典

  • images_train: the training dataset as an array of 50,000 by 3,072 (= 32 pixels x 32 pixels x 3 color channels) values.

    images_train :训练数据集,它是50,000 x 3,072(= 32像素x 32像素x 3个颜色通道)值的数组。

  • labels_train: 50000 labels for the training set (each a number between 0 nad 9 representing which of the 10 classes the training image belongs to)

    labels_train :训练集的50000个标签(每个数字介于0到9之间,代表训练图像属于10个类别中的哪个类别)

  • images_test: test set (10,000 by 3,072)

    images_test :测试集(10,000 x 3,072)

  • labels_test: 10,000 labels for the test set

    labels_test :10,000个测试集的标签

  • classes: 10 text labels for translating the numerical class value into a word (0 for ‘plane’, 1 for ‘car’, etc.)

    classes :10个文本标签,用于将数字类别值转换为一个单词(“ 0”表示“ plane”,“ 1”表示“ car”,等等)

Now we can start building our model. The actual numerical computations are being handled by TensorFlow, which uses a fast and efficient C++ backend to do this. TensorFlow wants to avoid repeatedly switching between Python and C++ because that would slow down our calculations.

现在我们可以开始构建模型了。 实际的数值计算由TensorFlow处理,TensorFlow使用快速高效的C ++后端执行此操作。 TensorFlow希望避免在Python和C ++之间反复切换,因为这会减慢我们的计算速度。

The common workflow is therefore to first define all the calculations we want to perform by building a so-called TensorFlow graph. During this stage no calculations are actually being performed, we are merely setting the stage. Only afterwards we run the calculations by providing input data and recording the results.

因此,常见的工作流程是通过构建所谓的TensorFlow图首先定义我们要执行的所有计算。 在此阶段中,实际上没有进行任何计算,我们只是在设置阶段。 之后,我们才通过提供输入数据并记录结果来运行计算。

So let’s start defining our graph. We first describe the way our input data for the TensorFlow graph looks like by creating placeholders. These placeholders do not contain any actual data, they just specify the input data’s type and shape.

因此,让我们开始定义图表。 我们首先通过创建占位符来描述TensorFlow图的输入数据的外观。 这些占位符不包含任何实际数据,它们仅指定输入数据的类型和形状。

For our model, we’re first defining a placeholder for the image data, which consists of floating point values (tf.float32). The shape argument defines the input dimensions. We will provide multiple images at the same time (we will talk about those batches later), but we want to stay flexible about how many images we actually provide. The first dimension of shape is therefore None, which means the dimension can be of any length. The second dimension is 3,072, the number of floating point values per image.

对于我们的模型,我们首先为图像数据定义一个占位符,该占位符由浮点值( tf.float32 )组成。 shape参数定义输入尺寸。 我们将同时提供多张图片( 稍后我们将讨论这些批次),但是我们希望在实际提供多少张图片方面保持灵活性。 因此, shape的第一尺寸为None ,这意味着尺寸可以为任意长度。 第二维是3,072,即每个图像的浮点值的数量。

The placeholder for the class label information contains integer values (tf.int64), one value in the range from 0 to 9 per image. Since we’re not specifying how many images we’ll input, the shape argument is [None].

类标签信息的占位符包含整数值( tf.int64 ),每个图像一个值在0到9的范围内。 由于我们没有指定要输入的图像数量,因此shape参数为[None]

weights and biases are the variables we want to optimize. But let’s talk about our model first.

weightsbiases是我们要优化的变量。 但是,让我们先谈谈我们的模型。

Our input consists of 3,072 floating point numbers and the desired output is one of 10 different integer values. How do we get from 3,072 values to a single one? Let’s start at the back. Instead of a single integer value between 0 and 9, we could also look at 10 score values — one for each class — and then pick the class with the highest score. So our original question now turns into: How do we get from 3,072 values to 10?

我们的输入包含3,072个浮点数,并且所需的输出是10个不同的整数值之一。 我们如何将3,072个值转换为一个值? 让我们从后面开始。 除了查看0到9之间的单个整数值之外,我们还可以查看10个得分值(每个班级一个),然后选择得分最高的班级。 因此,我们最初的问题现在变成:我们如何从3,072个值变成10个?

The simple approach which we are taking is to look at each pixel individually. For each pixel (or more accurately each color channel for each pixel) and each possible class, we’re asking whether the pixel’s color increases or decreases the probability of that class.

我们采用的简单方法是分别查看每个像素。 对于每个像素(或更准确地说,每个像素的每个颜色通道)和每个可能的类别,我们在询问像素的颜色是增加还是减少了该类别的概率。

Let’s say the first pixel is red. If images of cars often have a red first pixel, we want the score for car to increase. We achieve this by multiplying the pixel’s red color channel value with a positive number and adding that to the car-score. Accordingly, if horse images never or rarely have a red pixel at position 1, we want the horse-score to stay low or decrease. This means multiplying with a small or negative number and adding the result to the horse-score.

假设第一个像素为红色。 如果汽车的图像通常具有红色的第一像素,我们希望增加汽车的分数。 我们可以通过将像素的红色通道值乘以正数并将其添加到汽车评分中来实现。 因此,如果马图像在位置1处永远不会或很少有红色像素,我们希望马分数保持较低或降低。 这意味着要乘以一个小数或负数,然后将结果加到马数中。

For each of the 10 classes we repeat this step for each pixel and sum up all 3,072 values to get a single overall score, a sum of our 3,072 pixel values weighted by the 3,072 parameter weights for that class. In the end we have 10 scores, one for each class. Then we just look at which score is the highest, and that’s our class label.

对于10个类别中的每个类别,我们对每个像素重复此步骤,并对所有3,072个值求和以得到单个总得分,即该类别的3,072个像素值的总和与3,072个参数权重的加权。 最终,我们得到10分,每节课一个。 然后我们只看哪个分数最高,那就是我们的班级标签。

The notation for multiplying the pixel values with weight values and summing up the results can be drastically simplified by using matrix notation. Our image is represented by a 3,072-dimensional vector. If we multiply this vector with a 3,072 x 10 matrix of weights, the result is a 10-dimensional vector containing exactly the weighted sums we are interested in.

通过使用矩阵符号可以大大简化将像素值与权重值相乘并将结果相加的符号。 我们的图像由3,072维向量表示。 如果我们将此向量与3,072 x 10的权重矩阵相乘,则结果将是一个10维向量,其中恰好包含我们感兴趣的加权和。

The actual values in the 3,072 x 10 matrix are our model parameters. If they are random/garbage our output will be random/garbage. That’s where the training data comes into play. By looking at the training data we want the model to figure out the parameter values by itself.

3,072 x 10矩阵中的实际值是我们的模型参数。 如果它们是随机的/垃圾,那么我们的输出将是随机的/垃圾。 这就是训练数据发挥作用的地方。 通过查看训练数据,我们希望模型自己找出参数值。

All we’re telling TensorFlow in the two lines of code shown above is that there is a 3,072 x 10 matrix of weight parameters, which are all set to 0 in the beginning. In addition, we’re defining a second parameter, a 10-dimensional vector containing the bias. The bias does not directly interact with the image data and is added to the weighted sums. The bias can be seen as a kind of starting point for our scores.

在上面显示的两行代码中,我们告诉TensorFlow的是,权重参数矩阵为3,072 x 10,在开始时都设置为0。 此外,我们正在定义第二个参数,即包含偏差的10维向量。 偏差不直接与图像数据交互,而是被添加到加权和中。 偏差可以看作是我们分数的起点。

Think of an image which is totally black. All its pixel values would be 0, therefore all class scores would be 0 too, no matter how the weights matrix looks like. Having biases allows us to start with non-zero class scores.

想想一个全黑的图像。 它的所有像素值都将为0,因此无论weights矩阵如何,所有类分数也将均为0。 有偏见可以让我们从非零的班级成绩开始。

This is where the prediction takes place. We’ve arranged the dimensions of our vectors and matrices in such a way that we can evaluate multiple images in a single step. The result of this operation is a 10-dimensional vector for each input image.

这是进行预测的地方。 我们已经安排了向量和矩阵的尺寸,以便可以在一个步骤中评估多个图像。 该操作的结果是每个输入图像的10维向量。

The process of arriving at good values for the weights and bias parameters is called training and works as follows: First, we input training data and let the model make a prediction using its current parameter values. This prediction is then compared to the correct class labels. The numerical result of this comparison is called loss. The smaller the loss value, the closer the predicted labels are to the correct labels and vice versa.

获得weightsbias参数的良好值的过程称为训练,其工作方式如下:首先,我们输入训练数据,然后让模型使用其当前参数值进行预测。 然后将此预测与正确的类别标签进行比较。 该比较的数值结果称为损耗。 损失值越小,预测标签越接近正确标签,反之亦然。

We want to model to minimize the loss, so that its predictions are close to the true labels. But before we look at the loss minimization, let’s take a look at how the loss is calculated.

我们希望建模以使损失最小化,以便其预测接近真实标签。 但是,在我们考虑损失最小化之前,让我们看一下如何计算损失。

The scores calculated in the previous step, stored in the logits variable, contains arbitrary real numbers. We can transform these values into probabilities (real values between 0 and 1 which sum to 1) by applying the softmax function, which basically squeezes its input into an output with the desired attributes. The relative order of its inputs stays the same, so the class with the highest score stays the class with the highest probability. The softmax function’s output probability distribution is then compared to the true probability distribution, which has a probability of 1 for the correct class and 0 for all other classes.

上一步中计算的分数存储在logits变量中,包含任意实数。 通过应用softmax函数 ,我们可以将这些值转换为概率(0到1之间的实际值之和为1),该函数基本上将其输入压缩为具有所需属性的输出。 其输入的相对顺序保持不变,因此得分最高的类别将保持最高的概率。 然后将softmax函数的输出概率分布与真实概率分布进行比较,对于正确的类别,其真实概率为1,对于所有其他类别,其概率为0。

We use a measure called cross-entropy to compare the two distributions (a more technical explanation can be found here). The smaller the cross-entropy, the smaller the difference between the predicted probability distribution and the correct probability distribution. This value represents the loss in our model.

我们使用一种称为交叉熵的方法来比较这两种分布(可以在此处找到更多技术说明)。 交叉熵越小,则预测概率分布与正确概率分布之间的差异就越小。 此值代表我们模型中的损失。

Luckily TensorFlow handles all the details for us by providing a function that does exactly what we want. We compare logits, the model’s predictions, with labels_placeholder, the correct class labels. The output of sparse_softmax_cross_entropy_with_logits() is the loss value for each input image. We then calculate the average loss value over the input images.

幸运的是,TensorFlow通过提供完全符合我们需要的功能来为我们处理所有细节。 我们比较logits ,模型的预测,与labels_placeholder ,正确的类标签。 sparse_softmax_cross_entropy_with_logits()的输出是每个输入图像的损耗值。 然后,我们计算输入图像的平均损耗值。

But how can we change our parameter values to minimize the loss? This is where TensorFlow works its magic. Via a technique called auto-differentiation it can calculate the gradient of the loss with respect to the parameter values. This means that it knows each parameter’s influence on the overall loss and whether decreasing or increasing it by a small amount would reduce the loss. It then adjusts all parameter values accordingly, which should improve the model’s accuracy. After this parameter adjustment step the process restarts and the next group of images are fed to the model.

但是,如何更改参数值以使损失最小化? 这就是TensorFlow发挥作用的地方。 通过一种称为自动微分的技术,它可以计算出损耗相对于参数值的梯度。 这意味着它知道每个参数对总损耗的影响,并知道减小或增大该参数是否会减小损耗。 然后,它会相应地调整所有参数值,这将提高模型的准确性。 在此参数调整步骤之后,该过程将重新启动并将下一组图像输入模型。

TensorFlow knows different optimization techniques to translate the gradient information into actual parameter updates. Here we use a simple option called gradient descent which only looks at the model’s current state when determining the parameter updates and does not take past parameter values into account.

TensorFlow知道将梯度信息转换为实际参数更新的不同优化技术。 在这里,我们使用一个称为梯度下降的简单选项,它在确定参数更新时仅查看模型的当前状态,而不会考虑过去的参数值。

Gradient descent only needs a single parameter, the learning rate, which is a scaling factor for the size of the parameter updates. The bigger the learning rate, the more the parameter values change after each step. If the learning rate is too big, the parameters might overshoot their correct values and the model might not converge. If it is too small, the model learns very slowly and takes too long to arrive at good parameter values.

梯度下降仅需要一个参数,即学习率,这是参数更新大小的缩放因子。 学习速率越大,每个步骤之后的参数值变化就越大。 如果学习率太大,则参数可能会超出其正确值,并且模型可能无法收敛。 如果它太小,则模型学习速度非常慢,并且花费太长时间才能获得良好的参数值。

The process of categorizing input images, comparing the predicted results to the true results, calculating the loss and adjusting the parameter values is repeated many times. For bigger, more complex models the computational costs can quickly escalate, but for our simple model we need neither a lot of patience nor specialized hardware to see results.

将输入图像分类,将预测结果与真实结果进行比较,计算损失并调整参数值的过程重复了很多次。 对于更大,更复杂的模型,计算成本会Swift上升,但是对于我们的简单模型,我们既不需要太多的耐心,也不需要专门的硬件来查看结果。

These two lines measure the model’s accuracy. argmax of logits along dimension 1 returns the indices of the class with the highest score, which are the predicted class labels. The labels are then compared to the correct class labels by tf.equal(), which returns a vector of boolean values. The booleans are cast into float values (each being either 0 or 1), whose average is the fraction of correctly predicted images.

这两行测量模型的准确性。 argmaxlogits沿着维度1个返回之类的具有最高得分的索引,这是所预测的类别标签。 然后,通过tf.equal()将标签与正确的类标签进行tf.equal() ,这将返回布尔值的向量。 布尔值转换为浮点值(每个为0或1),其平均值是正确预测的图像的分数。

We’re finally done defining the TensorFlow graph and are ready to start running it. The graph is launched in a session which we can access via the sess variable. The first thing we do after launching the session is initializing the variables we created earlier. In the variable definitions we specified initial values, which are now being assigned to the variables.

我们终于完成了TensorFlow图的定义,并准备开始运行它。 该图在一个会话中启动,我们可以通过sess变量进行访问。 启动会话后,我们要做的第一件事是初始化我们之前创建的变量。 在变量定义中,我们指定了初始值,这些初始值现在已分配给变量。

Then we start the iterative training process which is to be repeated max_steps times.

然后,我们开始迭代训练过程,该过程将重复max_steps次。

These lines randomly pick a certain number of images from the training data. The resulting chunks of images and labels from the training data are called batches. The batch size (number of images in a single batch) tells us how frequent the parameter update step is performed. We first average the loss over all images in a batch, and then update the parameters via gradient descent.

这些行从训练数据中随机选择一定数量的图像。 来自训练数据的图像和标签结果块称为批次。 批次大小(单个批次中的图像数量)告诉我们执行参数更新步骤的频率。 我们首先对一批图像中的所有损失进行平均,然后通过梯度下降更新参数。

If instead of stopping after a batch, we first classified all images in the training set, we would be able to calculate the true average loss and the true gradient instead of the estimations when working with batches. But it would take a lot more calculations for each parameter update step. At the other extreme, we could set the batch size to 1 and perform a parameter update after every single image. This would result in more frequent updates, but the updates would be a lot more erratic and would quite often not be headed in the right direction.

如果我们不是在批处理之后停下来,而是先对训练集中的所有图像进行分类,那么在处理批处理时,我们将能够计算出真实的平均损失和真实的梯度,而不是估计值。 但是,对于每个参数更新步骤,将需要进行更多的计算。 在另一种极端情况下,我们可以将批处理大小设置为1,并在每个单个图像之后执行参数更新。 这将导致更频繁的更新,但是更新将更加不稳定,并且通常不会朝着正确的方向前进。

Usually an approach somewhere in the middle between those two extremes delivers the fastest improvement of results. For bigger models memory considerations are very relevant too. It’s often best to pick a batch size that is as big as possible, while still being able to fit all variables and intermediate results into memory.

通常,在这两个极端之间的某个中间位置的方法可以最快地改善结果。 对于较大的模型,内存注意事项也非常重要。 通常最好选择尽可能大的批处理大小,同时仍然能够将所有变量和中间结果放入内存中。

Here the first line of code picks batch_size random indices between 0 and the size of the training set. Then the batches are built by picking the images and labels at these indices.

在这里,第一行代码选择介于0和训练集大小之间的batch_size随机索引。 然后通过在这些索引处选择图像和标签来构建批次。

Every 100 iterations we check the model’s current accuracy on the training data batch. To do this, we just need to call the accuracy-operation we defined earlier.

每进行100次迭代,我们就会在训练数据批次中检查模型的当前准确性。 为此,我们只需要调用前面定义的精度操作即可。

This is the most important line in the training loop. We tell the model to perform a single training step. We don’t need to restate what the model needs to do in order to be able to make a parameter update. All the info has been provided in the definition of the TensorFlow graph already. TensorFlow knows that the gradient descent update depends on knowing the loss, which depends on the logits which depend on weights, biases and the actual input batch.

这是训练循环中最重要的一行。 我们告诉模型执行单个训练步骤。 我们不需要重新陈述模型需要做什么才能进行参数更新。 TensorFlow图的定义中已经提供了所有信息。 TensorFlow知道梯度下降更新取决于知道loss ,这依赖于logits依赖于weightsbiases与实际输入批次。

We therefore only need to feed the batch of training data to the model. This is done by providing a feed dictionary in which the batch of training data is assigned to the placeholders we defined earlier.

因此,我们只需要将训练数据的批次输入模型即可。 这是通过提供Feed字典来完成的,其中将训练数据的批次分配给我们之前定义的占位符。

After the training is completed, we evaluate the model on the test set. This is the first time the model ever sees the test set, so the images in the test set are completely new to the model. We’re evaluating how well the trained model can handle unknown data.

训练完成后,我们在测试集中评估模型。 这是模型第一次看到测试集,因此测试集中的图像对模型来说是全新的。 我们正在评估经过训练的模型处理未知数据的能力。

The final lines print out how long it took to train and run the model.

最后几行打印出训练和运行模型所需的时间。

结果 (Results)

Let’s run the model with with the command “python softmax.py”. Here is how my output looks like:

让我们使用命令“ python softmax.py ”运行模型。 这是我的输出结果:

Step   0: training accuracy 0.14 Step 100: training accuracy 0.32 Step 200: training accuracy 0.3 Step 300: training accuracy 0.23 Step 400: training accuracy 0.26 Step 500: training accuracy 0.31 Step 600: training accuracy 0.44 Step 700: training accuracy 0.33 Step 800: training accuracy 0.23 Step 900: training accuracy 0.31 Test accuracy 0.3066 Total time: 12.42s

What does this mean? The accuracy of evaluating the trained model on the test set is about 31%. If you run the code yourself, your result will probably be around 25–30%. So our model is able to pick the correct label for an image it has never seen before around 25–30% of the time. That’s not bad!

这是什么意思? 在测试集上评估训练模型的准确性约为31%。 如果您自己运行代码,则结果可能约为25%到30%。 因此,我们的模型能够为大约25%至30%的时间之前从未见过的图像选择正确的标签。 不错!

There are 10 different labels, so random guessing would result in an accuracy of 10%. Our very simple method is already way better than guessing randomly. If you think that 25% still sounds pretty low, don’t forget that the model is still pretty dumb. It has no notion of actual image features like lines or even shapes. It looks strictly at the color of each pixel individually, completely independent from other pixels. An image shifted by a single pixel would represent a completely different input to this model. Considering this, 25% doesn’t look too shabby anymore.

有10个不同的标签,因此随机猜测将导致10%的准确性。 我们非常简单的方法已经比随机猜测更好。 如果您认为25%的声音听起来仍然很低,请不要忘记该模型仍然很笨。 它没有像线条或形状之类的实际图像特征的概念。 它严格看待每个像素的颜色,完全独立于其他像素。 偏移单个像素的图像将代表此模型的完全不同的输入。 考虑到这一点,25%看起来不再显得破旧了。

What would happen if we trained for more iterations? That would probably not improve the model’s accuracy. If you look at results, you can see that the training accuracy is not steadily increasing, but instead fluctuating between 0.23 and 0.44. It seems to be the case that we have reached this model’s limit and seeing more training data would not help. This model is not able to deliver better results. In fact, instead of training for 1000 iterations, we would have gotten a similar accuracy after significantly fewer iterations.

如果我们训练更多的迭代会怎样? 那可能不会提高模型的准确性。 如果看一下结果,您会发现训练精度并没有稳步提高,而是在0.23和0.44之间波动。 看来我们已经达到该模型的极限,看到更多的训练数据无济于事。 此模型无法提供更好的结果。 实际上,我们不需要训练1000次迭代,而是在迭代次数明显减少之后获得了相似的精度。

One last thing you probably noticed: the test accuracy is quite a lot lower than the training accuracy. If this gap is quite big, this is often a sign of overfitting. The model is then more finely tuned to the training data it has seen, and it is not able to generalize as well to previously unseen data.

您可能会注意到的最后一件事:测试精度比训练精度低很多。 如果这个差距很大,这通常是过度拟合的迹象。 然后,将模型更精细地调整到它所看到的训练数据,并且无法将其同样推广到以前看不见的数据。

This post has turned out to be quite long already. I’d like to thank you for reading it all (or for skipping right to the bottom)! I hope you found something of interest to you, whether it’s how a machine learning classifier works or how to build and run a simple graph with TensorFlow. Of course, there is still a lot of material that I would like to add. So far, we have only talked about the softmax classifier, which isn’t even using any neural nets.

事实证明该帖子已经很长了。 感谢您阅读全部内容(或直接跳到底部)! 我希望您找到了您感兴趣的东西,无论是机器学习分类器的工作方式还是如何使用TensorFlow构建和运行简单图。 当然,我还要补充很多材料。 到目前为止,我们只讨论了softmax分类器,它甚至没有使用任何神经网络。

My next blog post changes that: Find out how much using a small neural network model can improve the results! Read it here.

我的下一篇博客文章更改了以下内容:了解使用小型神经网络模型可以改善结果的多少! 在这里阅读

Thanks for reading. You can also check out other articles I’ve written on my blog.

谢谢阅读。 您还可以查看我在博客上撰写的其他文章。

翻译自: https://www.freecodecamp.org/news/how-to-build-a-simple-image-recognition-system-with-tensorflow-part-1-d6a775ef75d/

  • 0
    点赞
  • 2
    收藏
    觉得还不错? 一键收藏
  • 0
    评论

“相关推荐”对你有帮助么?

  • 非常没帮助
  • 没帮助
  • 一般
  • 有帮助
  • 非常有帮助
提交
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值