Application example: Photo OCR - Getting lots of data: Artificial data synthesis

One of the most reliable ways to get a high performance machine learning system is to take a low bias learning algorithm and to train it on a massive training set. But where did you get so much training data from? It turns out that in machine learning there's a fascinating idea called artificial data synthesis. This doesn't apply to every single problem, and often takes some thought and innovation and insight. But if this idea applies to your machine learning problem, it can sometimes be an easy way to get a huge training set to give to your learning algorithm. The idea of artificial data synthesis comprises two main variations: the first is creating data from scratch; the second is we can somehow amplify the existing training set or use a small training set and turn that into a large training set. We'll go over both ideas in this class.

Figure-1

Let's use the character recognition as the example. We'll input images and recognize what character it is.

Figure-2

If we go out and collect a large labeled data set, figure-2 is what it would look like. For this particular example, I've chosen a square aspect ratio. We'll take square image patches and the goal is to take an image patch and recognize the character in the middle of that image patch. For simplicity, I'll treat these images as grey scale images rather than color images. All these examples are real images. How can we come up with a much larger training set?

Figure-3

Modern computers often have a huge font libraries. If you use a word processing software, you might have all these fonts in figure-3. And in fact, you can also go to different websites, and download hundreds or thousands of different free font libraries. So, if you want more training examples, you can just take characters from different fonts and paste these characters against different random backgrounds. For example, you can just take that character 'c' and paste it against a random background. Then you now have a training example of an image of the character 'c'.

Figure-4

After some amount of work, you can get a synthetic training set like figure-4. Every image on figure-4 is actually a synthesized image. Besides pasting an image of one character against some random background image, you may also apply a little blurring operators or maybe a little affine distortions. By 'affine', it just means the shearing, scaling and little rotation operations. You'll have an unlimited supply of training examples to train a supervised learning by using synthetic data.

Figure-5

Another main approach to artificial data synthesis is where you take real examples that you currently have, and create additional data, so as to amplify your training set. For example, figure-5 is a real image of character 'A'. I overlayed this with the grid lines just for the purpose of illustration. We can take this image and introduce artificial warpings or artificial distortions to turn that into 16 new examples. For the specific example of character recognition, introducing these warpings seems like a natural choice. But for different machine learning application, there maybe different distortions that might make more sense.

Figure-6

Let's see another example from the totally different domain of speech recognition. Let's say you have audio clips, and you want to learn from the audio clip to recognize what were the words spoken in that clip. So let's say you have one labeled training example of someone saying a few specific words: 0, 1, 2, 3, 4, 5. And you want to try to apply a learning algorithm to try to recognize the words said in that. How can we amplify the data set? One thing we can do is introduce additional audio distortions into the data set:

  • Add background sounds to simulate a bad cell phone connection
  • Noisy background with cars driving past, people walking in the background
  • Noisy background with machinery

We've now multiplied this one example into many more examples without much work by just automatically adding these different background sounds to the clean audio clip

Figure-7

Just one word of warning about synthesizing data by introducing distortions. The distortion introduced should be representative of the sort of noises or distortions you might see in the test set. Usually it does not help to add purely random or meaningless noise to your data. What we've done in the bottom of figure-7 is taken the image, and for each pixel of those 4 images, just added some random Gaussian noise to each pixel to change the pixel brightness. It's just a totally meaningless noise, right? The process of artificial data synthesis is a little of an art, and sometimes you just have to try it and see if it works. But if you're trying to decide what sorts of distortions to add, do think about what are the meaningful distortions you might add that will cause you to generate additional training examples that are at least somewhat representative of the sorts of images you expect to see in your test set.

Figure-8

Finally are some notes about getting more data in machine learning as figure-8. Note that the "Crowd source" (众包) sometimes is a good way or alternative option worth consideration to get a lot of data. Today, there are a few websites or a few services that allow you to hire people on the web to fairly inexpensively label large training sets for you. It is something that has its own academic literature, and has some of its own complications pertaining to labeler reliability and so on. And probably Amazon Mechanical Turk system is probably the most popular crowd sourcing option right now. 

<end>

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 2
    评论
评论 2
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值