Computer Vision L7 -- Self-supervised Learning

最新推荐文章于 2024-07-09 17:42:00 发布

Tyia LIN

最新推荐文章于 2024-07-09 17:42:00 发布

阅读量3k

点赞数

文章标签：计算机视觉

本文链接：https://blog.csdn.net/m0_50690440/article/details/124276871

版权

How do we make the learning agent just consumes the data itself, when it doesn't get back to access to the labels.

the main idea here is we have a dataset with no labels anymore. So what an encoder says is maybe I can just map it to itself. I can learn a neural net that is going to compress the image and then do data reconstruction.

The thing is if you can basically learn to compress the images, you are actually forced to reduce the dimensionality of the image, so there would be some features should come out from the middle of the layer. Something like below is expected: to extract feartures of the image.

But this is just a hope. In reality the autoencoder doesn't do like this way, it just learn to store a downsampled version of the image, a low resolution version of the image.

The idea here is very simple -- break your data, break each example into two parts and predict one part from the other part. Do the extrapolation.

We saw this example of this in the last lecture basically about colorization. We are exactly spliting the data: we take an image, then partition the intensity and the color, and predict one from the other.

Another example:

Can you figure out what the spatial arrangement is between them? It is easy for us because we know how cat looks like.

Because every image has the spatial layout basically intrinsic to it. We can the take images and break them into patches, and train a neural net to try to predict what the correct spatial way out is. And the hypothesis here is -- it should basically learn the prototype of what a cat looks like, and this idea does happen in reality.

So this is a very simple prediction problem, we take an image and partition it into two parts. One part is the pixels and the other part is the spatial layout of those pixels, and you have to predict one from the other.

It's learning to classify -- Given the two patches (X) and try to find what the arrangement is (Y). There're only eight possible categories. Now we're constructing the labels from the data itself.

We have the image x, and we break it up to two different views. Then we have two neural net on these views, we want to achieve the max agreement between them. In practice, what happens is Z ends up discarding information and only stores the minimal, sufficient information to solve the problem. So what people do is they want to use the layer before, because that hasn't discarded that as much information yet. That's why to make it more clear in this diagram, the diagrams drawn with that partition between f() and g(). Because you end up using kind of the layer before Z for representation. ( f() is the neural net, g() is the linear transformation, and our model should both learn f and g)

The loss function we want to minimize says: we calculate the similarity of zi and zj, which are embeddings of different views from the same image. We want this numerator term to be as similar as possible. While in the denominator, zk is a view from a different random image, so we want zi and zk to be as far away as possible.

Different view of the original image can be things like below:

After you train this whole representation, you have some way of creating these different views and you learn this embedding space. How do you actually then use embedding space for a task.

In other way, you train a neural net for some work, but how do you transfer it to new task. Here comes the fine-tuning:

We have the neural net for the first task, it could be object classification or it could be self-surpervised objectives. we do backprop and gradient descent to get the parameters of the network.

What fine-tuning is when you have a new task to solve, we just continue training the neural net. You initialize it basically where you left off for the previous task, but you change the loss function and just continue continuing to train. it's a very simple way to do transfer learning.

Backprop and neural net are really good at finding patterns. If there is any pattern in the data, they will basically finally find it. And sometimes that pattern is not what you want.

Why is color so important to solve the problem here? There's some information hidden in the color channel, which is very hard for our eyes to see:

We could see there's kind of a purple fringe on the outline of the photo for the tree. This is called chromatic aberration.

The problem is fundamentally, it's really difficult to build a lens that focuses all wavelengths of light, all color of light on the very same spot. So what chromatic aberration is when you see this purple fringing, what's going on is the purple wave length are being focused very well in the image, while the green wavelength is not, it is actually blurred.

And most time it's not a problem. But when you take a picture of a tree, that's a really common cause of this chromatic aberration.

There is another phenomenon called Vignetting, it is also caused by lens that the center of image will be brighter than the boundary.

So neural net is actually using the chromatic abrration and Vignetting in images to find pattern. As a result, in the previous case, if we remove the color, the neural net loses the clues from chromatic abrration and Vignetting, then it cannot find the position of each patch.

We can actually use these clues to detect fake news: every natural photos and have some chromatic aberrations in it, so we can use this to try to determine whether photos been cropped. Because fake news can happen when someone crops a photo in a clever way so readers can't see the full context.

So by analyzing the photo, the pattern for chromatic aberration will suggest that the center of the image is not actually in the center of the image. You can conclude that the photo was cropped.

Tyia LIN

关注

0
点赞
踩
0

收藏

觉得还不错? 一键收藏
0
评论
Computer Vision L7 -- Self-supervised Learning

How do we make the learning agent just consumes the data itself, when it doesn't get back to access to the labels.the main idea here is we have a dataset with no labels anymore. So what an encoder says is maybe I can just map it to itself.I can l...
复制链接

扫一扫