So the network firstly pick up random weights for each neuron, and output shows that Kitchen has the most probability. While the ground truth is that we, as humans,100% believe it is actually Bedroom. Now that we have preditions (blue chart) and ground truth (green chart), we can compute loss:
The lower number of bits you need, the closer these two distributions are.
We might wonder what are in the first convolution layer? They are different kinds of filters.
Left is the filters and right is the Fourier Magnitude of each filter. We see it is actually doing edge detection.
this is basically the problem of object recognition is in computer vision.
How do you build systems that can recognize objects regardless of the viewpoint with all of these different types of variation?