Heroes of Deep Learning (Optional)
Geoffrey Hinton Interview 吴恩达与杰弗里·辛顿的访谈
As part of this course by deeplearning.ai, I hope to not just teach you the technical ideas in deep learning, but also introduce you to some of the people, some of the heroes in deep learning. The people that invented so many of these ideas that you learn about in this course or in this specialization. In these videos, I hope to also ask these leaders of deep learning to give you career advice for how you can break into deep learning, for how you can do research or find a job in deep learning. As the first of this interview series, I am delighted to present to you an interview with Geoffrey Hinton.
[Andrew] Welcome Geoff, and thank you for doing this interview with deeplearning.ai.
[Hinton] Thank you for inviting me.
[Andrew] I think that at this point you more than anyone else on this planet has invented so many of the ideas behind deep learning. And a lot of people have been calling you the godfather of deep learning. Although it wasn't until we were chatting a few minutes ago, until I realized you think I'm the first one to call you that, which I'm quite happy to have done. But what I want to ask is, many people know you as a legend, I want to ask about your personal story behind the legend. So how did you get involved in, going way back, how did you get involved in AI and machine learning and neural networks?
[Hinton] So when I was at high school, I had a classmate who was always better than me at everything, he was a brilliant mathematician. And he came into school one day and said, did you know the brain uses holograms? And I guess that was about 1966, and I said, sort of what's a hologram? And he explained that in a hologram you can chop off half of it, and you still get the whole picture. And that memories in the brain might be distributed over the whole brain. And so I guess he'd read about Lashley's experiments, where you chop off bits of a rat's brain and discover that it's very hard to find one bit where it stores one particular memory. So that's what first got me interested in how does the brain store memories. And then when I went to university, I started off studying physiology and physics. I think when I was at Cambridge, I was the only undergraduate doing physiology and physics. And then I gave up on that and tried to do philosophy, because I thought that might give me more insight. But that seemed to me actually lacking in ways of distinguishing when they said something false. And so then I switched to psychology. And in psychology they had very, very simple theories, and it seemed to me it was sort of hopelessly inadequate to explaining what the brain was doing. So then I took some time off and became a carpenter. And then I decided that I'd try AI, and went to Edinburgh, to study AI with Longuet Higgins. And he had done very nice work on neural networks, and he'd just given up on neural networks, and been very impressed by Winograd's thesis. So when I arrived he thought I was kind of doing this old fashioned stuff, and I ought to start on symbolic AI. And we had a lot of fights about that, but I just kept on doing what I believed in.
[Andrew] And then what?
[Hinton] I eventually got a PhD in AI, and then I couldn't get a job in Britain. But I saw this very nice advertisement for Sloan Fellowships in California, and I managed to get one of those. And I went to California, and everything was different there. So in Britain, neural nets was regarded as kind of silly, and in California, Don Norman and David Rumelhart were very open to ideas about neural nets. It was the first time I'd been somewhere where thinking about how the brain works, and thinking about how that might relate to psychology, was seen as a very positive thing. And it was a lot of fun there, in particular collaborating with David Rumelhart was great.
[Andrew] I see, great. So this was when you were at UCSD, and you and Rumelhart around what, 1982, wound up writing the seminal backprop paper, right?
[Hinton] Actually, it was more complicated than that.
[Andrew] What happened?
[Hinton] In, I think, early 1982, David Rumelhart and me, and Ron Williams, between us developed the backprop algorithm, it was mainly David Rumelhart's idea. We discovered later that many other people had invented it. David Parker had invented, it probably after us, but before we'd published. Paul Werbos had published it already quite a few years earlier, but nobody paid it much attention. And there were other people who'd developed very similar algorithms, it's not clear what's meant by backprop. But using the chain rule to get derivatives was not a novel idea.
[Andrew] I see, why do you think it was your paper that helped so much the community latch on to backprop? It feels like your paper marked an infection in the acceptance of this algorithm, whoever accepted it.
[Hinton] So we managed to get a paper into Nature in 1986. And I did quite a lot of political work to get the paper accepted. I figured out that one of the referees was probably going to be Stuart Sutherland, who was a well known psychologist in Britain. And I went to talk to him for a long time, and explained to him exactly what was going on. And he was very impressed by the fact that we showed that backprop could learn representations for words. And you could look at those representations, which are little vectors, and you could understand the meaning of the individual features. So we actually trained it on little triples of words about family trees, like Mary has mother Victoria. And you'd give it the first two words, and it would have to predict the last word. And after you trained it, you could see all sorts of features in the representations of the individual words. Like the nationality of the person there, what generation they were, which branch of the family tree they were in, and so on. That was what made Stuart Sutherland really impressed with it, and I think that's why the paper got accepted.
[Andrew] Very early word embeddings, and you're already seeing learned features of semantic meanings emerge from the training algorithm.
[Hinton] Yes, so from a psychologist's point of view, what was interesting was it unified two completely different strands of ideas about what knowledge was like. So there was the old psychologist's view that a concept is just a big bundle of features, and there's lots of evidence for that. And then there was the AI view of the time, which is a formal structurist view. Which was that a concept is how it relates to other concepts. And to capture a concept, you'd have to do something like a graph structure or maybe a semantic net. And what this back propagation example showed was, you could give it the information that would go into a graph structure, or in this case a family tree. And it could convert that information into features in such a way that it could then use the features to derive new consistent information, ie generalize. But the crucial thing was this to and from between the graphical representation or the tree structured representation of the family tree, and a representation of the people as big feature vectors. And in fact that from the graph-like representation you could get feature vectors. And from the feature vectors, you could get more of the graph-like representation.
[Andrew] So this is 1986?
[Hinton] In the early 90s, Bengio showed that you can actually take real data, you could take English text, and apply the same techniques there, and get embeddings for real words from English text, and that impressed people a lot.
[Andrew] I guess recently we've been talking a lot about how fast computers like GPUs and supercomputers that's driving deep learning. I didn't realize that back between 1986 and the early 90's, it sounds like between you and Bengio there was already the beginnings of this trend.
[Hinton] Yes, it was a huge advance. In 1986, I was using a list machine which was less than a tenth of a mega flop. And by about 1993 or thereabouts, people were seeing ten mega flops. I see. So there was a factor of 100, and that's the point at which is was easy to use, because computers were just getting faster.
[Andrew] Over the past several decades, you've invented so many pieces of neural networks and deep learning. I'm actually curious, of all of the things you've invented, which of the ones you're still most excited about today?
[Hinton] So I think the most beautiful one is the work I do with Terry Sejnowski on Boltzmann machines. So we discovered there was this really, really simple learning algorithm that applied to great big density connected nets where you could only see a few of the nodes. So it would learn hidden representations and it was a very simple algorithm. And it looked like the kind of thing you should be able to get in a brain because each synapse only needed to know about the behavior of the two neurons it was directly connected to. And the information that was propagated was the same. There were two different phases, which we called wake and sleep. But in the two different phases, you're propagating information in just the same way. Whereas in something like back propagation, there's a forward pass and a backward pass, and they work differently. They're sending different kinds of signals. So I think that's the most beautiful thing. And for many years it looked just like a curiosity, because it looked like it was much too slow. But then later on, I got rid of a little bit of the beauty, and it started letting me settle down and just use one iteration, in a somewhat simpler net. And that gave restricted Boltzmann machines, which actually worked effectively in practice. So in the Netflix competition, for example, restricted Boltzmann machines were one of the ingredients of the winning entry.
[Andrew] And in fact, a lot of the recent resurgence of neural net and deep learning, starting about 2007, was the restricted Boltzmann machine, and derestricted Boltzmann machine work that you and your lab did.
[Hinton] Yes so that's another of the pieces of work I'm very happy with, the idea of that you could train your restricted Boltzmann machine, which just had one layer of hidden features and you could learn one layer of feature. And then you could treat those features as data and do it again, and then you could treat the new features you learned as data and do it again, as many times as you liked. So that was nice, it worked in practice. And then UY Tay realized that the whole thing could be treated as a single model, but it was a weird kind of model. It was a model where at the top you had a restricted Boltzmann machine, but below that you had a Sigmoid belief net which was something that invented many years early. So it was a directed model and what we'd managed to come up with by training these restricted Boltzmann machines was an efficient way of doing inferences in Sigmoid belief nets. So, around that time, there were people doing neural nets, who would use densely connected nets, but didn't have any good ways of doing probabilistic imprints in them. And you had people doing graphical models, unlike my children, who could