Coursera吴恩达机器学习专项课程01：Supervised Machine Learning: Regression and Classification笔记 Week01

阿正的梦工坊

已于 2024-03-09 19:54:34 修改

阅读量2k

点赞数 27

分类专栏： Machine Learning 文章标签：机器学习人工智能

于 2024-02-25 14:11:51 首次发布

本文链接：https://blog.csdn.net/shizheng_Li/article/details/136281366

版权

Machine Learning 专栏收录该内容

76 篇文章

订阅专栏

该博客介绍了监督机器学习课程，包括回归与分类内容。课程由DeepLearning.AI和斯坦福大学合作，由Andrew Ng授课。涵盖监督学习（线性回归、逻辑回归等）、无监督学习（聚类等）。还介绍了课程大纲，如每周学习内容及实践测验，以及梯度下降等知识。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Specialization Certificate

在这里插入图片描述

Week 01 of Supervised Machine Learning: Regression and Classification

Course Certificate
在这里插入图片描述

by DeepLearning.AI & Stanford University

在这里插入图片描述

课程地址：https://www.coursera.org/learn/machine-learning

本笔记包含字幕，quiz的答案以及作业的代码，仅供个人学习使用，如有侵权，请联系删除。这是在2022年5月或者6月完成的课程笔记。

文章目录

Specialization Certificate
Week 01 of Supervised Machine Learning: Regression and Classification
- About this Course
Syllabus
Week 01
其他
英文发音

About this Course

In the first course of the Machine Learning Specialization, you will:

• Build machine learning models in Python using popular machine learning libraries NumPy and scikit-learn.

• Build and train supervised machine learning models for prediction and binary classification tasks, including linear regression and logistic regression

The Machine Learning Specialization is a foundational online program created in collaboration between DeepLearning.AI and Stanford Online. In this beginner-friendly program, you will learn the fundamentals of machine learning and how to use these techniques to build real-world AI applications.

This Specialization is taught by Andrew Ng, an AI visionary who has led critical research at Stanford University and groundbreaking work at Google Brain, Baidu, and Landing.AI to advance the AI field.

This 3-course Specialization is an updated and expanded version of Andrew’s pioneering Machine Learning course, rated 4.9 out of 5 and taken by over 4.8 million learners since it launched in 2012.

It provides a broad introduction to modern machine learning, including supervised learning (multiple linear regression, logistic regression, neural networks, and decision trees), unsupervised learning (clustering, dimensionality reduction, recommender systems), and some of the best practices used in Silicon Valley for artificial intelligence and machine learning innovation (evaluating and tuning models, taking a data-centric approach to improving performance, and more.)

By the end of this Specialization, you will have mastered key concepts and gained the practical know-how to quickly and powerfully apply machine learning to challenging real-world problems. If you’re looking to break into AI or build a career in machine learning, the new Machine Learning Specialization is the best place to start.

Syllabus

Week 1: Introduction to Machine Learning

Welcome to the Machine Learning Specialization! You’re joining millions of others who have taken either this or the original course, which led to the founding of Coursera, and has helped millions of other learners, like you, take a look at the exciting world of machine learning!

20 videos

*Graded:* Practice quiz: Supervised vs unsupervised learning

*Graded:* Practice quiz: Regression

*Graded:* Practice quiz: Train the model with gradient descent

Week 2: Regression with multiple input variables

This week, you’ll extend linear regression to handle multiple input features. You’ll also learn some methods for improving your model’s training and performance, such as vectorization, feature scaling, feature engineering and polynomial regression. At the end of the week, you’ll get to practice implementing linear regression in code.

10 videos

*Graded:* Practice quiz: Multiple linear regression

*Graded:* Practice quiz: Gradient descent in practice

*Graded:* Week 2 practice lab: Linear regression

Week 3: Classification

This week, you’ll learn the other type of supervised learning, classification. You’ll learn how to predict categories using the logistic regression model. You’ll learn about the problem of overfitting, and how to handle this problem with a method called regularization. You’ll get to practice implementing logistic regression with regularization at the end of this week!

11 videos, 1 reading

*Graded:* Practice quiz: Classification with logistic regression

*Graded:* Practice quiz: Cost function for logistic regression

*Graded:* Practice quiz: Gradient descent for logistic regression

*Graded:* Practice quiz: The problem of overfitting

*Graded:* Week 3 practice lab: logistic regression

Week 01

[1] Overview of Machine Learning

Welcome to machine learning!

在这里插入图片描述

[MUSIC] Welcome to Machine learning.

What is machine learning? You probably use it many times
a day without even knowing it.

Consumer applications of Machine Learning

Anytime you want to find out something
like how do I make a sushi roll? You can do a web search on Google,
Bing or Baidu to find out. And that works so well because their
machine learning software has figured out how to rank web pages.

Or when you upload pictures
to Instagram or Snapchat and think to yourself, I want to tag my
friends so they can see their pictures. Well these apps can recognize your friends
in your pictures and label them as well. That’s also machine learning.

Or if you’ve just finished watching
a Star Wars movie on the video streaming service and you think what
other similar movies can I watch? Well the streaming service will likely use
machine learning to recommend something that you might like.

Each time you use voice to text on
your phone to write a text message. >> Hey Andrew, how’s it going? >> Or tell your phone. Hey Siri play a song by Rihanna, or ask your other phone okay Google
show me Indian restaurants near me. That’s also machine learning.

Each time you receive an email titled,
Congratulations! You’ve won a million dollars. Well maybe you’re rich, congratulations. Or more likely your email service
will probably flag it as spam. That too is an application
of machine learning.

Industrial applications of AI

Beyond consumer applications that
you might use, AI is also rapidly making its way into big companies and
into industrial applications.

For example, I’m deeply concerned about
climate change, and I’m glad to see that machine learning is already hoping to
optimize wind turbine power generation.

Or in healthcare, is starting to make its way into hospitals
to help doctors make accurate diagnosis.

Or recently at Landing AI have
been doing a lot of work, putting computer vision into factories
to help inspect if something coming off the assembly line has any defects. That’s machine learning, it’s the science of getting computers to
learn without being explicitly programmed.

In this course,
you learn about machine learning and get to implement machine learning and
code yourself.

Millions of others have taken
the earlier version of this course, which is of course,
that led to the founding of Coursera. And many learners ended up building
exciting machine learning systems or even pursuing very
successful careers in AI. I’m excited that you’re
on this journey with me. Welcome and let’s get started.

Applications of machine learning

In this class, you learn about the state of the art and also practice implementing machine learning
algorithms yourself.

You learn about the most important machine
learning algorithms, some of which are exactly what’s being used in large AI or large tech companies
today and you get a sense of what is the
state of the art in AI.

Beyond learning the algorithms
though, in this class, you also learn all the
important practical tips and tricks for making
them perform well. You get to implement
them and see how they work for yourself.

Why is machine learning
so widely used today?

Machine Learning had grown up as a sub-field of AI or
artificial intelligence. We wanted to build
intelligent machines.

It turns out that there are a few basic things that we
could program a machine to do, such as how to find the
shortest path from a to b, like in your GPS. But for the most part, we just
did not know how to write an explicit program to do many of the more
interesting things, such as perform web search, recognize human speech, diagnose diseases from X-rays or
build a self-driving car.

The only way we knew
how to do these things was to have a machine
learn to do it by itself.

For me, when I founded and was leading the
Google Brain Team, I worked on problems
like speech recognition, computer vision for Google Maps, Street View images
and advertising, or leading AI Baidu. I worked on everything from
AI for augmented reality to combating payment forwards to leading a
self-driving car team. Most recently, a landing AI, AI fund in the
Stanford University up again to work on the
applications in the factory, large-scale agriculture,
health care, e-commerce, and other problems.

Today, there are
hundreds of thousands, perhaps millions of people working on machine
learning applications who could tell you in the stories about their work with
machine learning.

When you’ve learned
these skills, I hope that you too will
find the great fun to dabble in exciting different
applications and maybe even
different industries.

Future of AI

In fact, I find it
hard to think of any industry that
machine learning is unlikely to touch in a significant way now
or in the near future.

Looking even further
into the future, many people, including me, are excited about
the AI dream of someday building machines as
intelligent as you or me. This is sometimes called Artificial General
Intelligence or AGI.

I think AGI has
been overhyped and we’re still a long way
away from that goal. I don’t know. It’ll take 50 years or 500 years
or longer to get there. But mostly AI researchers believe that the best way to get closer toward that goal is by
using learning algorithms. Maybe ones that take
some inspiration from how the human brain works. You also hear a
little more about this Quest for AGI
later in this course.

According to a
study by McKinsey, AI and machine learning
is estimated to create an additional 13
trillion US dollars of value annually
by the year 2023.

Even though machine learning
is already creating tremendous amounts of value
in the software industry, I think there could be even vastly greater
value that has yet to be created outside the
software industry in sectors such as retail, travel, transportation, automotive, materials
manufacturing, and so on. Because of the massive
untapped opportunities across so many
different sectors, today there is a vast
unfulfilled demand for this skill set.

That’s why this is
such a great time to be learning about
machine learning.

If you find machine learning
applications exciting, I hope you stick with
me through this class. I can almost
guarantee that you’ll find mastering these
skills worthwhile.

In the next video, we’ll look at a more formal definition of
what is machine learning. And we’ll begin to talk about the main types of machine learning
problems and algorithms.

You pick up some of the main machine learning
terminology and start to get a sense of what are the
different algorithms and when each one
might be appropriate. So let’s go on to
the next video.

[2] Supervised vs. Unsupervised Machine Learning

What is machine learning?

在这里插入图片描述

Definition of Machine Learning

Give computers the ability to learn without being explicitly programmed.

So what is machine learning?

In this video, you’ll learn
the definition of what it is and also get a sense of when
you might want to apply it.

Let’s take a look together.

Here’s a definition of what is
machine learning that is attributed to Arthur Samuel.

He defined machine learning as the field
of study that gives computers the ability to learn without being
explicitly programmed.

Samuel: checkers playing program

Samuel’s claim to fame was
that back in the 1950s, he wrote the checkers playing program.

And the amazing thing about this
program was that Arthur Samuel himself wasn’t a very good checkers player.

What he did was he had programmed
the computer to play maybe tens of thousands of games against itself
and by watching what sorts of board positions tended to lead to wins and
what positions tended to lead to losses.

The checkers playing program learned over
time what a good or bad board positions. By trying to get to good and
avoid bad positions, his program learned to get better and
better at playing checkers.

Because the computer had the patience to
play tens of thousands of games against itself, it was able to get so
much checkers playing experience that eventually it became a better checkers
player than Arthur Samuel himself.

A quiz

Now, throughout these videos,
besides me trying to talk about stuff, I occasionally ask you a question to help
make sure you understand the content.

Here’s one about what happens if
the computer had played far fewer games. Please take a look and pick whichever
you think is a better answer. Thanks for looking at the quiz. And so if you have selected this answer,
would have made it worse, then you got it right. In general,
the more opportunities you give a learning algorithm to learn,
the better it will perform. If you didn’t select the correct answer
the first time, that’s totally okay too.

The point of these quiz questions isn’t
to see if you can get them all correct on the first try, these questions are here just to help you
practice the concept you’re learning.

Machine learning algorithms

在这里插入图片描述

Supervised learning and Unsupervised learning

Arthur Samuel’s definition
was a rather informal one, but in the next two videos,
we’ll dive deeper together into one of the major types of
machine learning algorithms.

In this class, you learn about many
different learning algorithms. The two main types of machine
learning are supervised learning and unsupervised learning. We’ll define what these terms mean
more in the next couple of videos.

Of these two, supervised learning
is the type of machine learning that is used most in many
real world applications, and that has seen the most rapid
advancement and innovation.

In this specialization which has three
causes until though, the first and second causes will focus
on supervised learning and the third will focus on
unsupervised learning.

You might have also heard
of reinforcement learning. This is another type of machine learning
algorithm did not talk about briefly, but by far the two most used types of learning
albums today are supervised learning and unsupervised learning. That’s why we’ll spend most of
this class talking about them.

Practical advice for applying learning algorithms

The other thing we’re going to spend a lot
of time on in this specialization is practical advice for
applying learning algorithms. This is something I feel
pretty strongly about.

Teaching about learning algorithms is
like giving someone a set of tools. And equally important or
even more important than making sure you have great tools is making sure
you know how to apply them. Because what good is it if someone
were to give you a steelyard hammer or a steelyard hand drill and say good luck,
now you have all the tools you need to build a three story house,
doesn’t really work like that.

Tools

在这里插入图片描述

How to apply the tools of machine learning

And so too in machine learning, making sure you have the tools
is really important. And so as making sure that you
know how to apply the tools of machine learning effectively.

So that’s what you get in this class,
the tools as well as the skills and applying them effectively.

I regularly visit with friends and
teams in some of the top tech companies. And even today, I see experienced
machine learning teams apply machine learning algorithms to some problems and sometimes they’ve been going at it for
six months without much success. And when I look at what they’re doing, I
sometimes feel like I could have told them six months ago that the current approach
won’t work and there’s a different way of using these tools that will give
them a much better chance of success.

So in this class, one of the relatively
unique things you learn is you learn a lot about the best practices for
how to actually develop a practical, valuable machine learning system.

This way, you’re less likely to end up in
one of those teams that end up losing six months going in the wrong direction.

Gain a sense of how the most skilled machine learning engineers build systems

In this class, you gain a sense of how the
most skilled machine learning engineers build systems, and I hope you finish this
class as one of those very rare people in today’s world that know how to design and
build serious machine learning systems. So that’s machine learning.

In the next video, let’s look more deeply
at what is supervised learning and also what is unsupervised learning.

In addition, you learn when you might
want to use each of them, supervised and unsupervised learning. I’ll see you in the next video.

Supervised learning part 1

Learned from being given “right answers” (output labels)

在这里插入图片描述

Machine learning is creating tremendous economic value today. I think 99 percent of
the economic value created by machine
learning today is through one type
of machine learning, which is called
supervised learning. Let’s take a look
at what that means. Supervised machine learning or more commonly,
supervised learning, refers to algorithms
that learn x to y or input to output mappings. The key characteristic of
supervised learning is that you give your learning algorithm
examples to learn from. That includes the right
answers, whereby right answer, I mean, the correct label
y for a given input x, and is by seeing
correct pairs of input x and desired
output label y that the learning algorithm
eventually learns to take just the input
alone without the output label and gives a reasonably accurate prediction
or guess of the output.

Examples

在这里插入图片描述

Let’s look at some examples.

If the input x is an email and the output
y is this email, spam or not spam, this gives you your spam filter.

Or if the input is
an audio clip and the algorithm’s job is
output the text transcript, then this is speech recognition.

Or if you want to
input English and have it output to
corresponding Spanish, Arabic, Hindi,
Chinese, Japanese, or something else translation, then that’s machine translation.

Or the most lucrative form
of supervised learning today is probably used
in online advertising. Nearly all the large
online ad platforms have a learning algorithm that
inputs some information about an ad and some
information about you and then tries to figure out if you will click
on that ad or not. Because by showing
you ads they’re just slightly more
likely to click on, for these large
online ad platforms, every click is revenue, this actually drives a lot of revenue for these companies.

This is something I once
done a lot of work on, maybe not the most
inspiring application, but it certainly has a
significant economic impact in some countries today.

Or if you want to build
a self-driving car, the learning algorithm
would take as input an image and some
information from other sensors such as a radar or other things and then try
to output the position of, say, other cars so that your self-driving car can safely drive around
the other cars.

Or take manufacturing. I’ve actually done
a lot of work in this sector at learning AI. You can have a learning
algorithm takes as input a picture of a
manufactured product, say a cell phone that just
rolled off the production line and have the learning
algorithm output whether or not
there is a scratch, dent, or other defect
in the product. This is called visual
inspection and it’s helping manufacturers reduce or prevent defects in their products.

There all are supervised learning, using output labels to train our models.

In all of these applications, you will first train your
model with examples of inputs x and the right answers, that is the labels y. After the model has
learned from these input, output, or x and y pairs, they can then take a
brand new input x, something it has
never seen before, and try to produce the appropriate
corresponding output y.

Example: predict housing prices

在这里插入图片描述

Let’s dive more deeply
into one specific example.

Say you want to predict housing prices based on
the size of the house.

You’ve collected
some data and say you plot the data and
it looks like this. Here on the horizontal axis is the size of the
house in square feet. Yes, I live in the United States where we still use square feet. I know most of the world
uses square meters. Here on the vertical axis is
the price of the house in, say, thousands of dollars.

Fit a straight line

With this data, let’s say a
friend wants to know what’s the price for their
750 square foot house.

How can the learning
algorithm help you?

One thing a learning algorithm might be able to do is say, for the straight line to the data and reading
off the straight line, it looks like your
friend’s house could be sold for maybe about, I don’t know, $150,000.

But fitting a
straight line isn’t the only learning
algorithm you can use.

Fit a curve

There are others that could work better for this application. For example, routed and
fitting a straight line, you might decide that it’s
better to fit a curve, a function that’s slightly more complicated or more complex
than a straight line.

If you do that and make
a prediction here, then it looks like, well, your friend’s house could be
sold for closer to $200,000.

One of the things you
see later in this class is how you can decide whether
to fit a straight line, a curve, or another function that is even
more complex to the data.

We will learn how to get an algorithm to systematically choose the most appropriate line or curve to fit to the data.

Now, it doesn’t seem
appropriate to pick the one that gives your
friend the best price, but one thing you
see is how to get an algorithm to systematically choose the most
appropriate line or curve or other thing
to fit to this data.

This is a particular type of supervised learning called regression.

By regression, we are trying to predict a number from infinitely many possible numbers.

What you’ve seen
in this slide is an example of
supervised learning.

Because we gave the
algorithm a dataset in which the so-called
right answer, that is the label or the correct price y is given
for every house on the plot.

The task of the learning
algorithm is to produce more of
these right answers, specifically predicting what is the likely price for other houses like
your friend’s house. That’s why this is
supervised learning.

To define a little
bit more terminology, this housing price prediction is the particular type of
supervised learning called regression.

By regression, I
mean we’re trying to predict a number from infinitely many possible numbers such as the house
prices in our example, which could be 150,000 or 70,000 or 183,000 or any
other number in between.

Regression

在这里插入图片描述

That’s supervised
learning, learning input, output, or x to y mappings. You saw in this
video an example of regression where the task
is to predict number.

But there’s also a
second major type of supervised learning problem
called classification. Let’s take a look at what
that means in the next video.

Supervised learning part 2

Classification algorithm

So supervised learning algorithms learn to
predict input, output or X to Y mapping. And in the last video you saw
that regression algorithms, which is a type of supervised learning
algorithm learns to predict numbers out of infinitely many possible numbers. There’s a second major type of supervised
learning algorithm called a classification algorithm.

Example: breast cancer detection

在这里插入图片描述

Let’s take a look at what this means. Take breast cancer detection as
an example of a classification problem.

Say you’re building
a machine learning system so that doctors can have a diagnostic
tool to detect breast cancer. This is important because early detection
could potentially save a patient’s life.

Using a patient’s medical records
your machine learning system tries to figure out if a tumor that is a lump is
malignant meaning cancerous or dangerous. Or if that tumor, that lump is benign,
meaning that it’s just a lump that isn’t cancerous and
isn’t that dangerous?

Dataset: tumors sizes

Label: benign (circle symbol） malignant (cross symbol)

Some of my friends have actually been
working on this specific problem. So maybe your dataset has
tumors of various sizes. And these tumors are labeled
as either benign, which I will designate in
this example with a 0 or malignant, which will designate
in this example with a 1.

You can then plot your data
on a graph like this where the horizontal axis represents
the size of the tumor and the vertical axis takes
on only two values 0 or 1 depending on whether the tumor
is benign, 0 or malignant 1. One reason that this is different from
regression is that we’re trying to predict only a small number of possible outputs or
categories. In this case two possible outputs 0 or 1, benign or malignant.

This is different from regression
which tries to predict any number, all of the infinitely many
number of possible numbers. And so the fact that there
are only two possible outputs is what makes this classification. Because there are only
two possible outputs or two possible categories in this example, you can also plot this data
set on a line like this.

Right now, I’m going to use two
different symbols to denote the category using a circle an O
to denote the benign examples and across to denote the malignant examples.

And if new patients walks in for
a diagnosis and they have a lump that is this size,
then the question is, will your system classify this
tumor as benign or malignant?

It turns out that in classification
problems you can also have more than two possible output categories. Maybe you’re learning algorithm can
output multiple types of cancer diagnosis if it turns out to be malignant. So let’s call two different types
of cancer type 1 and type 2. In this case the average would
have three possible output categories it could predict.

Classification algorithms predict categories

And by the way in classification,
the terms output classes and output categories are often
used interchangeably.

So what I say class or
category when referring to the output, it means the same thing.

So to summarize classification
algorithms predict categories. Categories don’t have to be numbers. It could be non numeric for example, it can predict whether a picture
is that of a cat or a dog. And it can predict if a tumor is benign or
malignant. Categories can also be numbers like 0,
1 or 0, 1, 2.

Regression V.S. Classification

infinite number / small finite limited set of possible output categories

But what makes classification
different from regression when you’re interpreting the numbers
is that classification predicts a small finite limited set of possible
output categories such as 0, 1 and 2 but not all possible numbers
in between like 0.5 or 1.7.

在这里插入图片描述

Two or more inputs for classification problems

In the example of supervised
learning that we’ve been looking at, we had only one input value
the size of the tumor. But you can also use more than one
input value to predict an output.

Input: tumor size, patient’s age

Here’s an example,
instead of just knowing the tumor size, say you also have each
patient’s age in years.

Your new data set now has two inputs,
age and tumor size. What in this new dataset we’re going to
use circles to show patients whose tumors are benign and crosses to show the
patients with a tumor that was malignant.

So when a new patient comes in, the doctor
can measure the patient’s tumor size and also record the patient’s age. And so given this, how can we predict if this patient’s
tumor is benign or malignant?

Find boundary line

Well, given the data said like this,
what the learning algorithm might do is find some boundary that separates out
the malignant tumors from the benign ones. So the learning algorithm has to
decide how to fit a boundary line through this data.

The boundary line found by the learning
algorithm would help the doctor with the diagnosis.

In this case the tumor is
more likely to be benign. From this example we have seen how
to inputs the patient’s age and tumor size can be used.

In other machine learning problems often
many more input values are required. My friends who worked on breast cancer
detection use many additional inputs, like the thickness of the tumor clump,
uniformity of the cell size, uniformity of the cell shape and so on.

在这里插入图片描述

To recap, what is supervised learning?

在这里插入图片描述

So to recap supervised learning
maps input x to output y, where the learning algorithm learns
from the quote right answers.

The two major types of supervised learning
our regression and classification.

In a regression application like
predicting prices of houses, the learning algorithm has to predict numbers from
infinitely many possible output numbers.

Whereas in classification the learning
algorithm has to make a prediction of a category,
all of a small set of possible outputs.

So you now know what is
supervised learning, including both regression and
classification. I hope you’re having fun.

Next there’s a second major
type of machine learning called unsupervised learning. Let’s go on to the next
video to see what that is

Quiz

在这里插入图片描述

Unsupervised learning part 1

Unsupervised learning: unlabeled data

在这里插入图片描述

After supervised learning, the most widely used form of machine
learning is unsupervised learning.

Let’s take a look at what that means,
we’ve talked about supervised learning and this video is about unsupervised learning.

But don’t let the name uncivilized for
you, unsupervised learning is I think just
as super as supervised learning.

When we’re looking at supervised
learning in the last video recalled, it looks something like this in
the case of a classification problem.

Each example, was associated with
an output label y such as benign or malignant, designated by the poles and
crosses in unsupervised learning.

Were given data that isn’t
associated with any output labels y, say you’re given data on patients and
their tumor size and the patient’s age. But not whether the tumor was benign or
malignant, so the dataset looks like this on the right. We’re not asked to diagnose
whether the tumor is benign or malignant, because we’re
not given any labels.

Unsupervised learning: Find something interesting in unlabeled data

Why in the dataset, instead,
our job is to find some structure or some pattern or just find
something interesting in the data.

This is unsupervised learning, we call it unsupervised because we’re
not trying to supervise the algorithm.

To give some quote right answer for
every input, instead, we asked the our room to figure out
all by yourself what’s interesting.

Or what patterns or
structures that might be in this data, with this particular data set.

An unsupervised learning algorithm can decide that the data can be assigned to different groups or clusters.

An unsupervised learning algorithm,
might decide that the data can be assigned to two different
groups or two different clusters.

And so it might decide, that there’s
one cluster what group over here, and there’s another cluster or
group over here.

This is a particular type of unsupervised
learning, called a clustering algorithm.

Because it places the unlabeled data,
into different clusters and this turns out to be used
in many applications.

Clustering

在这里插入图片描述

Cluster applications

Cluster: Used in Google News

在这里插入图片描述

Clustering: Group panda, twin, zoo

在这里插入图片描述

For example,
clustering is used in google news, what google news does
is every day it goes. And looks at hundreds of thousands of
news articles on the internet, and groups related stories together.

For example, here is a sample from
Google News, where the headline of the top article, is giant panda gives birth to
rear twin cubs at Japan’s oldest zoo. This article has actually caught my eye,
because my daughter loves pandas and so there are a lot of stuff panda toys. And watching panda videos in my house,
and looking at this, you might notice that below this
are other related articles.

Maybe from the headlines alone, you can start to guess what
clustering might be doing. Notice that the word
panda appears here here, here, here and here and
notice that the word twin also appears in all five articles.

And the word Zoo also appears
in all of these articles, so the clustering algorithm
is finding articles. All of all the hundreds of thousands of
news articles on the internet that day, finding the articles that mention similar
words and grouping them into clusters.

The algorithm has to figure out on his own without supervision

Now, what’s cool is that this clustering
algorithm figures out on his own which words suggest, that certain
articles are in the same group. What I mean is there isn’t an employee at
google news who’s telling the algorithm to find articles that the word panda. And twins and
zoo to put them into the same cluster, the news topics change every day. And there are so many news stories,
it just isn’t feasible to people doing this every single day for
all the topics that use covers.

Instead the algorithm has to figure
out on his own without supervision, what are the clusters
of news articles today. So that’s why this clustering algorithm, is a type of unsupervised
learning algorithm.

Cluster: Gene, DNA microarray

Let’s look at the second example
of unsupervised learning applied to clustering genetic or DNA data.

在这里插入图片描述

This image shows a picture
of DNA micro array data, these look like tiny
grids of a spreadsheet. And each tiny column represents
the genetic or DNA activity of one person, So for example, this entire Column
here is from one person’s DNA. And this other column
is of another person, each row represents a particular gene.

So just as an example, perhaps this
row here might represent a gene that affects eye color, or this row here is
a gene that affects how tall someone is. Researchers have even found a genetic
link to whether someone dislikes certain vegetables, such as broccoli, or
brussels sprouts, or asparagus. So next time someone asks you why
didn’t you finish your salad, you can tell them,
maybe it’s genetic for DNA micro race.

We don’t tell the algorithm in advance which group a person belong to.

The idea is to measure how much
certain genes, are expressed for each individual person.

So these colors red, green, gray,
and so on, show the degree to which different individuals do, or
do not have a specific gene active.

And what you can do is then run
a clustering algorithm to group individuals into different categories. Or different types of people like maybe
these individuals that group together, and let’s just call this type one.

And these people
are grouped into type two, and these people are groups as type three.

This is unsupervised learning,
because we’re not telling the algorithm in advance, that there is a type one
person with certain characteristics. Or a type two person with
certain characteristics, instead what we’re saying
is here’s a bunch of data. I don’t know what the different
types of people are but can you automatically
find structure into data. And automatically figure out whether
the major types of individuals, since we’re not giving the algorithm the
right answer for the examples in advance.

Clustering: Grouping customers

Used in DeepLearning.AI company

在这里插入图片描述

This is unsupervised learning,
here’s the third example, many companies have huge databases of
customer information given this data.

Can you automatically
group your customers, into different market segments so that you
can more efficiently serve your customers.

Concretely the deep learning dot AI team
did some research to better understand the deep learning dot AI community. And why different individuals
take these classes, subscribed to the batch weekly newsletter,
or attend our AI events.

Let’s visualize the deep
learning dot AI community, as this collection of
people running clustering.

That is market segmentation found
a few distinct groups of individuals, one group’s primary motivation is
seeking knowledge to grow their skills. Perhaps this is you, and so that’s great, a second group’s primary motivation is
looking for a way to develop their career. Maybe you want to get a promotion or
a new job, or make some career progression if this
describes you, that’s great too.

And yet another group wants to stay
updated on how AI impacts their field of work, perhaps this is you,
that’s great too. This is a clustering that our team used
to try to better serve our community as we’re trying to figure out. Whether the major categories of learners
in the deeper and community, So if any of these is your top motivation for
learning, that’s great.

And I hope I’ll be able to help you on
your journey, or in case this is you, and you want something totally different
than the other three categories. That’s fine too, and I want you to know,
I love you all the same, so to summarize a clustering algorithm. Which is a type of unsupervised
learning algorithm, takes data without labels and tries to
automatically group them into clusters.

And so maybe the next time you see or
think of a panda, maybe you think of clustering as well. And besides clustering, there are other
types of unsupervised learning as well. Let’s go on to the next video, to take a look at some other types
of unsupervised learning algorithms.

在这里插入图片描述

Unsupervised learning part 2

在这里插入图片描述

In the last video, you saw what is unsupervised learning, and one type of unsupervised
learning called clustering.

Formal definition of unsupervised learning

Let’s give a slightly
more formal definition of unsupervised learning and take a quick look at some other types of unsupervised learning
other than clustering.

The input data comes only with inputs x but not output labels y

The algorithm has to find some structure or pattern in the data.

Whereas in supervised learning, the data comes with
both inputs x and input labels y, in
unsupervised learning, the data comes only with inputs x but not output labels y, and the algorithm has to find some structure or some pattern or something interesting
in the data.

Clustering

Anomaly detection

Dimensionality reduction

We’re seeing just one example of unsupervised learning called
a clustering algorithm, which groups similar
data points together.

In this specialization,
you’ll learn about clustering as well as two other types of
unsupervised learning.

One is called anomaly detection, which is used to
detect unusual events. This turns out to be
really important for fraud detection in
the financial system, where unusual events,
unusual transactions could be signs of fraud and for
many other applications.

You also learn about
dimensionality reduction. This lets you take a big data-set and almost
magically compress it to a much smaller data-set while losing as little
information as possible. In case anomaly detection and dimensionality
reduction don’t seem to make too much
sense to you yet. Don’t worry about
it. We’ll get to this later in the
specialization.

A question to check my understanding.

Now, I’d like to ask you another question to help you
check your understanding, and no pressure, if
you don’t get it right on the first
try, is totally fine.

Please select any
of the following that you think are examples
of unsupervised learning. Two are unsupervised
examples and two are supervised learning
examples. Please take a look.

Quiz

My try

在这里插入图片描述

Maybe you remember the
spam filtering problem. If you have labeled data you now label as spam or
non-spam e-mail, you can treat this as a
supervised learning problem.

The second example, the
news story example. That’s exactly the
Google News and tangible example that you
saw in the last video. You can approach that using a clustering algorithm to
group news articles together. That we’ll use
unsupervised learning.

The market segmentation example that I talked about a
little bit earlier. You can do that as an unsupervised learning
problem as well because you can give your algorithm
some data and ask it to discover market
segments automatically.

The final example on
diagnosing diabetes. Well, actually that’s a lot like our breast cancer example from the supervised
learning videos. Only instead of benign
or malignant tumors, we instead have diabetes
or not diabetes. You can approach this as a
supervised learning problem, just like we did for the breast tumor
classification problem.

Even though in the last video, we’ve talked mainly about
clustering, in later videos, in this specialization, we’ll
dive much more deeply into anomaly detection and
dimensionality reduction as well. That’s unsupervised learning.

Before we wrap up this section, I want to share
with you something that I find really exciting, and useful, which is the use of Jupyter Notebooks in
machine learning. Let’s take a look at
that in the next video.

Jupyter Notebooks

Introduction to Jupyter Notebooks

在这里插入图片描述

Lab: Python and Jupyter Notebooks

# print statements
variable = "right in the strings!"
print(f"f strings allow you to embed variables {variable}")

name = "Shizheng Li"
print(f"Coursera is a great platform for ML learner, this is {name} speaking")

[3] Practice Quiz: Supervised vs unsupervised learning

Latest Submission Grade 100%

在这里插入图片描述

[4] Regression Model

Linear regression model part 1

在这里插入图片描述

Supervised Learning

Regression model predicts numbers, Infinitely many possible outputs

Classification model predicts categories, Small number of possible outputs

在这里插入图片描述

In this video,
we’ll look at what the overall process of
supervised learning is like.

Specifically, you see
the first model of this course, Linear
Regression Model. That just means fitting a
straight line to your data.

It’s probably the most widely used learning
algorithm in the world today. As you get familiar
with linear regression, many of the concepts you
see here will also apply to other machine
learning models that you’ll see later in
this specialization.

Linear regression Example: Predict the price of a house

Let’s start with a
problem that you can address using linear regression. Say you want to
predict the price of a house based on the
size of the house. This is the example we’ve
seen earlier this week. We’re going to use a dataset on house sizes and
prices from Portland, a city in the United States.

Here we have a graph where the horizontal axis is the size of the house
in square feet, and the vertical axis is the price of a house in
thousands of dollars. Let’s go ahead and plot the data points for various
houses in the dataset.

Here each data point, each of these little
crosses is a house with the size and the price that it most recently was sold for. Now, let’s say you’re
a real estate agent in Portland and you’re helping
a client to sell her house.

She is asking you, how much do you think I can
get for this house?

This dataset might help you estimate the price
she could get for it.

You start by measuring
the size of the house, and it turns out that the
house is 1250 square feet. How much do you think this
house could sell for?

One thing you could do this, you can build a linear regression model
from this dataset. Your model will fit a
straight line to the data, which might look like this. Based on this straight
line fit to the data, you can see that the house
is 1250 square feet, it will intersect the
best fit line over here, and if you trace that to the
vertical axis on the left, you can see the price
is maybe around here, say about $220,000.

This is an example of what’s called a supervised
learning model. We call this supervised learning because you are first
training a model by giving a data that has right
answers because you get the model examples of houses with both the
size of the house, as well as the price that the model should
predict for each house. Well, here are the
prices, that is, the right answers are given for every house in the dataset.

This linear regression model is a particular type of
supervised learning model. It’s called regression model
because it predicts numbers as the output like
prices in dollars. Any supervised learning
model that predicts a number such as 220,000 or 1.5 or negative 33.2 is addressing what’s called
a regression problem.

Linear regression is one
example of a regression model. But there are other models for addressing regression
problems too. We’ll see some of those later in Course 2 of this specialization. Just to remind you, in contrast with the
regression model, the other most common type of supervised learning model is called a classification model.

Classification model predicts categories or
discrete categories, such as predicting if
a picture is of a cat, meow or a dog, woof, or if given
medical record, it has to predict if a patient
has a particular disease. You’ll see more about classification models later
in this course as well.

As a reminder about the difference between
classification and regression, in classification, there are only a small number
of possible outputs. If your model is recognizing
cats versus dogs, that’s two possible outputs. Or maybe you’re trying
to recognize any of 10 possible medical
conditions in a patient, so there’s a discrete, finite set of possible outputs.

We call it classification problem, whereas in regression, there are infinitely
many possible numbers that the model could output.

在这里插入图片描述

In addition to visualizing this data as a plot
here on the left, there’s one other
way of looking at the data that would be useful, and that’s a data table
here on the right.

The data comprises
a set of inputs. This would be the
size of the house, which is this column here. It also has outputs. You’re trying to
predict the price, which is this column here.

Notice that the horizontal
and vertical axes correspond to these two columns, the size and the price. If you have, say, 47
rows in this data table, then there are 47 of these little crosses on
the plot of the left, each cross corresponding
to one row of the table.

For example, the first row of the table is a
house with size, 2,104 square feet, so that’s around here, and this house is sold for
$400,000 which is around here. This first row of
the table is plotted as this data point over here.

Notation

Terminology

在这里插入图片描述

Now, let’s look at some notation for
describing the data. This is notation that you find useful throughout your
journey in machine learning. As you increasingly get familiar with machine
learning terminology, this would be
terminology they can use to talk about machine
learning concepts with others as well
since a lot of this is quite
standard across AI, you’ll be seeing this notation multiple times in
this specialization, so it’s okay if you don’t remember everything
for assign through, it will naturally become
more familiar overtime.

Training set

The dataset that you
just saw and that is used to train the model
is called a training set.

Note that your client’s
house is not in this dataset because
it’s not yet sold, so no one knows
what the price is.

To predict the price of
your client’s house, you first train your
model to learn from the training set and
that model can then predict your client’s
houses price.

Input variable = feature

Output variable = target

In Machine Learning, the
standard notation to denote the input here is lowercase x, and we call this
the input variable, is also called a feature
or an input feature.

For example, for the first
house in your training set, x is the size of the house, so x equals 2,104. The standard notation to denote the output variable which
you’re trying to predict, which is also sometimes
called the target variable, is lowercase y. Here, y is the
price of the house, and for the first
training example, this is equal to 400, so y equals 400.

M: total number of training examples

The dataset has one row for each house and in
this training set, there are 47 rows with each row representing a
different training example. We’re going to use
lowercase m to refer it to the total number
of training examples, and so here m is equal to 47.

x superscript in parenthesis i: $x^{(i)}, y^{(i)}$

To indicate the single
training example, we’re going to use the
notation parentheses x, y.

For the first training
example, x, y, this pair of numbers
is 2,104, 400. Now we have a lot of
different training examples. We have 47 of them in fact. To refer to a specific
training example, this will correspond to a specific row in this
table on the left, I’m going to use the notation x superscript in parenthesis, i, y superscript
in parentheses i.

The superscript tells us that this is the ith
training example, such as the first, second, or third up to the
47th training example. I here, refers to a
specific row in the table.

For instance, here is
the first example, when i equals 1 in
the training set, and so x superscript 1 is equal to 2,104 and y superscript 1 is equal to 400 and let’s add this
superscript 1 here as well.

Superscript i in parentheses is not exponentiation

Just to note, this superscript i in parentheses is
not exponentiation.

When I write this, this is not x squared. This is not x to the power 2. It just refers to the
second training example.

This i, is just an index into the training set and refers
to row i in the table. In this video, you saw what
a training set is like, as well as a standard notation for describing
this training set.

In the next video, let’s look at what
rotate to take this training set that
you just saw and feed it to learning algorithm so that the algorithm can
learn from this data. Let’s see that in
the next video.

Linear regression model part 2

在这里插入图片描述

Let’s look in this video at the process of how
supervised learning works. Supervised learning algorithm
will input a dataset and then what exactly does it
do and what does it output? Let’s find out in this video.

Recall that a training set in supervised learning includes
both the input features, such as the size
of the house and also the output targets, such as the price of the house. The output targets are the right answers to the
model we’ll learn from.

To train the model, you feed the training set, both the input features and the output targets to
your learning algorithm.

Learning algorithm: function = hypothesis = f

Then your supervised
learning algorithm will produce some function. We’ll write this
function as lowercase f, where f stands for function. Historically, this
function used to be called a hypothesis, but I’m just going to call it
a function f in this class.

The job with f is
to take a new input x and output and estimate
or a prediction, which I’m going to call y-hat, and it’s written like the variable y with this
little hat symbol on top.

y hat = prediction = estimated value

y = target = actual true value in the train set

In machine learning,
the convention is that y-hat is the estimate or
the prediction for y.

The function f is
called the model.

X is called the input
or the input feature, and the output of the model
is the prediction, y-hat.

The model’s prediction is
the estimated value of y.

When the symbol is
just the letter y, then that refers to the target, which is the actual true
value in the training set.

In contrast, y-hat
is an estimate. It may or may not be
the actual true value.

Well, if you’re
helping your client to sell the house, well, the true price of the house is unknown until they sell it.

Your model f, given the size, outputs the price which
is the estimator, that is the prediction of
what the true price will be.

How to represent the function f?

Now, when we design a
learning algorithm, a key question is, how are we going to
represent the function f?

Or in other words, what is the math formula we’re
going to use to compute f?

For now, let’s stick with
f being a straight line. Your function can
be written as f_w, b of x equals, I’m going to use w times x plus b.

I’ll define w and b soon. But for now, just know
that w and b are numbers, and the values chosen for
w and b will determine the prediction y-hat based
on the input feature x.

This f_w b of x means f is a function
that takes x as input, and depending on the
values of w and b, f will output some value
of a prediction y-hat.

As an alternative
to writing this, f_w, b of x, I’ll sometimes just
write f of x without explicitly including w
and b into subscript. Is just a simpler
notation that means exactly the same
thing as f_w b of x.

Plot the training set on the graph

Let’s plot the training set on the graph where the
input feature x is on the horizontal axis and the output target y is
on the vertical axis.

Remember, the algorithm
learns from this data and generates the best-fit line
like maybe this one here. This straight line is
the linear function f_w b of x equals
w times x plus b. Or more simply, we can
drop w and b and just write f of x equals wx plus b.

Just choose a linear function as f here

Here’s what this
function is doing, it’s making predictions
for the value of y using a streamline
function of x.

You may ask, why are we
choosing a linear function, where linear function is
just a fancy term for a straight line instead of some non-linear function
like a curve or a parabola?

Well, sometimes you want to fit more complex non-linear
functions as well, like a curve like this. But since this
linear function is relatively simple and
easy to work with, let’s use a line as a foundation that
will eventually help you to get to more complex
models that are non-linear.

Univariate linear regression = linear regression with one variable

This particular
model has a name, it’s called linear regression.

More specifically, this is linear regression
with one variable, where the phrase one
variable means that there’s a single input
variable or feature x, namely the size of the house.

Another name for a
linear model with one input variable is
univariate linear regression, where uni means one in Latin, and where variate
means variable.

Univariate is just a fancy
way of saying one variable.

In a later video, you’ll also see a variation
of regression where you’ll want to make a
prediction based not just on the size of a house, but on a bunch of other
things that you may know about the house
such as number of bedrooms and other features.

By the way, when you’re
done with this video, there is another optional lab. You don’t need to
write any code. Just review it, run the
code and see what it does. That will show you
how to define in Python a straight line function.

The lab will let you
choose the values of w and b to try to fit
the training data. You don’t have to do the
lab if you don’t want to, but I hope you play
with it when you’re done watching this video.

That’s linear regression. In order for you
to make this work, one of the most important
things you have to do is construct a cost function. The idea of a cost
function is one of the most universal
and important ideas in machine learning, and is used in both
linear regression and in training many of the most advanced AI models in the world. Let’s go on to the next
video and take a look at how you can construct
a cost function.

Quiz

在这里插入图片描述

Optional lab: Model representation

In this ungraded lab, you can see how a linear regression model is defined in code, and you can see plots that show how well a model fits some data given choices of w and b. You can also try different values of w and b to see if it improves the fit to the data.

在这里插入图片描述

# x_train is the input variable (size in 1000 square feet)
# y_train is the target (price in 1000s of dollars)
x_train = np.array([1.0, 2.0])
y_train = np.array([300.0, 500.0])
print(f"x_train = {x_train}")
print(f"y_train = {y_train}")

Number of training examples m

You will use m to denote the number of training examples. Numpy arrays have a .shape parameter. x_train.shape returns a python tuple with an entry for each dimension. x_train.shape[0] is the length of the array and number of examples as shown below.

# m is the number of training examples
print(f"x_train.shape: {x_train.shape}")
m = x_train.shape[0]
print(f"Number of training examples is: {m}")

Output

x_train.shape: (2,)
Number of training examples is: 2

One can also use the Python len() function as shown below.

# m is the number of training examples
m = len(x_train)
print(f"Number of training examples is: {m}")

Output

Number of training examples is: 2

Training example x_i, y_i

You will use (x $^{(i)}$ , y $^{(i)}$ ) to denote the $i^{th}$ training example. Since Python is zero indexed, (x $^{(0)}$ , y $^{(0)}$ ) is (1.0, 300.0) and (x $^{(1)}$ , y $^{(1)}$ ) is (2.0, 500.0).

To access a value in a Numpy array, one indexes the array with the desired offset. For example the syntax to access location zero of x_train is x_train[0].
Run the next code block below to get the $i^{th}$ training example.

i = 0 # Change this to 1 to see (x^1, y^1)

x_i = x_train[i]
y_i = y_train[i]
print(f"(x^({i}), y^({i})) = ({x_i}, {y_i})")

Ploting data

You can plot these two points using the scatter() function in the matplotlib library, as shown in the cell below.

The function arguments marker and c show the points as red crosses (the default is blue dots).

You can use other functions in the matplotlib library to set the title and labels to display

# Plot the data points
plt.scatter(x_train, y_train, marker='x', c='r')
# Set the title
plt.title("Housing Prices")
# Set the y-axis label
plt.ylabel('Price (in 1000s of dollars)')
# Set the x-axis label
plt.xlabel('Size (1000 sqft)')
plt.show()

在这里插入图片描述

w = 100
b = 100
print(f"w: {w}")
print(f"b: {b}")

Now, let’s compute the value of $f_{w,b}(x^{(i)})$ for your two data points. You can explicitly write this out for each data point as -

for $x^{(0)}$ , f_wb = w * x[0] + b

for $x^{(1)}$ , f_wb = w * x[1] + b

For a large number of data points, this can get unwieldy and repetitive. So instead, you can calculate the function output in a for loop as shown in the compute_model_output function below.

Note: The argument description (ndarray (m,)) describes a Numpy n-dimensional array of shape (m,). (scalar) describes an argument without dimensions, just a magnitude.
Note: np.zero(n) will return a one-dimensional numpy array with $n$ entries

The Python function compute_model_output

def compute_model_output(x, w, b):
    """
    Computes the prediction of a linear model
    Args:
      x (ndarray (m,)): Data, m examples 
      w,b (scalar)    : model parameters  
    Returns
      y (ndarray (m,)): target values
    """
    m = x.shape[0]
    f_wb = np.zeros(m)
    for i in range(m):
        f_wb[i] = w * x[i] + b
        
    return f_wb

Now let’s call the compute_model_output function and plot the output…

tmp_f_wb = compute_model_output(x_train, w, b,)

# Plot our model prediction
plt.plot(x_train, tmp_f_wb, c='b',label='Our Prediction')

# Plot the data points
plt.scatter(x_train, y_train, marker='x', c='r',label='Actual Values')

# Set the title
plt.title("Housing Prices")
# Set the y-axis label
plt.ylabel('Price (in 1000s of dollars)')
# Set the x-axis label
plt.xlabel('Size (1000 sqft)')
plt.legend()
plt.show()

Output

在这里插入图片描述

As you can see, setting $w = 100$ and $b = 100$ does not result in a line that fits our data.

Challenge

Try experimenting with different values of $w$ and $b$ . What should the values be for a line that fits our data?

Tip:

You can use your mouse to click on the triangle to the left of the green “Hints” below to reveal some hints for choosing b and w.

We try w = 200, b = 100 to retrain our model

在这里插入图片描述

Prediction

Now that we have a model, we can use it to make our original prediction. Let’s predict the price of a house with 1200 sqft. Since the units of $x$ are in 1000’s of sqft, $x$ is 1.2.

w = 200                         
b = 100    
x_i = 1.2
cost_1200sqft = w * x_i + b    

print(f"${cost_1200sqft:.0f} thousand dollars")

Output

$340 thousand dollars

Cost function formula

Cost function: will tell us how well the model is doing

In order to implement
linear regression the first key step is first to define something called
a cost function. This is something we’ll
build in this video, and the cost function
will tell us how well the model is doing so that we can try to get
it to do better.

Parameters of model: variables you can adjust during training in order to improve the model

Let’s look at what this means. Recall that you have a
training set that contains input features x and
output targets y.

The model you’re going to
use to fit this training set is this linear function f_w, b of x equals to
w times x plus b.

To introduce a little bit
more terminology the w and b are called the
parameters of the model. In machine learning
parameters of the model are the variables you
can adjust during training in order to
improve the model.

Sometimes you also hear
the parameters w and b referred to as coefficients
or as weights.

在这里插入图片描述

Linear regression coefficients: w, b

在这里插入图片描述

Take a look at some plots of function of x on a chart

Now let’s take a look at what these parameters w and b do. Depending on the values
you’ve chosen for w and b you get a different
function f of x, which generates a different
line on the graph.

Remember that we
can write f of x as a shorthand for f_w, b of x. We’re going to take
a look at some plots of f of x on a chart. Maybe you’re already familiar with drawing lines on charts, but even if this is
a review for you, I hope this will help
you build intuition on how w and b the parameters determine f.

Examples

When w is equal
to 0 and b is equal to 1.5, then f looks like
this horizontal line. In this case, the function
f of x is 0 times x plus 1.5 so f is always
a constant value. It always predicts 1.5 for the estimated value of y.
Y hat is always equal to b and here b is also called the y intercept because
that’s where it crosses the vertical axis or
the y axis on this graph.

As a second example, if w is 0.5 and b is equal 0, then f of x is 0.5 times x. When x is 0, the prediction is also 0, and when x is 2, then the prediction is
0.5 times 2, which is 1. You get a line that looks
like this and notice that the slope is 0.5 divided by 1. The value of w
gives you the slope of the line, which is 0.5.

Finally, if w equals
0.5 and b equals 1, then f of x is 0.5 times
x plus 1 and when x is 0, then f of x equals b, which is 1 so the
line intersects the vertical axis at
b, the y intercept. Also when x is 2, then f of x is 2, so the line looks like this. Again, this slope
is 0.5 divided by 1 so the value of w gives
you the slope which is 0.5.

Cost function

在这里插入图片描述

Choose values for the parameters so the function can fit the data well.

Recall that you have a training set like
the one shown here. With linear regression, what you want to do is to choose values for the parameters w and b so that the straight
line you get from the function f somehow
fits the data well.

Like maybe this line shown here. When I see that the line
fits the data visually, you can think of this
to mean that the line defined by f is roughly passing through or somewhere close
to the training examples as compared to other
possible lines that are not as close
to these points.

Just to remind you
of some notation, a training example
like this point here is defined by
x superscript i, y superscript i where
y is the target.

For a given input x^i, the function f also makes
a predictive value for y and a value that
it predicts to y is y hat i shown here.

For our choice of a model f
of x^i is w times x^i plus b. Stated differently,
the prediction y hat i is f of wb of x^i where for the model we’re using f of x^i is equal to wx^i plus b.

How do you find values for w and b so that the prediction y hat i is close to the true target?

Now the question is how
do you find values for w and b so that the
prediction y hat i is close to the true target y^i for many or maybe all training
examples x^i, y^i.

Let’s take a look at how to measure how well a line fits the training data

Construct a cost function

To answer that question, let’s first take
a look at how to measure how well a line
fits the training data.

To do that, we’re going to
construct a cost function.

Error: measuring how far off to prediction is from the target.

The cost function takes the
prediction y hat and compares it to the target y by
taking y hat minus y.

This difference is
called the error, we’re measuring how far off to prediction is from the target.

Sum up the squared errors

Next, let’s computes the
square of this error.

Also, we’re going to want
to compute this term for different training examples
i in the training set.

When measuring the error, for example i, we’ll compute this
squared error term.

Finally, we want to measure the error across the
entire training set. In particular, let’s sum up
the squared errors like this. We’ll sum from i equals 1,2, 3 all the way up to m and remember that m is the
number of training examples, which is 47 for this dataset.

Notice that if we have more
training examples m is larger and your cost function will calculate a bigger number. This is summing
over more examples.

Compute the average squared error

To build a cost function that doesn’t automatically get bigger as the training set size
gets larger by convention, we will compute the average
squared error instead of the total squared
error and we do that by dividing by m like this.

We’re nearly there.
Just one last thing. By convention, the cost function that
machine learning people use actually divides by 2 times m.

The extra division by 2 is just meant
to make some of our later calculations
look neater, but the cost function
still works whether you include this division
by 2 or not.

Cost function = Squared error cost function

This expression right here is the cost function and
we’re going to write J of wb to refer to
the cost function. This is also called the
squared error cost function, and it’s called this
because you’re taking the square of these error terms.

In machine learning
different people will use different
cost functions for different applications, but the squared error cost
function is by far the most commonly used one for linear regression
and for that matter, for all regression
problems where it seems to give good results
for many applications.

Just as a reminder,
the prediction y hat is equal to the outputs
of the model f at x. We can rewrite the
cost function J of wb as 1 over 2m times the sum from i
equals 1 to m of f of x^i minus y^i the
quantity squared.

Eventually we are going to want to make the cost function small

Eventually we’re going to
want to find values of w and b that make the
cost function small.

Build intuition about what it means if cost function J is large versus if J is small.

Let’s do it in the next video.

But before going there, let’s first gain more
intuition about what J of wb is really computing. At this point you might
be thinking we’ve done a whole lot of math to
define the cost function.

But what exactly is it doing? Let’s go on to the next
video where we’ll step through one example of
what the cost function is really computing
that I hope will help you build
intuition about what it means if J of wb is large
versus if the cost j is small. Let’s go on to the next video.

Quiz

在这里插入图片描述

Cost function intuition

Walk through examples to see how the cost function can be used to find the best parameters for the model.

We’re seeing the
mathematical definition of the cost function.

Now, let’s build some intuition about what the cost
function is really doing. In this video, we’ll walk
through one example to see how the cost function can be used to find the best parameters
for your model. I know this video’s little
bit longer than the others, but bear with me, I
think it’ll be worth it.

To recap, here’s what we’ve seen about the cost
function so far.

You want to fit a straight
line to the training data, so you have this model, fw, b of x is w times x, plus b.

Here, the model’s
parameters are w, and b.

Now, depending on the values
chosen for these parameters, you get different
straight lines like this.

You want to find values for w, and b, so that the straight line fits
the training data well.

To measure how well
a choice of w, and b fits the training data, you have a cost function J.

What the cost
function J does is, it measures the difference between the model’s predictions, and the actual
true values for y.

在这里插入图片描述

What you see later, is that linear regression would try to find values for w, and b, then make a J of w
be as small as possible. In math, we write it like this. We want to minimize, J as a function of w, and b. Now, in order for us to better visualize the
cost function J, this work of a
simplified version of the linear regression model.

We’re going to use
the model fw of x, is w times x. You can think of this as taking the original model on the left, and getting rid of
the parameter b, or setting the
parameter b equal to 0. It just goes away
from the equation, so f is now just w times x. You now have just
one parameter w, and your cost function J, looks similar to
what it was before.

Taking the difference, and squaring it, except now, f is
equal to w times xi, and J is now a function of just w. The goal becomes a little
bit different as well, because you have just
one parameter, w, not w and b.

Simplified model: one parameter w

With this simplified model, the goal is to find
the value for w, that minimizes J of w.

To see this visually, what this means is
that if b is set to 0, then f defines a line
that looks like this. You see that the line passes
through the origin here, because when x is 0, f of x is 0 too. Now, using this
simplified model, let’s see how the cost
function changes as you choose different values for the
parameter w.

In particular, let’s look at graphs
of the model f of x, and the cost function J.

Plot the model and cost function side by side

I’m going to plot
these side-by-side, and you’ll be able to see
how the two are related.

First, notice that
for f subscript w, when the parameter w is fixed, that is, is always
a constant value, then fw is only a function of x, which means that the
estimated value of y depends on the value of the input x.

In contrast, looking
to the right, the cost function J, is a function of w, where w controls the slope
of the line defined by f w.

The cost defined by J, depends on a parameter, in this case, the parameter
w. Let’s go ahead, and plot these functions, fw of x, and J of w side-by-side so you can
see how they are related.

We’ll start with the model, that is the function
fw of x on the left. Here are the input feature x
is on the horizontal axis, and the output value y
is on the vertical axis. Here’s the plots of three
points representing the training set at positions 1, 1, 2, 2, and 3,3. Let’s pick a value
for w. Say w is 1. For this choice of w, the function fw, they’ll say this straight
line with a slope of 1.

Now, what you can do
next is calculate the cost J when w equals 1. You may recall that
the cost function is defined as follows, is the squared error
cost function. If you substitute fw(X^i)
with w times X^i, the cost function
looks like this. Where this expression is
now w times X^i minus Y^i. For this value of w, it turns out that the error term inside the cost function, this w times X^i minus Y^i is equal to 0 for each
of the three data points. Because for this data-set, when x is 1, then y is 1. When w is also 1, then f(x) equals 1, so f(x) equals y for this
first training example, and the difference is 0. Plugging this into
the cost function J, you get 0 squared.

Similarly, when x is 2, then y is 2, and f(x) is also 2. Again, f(x) equals y, for the second training example. In the cost function, the squared error for the second example
is also 0 squared.

Finally, when x is 3, then y is 3 and f(3) is also 3. In a cost function the third squared error
term is also 0 squared.

For all three examples
in this training set, f(X^i) equals Y^i for
each training example i, so f(X^i) minus Y^i is 0. For this particular data-set, when w is 1, then the cost J is equal to 0.

Now, what you can do on the right is plot
the cost function J. Notice that because
the cost function is a function of
the parameter w, the horizontal axis is
now labeled w and not x, and the vertical axis
is now J and not y. You have J(1) equals to 0. In other words, when w equals 1, J(w) is 0, so let me go ahead
and plot that.

Now, let’s look at how
F and J change for different values of w. W can
take on a range of values, so w can take on
negative values, w can be 0, and it can take
on positive values too.

w = 1

在这里插入图片描述

w = 0.5

What if w is equal
to 0.5 instead of 1, what would these
graphs look like then?

Let’s go ahead and plot that. Let’s set w to be equal to 0.5, and in this case, the function f(x)
now looks like this, is a line with a
slope equal to 0.5. Let’s also compute the cost J, when w is 0.5.

Recall that the cost
function is measuring the squared error or difference between
the estimator value, that is y hat I, which is F(X^i), and the true value, that is Y^i for each example
i.

Visually you can see that the error or difference
is equal to the height of this vertical line here
when x is equal to 1. Because this lower line is the gap between the actual value of y and the value that
the function f predicted, which is a bit
further down here.

For this first example, when x is 1, f(x) is 0.5. The squared error on
the first example is 0.5 minus 1 squared. Remember the cost function, we’ll sum over all the training examples in
the training set.

Let’s go on to the
second training example. When x is 2, the model is predicting f(x) is 1 and the actual
value of y is 2. The error for the second
example is equal to the height of this little
line segment here, and the squared error is the square of the length
of this line segment, so you get 1 minus 2 squared. Let’s do the third example.

Repeating this process,
the error here, also shown by this line segment, is 1.5 minus 3 squared. Next, we sum up all
of these terms, which turns out to
be equal to 3.5. Then we multiply this
term by 1 over 2m, where m is the number
of training examples. Since there are three
training examples m equals 3, so this is equal to
1 over 2 times 3, where this m here is 3. If we work out the math, this turns out to be
3.5 divided by 6. The cost J is about 0.58.

Let’s go ahead and plot that
over there on the right.

在这里插入图片描述

考虑最简单的线性规划，y 只是 w的函数，不考虑 b

不同的w取值，可以得到不同的 J 函数

在这里插入图片描述

Now, let’s try one
more value for w. How about if w equals 0? What do the graphs for f and J look like when w is equal to 0? It turns out that
if w is equal to 0, then f of x is just this horizontal line that
is exactly on the x-axis.

The error for each example is a line that goes
from each point down to the horizontal line that
represents f of x equals 0. The cost J when w equals 0 is 1 over 2m
times the quantity, 1^2 plus 2^2 plus 3^2, and that’s equal to
1 over 6 times 14, which is about 2.33.

Let’s plot this point
where w is 0 and J of 0 is 2.33 over here. You can keep doing this
for other values of w. Since w can be any number, it can also be a negative value. If w is negative 0.5, then the line f is a
downward-sloping line like this.

It turns out that
when w is negative 0.5 then you end up with
an even higher cost, around 5.25, which is
this point up here.

You can continue computing
the cost function for different values of w and
so on and plot these. It turns out that by
computing a range of values, you can slowly trace out
what the cost function J looks like and that’s what J is.

To recap, each value of parameter w corresponds to
different straight line fit, f of x, on the
graph to the left.

For the given training set, that choice for a value
of w corresponds to a single point on the graph on the right
because for each value of w, you can calculate the
cost J of w.

For example, when w equals 1, this corresponds to this
straight line fit through the data and it also corresponds to this
point on the graph of J, where w equals 1 and the
cost J of 1 equals 0. Whereas when w equals 0.5, this gives you this line
which has a smaller slope.

This line in combination with the training set corresponds to this point on the cost function
graph at w equals 0.5. For each value of
w you wind up with a different line and its
corresponding costs, J of w, and you can use these points to trace out this
plot on the right.

Given this, how can you
choose the value of w that results in
the function f, fitting the data well? Well, as you can imagine, choosing a value of
w that causes J of w to be as small as possible
seems like a good bet.

过渡到找到参数，使得cost function最小

在这里插入图片描述

Minimize the cost function

J is the cost function that measures how big the
squared errors are, so choosing w that minimizes
these squared errors, makes them as small as possible, will give us a good model. In this example, if you were to choose the value
of w that results in the smallest
possible value of J of w you’d end up
picking w equals 1.

As you can see, that’s
actually a pretty good choice. This results in
the line that fits the training data very well. That’s how in linear regression
you use the cost function to find the value of
w that minimizes J.

In the more general
case where we had parameters w and b
rather than just w, you find the values of w
and b that minimize J.

To summarize, you
saw plots of both f and J and worked through
how the two are related. As you vary w or vary
w and b you end up with different straight
lines and when that straight line
passes across the data, the cause J is small.

The goal of linear
regression is to find the parameters w or w and b that results in the smallest possible value
for the cost function J. Now in this video, we worked through
our example with a simplified problem using
only w.

About the next video

In the next video, let’s visualize what the
cost function looks like for the full version of linear
regression using both w and b. You see some cool 3D plots. Let’s go to the next video.

Quiz

在这里插入图片描述

Visualizing the cost function

In the last video, you
saw one visualization of the cost function J
of w or J of w, b.

Let’s look at some further richer visualizations
so that you can get an even better intuition about what the cost
function is doing.

Here is what we’ve seen so far. There’s the model, the
model’s parameters w and b, the cost function J of w and b, as well as the goal
of linear regression, which is to minimize the
cost function J of w and b over parameters w and b. In the last video, we had temporarily set b to zero in order to simplify
the visualizations.

Get a visual understanding of the model function

Now, let’s go back to
the original model with both parameters w and b without setting b
to be equal to 0.

Same as last time, we want to get a
visual understanding of the model function, f of x, shown here on the left, and how it relates to the
cost function J of w, b, shown here on the right.

Here’s a training set of
house sizes and prices.

Let’s say you pick
one possible function of x, like this one. Here, I’ve set w
to 0.06 and b to 50. f of x is 0.06
times x plus 50. Note that this is not a particularly good model
for this training set, is actually a pretty bad model.

It seems to consistently
underestimate housing prices. Given these values for w
and b let’s look at what the cost function J of
w and b may look like. Recall what we saw last time
was when you had only w, because we temporarily set b
to zero to simplify things, but then we had come up with
a plot of the cost function that look like this as
a function of w only. When we had only
one parameter, w, the cost function had
this U-shaped curve, shaped a bit like a soup bowl. That sounds delicious.

Now, in this housing
price example that we have on this slide, we have two parameters, w and b.

在这里插入图片描述

The cost function has a similar shape like a soup bowl.

3D surface plot

The plots becomes a
little more complex. It turns out that
the cost function also has a similar
shape like a soup bowl, except in three dimensions
instead of two.

In fact, depending on
your training set, the cost function will
look something like this. To me, this looks
like a soup bowl, maybe because I’m a
little bit hungry, or maybe to you it looks like a curved dinner
plate or a hammock. Actually that sounds
relaxing too, and there’s your coconut drink.

Maybe when you’re done
with this course, you should treat yourself to vacation and relax in
a hammock like this.

What you see here is
a 3D-surface plot where the axes are
labeled w and b.

As you vary w and b, which are the two
parameters of the model, you get different values for the cost function J of w, and b. This is a lot like
the U-shaped curve you saw in the last video, except instead of having one parameter w as
input for the j, you now have two parameters, w and b as inputs into this soup bowl or this
hammock-shaped function J.

I just want to point out
that any single point on this surface represents
some particular choice of w and b. For example, if w was minus
10 and b was minus 15, then the height of the surface above this point is the value of j when w is minus 10
and b is minus 15.

在这里插入图片描述

Contour map

在这里插入图片描述

Contour plot

Now, in order to look even more closely at specific points, there’s another way of plotting the cost function J that would be useful
for visualization, which is, rather than using
these 3D-surface plots, I like to take this
exact same function J.

I’m not changing the function J at all and plot it using something
called a contour plot.

If you’ve ever seen
a topographical map showing how high
different mountains are, the contours in a topographical
map are basically horizontal slices of the
landscape of say, a mountain.

This image is of
Mount Fuji in Japan. I still remember
my family visiting Mount Fuji when I
was a teenager. It’s beautiful sights. If you fly directly
above the mountain, that’s what this
contour map looks like.

It shows all the points, they’re at the same height
for different heights.

At the bottom of this slide
is a 3D-surface plot of the cost function J. I know it doesn’t look
very bowl-shaped, but it is actually a bowl
just very stretched out, which is why it looks like that. In an optional lab, that is shortly to follow, you will be able to see
this in 3D and spin around the surface yourself
and it’ll look more obviously
bowl-shaped there.

Next, here on the upper
right is a contour plot of this exact same cost function as that shown at the bottom. The two axes on this
contour plots are b, on the vertical axis, and w on the horizontal axis.

What each of these ovals, also called ellipses, shows, is the center points on the 3D surface which are
at the exact same height.

Ovals or ellipses: A set of points which have the same value for the cost function J

In other words, the set
of points which have the same value for
the cost function J.

To get the contour plots, you take the 3D surface
at the bottom and you use a knife to slice
it horizontally. You take horizontal slices of that 3D surface and
get all the points, they’re at the same height. Therefore, each horizontal
slice ends up being shown as one of these ellipses
or one of these ovals.

Concretely, if you
take that point, and that point, and that point, all of these three points have the same value for
the cost function J, even though they have
different values for w and b.

In the figure on the upper left, you see also that
these three points correspond to
different functions, f, all three of which
are actually pretty bad for predicting housing
prices in this case.

Now, the bottom of the bowl, where the cost function
J is at a minimum, is this point right here, at the center of this
concentric ovals. If you haven’t seen
contour plots much before, I’d like you to
imagine, if you will, that you are flying
high up above the bowl in an airplane
or in a rocket ship, and you’re looking
straight down at it.

That is as if you set your computer monitor
flat on your desk facing up and the bowl shape is coming directly
out of your screen, rising above you desk.

Imagine that the bowl
shape grows out of your computer screen
lying flat like that, so that each of these ovals have the same height above
your screen and the minimum of the bowl is right down there in the center
of the smallest oval.

It turns out that the
contour plots are a convenient way to visualize
the 3D cost function J, but in a way, there’s
plotted in just 2D. In this video, you saw how the 3D bowl-shaped surface plot can also be visualized
as a contour plot.

Using this visualization too, in the next video, let’s visualize some
specific choices of w and b in the
linear regression model so that you can see how
these different choices affect the straight line
you’re fitting to the data. Let’s go on to the next video.

在这里插入图片描述

Visualization examples

Let’s look at some
more visualizations of w and b.

Here’s one example. Over here, you have a particular
point on the graph j. For this point, w
equals about negative 0.15 and b equals about 800. This point corresponds to
one pair of values for w and b that use a
particular cost j.

In fact, this booklet
pair of values for w and b corresponds to this
function f of x, which is this line you
can see on the left.

This line intersects the
vertical axis at 800 because b equals 800 and the slope of
the line is negative 0.15, because w equals negative 0.15.

Now, if you look at the data
points in the training set, you may notice that this line is not a good fit to the data. For this function f of x, with these values of w and b, many of the predictions for
the value of y are quite far from the actual target value of y that is in the training data. Because this line
is not a good fit, if you look at the graph of j, the cost of this
line is out here, which is pretty far
from the minimum.

There’s a pretty high cost
because this choice of w and b is just not that good
a fit to the training set.

在这里插入图片描述

Another example with a different choice of w and b

Now, let’s look at another example with a
different choice of w and b.

Now, here’s another
function that is still not a great
fit for the data, but maybe slightly less bad.

This points here represents the cost for this booklet pair of w and b that
creates that line.

The value of w is equal to 0 and the value b is about 360. This pair of parameters
corresponds to this function, which is a flat line, because f of x equals
0 times x plus 360. I hope that makes sense.

Let’s look at yet another example.

Here’s one more
choice for w and b, and with these values, you end up with
this line f of x. Again, not a great
fit to the data, is actually further
away from the minimum compared to the
previous example.

The minimum is at the center of that smallest ellipse.

Remember that the minimum is at the center of that
smallest ellipse.

Last example, if you look
at f of x on the left, this looks like a pretty good
fit to the training set.

You can see on the right, this point representing
the cost is very close to the center of
the smaller ellipse, it’s not quite
exactly the minimum, but it’s pretty close.

For this value of w and b, you get to this line, f of x. You can see that if you measure the vertical distances between the data points and the predicted values
on the straight line, you’d get the error
for each data point.

The sum of squared
errors for all of these data points
is pretty close to the minimum possible sum of squared errors among all
possible straight line fits.

By looking at these examples, I can get a better sense of how different choices of the parameters affect the line.

I hope that by looking
at these figures, you can get a better sense
of how different choices of the parameters
affect the line f of x and how this corresponds to different
values for the cost j, and hopefully you can see how the better fit lines correspond
to points on the graph of j that are closer to the
minimum possible cost for this cost function
j of w and b.

Optional Lab

In the optional lab that
follows this video, you’ll get to run some codes and remember
all the code is given, so you just need to hit Shift Enter to run it
and take a look at it and the lab will show you how the cost function is
implemented in code.

Given a small training set and different choices
for the parameters, you’ll be able to see
how the cost varies depending on how well
the model fits the data.

In the optional lab, you also can play with in interactive console
plot. Check this out. You can use your
mouse cursor to click anywhere on the contour
plot and you will see the straight line defined by the values you chose for
the parameters w and b.

You’ll see a dot up here also on the 3D surface plot
showing the cost.

Finally, the optional
lab also has a 3D surface plot
that you can manually rotate and spin around using your mouse cursor to take a better look at what the
cost function looks like. I hope you’ll enjoy playing
with the optional lab.

Next video is about gradient descent

Now in linear regression, rather than having to
manually try to read a contour plot for the
best value for w and b, which isn’t really a good
procedure and also won’t work once we get to more complex
machine learning models.

What you really want is an efficient algorithm that
you can write in code for automatically finding the
values of parameters w and b they give you
the best fit line.

That minimizes the
cost function j. There is an algorithm for doing this called gradient descent. This algorithm is one of the most important algorithms
in machine learning.

Gradient descent and variations on gradient descent
are used to train, not just linear regression, but some of the biggest and most complex models in all of AI.

Let’s go to the next
video to dive into this really important algorithm
called gradient descent.

Optional lab: Cost function

This optional lab will show you how the cost function is implemented in code.

And given a small training set and different choices for the parameters you’ll be able to see how the cost varies depending on how well the model fits the data.

In the optional lab, you’ll also get to play with an interactive contour plot. You can use your mouse cursor to click anywhere on the contour plot, and you see the straight line defined by the values you chose, for parameters w and b.

Finally the optional lab also has a 3d surface plot that you can manually rotate and spin around, using your mouse cursor, to take a better look at what the cost function looks like.

import numpy as np
%matplotlib widget
import matplotlib.pyplot as plt
from lab_utils_uni import plt_intuition, plt_stationary, plt_update_onclick, soup_bowl
plt.style.use('./deeplearning.mplstyle')

The problem: housing price prediction

x_train = np.array([1.0, 2.0])           #(size in 1000 square feet)
y_train = np.array([300.0, 500.0])           #(price in 1000s of dollars)

Computing Cost

The term ‘cost’ in this assignment might be a little confusing since the data is housing cost. Here, cost is a measure how well our model is predicting the target price of the house. The term ‘price’ is used for housing data.

The equation for cost with one variable is:
$\frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})^2 \tag{1}$

where
$f_{w,b}(x^{(i)}) = wx^{(i)} + b \tag{2}$

$f_{w,b}(x^{(i)})$ is our prediction for example $i$ using parameters $w, b$ .
$f_{w,b}(x^{(i)}) -y^{(i)})^2$ is the squared difference between the target value and the prediction.
These differences are summed over all the $m$ examples and divided by 2m to produce the cost, $J (w, b)$ .

Note, in lecture summation ranges are typically from 1 to m, while code will be from 0 to m-1.

在这里插入图片描述

The code below calculates cost by looping over each example. In each loop:

f_wb, a prediction is calculated
the difference between the target and the prediction is calculated and squared.
this is added to the total cost.

计算Cost 的函数

def compute_cost(x, y, w, b): 
    """
    Computes the cost function for linear regression.
    
    Args:
      x (ndarray (m,)): Data, m examples 
      y (ndarray (m,)): target values
      w,b (scalar)    : model parameters  
    
    Returns
        total_cost (float): The cost of using w,b as the parameters for linear regression
               to fit the data points in x and y
    """
    # number of training examples
    m = x.shape[0] 
    
    cost_sum = 0 
    for i in range(m): 
        f_wb = w * x[i] + b   
        cost = (f_wb - y[i]) ** 2  
        cost_sum = cost_sum + cost  
    total_cost = (1 / (2 * m)) * cost_sum  

    return total_cost

Cost Function Intuition

在这里插入图片描述

实验中写好的程序

plt_intuition(x_train,y_train)

可以直接绘制图形

最优的参数

在这里插入图片描述

调的其他参数

在这里插入图片描述

The plot contains a few points that are worth mentioning.

cost is minimized when $w = 200$ , which matches results from the previous lab
Because the difference between the target and pediction is squared in the cost equation, the cost increases rapidly when $w$ is either too large or too small.
Using the w and b selected by minimizing cost results in a line which is a perfect fit to the data.

Cost Function Visualization- 3D

You can see how cost varies with respect to both w and b by plotting in 3D or using a contour plot.
It is worth noting that some of the plotting in this course can become quite involved. The plotting routines are provided and while it can be instructive to read through the code to become familiar with the methods, it is not needed to complete the course successfully. The routines are in lab_utils_uni.py in the local directory.

Larger Data Set

It’s use instructive to view a scenario with a few more data points. This data set includes data points that do not fall on the same line. What does that mean for the cost equation? Can we find $w$ , and $b$ that will give us a cost of 0?

x_train = np.array([1.0, 1.7, 2.0, 2.5, 3.0, 3.2])
y_train = np.array([250, 300, 480,  430,   630, 730,])

In the contour plot, click on a point to select w and b to achieve the lowest cost. Use the contours to guide your selections. Note, it can take a few seconds to update the graph.

plt.close('all') 
fig, ax, dyn_items = plt_stationary(x_train, y_train)
updater = plt_update_onclick(fig, ax, x_train, y_train, dyn_items)

第一次绘制图形

在这里插入图片描述

Above, note the dashed lines in the left plot. These represent the portion of the cost contributed by each example in your training set. In this case, values of approximately $w = 209$ and $b = 2.4$ provide low cost. Note that, because our training examples are not on a line, the minimum cost is not zero.

选了第二个点

在这里插入图片描述

测试一下最优的点

在这里插入图片描述

Convex Cost surface

The fact that the cost function squares the loss ensures that the ‘error surface’ is convex like a soup bowl. It will always have a minimum that can be reached by following the gradient in all dimensions. In the previous plot, because the $w$ and $b$ dimensions scale differently, this is not easy to recognize. The following plot, where $w$ and $b$ are symmetric, was shown in lecture:

soup_bowl()

绘制的 soup bowl

在这里插入图片描述

Congratulations!

You have learned the following:

The cost equation provides a measure of how well your predictions match your training data.
Minimizing the cost can provide optimal values of $w$ , $b$ .

[5] Practice Quiz: Regression Model

Practice quiz: Regression

Latest Submission Grade 100%

在这里插入图片描述

The x, the input features, are fed into the model to generate a prediction

在这里插入图片描述

When the cost is small, this means that the model fits the training set well, not poorly.

[6] Train the model with gradient descent

Gradient descent

Welcome back. In the last video, we saw visualizations of the cost function j
and how you can try different choices of
the parameters w and b and see what cost
value they get you.

It would be nice if we had a more systematic way to
find the values of w and b, that results in the
smallest possible cost, j of w, b.

It turns out there’s
an algorithm called gradient descent that
you can use to do that.

Gradient descent is used all over the place
in machine learning, not just for linear regression, but for training
for example some of the most advanced
neural network models, also called deep
learning models.

Deep learning models
are something you learned about in
the second course. Learning these two of
gradient descent will set you up with one of the most important building blocks
in machine learning.

在这里插入图片描述

Overview of gradient descent

Here’s an overview of what we’ll do with gradient descent.

You have the cost
function j of w, b right here that you
want to minimize.

In the example
we’ve seen so far, this is a cost function
for linear regression, but it turns out that
gradient descent is an algorithm that you can use to try to
minimize any function, not just a cost function
for linear regression.

Just to make this discussion on gradient descent more general, it turns out that
gradient descent applies to more
general functions, including other cost
functions that work with models that have
more than two parameters.

For instance, if you
have a cost function J as a function of w_1, w_2 up to w_n and b, your objective is to minimize j over the parameters
w_1 to w_n and b.

In other words, you
want to pick values for w_1 through w_n and b, that gives you the smallest
possible value of j.

It turns out that
gradient descent is an algorithm
that you can apply to try to minimize this
cost function j as well.

J: Cost function

Gradient Descent

在这里插入图片描述

Start off with some initial guesses for parameters.

What you’re going to do
is just to start off with some initial
guesses for w and b.

In linear regression,
it won’t matter too much what the initial value are, so a common choice is
to set them both to 0.

For example, you can set w to 0 and b to 0 as the initial guess. With the gradient
descent algorithm, what you’re going to do is, you’ll keep on changing the
parameters w and b a bit every time to try to
reduce the cost j of w, b until hopefully j settles
at or near a minimum.

Not be a bow shape or a hammock shape, might be other shapes.

One thing I should note is
that for some functions j that may not be a bow
shape or a hammock shape, it is possible for there to be more than one possible minimum.

Let’s take a look
at an example of a more complex surface plot j to see what gradient is doing.

This function is not a
squared error cost function. For linear regression with the squared error cost function, you always end up with a bow
shape or a hammock shape.

But this is a type
of cost function you might get if you’re training
a neural network model.

Notice the axes, that is w
and b on the bottom axis. For different values of w and b, you get different
points on this surface, j of w, b, where the height
of the surface at some point is the value
of the cost function.

The high points are hills and the low points are valleys.

My goal: standing at a point and get to the bottom of one of these valleys as efficiently as possible.

Now, let’s imagine that this surface plot is
actually a view of a slightly hilly outdoor park or a golf course where
the high points are hills and the low points
are valleys like so.

I’d like you to
imagine if you will, that you are physically standing at this point on the hill.

If it helps you to relax, imagine that there’s lots of really nice green grass and butterflies and flowers
is a really nice hill.

Your goal is to start
up here and get to the bottom of one of these valleys as
efficiently as possible.

What direction do I choose to take a baby step?

What the gradient descent
algorithm does is, you’re going to spin
around 360 degrees and look around
and ask yourself, if I were to take a tiny little baby
step in one direction, and I want to go
downhill as quickly as possible to or one
of these valleys.

What direction do I choose
to take that baby step?

Well, if you want to walk down this hill as efficiently
as possible, it turns out that
if you’re standing at this point in the hill
and you look around, you will notice that the
best direction to take your next step downhill is
roughly that direction.

Choose the direction of steepest descent.

Mathematically, this is the direction of
steepest descent.

It means that when you take
a tiny baby little step, this takes you
downhill faster than a tiny little baby
step you could have taken in any
other direction.

After taking this first step, you’re now at this point
on the hill over here. Now let’s repeat the process. Standing at this new point, you’re going to again spin around 360 degrees
and ask yourself, in what direction will I take the next little baby step
in order to move downhill?

If you do that and
take another step, you end up moving a bit in that direction and
you can keep going. From this new point, you can again look
around and decide what direction would take
you downhill most quickly.

Take another step,
another step, and so on, until you find yourself at
the bottom of this valley, at this local
minimum, right here.

Local Minima

在这里插入图片描述

Multiple local minima

Go through multiple steps of gradient descent

经历梯度下降的多个步骤

What you just did was go through multiple steps of
gradient descent.

It turns out, gradient descent has an interesting property. Remember that you can
choose a starting point at the surface by choosing starting values for the
parameters w and b.

When you perform gradient
descent a moment ago, you had started at
this point over here. Now, imagine if you try
gradient descent again, but this time you choose a different starting
point by choosing parameters that place
your starting point just a couple of steps
to the right over here.

If you then repeat the
gradient descent process, which means you look around, take a little step
in the direction of steepest ascent so
you end up here. Then you again look around, take another step, and so on. If you were to run gradient
descent this second time, starting just a couple steps in the right of where we
did it the first time, then you end up in a
totally different valley.

This different minimum
over here on the right. The bottoms of
both the first and the second valleys are
called local minima.

Because if you start going
down the first valley, gradient descent won’t lead
you to the second valley, and the same is true if you started going down
the second valley, you stay in that
second minimum and not find your way into
the first local minimum.

This is an interesting property of the gradient
descent algorithm, and you see more
about this later.

In this video, you saw how gradient descent helps
you go downhill.

In the next video, let’s look at the mathematical
expressions that you can implement to make
gradient descent work. Let’s go on to the next video.

Implementing gradient descent

在这里插入图片描述

Let’s take a look at
how you can actually implement the gradient
descent algorithm.

Let me write down the
gradient descent algorithm.

Here it is. On each step, w, the parameter, is updated to the
old value of w minus Alpha times this term d/dw of the cos function J of wb.

What this expression
is saying is, after your parameter w by
taking the current value of w and adjusting it
a small amount, which is this expression
on the right, minus Alpha times
this term over here.

=: assignment operator

If you feel like there’s a lot going on in this equation, it’s okay, don’t worry about it. We’ll unpack it together.

First, this equal notation here. Now, since I said
we’re assigning w a value using this equal sign, so in this context, this equal sign is the
assignment operator.

Specifically, in this context, if you write code
that says a equals c, it means take the value c and
store it in your computer, in the variable a. Or if you write a
equals a plus 1, it means set the value of
a to be equal to a plus 1, or increments the
value of a by one.

==: Truth assertion

The assignment
operator encoding is different than truth
assertions in mathematics. Where if I write a equals c, I’m asserting, that is, I’m claiming that the values of a and c are equal
to each other.

Hopefully, I will never write
a truth assertion a equals a plus 1 because that just
can’t possibly be true.

In Python and in other
programming languages, truth assertions are sometimes
written as equals equals, so you may see oh, that says a equals
equals c if you’re testing whether a is equal
to c.

But in math notation, as we conventionally use it, like in these videos, the equal sign can be used for either assignments or
for truth assertion. I try to make sure I was clear when I write an equal sign, whether we’re assigning
a value to a variable, or whether we’re
asserting the truth of the equality of two values.

Leaning rate: alpha

Control how big a step you take downhill.

Now, this dive more deeply into what the symbols
in this equation means. The symbol here is the
Greek alphabet Alpha. In this equation, Alpha is
also called the learning rate.

The learning rate is usually
a small positive number between 0 and 1 and it
might be say, 0.01.

What Alpha does is, it basically controls how big of a step you take downhill.

If Alpha is very large, then that corresponds to a very aggressive gradient
descent procedure where you’re trying to
take huge steps downhill.

If Alpha is very small, then you’d be taking small
baby steps downhill.

We’ll come back later to
dive more deeply into how to choose a good
learning rate Alpha.

Derivative term of the cost function J

Finally, this term here, that’s the derivative term
of the cost function J. Let’s not worry
about the details of this derivative right now.

But later on, you’ll get to see more about the
derivative term. But for now, you can think of this derivative term that I drew a magenta box around
as telling you in which direction you want
to take your baby step.

In combination with the
learning rate Alpha, it also determines the size of the steps you want
to take downhill. Now, I do want to mention that derivatives come from calculus. Even if you aren’t familiar with calculus, don’t worry about it. Even without knowing
any calculus, you’d be able to figure
out all you need to know about this
derivative term in this video and the
next.

One more thing. Remember your model
has two parameters, not just w, but also b. You also have an
assignment operations update the parameter b
that looks very similar.

b is assigned the
old value of b minus the learning rate Alpha times this slightly different
derivative term, d/db of J of wb.

Remember in the graph
of the surface plot where you’re taking baby steps until you get to the
bottom of the value, well, for the gradient
descent algorithm, you’re going to repeat these two update steps until
the algorithm converges.

Converge: reach the point at a local minimum

By converges, I mean that you reach the point at a
local minimum where the parameters w and b no longer change much with each
additional step that you take.

Simultaneously update w and b

Now, there’s one
more subtle detail about how to correctly in
semantic gradient descent, you’re going to update
two parameters, w and b. This update takes place for
both parameters, w and b.

One important detail is
that for gradient descent, you want to simultaneously
update w and b, meaning you want to update both parameters
at the same time.

What I mean by that, is
that in this expression, you’re going to update w
from the old w to a new w, and you’re also updating b from its oldest value to
a new value of b.

The way to implement this is
to compute the right side, computing this
thing for w and b, and simultaneously
at the same time, update w and b to
the new values.

Let’s take a look
at what this means.

Here’s the correct
way to implement gradient descent which does
a simultaneous update. This sets a variable temp_w
equal to that expression, which is w minus that term here. There’s also a set in another
variable temp_b to that, which is b minus that term.

You compute both for hand
sides, both updates, and store them into
variables temp_w and temp_b. Then you copy the value
of temp_w into w, and you also copy the
value of temp_b into b. Now, one thing you may
notice is that this value of w is from the for
w gets updated. Here, I noticed that
the pre-update w is where it goes into the
derivative term over here.

Incorrect way to update parameters

In contrast, here is an incorrect implementation of gradient descent that does
not do a simultaneous update.

In this incorrect
implementation, we compute temp_w, same as before, so
far that’s okay. Now here’s where things
start to differ. We then update w
with the value in temp_w before calculating
the new value for the other parameter to be.

Next, we calculate temp_b
as b minus that term here, and finally, we update b
with the value in temp_b.

The difference between
the right-hand side and the left-hand side
implementations is that if you look over here, this w has already been
updated to this new value, and this is updated
w that actually goes into the cost
function j of w, b.

It means that this term here
on the right is not the same as this term over here
that you see on the left. That also means
this temp_b term on the right is not quite the same as the temp b
term on the left, and thus this updated value for b on the right is not the same as this updated value for
variable b on the left.

Gradient descent: update simultaneously

The way that gradient descent
is implemented in code, it actually turns out to be
more natural to implement it the correct way with
simultaneous updates. When you hear someone talk
about gradient descent, they always mean the
gradient descents where you perform a simultaneous
update of the parameters.

If however, you were to implement
non-simultaneous update, it turns out it will probably
work more or less anyway.

But doing it this way isn’t really the correct
way to implement it, is actually some other algorithm with different properties. I would advise you
to just stick to the correct
simultaneous update and not use this incorrect
version on the right. That’s gradient descent.

In the next video, we’ll go into details of the derivative term which
you saw in this video, but that we didn’t really
talk about in detail.

Derivatives are
part of calculus, and again, if you’re
not familiar with calculus, don’t worry about it. You don’t need to know
calculus at all in order to complete this course or
this specialization, and you have all the
information you need in order to implement
gradient descent.

Coming up in the next video, we’ll go over
derivatives together, and you come away with the intuition and
knowledge you need to be able to implement and apply
gradient descent yourself. I think that’ll be
an exciting thing for you to know
how to implement. Let’s go on to the next
video to see how to do that.

Quiz

update w: updates parameter by a small amount, in order to reduce the cost J.

在这里插入图片描述

Gradient descent intuition

Now let’s dive more deeply
in gradient descent to gain better intuition about what it’s doing and why
it might make sense.

Here’s the gradient
descent algorithm that you saw in the
previous video. As a reminder, this variable, this Greek symbol Alpha,
is the learning rate. The learning rate controls
how big of a step you take when updating the model’s
parameters, w and b. This term here, this d over dw, this is a derivative term. By convention in math, this d is written with
this funny font here.

In case anyone watching
this has PhD in math or is an expert in
multivariate calculus, they may be wondering,
that’s not the derivative, that’s the partial derivative.
Yes, they be right.

But for the purposes of implementing a machine
learning algorithm, I’m just going to
call it derivative. Don’t worry about these
little distinctions.

在这里插入图片描述

What we’re going to focus on now is get more intuition about what this learning rate
and what this derivative are doing and why when
multiplied together like this, it results in updates
to parameters w and b. That makes sense.

In order
to do this let’s use a slightly simpler
example where we work on minimizing
just one parameter. Let’s say that you have
a cost function J of just one parameter w
with w is a number. This means the gradient
descent now looks like this. W is updated to w minus
the learning rate Alpha times d over dw
of J of w.

Adjust parameters to minimize the cost

You’re trying to minimize the
cost by adjusting the parameter w. This is like our previous example where
we had temporarily set b equal to 0 with one
parameter w instead of two, you can look at
two-dimensional graphs of the cost function j, instead of three
dimensional graphs.

Let’s look at what gradient descent does
on just function J of w. Here on the horizontal
axis is parameter w, and on the vertical axis is
the cost j of w.

Now less initialized gradient descent
with some starting value for w. Let’s initialize
it at this location. Imagine that you start off at this point right here
on the function J, what gradient descent
will do is it will update w to be w minus learning rate Alpha times d over dw of J of w.

Let’s look at what this
derivative term here means.

A way to think about the
derivative at this point on the line is to draw
a tangent line, which is a straight line that touches this curve
at that point. Enough, the slope
of this line is the derivative of the
function j at this point. To get the slope, you can draw a little
triangle like this. If you compute the
height divided by the width of this triangle,
that is the slope. For example, this slope
might be 2 over 1, for instance and when the tangent line is pointing
up and to the right, the slope is positive, which means that this derivative
is a positive number, so is greater than 0.

The learning rate is always a positive number.

The updated w is going to be w minus the learning rate
times some positive number. The learning rate is
always a positive number.

If you take w minus
a positive number, you end up with a new value
for w, that’s smaller. On the graph, you’re
moving to the left, you’re decreasing the
value of w.

You may notice that this is the right
thing to do if your goal is to decrease the cost J, because when we move towards
the left on this curve, the cost j decreases, and you’re getting
closer to the minimum for J, which is over here.

So far, gradient descent, seems to be doing
the right thing.

在这里插入图片描述

Now, let’s look at
another example.

Let’s take the same
function j of w as above, and now let’s say
that you initialized gradient descent at a
different location. Say by choosing a
starting value for w that’s over here on the left. That’s this point
of the function j.

Now, the derivative term, remember is d over dw of J of w, and when we look at
the tangent line at this point over here, the slope of this line is a derivative of J at this point. But this tangent line is
sloping down into the right. This lines sloping down into the right has a negative slope. In other words, the
derivative of J at this point is a negative number. For instance, if you
draw a triangle, then the height like this is negative 2 and the width is 1, the slope is negative
2 divided by 1, which is negative 2, which is a negative number.

When you update w, you get w minus the
learning rate times a negative number. This means you subtract
from w, a negative number.

But subtracting a
negative number means adding a positive number, and so you end up increasing w. Because subtracting a
negative number is the same as adding a
positive number to w. This step of gradient
descent causes w to increase, which means you’re moving to
the right of the graph and your cost J has
decrease down to here.

Update gets you closer to the minimum

Again, it looks like gradient descent is doing
something reasonable, is getting you closer
to the minimum.

Hopefully, these last
two examples show some of the intuition behind
what a derivative term is doing and why this host
gradient descent change w to get you closer
to the minimum.

I hope this video gave
you some sense for why the derivative term in
gradient descent makes sense.

One other key quantity in the gradient descent algorithm is the learning rate Alpha. How do you choose Alpha? What happens if it’s too small or what happens
when it’s too big?

In the next video, let’s take a deeper look at the parameter Alpha to help build intuitions
about what it does, as well as how to make a
good choice for a good value of Alpha for your implementation
of gradient descent.

Quiz

Gradient descent is an algorithm for finding values of parameters w and b that minimize the cost function J.

在这里插入图片描述

Learning rate

The choice of the learning rate,
alpha will have a huge impact on the efficiency of your
implementation of gradient descent.

And if alpha, the learning rate is chosen poorly rate
of descent may not even work at all. In this video, let’s take a deeper
look at the learning rate. This will also help you choose
better learning rates for your implementations of gradient descent. So here again,
is the great inter sense rule.

W is updated to be W minus the learning
rate, alpha times the derivative term. To learn more about what
the learning rate alpha is doing.

Let’s see what could happen if the
learning rate alpha is either too small or if it is too large. For the case where the learning
rate is too small. Here’s a graph where the horizontal axis
is W and the vertical axis is the cost J. And here’s the graph of
the function J of W.

在这里插入图片描述

What if learning rate alpha is too small?

The steps are so tiny, they are little baby steps.

Need a lot of steps to get to the minimum.

Let’s start grading descent at this point
here, if the learning rate is too small. Then what happens is that you multiply
your derivative term by some really, really small number. So you’re going to be
multiplying by number alpha. That’s really small, like 0.0000001. And so you end up taking a very
small baby step like that.

Then from this point you’re going to
take another tiny tiny little baby step. But because the learning rate is so small,
the second step is also just minuscule. The outcome of this process is that you
do end up decreasing the cost J but incredibly slowly.

So, here’s another step and another step, another tiny step until you
finally approach the minimum. But as you may notice you’re going to need
a lot of steps to get to the minimum.

So to summarize if the learning
rate is too small, then gradient descents will work,
but it will be slow.

It will take a very long time because it’s
going to take these tiny tiny baby steps. And it’s going to need
a lot of steps before it gets anywhere close to the minimum.

What happened if the learning rate is too large?

Now, let’s look at a different case. What happens if the learning
rate is too large?

Here’s another graph of the cost function. And let’s say we start grating
descent with W at this value here. So it’s actually already
pretty close to the minimum. So the decorative points to the right.

But if the learning rate
is too large then you update W very giant step to
be all the way over here. And that’s this point
here on the function J. So you move from this point on the left,
all the way to this point on the right.

And now the cost has
actually gotten worse. It has increased because it
started out at this value here and after one step,
it actually increased to this value here. Now the derivative at this new
point says to decrease W but when the learning rate is too big. Then you may take a huge step going
from here all the way out here. So now you’ve gotten to
this point here and again, if the learning rate is too big.

Then you take another huge
step with an acceleration and way overshoot the minimum again. So now you’re at this point on the right
and one more time you do another update. And end up all the way here and
so you’re now at this point here. So as you may notice you’re
actually getting further and further away from the minimum.

If the learning rate is too large, then it may overshoot and never reach the minimum.

So if the learning rate is too large, then creating the sense may overshoot and
may never reach the minimum.

And another way to say that
is that great intersect may fail to converge and may even diverge.

在这里插入图片描述

Cost function J is already at a local minimum.

So, here’s another question,
you may be wondering one of your parameter W is already
at this point here. So that your cost J is
already at a local minimum. What do you think?

One step of gradient descent will do
if you’ve already reached a minimum? So this is a tricky one.

When I was first learning this stuff, it actually took me a long
time to figure it out.

Two local minima

But let’s work through this together. Let’s suppose you have
some cost function J. And the one you see here isn’t
a square error cost function and this cost function has two
local minima corresponding to the two valleys that you see here.

Now let’s suppose that after some
number of steps of gradient descent, your parameter W is over here,
say equal to five. And so this is the current value of W. This means that you’re at this
point on the cost function J.

And that happens to be a local minimum, turns out if you draw attention
to the function at this point. The slope of this line is zero and
thus the derivative term. Here is equal to zero for
the current value of W. And so you’re grading descent
update becomes W is updated to W minus the learning rate times zero.

We’re here that’s because
the derivative term is equal to zero. And this is the same as saying
let’s set W to be equal to W. So this means that if you’re
already at a local minimum, gradient descent leaves W unchanged.

Because it just updates the new value of
W to be the exact same old value of W. So concretely, let’s say if
the current value of W is five. And alpha is 0.1 after one iteration, you update W as W minus
alpha times zero and it is still equal to five.

The gradient descent can’t update the parameters when your parameters have already brought you to a local minimum.

So if your parameters have already
brought you to a local minimum, then further gradient descent
steps to absolutely nothing.

It doesn’t change the parameters which
is what you want because it keeps the solution at that local minimum.

This also explains why gradient
descent can reach a local minimum, even with a fixed learning rate alpha.

在这里插入图片描述

Here’s what I mean, to illustrate this,
let’s look at another example.

Here’s the cost function J of
W that we want to minimize. Let’s initialize gradient
descent up here at this point. If we take one update step,
maybe it will take us to that point.

And because this derivative
is pretty large, grading, descent takes a relatively big step right.

Now, we’re at this second point
where we take another step. And you may notice that the slope is not
as steep as it was at the first point. So the derivative isn’t as large.

And so the next update step will
not be as large as that first step. Now, read this third point here and the derivative is smaller than
it was at the previous step. And will take an even smaller
step as we approach the minimum. The decorative gets closer and
closer to zero.

So as we run gradient descent,
eventually we’re taking very small steps until you finally
reach a local minimum.

Can reach minimum without decreasing learning rate alpha

So just to recap,
as we get nearer a local minimum gradient descent will automatically
take smaller steps.

And that’s because as we
approach the local minimum, the derivative automatically gets smaller. And that means the update steps
also automatically gets smaller. Even if the learning rate alpha
is kept at some fixed value.

So that’s the gradient descent algorithm, you can use it to try to
minimize any cost function J. Not just the mean squared error
cost function that we’re using for the new regression.

In the next video,
we’re going to take the function J and set that back to be exactly the linear
regression models cost function. The mean squared error cost function
that we come up with earlier. And putting together great in dissent with
this cost function that will give you your first learning algorithm,
the linear regression algorithm.

Gradient descent for linear regression

Previously, you took a look at the linear regression model
and then the cost function, and then the gradient
descent algorithm.

In this video, we’re going
to pull out together and use the squared error
cost function for the linear regression model
with gradient descent. This will allow us to train the linear regression
model to fit a straight line to achieve the training data.

Let’s get to it. Here’s the linear
regression model. To the right is the squared
error cost function. Below is the gradient
descent algorithm. It turns out if you
calculate these derivatives, these are the terms
you would get. The derivative with respect
to W is this 1 over m, sum of i equals 1 through
m.

Then the error term, that is the difference
between the predicted and the actual values times
the input feature xi. The derivative with respect to b is this formula over here, which looks the same
as the equation above, except that it doesn’t have
that xi term at the end. If you use these
formulas to compute these two derivatives and implements gradient descent
this way, it will work.

在这里插入图片描述

Now, you may be wondering, where did I get
these formulas from? They’re derived using calculus. If you want to see
the full derivation, I’ll quickly run through the derivation on
the next slide. But if you don’t
remember or aren’t interested in the calculus,
don’t worry about it. You can skip the materials on the next slide entirely
and still be able to implement gradient
descent and finish this class and everything
will work just fine.

In this slide, which is one of the most mathematical slide
of the entire specialization, and again is
completely optional, we’ll show you how to calculate
the derivative terms.

Let’s start with the first term. The derivative of the
cost function J with respect to w. We’ll
start by plugging in the definition of the cost
function J. J of WP is this. 1 over 2m times this sum of
the squared error terms. Now remember also that f of wb of X^i is equal to
this term over here, which is WX^i plus b.

Compute the partial derivative with respect to w

What we would like to do
is compute the derivative, also called the partial
derivative with respect to w of this equation right
here on the right.

If you taken a
calculus class before, and again is totally
fine if you haven’t, you may know that by
the rules of calculus, the derivative is equal
to this term over here. Which is why the two here
and two here cancel out, leaving us with this equation that you saw on the
previous slide.

1/ 2m makes the partial derivative neater.

This is why we had to find
the cost function with the 1.5 earlier this week is because it makes the
partial derivative neater. It cancels out the
two that appears from computing the derivative.

For the other derivative
with respect to b, this is quite similar. I can write it out like
this, and once again, plugging the definition of f of X^i, giving this equation. By the rules of calculus, this is equal to this where there’s no X^i
anymore at the end. The 2’s cancel one small
and you end up with this expression for the
derivative with respect to b.

Now you have these two
expressions for the derivatives. You can plug them into the
gradient descent algorithm.

在这里插入图片描述

Repeatedly carry out these updates to w and b until convergence.

Here’s the gradient
descent algorithm for linear regression.

You repeatedly carry
out these updates to w and b until convergence. Remember that this f of x is
a linear regression model, so as equal to w times x plus b. This expression here is the derivative of the cost
function with respect to w. This expression is the derivative of the cost
function with respect to b.

Just as a reminder, you want to update w and b
simultaneously on each step.

在这里插入图片描述

Now, let’s get familiar with
how gradient descent works.

One the shoe we saw with
gradient descent is that it can lead to a local minimum
instead of a global minimum.

Whether global minimum
means the point that has the lowest possible value
for the cost function J of all possible points. You may recall this surface
plot that looks like an outdoor park with
a few hills with the process and the birds
as a relaxing Hobo Hill.

This function has more
than one local minimum. Remember, depending on where you initialize the
parameters w and b, you can end up at
different local minima. You can end up here, or you can end up here.

在这里插入图片描述

Convex function

But it turns out
when you’re using a squared error cost function
with linear regression, the cost function
does not and will never have multiple
local minima. It has a single global minimum because of this bowl-shape.

The technical term
for this is that this cost function is
a convex function.

Informally, a convex function is of bowl-shaped function and it cannot have any local minima other than the single
global minimum.

When you implement gradient
descent on a convex function, one nice property is that so long as you’re learning rate
is chosen appropriately, it will always converge
to the global minimum.

在这里插入图片描述

Congratulations,
you now know how to implement gradient descent
for linear regression. We have just one last
video for this week. That video, we’ll see
this algorithm in action. Let’s go to that last video.

Running gradient descent

Let’s see what happens when you run
gradient descent for linear regression.

Let’s go see the algorithm in action.

Here’s a plot of the model and
data on the upper left and a contour plot of the cost
function on the upper right and at the bottom is the surface
plot of the same cost function.

Often w and
b will both be initialized to 0, but for this demonstration,
lets initialized w = -0.1 and b = 900. So this corresponds to f(x) = -0.1x + 900. Now, if we take one step
using gradient descent, we ended up going from this
point of the cost function out here to this point just down and
to the right and notice that the straight line
fit is also changed a bit.

在这里插入图片描述

Let’s take another step. The cost function has now
moved to this third and again the function f(x)
has also changed a bit.

As you take more of these steps,
the cost is decreasing at each update. So the parameters w and
b are following this trajectory.

And if you look on the left,
you get this corresponding straight line fit that fits the data better and better
until we’ve reached the global minimum. The global minimum corresponds
to this straight line fit, which is a relatively
good fit to the data. I mean, isn’t that cool. And so that’s gradient descent and we’re going to use this to fit
a model to the holding data.

And you can now use this f(x)
model to predict the price of your clients house or
anyone else’s house. For instance, if your friend’s
house size is 1250 square feet, you can now read off the value and
predict that maybe they could get, I don’t know, $250,000 for the house.

在这里插入图片描述

Batch gradient descent.

To be more precise, this gradient descent
process is called batch gradient descent.

The term bashed grading descent refers
to the fact that on every step of gradient descent, we’re looking
at all of the training examples, instead of just a subset
of the training data.

So in computing grading descent,
when computing derivatives, when computing the sum from i =1 to m. And bash gradient descent is
looking at the entire batch of training examples at each update. I know that bash grading percent may
not be the most intuitive name, but this is what people in the machine
learning community call it.

If you’ve heard of
the newsletter The Batch, that’s published by DeepLearning.AI. The newsletter The batch was also named
for this concept in machine learning.

And then it turns out that there are other
versions of gradient descent that do not look at the entire training set, but instead looks at smaller subsets of
the training data at each update step. But we’ll use batch gradient descent for
linear regression. So that’s it for linear regression.

在这里插入图片描述

Congratulations on getting through
your first machine learning model. I hope you go and celebrate or I don’t
know maybe take a nap in your hammock.

In the optional lab that
follows this video. You’ll see a review of the gradient
descent algorithm as was how to implement it in code.

You’ll also see a plot that shows how
the cost decreases as you continue training more iterations.

And you’ll also see a contour plot, seeing how the cost gets closer
to the global minimum as gradient descent finds better and
better values for the parameters w and b.

So remember that to do the optional lab. You just need to read and run this code. You will need to write
any code yourself and I hope you take a few moments to do that.

And also become familiar with the gradient
descent code because this will help you to implement this and
similar algorithms in the future yourself.

On the way to becoming a machine learning person.

Thanks for sticking with me through
the end of this last video for the first week and congratulations for
making it all the way here. You’re on your way to becoming
a machine learning person.

In addition to the optional labs,
if you haven’t done so yet. I hope you also check out the practice
quizzes, which are a nice way that you can double check your own
understanding of the concepts. It’s also totally fine, if you don’t
get them all right the first time. And you can also take the quizzes multiple
times until you get the score that you want. You now know how to implement linear
regression with one variable and that brings us to the close of this week.

About week 02

Next week, we’ll learn to make linear
regression much more powerful instead of one feature like size of a house, you learn how to get it to
work with lots of features. You’ll also learn how to get
it to fit nonlinear curves. These improvements will make the algorithm
much more useful and valuable. Lastly, we’ll also go over some
practical tips that will really hope for getting linear regression to
work on practical applications. I’m really happy to have you
here with me in this class and I look forward to seeing you next week.

Optional lab: Gradient descent

In the optional lab, you’ll see a review of the gradient descent algorithm, as well as how to implement it in code.

You will also see a plot that shows how the cost decreases as you continue training more iterations. And you’ll also see a contour plot, seeing how the cost gets closer to the global minimum as gradient descent finds better and better values for the parameters w and b.

2022年7月5日14点30分补充这部分lab

在这里插入图片描述

Goals

In this lab, you will:

automate the process of optimizing $w$ and $b$ using gradient descent.

Tools

In this lab, we will make use of:

NumPy, a popular library for scientific computing
Matplotlib, a popular library for plotting data
plotting routines in the lab_utils.py file in the local directory

import math, copy
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('./deeplearning.mplstyle')
from lab_utils_uni import plt_house_x, plt_contour_wgrad, plt_divergence, plt_gradients

Problem Statement

Let’s use the same two data points as before - a house with 1000 square feet sold for $300,000 and a house with 2000 square feet sold for $500,000.

Size (1000 sqft)	Price (1000s of dollars)
1	300
2	500

# Load our data set
x_train = np.array([1.0, 2.0])   #features
y_train = np.array([300.0, 500.0])   #target value

Compute_Cost

This was developed in the last lab. We’ll need it again here.

#Function to calculate the cost
def compute_cost(x, y, w, b):
   
    m = x.shape[0] 
    cost = 0
    
    for i in range(m):
        f_wb = w * x[i] + b
        cost = cost + (f_wb - y[i])**2
    total_cost = 1 / (2 * m) * cost

    return total_cost

Gradient descent summary

So far in this course, you have developed a linear model that predicts $f_{w,b}(x^{(i)})$ :
$f_{w,b}(x^{(i)}) = wx^{(i)} + b \tag{1}$
In linear regression, you utilize input training data to fit the parameters $w$ , $b$ by minimizing a measure of the error between our predictions $f_{w,b}(x^{(i)})$ and the actual data $y^{(i)}$ . The measure is called the $cos t$ , $J (w, b)$ . In training you measure the cost over all of our training samples $x^{(i)},y^{(i)}$
$\frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})^2\tag{2}$

在这里插入图片描述

In lecture, gradient descent was described as:

在这里插入图片描述

Latex语法

\begin{align*} \text{repeat}&\text{ until convergence:} \; \lbrace \newline
\;  w &= w -  \alpha \frac{\partial J(w,b)}{\partial w} \tag{3}  \; \newline 
 b &= b -  \alpha \frac{\partial J(w,b)}{\partial b}  \newline \rbrace
\end{align*}

where, parameters $w$ , $b$ are updated simultaneously.
The gradient is defined as:

在这里插入图片描述

Latex语法

\begin{align}
\frac{\partial J(w,b)}{\partial w}  &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})x^{(i)} \tag{4}\\
  \frac{\partial J(w,b)}{\partial b}  &= \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)}) \tag{5}\\
\end{align}

Here simultaneously means that you calculate the partial derivatives for all the parameters before updating any of the parameters.

Implement Gradient Descent

You will implement gradient descent algorithm for one feature. You will need three functions.

compute_gradient implementing equation (4) and (5) above
compute_cost implementing equation (2) above (code from previous lab)
gradient_descent, utilizing compute_gradient and compute_cost

Conventions:

The naming of python variables containing partial derivatives follows this pattern, $\frac{\partial J(w,b)}{\partial b}$ will be dj_db.
w.r.t is With Respect To, as in partial derivative of $J (w b)$ With Respect To $b$ .

compute_gradient

compute_gradient implements (4) and (5) above and returns $\frac{\partial J(w,b)}{\partial w}$ , $\frac{\partial J(w,b)}{\partial b}$ . The embedded comments describe the operations.

def compute_gradient(x, y, w, b): 
    """
    Computes the gradient for linear regression 
    Args:
      x (ndarray (m,)): Data, m examples 
      y (ndarray (m,)): target values
      w,b (scalar)    : model parameters  
    Returns
      dj_dw (scalar): The gradient of the cost w.r.t. the parameters w
      dj_db (scalar): The gradient of the cost w.r.t. the parameter b     
     """
    
    # Number of training examples
    m = x.shape[0]    
    dj_dw = 0
    dj_db = 0
    
    for i in range(m):  
        f_wb = w * x[i] + b 
        dj_dw_i = (f_wb - y[i]) * x[i] 
        dj_db_i = f_wb - y[i] 
        dj_db += dj_db_i
        dj_dw += dj_dw_i 
    dj_dw = dj_dw / m 
    dj_db = dj_db / m 
        
    return dj_dw, dj_db

这里计算偏导的公式怎么来的呢？上课讲过

在这里插入图片描述

The lectures described how gradient descent utilizes the partial derivative of the cost with respect to a parameter at a point to update that parameter.
Let’s use our compute_gradient function to find and plot some partial derivatives of our cost function relative to one of the parameters, $w_0$ .

plt_gradients(x_train,y_train, compute_cost, compute_gradient)
plt.show()

Output

在这里插入图片描述

Above, the left plot shows $\frac{\partial J(w,b)}{\partial w}$ or the slope of the cost curve relative to $w$ at three points. On the right side of the plot, the derivative is positive, while on the left it is negative. Due to the ‘bowl shape’, the derivatives will always lead gradient descent toward the bottom where the gradient is zero.

The left plot has fixed $b = 100$ . Gradient descent will utilize both $\frac{\partial J(w,b)}{\partial w}$ and $\frac{\partial J(w,b)}{\partial b}$ to update parameters. The ‘quiver plot’ on the right provides a means of viewing the gradient of both parameters. The arrow sizes reflect the magnitude of the gradient at that point. The direction and slope of the arrow reflects the ratio of $\frac{\partial J(w,b)}{\partial w}$ and $\frac{\partial J(w,b)}{\partial b}$ at that point.
Note that the gradient points away from the minimum. Review equation (3) above. The scaled gradient is subtracted from the current value of $w$ or $b$ . This moves the parameter in a direction that will reduce cost.

Gradient Descent

Now that gradients can be computed, gradient descent, described in equation (3) above can be implemented below in gradient_descent. The details of the implementation are described in the comments. Below, you will utilize this function to find optimal values of $w$ and $b$ on the training data.

def gradient_descent(x, y, w_in, b_in, alpha, num_iters, cost_function, gradient_function): 
    """
    Performs gradient descent to fit w,b. Updates w,b by taking 
    num_iters gradient steps with learning rate alpha
    
    Args:
      x (ndarray (m,))  : Data, m examples 
      y (ndarray (m,))  : target values
      w_in,b_in (scalar): initial values of model parameters  
      alpha (float):     Learning rate
      num_iters (int):   number of iterations to run gradient descent
      cost_function:     function to call to produce cost
      gradient_function: function to call to produce gradient
      
    Returns:
      w (scalar): Updated value of parameter after running gradient descent
      b (scalar): Updated value of parameter after running gradient descent
      J_history (List): History of cost values
      p_history (list): History of parameters [w,b] 
      """
    
    w = copy.deepcopy(w_in) # avoid modifying global w_in
    # An array to store cost J and w's at each iteration primarily for graphing later
    J_history = []
    p_history = []
    b = b_in
    w = w_in
    
    for i in range(num_iters):
        # Calculate the gradient and update the parameters using gradient_function
        dj_dw, dj_db = gradient_function(x, y, w , b)     

        # Update Parameters using equation (3) above
        b = b - alpha * dj_db                            
        w = w - alpha * dj_dw                            

        # Save cost J at each iteration
        if i<100000:      # prevent resource exhaustion 
            J_history.append( cost_function(x, y, w , b))
            p_history.append([w,b])
        # Print cost every at intervals 10 times or as many iterations if < 10
        if i% math.ceil(num_iters/10) == 0:
            print(f"Iteration {i:4}: Cost {J_history[-1]:0.2e} ",
                  f"dj_dw: {dj_dw: 0.3e}, dj_db: {dj_db: 0.3e}  ",
                  f"w: {w: 0.3e}, b:{b: 0.5e}")
 
    return w, b, J_history, p_history #return w and J,w history for graphing

Use the gradient descent

# initialize parameters
w_init = 0
b_init = 0
# some gradient descent settings
iterations = 10000
tmp_alpha = 1.0e-2
# run gradient descent
w_final, b_final, J_hist, p_hist = gradient_descent(x_train ,y_train, w_init, b_init, tmp_alpha, 
                                                    iterations, compute_cost, compute_gradient)
print(f"(w,b) found by gradient descent: ({w_final:8.4f},{b_final:8.4f})")

Output

Iteration    0: Cost 7.93e+04  dj_dw: -6.500e+02, dj_db: -4.000e+02   w:  6.500e+00, b: 4.00000e+00
Iteration 1000: Cost 3.41e+00  dj_dw: -3.712e-01, dj_db:  6.007e-01   w:  1.949e+02, b: 1.08228e+02
Iteration 2000: Cost 7.93e-01  dj_dw: -1.789e-01, dj_db:  2.895e-01   w:  1.975e+02, b: 1.03966e+02
Iteration 3000: Cost 1.84e-01  dj_dw: -8.625e-02, dj_db:  1.396e-01   w:  1.988e+02, b: 1.01912e+02
Iteration 4000: Cost 4.28e-02  dj_dw: -4.158e-02, dj_db:  6.727e-02   w:  1.994e+02, b: 1.00922e+02
Iteration 5000: Cost 9.95e-03  dj_dw: -2.004e-02, dj_db:  3.243e-02   w:  1.997e+02, b: 1.00444e+02
Iteration 6000: Cost 2.31e-03  dj_dw: -9.660e-03, dj_db:  1.563e-02   w:  1.999e+02, b: 1.00214e+02
Iteration 7000: Cost 5.37e-04  dj_dw: -4.657e-03, dj_db:  7.535e-03   w:  1.999e+02, b: 1.00103e+02
Iteration 8000: Cost 1.25e-04  dj_dw: -2.245e-03, dj_db:  3.632e-03   w:  2.000e+02, b: 1.00050e+02
Iteration 9000: Cost 2.90e-05  dj_dw: -1.082e-03, dj_db:  1.751e-03   w:  2.000e+02, b: 1.00024e+02
(w,b) found by gradient descent: (199.9929,100.0116)

Take a moment and note some characteristics of the gradient descent process printed above.

The cost starts large and rapidly declines as described in the slide from the lecture.
The partial derivatives, dj_dw, and dj_db also get smaller, rapidly at first and then more slowly. As shown in the diagram from the lecture, as the process nears the ‘bottom of the bowl’ progress is slower due to the smaller value of the derivative at that point.
progress slows though the learning rate, alpha, remains fixed

观察上面两个偏导的变化：可以看到dj_dw 是负数，这里所说的变小的意思是绝对值变小。从两侧趋近于零。

Cost versus iterations of gradient descent

A plot of cost versus iterations is a useful measure of progress in gradient descent. Cost should always decrease in successful runs. The change in cost is so rapid initially, it is useful to plot the initial decent on a different scale than the final descent. In the plots below, note the scale of cost on the axes and the iteration step.

# plot cost versus iteration  
fig, (ax1, ax2) = plt.subplots(1, 2, constrained_layout=True, figsize=(12,4))
ax1.plot(J_hist[:100])
ax2.plot(1000 + np.arange(len(J_hist[1000:])), J_hist[1000:])
ax1.set_title("Cost vs. iteration(start)");  ax2.set_title("Cost vs. iteration (end)")
ax1.set_ylabel('Cost')            ;  ax2.set_ylabel('Cost') 
ax1.set_xlabel('iteration step')  ;  ax2.set_xlabel('iteration step') 
plt.show()

Output

在这里插入图片描述

Predictions

Now that you have discovered the optimal values for the parameters $w$ and $b$ , you can now use the model to predict housing values based on our learned parameters. As expected, the predicted values are nearly the same as the training values for the same housing. Further, the value not in the prediction is in line with the expected value.

print(f"1000 sqft house prediction {w_final*1.0 + b_final:0.1f} Thousand dollars")
print(f"1200 sqft house prediction {w_final*1.2 + b_final:0.1f} Thousand dollars")
print(f"2000 sqft house prediction {w_final*2.0 + b_final:0.1f} Thousand dollars")

Output

1000 sqft house prediction 300.0 Thousand dollars
1200 sqft house prediction 340.0 Thousand dollars
2000 sqft house prediction 500.0 Thousand dollars

Plotting

You can show the progress of gradient descent during its execution by plotting the cost over iterations on a contour plot of the cost(w,b).

fig, ax = plt.subplots(1,1, figsize=(12, 6))
plt_contour_wgrad(x_train, y_train, p_hist, ax)

Output

在这里插入图片描述

Above, the contour plot shows the $cos t (w, b)$ over a range of $w$ and $b$ . Cost levels are represented by the rings. Overlayed, using red arrows, is the path of gradient descent. Here are some things to note:

The path makes steady (monotonic) progress toward its goal.
initial steps are much larger than the steps near the goal.

Zooming in, we can see that final steps of gradient descent. Note the distance between steps shrinks as the gradient approaches zero.

fig, ax = plt.subplots(1,1, figsize=(12, 4))
plt_contour_wgrad(x_train, y_train, p_hist, ax, w_range=[180, 220, 0.5], b_range=[80, 120, 0.5],
            contours=[1,5,10,20],resolution=0.5)

Output

在这里插入图片描述

Increased Learning Rate

In the lecture, there was a discussion related to the proper value of the learning rate, $\alpha$ in equation(3). The larger $\alpha$ is, the faster gradient descent will converge to a solution. But, if it is too large, gradient descent will diverge. Above you have an example of a solution which converges nicely.

Let’s try increasing the value of $\alpha$ and see what happens:

# initialize parameters
w_init = 0
b_init = 0
# set alpha to a large value
iterations = 10
tmp_alpha = 8.0e-1 # 0.8
# run gradient descent
w_final, b_final, J_hist, p_hist = gradient_descent(x_train ,y_train, w_init, b_init, tmp_alpha, 
                                                    iterations, compute_cost, compute_gradient)

Output

Iteration    0: Cost 2.58e+05  dj_dw: -6.500e+02, dj_db: -4.000e+02   w:  5.200e+02, b: 3.20000e+02
Iteration    1: Cost 7.82e+05  dj_dw:  1.130e+03, dj_db:  7.000e+02   w: -3.840e+02, b:-2.40000e+02
Iteration    2: Cost 2.37e+06  dj_dw: -1.970e+03, dj_db: -1.216e+03   w:  1.192e+03, b: 7.32800e+02
Iteration    3: Cost 7.19e+06  dj_dw:  3.429e+03, dj_db:  2.121e+03   w: -1.551e+03, b:-9.63840e+02
Iteration    4: Cost 2.18e+07  dj_dw: -5.974e+03, dj_db: -3.691e+03   w:  3.228e+03, b: 1.98886e+03
Iteration    5: Cost 6.62e+07  dj_dw:  1.040e+04, dj_db:  6.431e+03   w: -5.095e+03, b:-3.15579e+03
Iteration    6: Cost 2.01e+08  dj_dw: -1.812e+04, dj_db: -1.120e+04   w:  9.402e+03, b: 5.80237e+03
Iteration    7: Cost 6.09e+08  dj_dw:  3.156e+04, dj_db:  1.950e+04   w: -1.584e+04, b:-9.80139e+03
Iteration    8: Cost 1.85e+09  dj_dw: -5.496e+04, dj_db: -3.397e+04   w:  2.813e+04, b: 1.73730e+04
Iteration    9: Cost 5.60e+09  dj_dw:  9.572e+04, dj_db:  5.916e+04   w: -4.845e+04, b:-2.99567e+04

Above, $w$ and $b$ are bouncing back and forth between positive and negative with the absolute value increasing with each iteration. Further, each iteration $\frac{\partial J(w,b)}{\partial w}$ changes sign and cost is increasing rather than decreasing. This is a clear sign that the learning rate is too large and the solution is diverging.
Let’s visualize this with a plot.

plt_divergence(p_hist, J_hist,x_train, y_train)
plt.show()

Output

在这里插入图片描述

Above, the left graph shows $w$ ’s progression over the first few steps of gradient descent. $w$ oscillates from positive to negative and cost grows rapidly. Gradient Descent is operating on both $w$ and $b$ simultaneously, so one needs the 3-D plot on the right for the complete picture.