C1 Week 2:(1)Linear Regression with Multiple Variables

不蒸馒头

已于 2024-01-30 11:41:20 修改

阅读量888

点赞数 25

分类专栏：吴恩达coursera 2022-ML-notes 文章标签：线性回归人工智能算法深度学习机器学习语言模型回归

于 2024-01-26 17:47:24 首次发布

本文链接：https://blog.csdn.net/2301_80435663/article/details/135758644

版权

吴恩达coursera 2022-ML-notes 专栏收录该内容

6 篇文章 0 订阅

订阅专栏

写在前面：该笔记为学习吴恩达团队在coursera新开设的机器学习课程而记录，新版本采用python进行授课且相较于旧版有小部分改动
官方网址：https://www.coursera.org/specialization/machine-learning-introduction
学习视频传送门：(强推|双字)2022吴恩达机器学习Deeplearning.ai课程_哔哩哔哩_bilibili

Course1 Week 2-Regression with Multiple input Variables内容如下：
(1)Linear Regression with Multiple Variables
Week 2:(1)Linear Regression with Multiple Variables-CSDN博客
(2)Practical Tips for Linear Regression
Week 2:(2)Practical Tips for Linear Regression-CSDN博客

catalog

1.Multiple Features

2.Vectorization Part

2.1 Vectorization Part 1

2.2 Vectorization Part 2

3.Gradient Descent for Multiple Linear Regression

1.Multiple Features

->Let’s start by looking at the version of linear regression that look at not just one feature,but a lot of different features.In the original version of linear regression,you had a single feature x,the size of the house and you’re able to predict y,the price of the house.The model was fwb of x equals wx plus b.

->But now,what if you did not only have the size of the house as a feature with which to try to predict the price,but if you also knew the number of bedrooms,the number of floors and the age of the home in years.It seems like this would give you a lot more information with which to predict the price.To introduce a little bit of new notation,we’re going to use the variable X_1, X_2, X_3 and X_4 to denote the four features.For simplicity,let’s introduce a little bit more notation.We’ll write X subscript j or sometimes I’ll just say for short,X sub j,to represent the list of features.Here,j will go from one to four,because we have four features.I’m going to use lowercase n to denote the total number of features.So in this example,n is equal to 4.As before,we’ll use X superscript i to denote the i-th training example.So here X superscript i is actually going to be a list of four numbers,or sometimes we’ll call this a vector that includes all the features of the i-th example.
->As a concrete example,X superscript in parentheses 2 will be a vector of the features of the second training example,so it will equal to this[ 1416,3,2,40] and technically ,I’m writing these numbers in a row,so sometimes this is called a row vector rather than a column vector.To refer to a specific feature in the i-th training example,I will write X superscript i,subscript j,so for example,X superscript 2 subscript 3 will be the value of the third feature,that if the number of floors in the second training example and so that’s going to be equal to 2.Sometimes in order to emphasize that this X^2 is not a number but is actually a list of numbers that is a vector,we’ll draw an arrow on top of that just to visually show that is a vector and over here as well,but you don’t have to draw this arrow in your notation.You can think of the arrow as an optional signifier.They’re sometimes used just to emphasize that this is a vector and not a number.
->Now that we have multiple features,let’s take a look at what a model would look like.Previously,this is how we defined the model,where X was a single feature,so a single number,But now with multiple features,we’re going to define it differently. Instead,the model will be,fwb of X equals w1x1 plus w2x2 plus w3x3 plus w4x4 plus b.Concretely for housing price prediction,one possible model may be that we estimate the price of the house as 0.1 times X_1,the size of the house,plus four times X_2,the number of bedrooms,plus ten times X_3,the numbers of floors,minus 2 times X_4,the age of the house in years plus 80.Let’s think a bit about how you might interpret these parameters.If the model is trying to predict the price of the house in thousands of dollars,you can think of this b equals 80 as saying that the base price of a house stars off at maybe $80000,assuming it has no size,no bedrooms,no floor and no age.YOU can think of this 0.1 as saying that maybe for every additional square foot,the price will increase by 0.1 $1000 or by $100, because we’re saying that for each square foot,the price increases by 0.1 times $1000,which is $100.Maybe for each additional bathroom,the price increase by $4000 and for each additional floor the price may increase by $10000 and for each additional year of the house’s age,the price may decrease by $2000,because the parameter is negative 2.In general,if you have n features,then the model will look like this.

->Here again is the definition of the model with n features.What we’re going to do next is introduce a little bit of of notation to rewrite this expression in a simpler but equivalent way.Let’s define W as a list of numbers that list the parameters W_1,W_2,W_3,all the way through W_n.In mathematics,this is called a vector and sometimes to designate that this is a vector,which just means a list of numbers.I’m going to draw a little arrow in top.So you can think of this little arrow as just an optional signifier to remind us that this is a vector.If you’ve taken the linear algebra class before,you might recognize that this is a row vector as opposed to a column vector.Next,same as before,b is a single number and not a vector and so this vector W together with this number b are the parameters of the model.Let me also write X as a list or a vector,again a row vector that lists all of the features X_1,X_2,X_3 up to X_n,this is again a vector,so I’m going to add a little arrow up on top to signify.In the notation up on top,we can also add little arrows here and here to signify that W and that X are actually these lists of numbers,that they’re actually these vectors.With this notation,the model can now be rewritten more succinctly as f of x equals,the vector w dot and this dot refers to a dot product from linear algebra of X the vector plus the number b.
->What is this dot product thing?Well, the dot products of two vectors of two lists of numbers W and X is computed by checking the corresponding pairs of numbers,W_1 and X_1 multiplying that,W_2 and X_2 multiplying that,W_3 and X_3 multiplying that,all the way up to W_n and X_n multiplying that and then summing up all of these products.Writing that out,this means that the dot products is equal to W_1 X_1 plus W_2 X_2plus W_3 X_3 plus all the way up to W_n X_n.Then finally we add back in the b on top.You notice that this gives us exactly the same expression as we had on top.The dot traffic notation lets you write the model in a more compact from with fewer characters.The name for this type of linear regression model with multiple input features is multiple is multiple linear regression.This is in contrast to univariate regression which has just one feature.By the way,you might think this algorithm is called multivariate regression,but that term actually refers to something else that we won’t be using here.I’m going to refer to this model as multiple linear regression.That’s it for linear regression with multiple features,which is also called multiple linear regression.In order to implement this,there’s a really neat trick called vectorization,which will make it much simpler to implement this and many other learning algorithms.

2.Vectorization Part

2.1 Vectorization Part 1

->When you're implementing a learning algorithm, using vectorization will both make your code shorter and also make it run much more efficiently. Learning how to write vectorized code will allow you to also take advantage of modern numerical linear algebra libraries, as well as maybe even GPU hardware that stands for graphics processing unit. This is hardware objectively designed to speed up computer graphics in your computer, but turns out can be used when you write vectorized code to also help you execute your code much more quickly.
->Let's look at a concrete example of what vectorization means. Here's an example with parameters w and b, where w is a vector with three numbers, and you also have a vector of features x with also three numbers. Here n is equal to 3. Notice that in linear algebra, the index or the counting starts from 1 and so the first value is subscripted w1 and x1. In Python code, you can define these variables w, b, and x using arrays like this. Here, I'm actually using a numerical linear algebra library in Python called NumPy, which is by far the most widely used numerical linear algebra library in Python and in machine learning. Because in Python, the indexing of arrays while counting in arrays starts from 0, you would access the first value of w using w square brackets 0. The second value using w square bracket 1, and the third and using w square bracket 2. The indexing here, it goes from 0,1 to 2 rather than 1, 2 to 3. Similarly, to access individual features of x, you will use x0, x1, and x2. Many programming languages including Python start counting from 0 rather than 1.
【example-computing the model's prediction】

->Now, let's look at an implementation without vectorization for computing the model's prediction. In codes, it will look like this. You take each parameter w and multiply it by his associated feature. Now, you could write your code like this, but what if n isn't three but instead n is a 100 or a 100,000 is both inefficient for you the code and inefficient for your computer to compute. Here's another way. Without using vectorization but using a for loop. In math, you can use a summation operator to add all the products of w_j and x_j for j equals 1 through n. Then I'll cite the summation you add b at the end. To summation goes from j equals 1 up to and including n. For n equals 3, j therefore goes from 1, 2 to 3. In code, you can initialize after 0. Then for j in range from 0 to n, this actually makes j go from 0 to n minus 1. From 0, 1 to 2, you can then add to f the product of w_j times x_j. Finally, outside the for loop, you add b. Notice that in Python, the range 0 to n means that j goes from 0 all the way to n minus 1 and does not include NSL. This is written range n in Python. But in this video, I added a 0 here just to emphasize that it starts from 0. While this implementation is a bit better than the first one, this still doesn't use factorization, and isn't that efficient?
->Now, let's look at how you can do this using vectorization. This is the math expression of the function f, which is the dot product of w and x plus b, and now you can implement this with a single line of code by computing fp equals np dot dot, I said dot dot because the first dot is the period and the second dot is the function or the method called DOT. But is fp equals np dot dot w comma x and this implements the mathematical dot products between the vectors w and x. Then finally, you can add b to it at the end. This NumPy dot function is a vectorized implementation of the dot product operation between two vectors and especially when n is large, this will run much faster than the two previous code examples.
->I want to emphasize that vectorization actually has two distinct benefits. First, it makes code shorter, is now just one line of code. Isn't that cool? Second, it also results in your code running much faster than either of the two previous implementations that did not use vectorization. The reason that the vectorized implementation is much faster is behind the scenes. The NumPy dot function is able to use parallel hardware in your computer and this is true whether you're running this on a normal computer, that is on a normal computer CPU or if you are using a GPU, a graphics processor unit, that's often used to accelerate machine learning jobs. The ability of the NumPy dot function to use parallel hardware makes it much more efficient than the for loop or the sequential calculation that we saw previously. Now, this version is much more practical when n is large because you are not typing w0 times x0 plus w1 times x1 plus thoughts of additional terms like you would have had for the previous version. But while this saves a lot on the typing, is still not that computationally efficient because it still doesn't use vectorization. To recap, vectorization makes your code shorter, so hopefully easier to write and easier for you or others to read, and it also makes it run much faster. Let's take a look at what your computer is actually doing behind the scenes to make vectorized code run so much faster.

2.2 Vectorization Part 2

->Let's take a deeper look at how a vectorized implementation may work on your computer behind the scenes. Let's look at this for loop. The for loop like this runs without vectorization. If j ranges from 0 to say 15, this piece of colds performs operations one after another. On the first timestamp which I'm going to write as t0. It first operates on the values at index 0. At the next time-step, it calculates values corresponding to index 1 and so on until the 15th step, where it computes that. In other words, it calculates these computations one step at a time, one step after another. In contrast, this function in NumPy is implemented in the computer hardware with vectorization. The computer can get all values of the vectors w and x, and in a single-step, it multiplies each pair of w and x with each other all at the same time in parallel. Then after that, the computer takes these 16 numbers and uses specialized hardware to add them altogether very efficiently, rather than needing to carry out distinct additions one after another to add up these 16 numbers.
->This means that codes with vectorization can perform calculations in much less time than codes without vectorization. This matters more when you're running algorithms on large data sets or trying to train large models, which is often the case with machine learning. That's why be able vectorize implementations of learning algorithms, has been a key step to getting learning algorithms to run efficiently, and therefore scale well to large datasets that many modern machine learning algorithms now have to operate on.

【without vectorization】
->Now, let's take a look at a concrete example of how this helps with implementing multiple linear regression and this linear regression with multiple input features. Say you have a problem with 16 features and 16 parameters, w1 through w16, in addition to the parameter b. You calculate it 16 derivative terms for these 16 weights and codes, maybe you store the values of w and d in two np.arrays, with d storing the values of the derivatives. For this example, I'm just going to ignore the parameter b. Now, you want to compute an update for each of these 16 parameters. W_j is updated to w_j minus the learning rate, say 0.1, times d_j, for j from 1 through 16. Encodes without vectorization, you would be doing something like this. Update w1 to be w1 minus the learning rate 0.1 times d1, next, update w2 similarly, and so on through w16, updated as w16 minus 0.1 times d16. Encodes without vectorization, you can use a for loop like this for j in range 016, that again goes from 0-15, said w_j equals w_j minus 0.1 times d_j.
【with vectorization】
->In contrast, with factorization, you can imagine the computer's parallel processing hardware like this. It takes all 16 values in the vector w and subtracts in parallel, 0.1 times all 16 values in the vector d, and assign all 16 calculations back to w all at the same time and all in one step. In code, you can implement this as follows, w is assigned to w minus 0.1 times d. Behind the scenes, the computer takes these NumPy arrays, w and d, and uses parallel processing hardware to carry out all 16 computations efficiently. Using a vectorized implementation, you should get a much more efficient implementation of linear regression.
->Maybe the speed difference won't be huge if you have 16 features, but if you have thousands of features and perhaps very large training sets, this type of vectorized implementation will make a huge difference in the running time of your learning algorithm. It could be the difference between codes finishing in one or two minutes, versus taking many hours to do the same thing.
【About optional lab】
->In the optional lab that follows this video, you see an introduction to one of the most used Python libraries and Machine Learning, which we've already touched on in this video called NumPy. You see how they create vectors encode and these vectors or lists of numbers are called NumPy arrays, and you also see how to take the dot product of two vectors using a NumPy function called dot. You also get to see how vectorized code such as using the dot function, can run much faster than a for-loop. In fact, you'd get to time this code yourself, and hopefully see it run much faster. This optional lab introduces a fair amount of new NumPy syntax, so when you read through the optional lab, please still feel like you have to understand all the code right away, but you can save this notebook and use it as a reference to look at when you're working with data stored in NumPy arrays.

3.Gradient Descent for Multiple Linear Regression

->You've learned about gradient descents about multiple linear regression and also vectorization. Let's put it all together to implement gradient descent for multiple linear regression with vectorization.Let's quickly review what multiple linear regression look like. We have parameters w_1 to w_n as well as b. But instead of thinking of w_1 to w_n as separate numbers, that is separate parameters, let's start to collect all of the w's into a vector w so that now w is a vector of length n. We're just going to think of the parameters of this model as a vector w, as well as b, where b is still a number same as before. Whereas before we had to find multiple linear regression like this, now using vector notation, we can write the model as f_w, b of x equals the vector w.product with the vector x plus b. Remember that this dot here means product. Our cost function can be defined as J of w_1 through w_n, b. But instead of just thinking of J as a function of these and different parameters w_j as well as b, we're going to write J as a function of parameter vector w and the number b. This w_1 through w_n is replaced by this vector W and J now takes this input of vector w and a number b and returns a number. Here's what gradient descent looks like. We're going to repeatedly update each parameter w_j to be w_j minus Alpha times the derivative of the cost J, where J has parameters w_1 through w_n and b. Once again, we just write this as J of vector w and number b.

->Let's see what this looks like when you implement gradient descent and in particular, let's take a look at the derivative term. We'll see that gradient descent becomes just a little bit different with multiple features compared to just one feature. Here's what we had when we had gradient descent with one feature. We had an update rule for w and a separate update rule for b. Hopefully, these look familiar to you. This term here is the derivative of the cost function J with respect to the parameter w. Similarly, we have an update rule for parameter b, with univariate regression, we had only one feature. We call that feature xi without any subscript. Now, here's a new notation for where we have n features, where n is two or more. We get this update rule for gradient descent. Update w_1 to be w_1 minus Alpha times this expression here and this formula is actually the derivative of the cost J with respect to w_1. The formula for the derivative of J with respect to w_1 on the right looks very similar to the case of one feature on the left. The error term still takes a prediction f of x minus the target y. One difference is that w and x are now vectors and just as w on the left has now become w_1 here on the right, xi here on the left is now instead xi _1 here on the right and this is just for J equals 1. For multiple linear regression, we have J ranging from 1 through n and so we'll update the parameters w_1, w_2, all the way up to w_n, and then as before,we'll update b.

【normal equation】

->I want to make a quick aside or a quick side note on an alternative way for finding w and b for linear regression. This method is called the normal equation. Whereas it turns out gradient descent is a great method for minimizing the cost function J to find w and b, there is one other algorithm that works only for linear regression and pretty much none of the other algorithms you see in this specialization for solving for w and b and this other method does not need an iterative gradient descent algorithm. Called the normal equation method, it turns out to be possible to use an advanced linear algebra library to just solve for w and b all in one goal without iterations. Some disadvantages of the normal equation method are：first unlike gradient descent, this is not generalized to other learning algorithms, such as the logistic regression algorithm that you'll learn about next week or the neural networks or other algorithms you see later in this specialization. The normal equation method is also quite slow if the number of features and this large.
->Almost no machine learning practitioners should implement the normal equation method themselves but if you're using a mature machine learning library and call linear regression, there is a chance that on the backend, it'll be using this to solve for w and b. If you're ever in the job interview and hear the term normal equation, that's what this refers to. Don't worry about the details of how the normal equation works. Just be aware that some machine learning libraries may use this complicated method in the back-end to solve for w and b. But for most learning algorithms, including how you implement linear regression yourself, gradient descents offer a better way to get the job done.
【About optional lab】
->In the optional lab that follows this video, you'll see how to define a multiple regression model encode and also how to calculate the prediction f of x. You'll also see how to calculate the cost and implement gradient descent for a multiple linear regression model. This will be using Python's NumPy library. If any of the code looks very new, that's okay but you should feel free also to take a look at the previous optional lab that introduces NumPy and vectorization for a refresher of NumPy functions and how to implement those encodes. That's it. You now know multiple linear regression. This is probably the single most widely used learning algorithm in the world today. But there's more. With just a few tricks such as picking and scaling features appropriately and also choosing the learning way Alpha appropriately, you'd really make this work much better.