

线性回归 (Linear Regression)

Linear regression is a part of Statistics that defines the relationship between two numerical variables. It is a linear model that believes and justifies that there exists a linear relationship between two variables.

线性回归是统计的一部分,它定义了两个数值变量之间的关系。 它是一个线性模型,可以相信并证明两个变量之间存在线性关系。

It takes into account the input variable and the output variable. It implies that one can calculate from a linear combination of input variables (x).

它考虑了输入变量和输出变量。 这意味着可以根据输入变量(x)的线性组合进行计算。

线性回归模型表示 (Linear Regression Model Representation)

Linear regression can be expressed in terms of an equation as:



y = B0 + B1 * x

Where x is an input variable. ‘B’ is greek alphabet representing coefficients here which are a scalar factor assigned to each input variable. An additional coefficient has been added to incorporate the intercept or bias.

其中x是输入变量。 “ B”是希腊字母,代表此处的系数,是分配给每个输入变量的标量因子。 添加了附加系数以合并截距或偏差。

线性回归的类型 (Types of Linear Regression)

Simple Linear Regression: It takes into account a single x variable and helps in predicting output(y) variables.


Example: When we are trying to predict the price of a house based on the square footage of the area covered by it. Here, Square footage of the house is the input variable and the price of the home is the output variable.

示例:当我们尝试根据房屋所覆盖区域的平方英尺来预测房屋价格时。 在这里,房屋的平方英尺是输入变量,房屋的价格是输出变量。

Multiple Regression: There are more than 1 input variables involved to predict output(y) variables.


Example: When we take an area of a house, the number of rooms, HouseStyle to predict the house price. Here, multiple input variables like the area of the house, number of rooms, HouseStyle are used to predict house price which is the output variable.

示例:当我们以房屋的面积为单位时,房间数,HouseStyle可以预测房价。 在这里,多个输入变量(如房屋面积,房间数量,HouseStyle)用于预测房屋价格,这是输出变量。

正则化 (Regularization)

It is the technique where we add information to the regression equation or reduce coefficients to zero to avoid overfitting or the complex nature of the problem. It is used when there is collinearity in input values

在这种技术中,我们将信息添加到回归方程中或将系数减小为零,以避免过度拟合或问题的复杂性。 当输入值存在共线性时使用

基于正则化的回归类型 (Types Of Regularization Based Regression)

Lasso Regression: It is also known as L1 Regularization. It is a procedure where Ordinary Least Squares is modified to reduce the absolute sum of the coefficients.

套索回归:也称为L1正则化。 这是修改普通最小二乘以减少系数的绝对和的过程。

Example: There are 10,000 features to predict variables, the Lasso model selects only a few coefficients and converts the reset to zero.


Ridge Regression: It is also known as L2 Regularization. It is a procedure where Ordinary Least Squares squared the absolute sum of the coefficients. When coefficients used in the regression are unbalanced, we introduce alpha value to improve the model. Example: When we are trying to predict the sales of outlets, the type of outlet has higher weight compared to the weight of items sold there then we introduce alpha which reduces the sum of coefficients.

岭回归:也称为L2正则化。 这是一个用普通最小二乘法对系数的绝对和求平方的过程。 当回归中使用的系数不平衡时,我们引入alpha值来改进模型。 示例:当我们试图预测网点的销售时,网点的类型比那里售出的商品的权重更高,因此我们引入alpha来减少系数的总和。

梯度下降 (Gradient Descent)

It is a process of optimizing coefficients by repeatedly minimizing the error of the model on your training data. The process involves adding learning rates and coefficients are updated for minimizing the error. It is iterated until a minimum sum square error is achieved or change is not possible.

这是通过反复最小化模型对训练数据的误差来优化系数的过程。 该过程涉及增加学习率,并且为了最小化误差而更新系数。 迭代直到达到最小和平方误差或无法更改。

Learning Rate () is the size of the improvement step for each iteration of the procedure and should be chosen decisively.


梯度下降的类型 (Types of Gradient Descent)

Stochastic Gradient Descent: This method looks at every example in the entire training set on every step.


Example: The training data has 200 samples then the parameters are updated for the same number of samples. It means once every individual sample is used in the model.

示例:训练数据有200个样本,然后针对相同数目的样本更新参数。 这意味着在模型中使用了每个单独的样本。

Batch Gradient Descent: This method iterates through a training set, whenever you come across a training example, you update the parameters according to the error gradient based on a single training example only.


Example: The training set has 100 samples, then the parameters of the model are updated only once based on all examples.


回归线属性 (Regression Line Properties)

Considering regression coefficients as B0 and B1, the line has the following properties:


  • The line minimizes the sum of squared differences between the actual values and predicted values.

  • The regression line graphically passes through the mean of X and Y values.

  • B0 means the y-intercept of the regression line.

  • B1 is the average change in Y for 1-unit change in X. It is also known as the slope of the regression line.


The least-squares regression line is the only straight line that has all of these properties.


定义输入和输出变量之间的关系 (Defining The Relationship Between Input And Output Variable)

When B1>0, x and y variables have positive relationships. It implies that x will increase y.

当B1> 0时,x和y变量具有正关系。 这意味着x将增加y。

When B1<0, x and y variables have negative relationships. It implies that x and y are inversely related, if x increases, y will decrease.

当B1 <0时,x和y变量具有负关系。 这意味着x和y成反比,如果x增加,y将减少。

For example , When we are trying to predict house price, house type, and several rooms used to define the model is known as input variables and house price is an output variable.


如何检查模型性能? (How To Check Model Performance?)

We plot the actual values and predicted values on a graph. The main idea is to find a line that best fits the data. The best line would be where the total prediction error is the smallest. Error is the distance between the point of the regression line.

我们在图表上绘制实际值和预测值。 主要思想是找到最适合数据的线。 最好的线是总预测误差最小的位置。 误差是回归线的点之间的距离。

How To Check Model Performance?

Error is squared so that positive and negative differences do not cancel each other.


R平方值 (R-Squared value)

This value exists from a range of 0 to 1 where 0 points to predictor X does not affect y and 1 means predictor has full effect on changes in y.


  1. Regression sum of squares(SSR)


    It tells us the distance between the regression line and the actual output line.


Regression sum of squares(SSR)

2. Sum of Squared Error(SSE) It tells how much y value differs from the predicted value.

2. 平方误差之和(SSE)告诉您y值与预测值相差多少。

Sum of Squared Error(SSE)

3. The total sum of squares (SSTO) It explains how much data points are close to mean.

3. 总平方和(SSTO)解释多少数据点接近均值。

The total sum of squares (SSTO)

结论 (Conclusion)

We covered the grounds of linear regression in the article. We learned about its model representation. We know about various types of regression and how we can use them in data science to predict values. We went through how we can predict based on one or more independent variables. Once we predict, we also know about how to check the model performance to know how much prediction varies from actual values.

我们在本文中介绍了线性回归的基础。 我们了解了其模型表示。 我们了解各种回归类型,以及如何在数据科学中使用它们来预测值。 我们介绍了如何根据一个或多个自变量进行预测。 一旦进行预测,我们还将知道如何检查模型性能,以了解与实际值有多少预测。

