线性回归非线性回归_线性回归的尽快指南

最新推荐文章于 2024-08-18 23:20:49 发布

weixin_26748251

最新推荐文章于 2024-08-18 23:20:49 发布

阅读量1.4k

点赞数

文章标签：逻辑回归 python 机器学习

原文链接：https://medium.com/datadriveninvestor/asap-guide-to-linear-regression-fda841656fbd

版权

本文提供了一篇关于线性回归的快速入门指南，内容涵盖线性回归和非线性回归的基础知识，适合机器学习初学者。通过翻译自DataDrivenInvestor的文章，了解如何在Python中应用线性和非线性回归模型。

摘要由CSDN通过智能技术生成

线性回归非线性回归

T

Ť

Linear Regression is famously known for being a simple algorithm and a good baseline to compare more complex models to. In this article, we explore the algorithm, understand the math, run the code, and learn linear regression As Soon As Possible.

线性回归以简单的算法和比较复杂模型的良好基准而闻名。在这篇文章中，我们将探讨算法，了解数学，运行代码，并了解线性回归ššOON一个性s P ossible。

第1节：基础 (Section 1: The Basics)

Linear Regression — or LR — is a regression algorithm.

线性回归-或LR-是一种回归算法。

It’s used for making predictions where the output is a continuous value, such as the number 7.

它用于在输出为连续值(例如数字7)的情况下进行预测。

It shows the relationship between one and more independent variables (the input/data/features) and the dependent variable (the outcome/prediction).

它显示了一个和多个自变量(输入/数据/功能)与因变量(结果/预测)之间的关系。

An easy way to remember this is the output being DEPENDENT on the input. To understand the difference even better, click here.

记住这一点的一种简单方法是，输出依赖于输入。要进一步了解差异，请单击此处。

第2节：数学 (Section 2: The Math)

第2.1节：简单线性回归(SLR)(Section 2.1: Simple Linear Regression (SLR))

Simple linear regression is the most barebone and easy to understand version of LR. It contains only one independent variable that predicts one dependent variable.

简单的线性回归是LR的最准也最容易理解的版本。它仅包含一个预测一个因变量的自变量。

It’s essential to understand this before moving up algorithms and LR itself. It’s also the foundation of multiple and polynomial linear regression. If you still don't understand SLR after this section, watch this on 2x speed (4 minutes)

在提升算法和LR本身之前，必须了解这一点。它也是多项式和多项式线性回归的基础。如果你还没有本节后理解单反，看这对2倍速(4分钟)

The equation behind LR is derived from a simple equation we all learn in secondary school:

LR背后的方程式是从我们大家在中学中学到的一个简单方程式得出的：

The equation for SLR is a direct derivation of the slope-intercept form equation. With both equations top and bottom, it’s easy to see the variable overlap.

SLR方程是斜率截距形式方程的直接推导。使用方程式的顶部和底部，很容易看到变量重叠。

β0 is the y-intercept
β0是y截距
β1 is the coefficient/slope
β1是系数/斜率
X is the independent variable
X是自变量
and ŷ is the dependent variable
ŷ是因变量

Although LR can show the relationship between two or more variables, SLR only shows the relationship between TWO variables. That’s why it only only has two variables, ŷ and X. This makes it easy to visualize:

尽管LR可以显示两个或多个变量之间的关系，但SLR仅显示两个变量之间的关系。这就是为什么它只有两个变量ŷ和X的原因。这使可视化变得容易：

Here’s a more outlined version:

这是更概述的版本：

第2.2节：多元线性回归(MLR)(Section 2.2: Multiple Linear Regression (MLR))

NOTE: Going forward, features and an independent variable(s) will be used interchangeably as they represent that same concept.

注意：今后，功能和自变量将可以互换使用，因为它们表示相同的概念。

While SLR can be used in single predictor/predicted situations, MLR is used in almost all other situations when there are multiple factors determining the outcome, such as this dataset. It follows a nearly identical equation to SLR but has multiple “β1X” or combinations of independent variables and coefficients.

虽然SLR可以用于单个预测器/预测情况，但当有多个因素决定结果时，MLR几乎可以用于所有其他情况，例如此数据集。它遵循与SLR几乎相同的方程，但具有多个“ β1X ”或独立变量和系数的组合。

MLR assigns a unique coefficient to each feature and stores both values in a single direction vector to be easily modified and used in operations (multiplications) as a matrix function:

MLR为每个特征分配一个唯一的系数，并将这两个值存储在一个方向向量中，以便轻松进行修改并将其用作矩阵函数进行运算(乘法) ：

维数(Dimensionality)

With two variables it’s very easy to visualize our data. Even with three variables, we can visualize it on a 3D plane (see Fig 5). But as our values increase, we reach the problem of dimensionality that must be understood before moving on.

使用两个变量，可以很容易地可视化我们的数据。即使具有三个变量，我们也可以在3D平面上对其进行可视化(参见图5) 。但是，随着我们价值观的提高，我们遇到了必须继续理解的尺寸问题。

One important thing to note is that only the amount independent variables increases. That is because the dependent variable is the value we are predicting so it’s always one value. For more complex AI models utilizing neural networks we can have multiple. But in most machine learning applications, and for the sake of this article, the outcome is only one value.

要注意的一件事是，仅独立变量的数量增加。那是因为因变量是我们预测的值，所以它始终是一个值。对于使用神经网络的更复杂的AI模型，我们可以有多个。但是在大多数机器学习应用程序中，就本文而言，结果只是一个价值。

Let’s say we have two features. That means, in theory, we have two x-axes and one y-axis. To map this we have to move up dimensions and project the data like this:

假设我们有两个功能。从理论上讲，这意味着我们有两个x轴和一个y轴。要对此进行映射，我们必须向上移动尺寸并投影数据，如下所示：

In this example, we are in a 3-dimensional space and the hyperplane is the yellow subspace with one less dimension (which is why its a 2-dimensional plane).

在此示例中，我们处于3维空间中，超平面是维数较少的黄色子空间(这就是其2维平面的原因)。

Two features are something we can still visualize, but as we move up even more dimensions, our mind fails to comprehend it. Our computer luckily can which is why MLR works. If you’d like a deeper level of understanding, check out this article.

我们仍然可以看到两个特征，但是随着尺寸的增加，我们的思维将无法理解。幸运的是，我们的计算机可以运行MLR。如果您想更深入地了解，请查看本文。

There is also something called dimensionality reduction (we won't get cover this for the sake of time) which can be done through different techniques (principal component analysis, linear discriminant analysis, kernel PCA, etc) to reduce the dimensions. If you want to look more into this, check out this chapter.

还有一些所谓的降维(我们将不会涉及这对于时间的缘故)，它可以通过不同的技术(主成分分析，线性判别分析，核PCA等)进行降维。如果您想了解更多，请查阅本章。

You might be wondering how all these values are even calculated? How do we know the intercept or the coefficients are correct? How do we even know if our model is correct?

您可能想知道如何计算所有这些值？我们怎么知道截距或系数是正确的？我们怎么知道我们的模型是否正确？

Obviously, we can see with our eye how aligned the line is with our data (See Fig 2.1) but our computers use a more granular and accurate way to do this.

显然，我们可以用肉眼看到线条与数据的对齐方式(请参见图2.1 )，但是我们的计算机使用了更精细，更准确的方式来实现此目的。

第2.3节：算法技术 (Section 2.3: Algorithm Techniques)

Before we get to this, let’s clear something up, the difference between an algorithm and model.

在开始之前，我们先弄清楚算法和模型之间的区别。

I cover this deeply in this article but tl;dr: an algorithm is run on our data to find patterns and rules which are stored in and used to create a model that can be used to make sure predictions.

我在本文中对此进行了深入介绍，但tl; dr ：在我们的数据上运行了一种算法，以查找存储在模型中的规则和规则，这些规则和规则用于创建可用于确保预测的模型。

So the techniques we are going to cover are used on the algorithm with our data to create an accurate model.

因此，我们将要覆盖的技术与我们的数据一起用于算法，以创建准确的模型。

普通最小广场 (Ordinary Least Square)

To find the unknown parameters of our model we use a linear least-squares method. The most popular of these methods is a technique called ordinary least squares.

为了找到模型的未知参数，我们使用线性最小二乘法。这些方法中最流行的是一种称为普通最小二乘法的技术。

OLS aims to minimize the sum of the squared residuals of each point. It does the sum of the squared residuals since points underneath the line have a negative residual and if we were to sum them it would cancel out other positive sums. Squaring each residual ensures positivity and avoids canceling out.

OLS旨在最小化每个点的残差平方和。它做残差平方的总和，因为线下的点具有负残差，如果我们将它们相加，它将抵消其他正和。对每个残差求平方可确保阳性并避免抵消。

A residual is a distance between a point and its predicted value on the line.

残差是点与线上预测值之间的距离。

Here’s a neat visualization:

这是一个整洁的可视化效果：

R平方(R Squared)

The most common way to measure the accuracy of our model is through a technique called r², also known as the coefficient of determination.

衡量我们模型准确性的最常见方法是通过称为r²的技术，也称为确定系数。

It tells us how well the regression line predicts the actual values by measuring the percentage of variation in Y that is accounted for by its regression on X (which is why its a value between 0 and 1).

它告诉我们回归线通过测量Y的变化百分比来预测实际值，而Y的变化百分比是由X上的回归所占的(这就是为什么它的值在0到1之间)的原因。

The R² basically compares the sum of squared residuals of each point from the regression line to the sum of squared residuals of each point from the mean.

R²基本上将回归线中每个点的残差平方和与均值中每个点的残差平方和进行比较。

This equation might look daunting but it’s simply comparing the error from mean and error from regression. There might be some variations in equations because SST avoids getting negative values so depending on the point being above or below the line we have to add or subtract.

这个方程看似令人生畏，但它只是在比较均值误差和回归误差。方程中可能会有一些变化，因为SST避免获得负值，因此取决于要加或减的直线上方或下方的点。

Here’s a good video that covers the fundamentals of it. Highly recommend watching it :-)

这是一个很好的视频，介绍了它的基本原理。强烈建议观看：-)

第3节：构建之前 (Section 3: Before We Build)

第3.1节：LR的假设(Section 3.1: Assumptions of LR)

Before getting into the building, there are a few assumptions of LR that we need to consider. If these assumptions are not met then our data can collapse on itself and mess up our model.

在进入大楼之前，我们需要考虑一些关于LR的假设。如果不满足这些假设，那么我们的数据可能会自行崩溃并弄乱我们的模型。

Linearity - the data must be linear (See Fig 2.1).
线性-数据必须是线性的(见图2.1) 。
Homoscedavity - the variance of errors should be constant (See Fig 7.1 below).
均质性-误差的方差应为常数(请参见下面的图7.1) 。
Multivariate normality - the residuals must be normally distributed (only a few outliers and the remaining points should be close to the line).
多元正态性-残差必须正态分布(仅少数异常值，其余点应靠近直线) 。
Independence of errors - the residuals should not be correlated with each other (See Fig 7.2 below).
误差的独立性-残差不应相互关联(请参见下面的图7.2) 。
Lack of multicollinearity - features/independent variables are not highly correlated with each other (one feature doesn’t directly predict the other like length and volume)
缺乏多重共线性-特征/独立变量之间的相关性不高(一个特征不能直接预测另一个特征，例如长度和体积)

第3.2节：建立MLR模型的方法(Sections 3.2: Methods of Building an MLR Model)

There’s a reason I said “MLR Model” and not “LR Model”. For an SLR model, we only have two variables to implement so there’s no nuance. But when dealing with MLR we can have anywhere from three to three hundred variables. That is why we must employ methods to reduce variables.

我说“ MLR模型”而不是“ LR模型”是有原因的。对于SLR模型，我们只有两个变量要实现，因此没有细微差别。但是在处理MLR时，我们可以拥有三到三百个变量。因此，我们必须采用减少变量的方法。

There are two main reasons why doing this is a good idea: (1) The more variables we have, the harder it is for our computer to make and run an effective model, and (2) the more variables we have, the harder it is for us to understand the impact of the variables and draw valuable insights from our data.

这样做是个好主意，主要有两个原因： (1)我们拥有的变量越多，计算机制作和运行有效模型的难度就越大； (2)我们拥有的变量越多，难度就越大是为了让我们了解变量的影响并从我们的数据中获得有价值的见解。

We use a P-value score (a correlational variable that measures the significance of a variable among others— usually set to 0.5 or 5%) to help us find the significant variables in our features that actually have an influence on the outcome. This is also known as a significance level

我们使用P值评分(相关变量来衡量变量的显着性-通常设置为0.5或5％)，以帮助我们找到功能中实际上对结果有影响的显着变量。这也称为显着性水平

There are 5 main methods we find this:

我们找到5种主要方法：

All-in - This method is using all the variables we have and is something we only do if we have prior domain knowledge that all our variables are significant.
All- in-此方法将使用我们拥有的所有变量，并且只有在我们事先了解我们所有变量都是有效的领域知识后，我们才能执行此操作。
Backyard Elimination - This method fits all the variables and takes out the variable that has the highest p-value. It refits the model and repeats this process until our variables are all below our p-value.
后院消除-此方法适合所有变量，并取出具有最高p值的变量。它会重新拟合模型并重复此过程，直到我们的变量都低于我们的p值。
Forward Elimination - This method creates an SLR model for each feature and finds the one with the lowest p-value. It then keeps that feature and fits all possible models with that feature and an extra predictor. It then repeats this process until no feature is under the p-value.
正向消除-此方法为每个特征创建一个SLR模型，并找到具有最低p值的模型。然后，它将保留该功能，并使用该功能和额外的预测变量来拟合所有可能的模型。然后重复该过程，直到没有任何特征在p值以下。
Bidirectional Elimination - This method utilizes the first step of forward elimination (creating an SLR model for each feature and finding the one with the lowest p-value) and then uses backward elimination to take out all the variables that are above our p-value and repeats. It’s also known as stepwise regression.
双向消除-这种方法利用了前向消除的第一步(为每个特征创建一个SLR模型并找到具有最低p值的模型) ，然后使用向后消除来去除所有高于我们p值的变量。重复。这也称为逐步回归。
Score Comparison - This method uses a specific criterion of goodness (such as Akaike criterion) and then construct all possible variations with n variables (2^n -1) and choose the best. It’s also extremely unsustainable as variables increase.
分数比较-此方法使用特定的优度标准(例如Akaike准则)，然后构造具有n个变量(2 ^ n -1)的所有可能变异并选择最佳变量。随着变量的增加，它也是极端不可持续的。

Lucky for us, most machine learning libraries — including scikit-learn— find the best method and apply it automatically when we’re coding.

对我们来说幸运的是，大多数机器学习库(包括scikit-learn )都能找到最佳方法，并在编码时自动应用。

Speaking of coding…

说到编码…

第4节：建立模型 (Section 4: Building the Model)

For building this model, and a majority of machine learning models, we will use the python library, skit-learn. We’re using skit-learn because it’s one of the most extensive and user-friendly machine learning libraries. Plus python rocks.

为了构建此模型以及大多数机器学习模型，我们将使用python库Skit-learn 。我们正在使用Skit-learn，因为它是最广泛且用户友好的机器学习库之一。再加上Python石。

Let’s start!

开始吧！

For learning purposes, this is a pretty straight forward MLR model. The dataset is retrieved from the UCI Machine Learning repository and contains 2014 Facebook post metrics from over 500 posts from an international cosmetics company.

出于学习目的，这是一个非常简单的MLR模型。该数据集是从UCI机器学习存储库中检索的，包含来自一家国际化妆品公司的500多个帖子的2014年Facebook帖子指标。

It has multiple independent and dependent variables that we can use. In this case, we’ll use the metrics to predict Lifetime Engaged Users as a baseline (although any other predicted variable can be used).

它具有多个可以使用的自变量和因变量。在这种情况下，我们将使用指标将“终身参与用户”作为基线进行预测(尽管可以使用任何其他预测变量) 。

If you want to follow the real code, go to my GitHub.

如果您想遵循真实的代码，请转到我的GitHub 。

First, we’re gonna import these three basic libraries that, like skit-learn, are applicable to all ML.

首先，我们将导入这三个基本库，它们像skit-learn一样适用于所有ML。

We can import the dataset through Pandas dataframe and use iloc to assign the variables. Remember that the name of the dataset has to be updated for different use cases AND it must be in the same folder as your .py file or uploaded on Jupyter Notebooks or Google Collab.

我们可以通过Pandas数据框导入数据集，并使用iloc分配变量。请记住，数据集的名称必须针对不同的用例进行更新，并且它必须与.py文件位于同一文件夹中，或者必须上载到Jupyter Notebooks或Google Collab中。

For any data science problem one of the most important tasks is preprocessing data. Because our computers are very precise and punctual machines, our datasets have to be perfect.

对于任何数据科学问题，最重要的任务之一就是预处理数据。因为我们的计算机是非常精确和准时的机器，所以我们的数据集必须是完美的。

They have to be accounted for missing values or NaN (Not a Number) values and following LR assumptions. The assumptions can be taken care of just by looking at our data or using a separate function to check for them, however, we have to take care of missing and incorrectly formatted values ourselves.

必须考虑缺失值或NaN(非数字)值，并遵循LR假设。可以仅通过查看我们的数据或使用单独的函数来检查这些假设就可以了，但是，我们必须自己处理丢失和格式错误的值。

There are multiple ways to do this but the most efficient way is to use the preprocessing tools provided by skit-learn. This varies for different situations but we’ll cover a few here.

有多种方法可以做到这一点，但是最有效的方法是使用skit-learn提供的预处理工具。对于不同的情况，这会有所不同，但在这里我们将介绍一些。

After assigning our data set we can run print(dataset.info()) to quickly visualize all of our columns to see where we need to fix our data. It’s better to actually look at our dataset or use conditional formatting on csv files but this works too. We get something like this:

分配数据集后，我们可以运行print(dataset.info())快速可视化所有列，以查看需要在何处修复数据。最好实际查看我们的数据集或对csv文件使用条件格式，但这也可以。我们得到这样的东西：

For this dataset, we have a total of 3 columns with missing data. We can use the SimpleImputer function to help us impute our missing data.

对于此数据集，我们总共有3列缺少数据。我们可以使用SimpleImputer函数来帮助我们估算丢失的数据。

For index 8 and 9 (the last index is not included hence ‘8:10’) we can use the ‘median’ to impute to compensate for outliers and for index 6 we use ‘most_frequent’ because it’s a binary datapoint.

对于索引8和9(不包括最后一个索引，因此不包含“ 8:10”)，我们可以使用“中位数”来估算以补偿异常值；对于索引6，我们可以使用“ most_frequent”，因为它是二进制数据点。

Index 1 has categorical data that has to be converted using a function called OneHotEncoding. This must be done AFTER imputing missing values since OneHotEncoding automatically makes the encoded column the first index which displaces all the rest.

索引1具有必须使用称为OneHotEncoding的函数转换的分类数据。由于OneHotEncoding会自动使编码列成为替换所有其余部分的第一个索引，因此必须在估算缺失值之后执行此操作。

Because of the large dataset and many variables for a simple algorithm we can use a 90/10 dataset split. The random state is tuned to 3 for consistency sakes.

由于一个简单算法的数据集很大且有许多变量，因此我们可以使用90/10数据集拆分。为了保持一致性，将随机状态调整为3。

Now training the model is surprisingly easy. After making an instance of the LinearRegression object, we can use it to fit the training data and train the model. The parenthesis of the LinearRegression() object is empty because we aren’t adjusting the parameters of our model (something we would assign within the parenthesis) because they aren’t as necessary/significant as compared to more complex models.

现在训练模式l非常简单。创建LinearRegression对象的实例后，我们可以使用它来拟合训练数据并训练模型。 LinearRegression()对象的括号为空，因为我们没有调整模型的参数(我们将在括号内分配某些参数)，因为与更复杂的模型相比，它们不是必需/重要的。

After training our model, we can use it to make new predictions. Here, we can assign a variable, y_pred, the predicted value of the test set. Then, by using the concatenate function, we can display the predicted values and actual values in a side by side 2D array through (len(y_wtv), 1)) for easy viewing.

训练完模型后，我们可以使用它做出新的预测。在这里，我们可以分配变量y_pred ，即测试集的预测值。然后，通过使用连接函数，我们可以通过(len(y_wtv)，1)在并排2D数组中显示预测值和实际值，以便于查看。

Finally, to determine our model performance we can calculate an r² value. In this model, we achieved a nearly .80 r² which means 80% of our data can be predicted by our model. The print function turns the r² value into a string so it can be easily printed and read.

最后，为了确定模型的性能，我们可以计算一个r²值。在此模型中，我们获得了将近.80r²，这意味着我们的模型可以预测80％的数据。打印功能将r²值转换为字符串，以便可以轻松打印和读取。

第5节：结论 (Section 5: Conclusion)

That’s about it. See, not as hard as it seems!

就是这样看，不像看起来那么难！

Now that we have this rudimentary understanding, we can work towards understanding ANY machine learning algorithm — ins and outs.

现在我们已经有了基本的了解，我们可以努力了解任何机器学习算法-来龙去脉。

I distilled all the stuff you need to know in this article but you can always explore different perspectives to learn more. If you want to understand these concepts in-depth, I recommend checking out all the sources of my images. They are all amazing websites, articles, or videos with tons of useful information.

我总结了本文中您需要了解的所有内容，但是您始终可以探索不同的观点以了解更多信息。如果您想深入了解这些概念，建议您检查一下图像的所有来源。它们都是令人惊叹的网站，文章或包含大量有用信息的视频。

Cheers.

干杯。

概要 (Summary)

Linear regression is very easy to understand and build and predicts a continuous value.
线性回归非常易于理解和构建，并且可以预测连续值。
Simple linear regression has one independent variable while multiple linear regression has two or more.
简单线性回归具有一个自变量，而多元线性回归具有两个或多个。
There are different mathematical procedures that create/train and measure the accuracy of our model
有多种数学程序可以创建/训练和衡量模型的准确性
Data needs to be preprocessed for missing points and acknowledged for outliers before regressor.fit(X_train, y_train)-ing our model
在regressor.fit(X_train，y_train)-建模之前，需要对缺失点进行数据预处理并针对异常值进行确认

希望您喜欢阅读！：(Hope you enjoyed that read!:)

Before you leave, please allow me to introduce myself :)

在您离开之前，请允许我自我介绍：)

I’m a curious 17-year-old who is super passionate about machine learning and data science and its intersection brain-computer interfaces. I love learning new stuff and meeting new people

我是一个好奇的17岁超级热衷于机器学习和数据科学及其交叉的人机界面。我喜欢学习新事物并结识新朋友

Connect with me on Linkedin, Medium (oh look! you’re already here), or Twitter!

通过Linkedin与我联系，中 (哦，看！您已经在这里了)或Twitter ！

Gain Access to Expert View — Subscribe to DDI Intel

获得访问专家视图的权限-订阅DDI Intel