batch lr替代关系_建立关系的替代方法

最新推荐文章于 2024-08-29 19:25:32 发布

weixin_26752765

最新推荐文章于 2024-08-29 19:25:32 发布

阅读量459

点赞数

文章标签： java python 人工智能算法机器学习

原文链接：https://medium.com/swlh/isnt-linear-regression-for-machine-learning-d31543f49181

版权

batch lr替代关系

Linear regression is one of the most well-known and simple tools for statistics and machine learning.

线性回归是统计和机器学习中最知名的和简单的工具之一。

In this article, you can explore a linear regression algorithm, how it operates, and how you can better use it?

在本文中，您可以探索线性回归算法，其运作方式以及如何更好地使用它？

Linear regression (LR) is a simple yet powerful supervised learning technique. It is applied in a large number of situations.

线性回归(LR)是一种简单但功能强大的监督学习技术 。它适用于许多情况。

LR determines how the input variable termed as the explanatory variables affecting the output variable named the response variable. It uses the best fit straight line with the smallest number of square residuals nicknamed the line of regression or the square least line. The simple linear model contains only one independent variable called simple linear regression. While the multiple linear regression has more than one explanatory variable.

LR确定称为解释变量的输入变量如何影响称为响应变量的输出变量。它使用最佳拟合的直线，其残差最小的平方数被称为回归线或最小平方线 。简单线性模型仅包含一个称为简单线性回归的自变量。而多元线性回归具有多个解释变量。

LR handles the study of continuous variables. It’s beneficial for companies to forecasts such as the future market trend and the salary relationship with the experience. LR used in forecasting, time series, and cause-effect relationships. The association between reckless driving and road injuries, for example.

LR处理连续变量的研究。对于公司而言，预测诸如未来的市场趋势以及与经验的薪资关系是有益的。 LR用于预测，时间序列和因果关系。例如，鲁re驾驶和道路伤害之间的关联。

LR could be either positive or negative. A positive relationship between the two variables means that an increase in the value of one variable always increases in the value of the other variable. On the other hand, a negative relationship between two variables means that an increase in the value of one variable means a reduction in the value of the other variable.

LR可以是正数或负数。两个变量之间的正相关关系意味着一个变量值的增加始终会增加另一个变量的值。另一方面，两个变量之间的负关系意味着一个变量的值增加意味着另一个变量的值减小。

线性回归的假设 (Assumptions of linear regression)

· The relationship between the dependent variable y and the independent variable x always linear. The coefficients of x must so be linear and unrelated. You cannot allow the coefficients to be the function of each other.

·因变量y和自变量x之间的关系始终是线性的。 x的系数必须是线性的并且不相关。您不能允许系数互为函数。

· The independent variables must also be non-random in non-financial applications. Besides, in financial scenarios, the approximation to a random independent variable can be accurate as long as the error variable and the independent variable are not associated.

·在非金融应用程序中，自变量也必须是非随机的。此外，在财务场景中，只要误差变量和自变量不相关联，对随机自变量的近似就可以是准确的。

· Multicollinearity occurs when independent variables associated. With the correlation matrix where the correlation coefficient of all the variables must be less than 1. Tolerance is another measure of multi-collinearity. Tolerance defined by T=1-R2, where T<0.1 may be multicollinear and T<0.01 is multicollinear. For the variable inflation factor (VIF), VIF>10 is multicollinearity among variables.

·当自变量关联时发生多重共线性 。对于所有变量的相关系数必须小于1的相关矩阵，公差是多共线性的另一种度量。由T = 1-R2定义的公差，其中T <0.1可以是多共线，而T <0.01是多共线。对于可变通胀因子 (VIF)，VIF> 10是变量之间的多重共线性。

· The word error is usually spread. It tested to shape a histogram or a Q-Q residual plot. The histogram should be symmetrical and bell-shaped and the points of the Q-Q plot should be on a 45-degree axis.

·错误一词通常会传播。它经过测试可以塑造直方图或QQ残差图。 直方图应对称且呈钟形，并且QQ图的点应位于45度轴上。

· The variance of the definition of error is constant. This called Homoscedasticity Constraint or Constant Error Variance. It evaluated using a scatter plot. Breusch-Pagan test used to test for homoscedasticity. Performs an extra analysis with squared residues on independent variables.

·误差定义的方差是恒定的。这称为同方差约束或恒定误差方差 。使用散点图进行了评估。 Breusch-Pagan检验用于测试均方差。对自变量平方残差执行额外的分析。

• Autocorrelation happens where the residues are not independent of each other. Durbin-Watson (DW) checks the null hypothesis that the residues are not self-correlated. A DW statistic below 2 signals that nearby residuals correlated with one another.

•自相关发生在残基彼此不独立的情况下。 Durbin-Watson (DW)检查了残基不是自相关的原假设。低于2的DW统计信号表明附近的残差彼此相关。

• If LR makes reliable predictions, your input and output variables will be Gaussian distribution. Multivariate normality under which all variables expected to be multivariate and regular. Identified using the histogram or Q-Q plot. Further, verify the normality of the fitness test using the Kolmogorov-Smirnov test. When the data is not usually distributed for translation, log transformation done.

•如果LR做出可靠的预测，则您的输入和输出变量将为高斯分布。多元正态性，所有变量均应为多元正态。使用直方图或QQ图识别。此外，使用Kolmogorov-Smirnov检验验证适应性检验的正常性。如果通常不分发数据进行转换，则完成日志转换。

预测的准确性水平 (Level of the accuracy of the prediction)

· The scale of the residues gives a clear example of how effective a regression line is to estimate Y values from X values. This calculation referred to as the standard error of the estimation. This is the standard deviation of the estimate. The smaller the number, the more precise the forecasts appear to be.

·残基的规模清楚地说明了回归线从X值估计Y值的有效性。该计算称为估计的标准误差。这是估算值的标准偏差。数字越小，预测似乎越精确。

· The reliability of the model tested using the formula R2, which is the square of the association between x and y. The stronger the R2 the more it suits. It’s still between 0 and 1. The stronger the linear alignment, the closer the R² is to 1.

·使用公式R2进行测试的模型的可靠性，公式R2是x和y之间关联的平方。 R2越强，则越适合。它仍然在0到1之间。线性对齐越强，R²越接近1。

· Adjusted R2 is an extra method that applies R2 to the number of explanatory variables in the equation. This used to control whether extra explanatory variables are part of the equation. Based R2 is the strongest approximation of the connection. Adjusted R2 may be negative, although that is not the case.

· 调整后的R2是将R2应用于等式中解释变量的数量的另一种方法。这用于控制额外的解释变量是否为方程式的一部分。基于R2的是连接的最强近似值。调整后的R2可能为负，但事实并非如此。

In an over-fitting setting, a high R2 value, which contributes to a decreased predictability achieved. That is not the case with the R2 adjusted. Each variable added to the model increases R2 and never decreases. While the adjusted R2 only rises if the new predictor strengthens the LR model.

在过拟合的设置中，较高的R2值会导致降低可预测性。调整R2并非如此。添加到模型中的每个变量都会增加R2，而不会减少。而仅当新的预测变量增强了LR模型时，调整后的R2才会增加。

建立关系的替代方法 (Alternative approaches to modeling the relationship)

· Many alternative explanatory factors are categorical and can’t test on a quantitative scale. It’s a trick to use dummy variables. A dummy variable is a variable with a potential value between 0 and 1. Example of gender, quarter.

·许多其他解释性因素是绝对的，不能在定量范围内进行检验。使用伪变量是一个技巧。虚拟变量是可能值为0到1之间的变量。性别示例，季度。

· You may have an interaction variable combination of two explanatory variables. Including an interaction variable in a regression equation, if, you assume that the influence of one explanatory variable on y depends on the value of another explanatory variable.

·您可能具有两个解释变量的交互变量组合。如果假设回归变量中包含一个交互变量，则假定一个解释变量对y的影响取决于另一个解释变量的值。

· Nonlinear transformations of variables used as a consequence of curvature found in scatterplots. You should transform the dependent variable y or either of the explanatory variables, x or you can do all. It involves the normal logarithm, the square root, the reciprocal, and the square.

·由于散点图中的曲率而导致的变量的非线性变换。您应该转换因变量y或任一解释变量x，否则可以全部转换。它涉及正常对数，平方根，倒数和平方。

为什么要在回归中记录变量？ (Why log your variables in a regression?)

• The variable’s got the right skew and taking a log will make the distribution of the transformed variable symmetrical. But this is not enough excuse to log the variable. There are no regression rules that govern the independent or dependent variables to be normal. If you have outliers in your dependent or independent variables, a log transformation cut the effect.

•变量具有正确的偏斜，并且取对数将使变换后的变量的分布对称。但这还不足以记录变量。没有将自变量或因变量控制为正常的回归规则。如果因变量或因变量中有离群值，则对数转换会减少影响。

• The variance of your regression residuals is increasing with your regression predictions. Taking the log of your dependent or independent variables may drop the heteroscedasticity.

•回归残差的方差随着回归预测的增加而增加。记录因变量或自变量的对数可能会降低异方差性。

• Your regression residual variance is growing with your regression forecasts. Taking a log of the dependent or independent variables that cut heteroscedasticity. Your regression residual is not normal. It might or may not have been a problem for you. Even if the residues are not usual. you should log the dependent or independent variables and verify whether the residuals are regular after the log transformation.

•您的回归残差方差随着您的回归预测而增长。记录减少异方差的因变量或自变量的对数。您的回归残差不正常。这可能对您来说不是问题。即使残留物不常见。您应该记录因变量或自变量，并在对数转换后验证残差是否为正则。

• If dependent and independent variables do not have a linear and exponential relation. For example, the amount of income correlated with food consumption. The proportional rise in income would raise consumption to a certain amount and, after that, food consumption would either flatten or even decrease.

•如果因变量和自变量不具有线性和指数关系。例如，收入数额与粮食消费相关。收入的成比例增长将使消费增加到一定程度，此后，粮食消费将趋于平缓甚至下降。

自变量的相关性 (The relevance of the independent variable)

The underlying idea is that parsimony demonstrates most with the least. It supports a model with less explanatory variables. The below techniques can be used to identify explanatory variable significance in the linear regression equation.

其基本思想是， 简约性表现出最少的表现。它支持具有较少解释变量的模型。以下技术可用于识别线性回归方程式中的解释变量重要性。

The coefficient of correlation describes the strength and direction of the linear relationship between x and y. The hypothesis test helps one to determine, if the population correlation coefficient value is close to zero, or if it is different from zero.

相关系数描述了x和y之间线性关系的强度和方向。假设检验有助于确定总体相关系数值是否接近零，或者是否不同于零。

When the test determines the correlation coefficient is different from zero, the correlation coefficient is important. If the test shows that the correlation coefficient is close to zero, we assume the correlation coefficient is not significant. There are two methods to test the significance of using p-value and t statistic.

当测试确定相关系数不同于零时，相关系数很重要。如果测试表明相关系数接近零，则我们假设相关系数不显着。有两种方法可以检验使用p值和t统计量的重要性。

T-values of regression coefficients to include or exclude explanatory variables in the regression equation. A variable assumed to be important if p-value < 0.05 at 95% confidence level and t statistic > 2 use in the regression equation. If t statistic is less than 1, then it is a statistical fact that standard error would decrease and adjusted R2 will increase if this variable excluded from the regression equation.

回归系数的T值，以在回归方程中包含或排除解释变量。如果在回归方程中使用p值 <0.05(在95％置信水平下且t统计量> 2)，则认为该变量很重要。如果t统计量小于1，则是一个统计事实，如果将此变量从回归方程中排除，则标准误差将减小，而调整后的R2将增大。

F-test method to determine if the explained variation is high relative to the unexplained variation. The F-test of significance is the hypothesis test for the linear relationship. It has a related p-value that allows the test to run. If the F-value of the ANOVA table is large and the corresponding p-value is small. Reject the null hypothesis and assume explanatory variables have some value.

F检验方法，用于确定所解释的变化相对于无法解释的变化是否较高。显着性F检验是线性关系的假设检验。它具有相关的p值，该值允许测试运行。如果方差分析表的F值较大而相应的p值较小。拒绝原假设，并假设解释变量具有一定价值。

结论 (Conclusion)

Regression Analysis used in the broader sense. Yet, it focuses on quantifying shifts in the dependent variable related to adjustments in the independent variable. It is since all linear or non-linear regression models, link the dependent variable to the independent variables.

广义上使用回归分析。然而，它着重于量化与自变量调整相关的因变量的变化。由于所有线性或非线性回归模型都将因变量链接到自变量。

Now, take your thoughts on Twitter and Linkedin! Agree or disagree with Saurav Singla ideas and examples? Want to tell us your story? Tweet @SauravSingla_08 and Comment Saurav_Singla right now!

现在，在Twitter和Linkedin上发表您的想法！同意还是不同意Saurav Singla的想法和例子？想告诉我们您的故事吗？发推文@ SauravSingla_08和评论Saurav_Singla现在！