python 进行线性回归
by Tirthajyoti Sarkar
由Tirthajyoti Sarkar
In this article, we discuss 8 ways to perform simple linear regression using Python code/packages. We gloss over their pros and cons, and show their relative computational complexity measure.
在本文中,我们讨论了使用Python代码/包执行简单线性回归的8种方法。 我们掩盖了它们的优缺点,并展示了它们相对的计算复杂性。
For many data scientists, linear regression is the starting point of many statistical modeling and predictive analysis projects. The importance of fitting (accurately and quickly) a linear model to a large data set cannot be overstated. As pointed out in this article, ‘LINEAR’ term in the linear regression model refers to the coefficients, and not to the degree of the features.
对于许多数据科学家而言, 线性回归是许多统计建模和预测分析项目的起点。 (准确,快速地)将线性模型拟合到大数据集的重要性不可夸大。 正如本文所指出的那样,线性回归模型中的“ 线性 ”一词是指系数,而不是特征的程度。
Features (or independent variables) can be of any degree or even transcendental functions like exponential, logarithmic, sinusoidal. Thus, a large body of natural phenomena can be modeled (approximately) using these transformations and linear model even if the functional relationship between the output and features are highly nonlinear.
特征(或自变量)可以具有任何程度,甚至可以具有超越函数,例如指数,对数,正弦曲线。 因此,即使输出和特征之间的函数关系是高度非线性的,也可以使用这些变换和线性模型来建模(近似)大量自然现象。
On the other hand, Python is fast emerging as the de-facto programming language of choice for data scientists. Therefore, it is critical for a data scientist to be aware of all the various methods he/she can quickly fit a linear model to a fairly large data set and asses the relative importance of each feature in the outcome of the process.
另一方面,PythonSwift崛起,成为数据科学家首选的事实上的编程语言 。 因此,对于数据科学家而言,至关重要的是要了解他/她可以快速将线性模型拟合到相当大的数据集并评估每个特征在处理结果中的相对重要性的所有各种方法。
However, is there only one way to perform linear regression analysis in Python? In case of multiple available options, how to choose the most effective method?
但是,只有一种方法可以在Python中执行线性回归分析吗? 如果有多个可用选项,如何选择最有效的方法?
Because of the wide popularity of the machine learning library scikit-learn, a common approach is often to call the Linear Model class from that library and fit the data. While this can offer additional advantages of applying other pipeline features of machine learning (e.g. data normalization, model coefficient regularization, feeding the linear model to another downstream model), this is often not the fastest or cleanest method when a data analyst needs just a quick and easy way to determine the regression coefficients (and some basic associated statistics).
由于机器学习库scikit-learn的广泛普及,通常的方法通常是从该库中调用Linear Model类并拟合数据。 虽然这可以提供应用机器学习的其他管道功能 (例如,数据标准化,模型系数正则化,将线性模型馈送到另一个下游模型)的其他优点,但是当数据分析人员只需要快速分析时,这通常不是最快或最干净的方法。确定回归系数的