java 回归遍历_回归基础：代码遍历

最新推荐文章于 2022-03-12 09:14:48 发布

weixin_26752765

最新推荐文章于 2022-03-12 09:14:48 发布

阅读量432

点赞数

文章标签： java leetcode python 算法数据结构

原文链接：https://towardsdatascience.com/regression-basics-code-walk-through-c2eac24da2e9

版权

这篇博客深入探讨了Java中的回归遍历概念，通过翻译和解析来自towardsdatascience的文章，详细解释了如何用代码实现回归遍历算法。

摘要由CSDN通过智能技术生成

java 回归遍历

This article guides you through the basics of regression by showing code and thorough explanations of a full data project using a Kaggle used car dataset. The project utilizes Linear Regression, Ridge CV, Lasso CV, and Elastic Net CV models to predict sale price. Full code is available on Github.

本文通过显示代码和使用Kaggle二手车数据集的完整数据项目的详尽解释，引导您了解回归的基础知识。该项目利用线性回归，Ridge CV，Lasso CV和Elastic Net CV模型来预测销售价格。完整代码可在Github上获得。

设定 (Getting Set Up)

The used car dataset is available for download from Kaggle as a CSV file. The information is provided by cardekho, an Indian used car website. Once you have downloaded the CSV file, you can utilize the Pandas library to view and analyze the data.

二手车数据集可作为CSV文件从Kaggle下载。该信息由印度二手车网站cardekho提供。下载CSV文件后，即可使用熊猫库以查看和分析数据。

The file path inside the quotes will be different depending on where you save the file. On a Mac you can find the file path by right clicking on the file and holding the Option key. An option to ‘Copy ‘file.csv’ as Pathname’ should appear, and then you can paste this in the brackets like I have below.

引号内的文件路径将根据保存文件的位置而有所不同。在Mac上，您可以通过右键单击文件并按住Option键找到文件路径。应该出现“将'file.csv'复制为路径名”的选项，然后您可以将其粘贴到下面的方括号中。

import pandas as pd# upload data from csvcar = pd.read_csv('/Users/Jewel/Desktop/Car-Price-Prediction/car details.csv')

Now the CSV is saved as a Pandas DataFrame. To see what kind of data we are working with, we can use the code below to view the first 5 rows of the data frame that we just created. If we want to see a specific amount of rows, we can put a number in the brackets to view that many rows (cars.head(10) would show the first 10 rows).

现在，CSV被保存为Pandas DataFrame。要查看我们正在使用哪种数据，我们可以使用下面的代码查看我们刚创建的数据框的前5行。如果要查看特定数量的行，可以在括号中放置一个数字以查看有多少行(cars.head(10)将显示前10行)。

car.head()

This allows us to see what the data frame contains, the columns and some examples of the used car details included. Each row is a unique car, and the columns represent the different features of the car. In this project we will use the features to predict the ‘selling_price’ of each car. To do this, we will use regression, but first we should explore the data and clean it as needed.

这使我们能够查看数据框包含的内容，列以及所用二手车详细信息的一些示例。每行都是唯一的汽车，列代表汽车的不同特征。在这个项目中，我们将使用这些功能预测每辆汽车的“ selling_price”。为此，我们将使用回归，但是首先我们应该探索数据并根据需要进行清理。

探索性数据分析和清理 (Exploratory Data Analysis & Cleaning)

1.检查缺失值 (1. Check for Missing Values)

car.isnull().sum()

There is not any missing values! Yay. If there is a missing value, it is difficult to model the information correctly. Usually there are lots of missing values in datasets, and in order to model the data we will either delete the row with the missing values or replace that instance with another value (mean, previous value…).

没有任何缺失的值！好极了。如果缺少值，则很难正确建模信息。通常，数据集中有很多缺失值，为了对数据建模，我们将删除具有缺失值的行，或者将该实例替换为另一个值(均值，先前值……)。

2.检查数据类型 (2. Check Data Types)

The next step in exploring the data is to see what types of data are stored in the columns. This is helpful for noticing if a column that appears to be filled with numbers is incorrectly coded as an ‘object’. If this is the case you can easily change the data type so that the computer correctly understands the information you are providing.

探索数据的下一步是查看列中存储了哪些类型的数据。这有助于通知看似由数字填充的列是否被错误地编码为“对象”。在这种情况下，您可以轻松更改数据类型，以使计算机正确理解您所提供的信息。

car.dtypes

float = numbers with decimals (1.678)int = integer or whole number without decimals (1, 2, 3)obj = object, string, or words (‘hello’)The 64 after these data types refers to how many bits of storage the value occupies. You will often seen 32 or 64.

float =带小数的数字(1.678) int =整数或不带小数的整数(1、2、3) obj =对象，字符串或单词('hello')这些数据类型后的64表示存储了多少位价值占有。您经常会看到32或64。

The data types all look correct based on the first few rows of the data frame we have seen so far.

根据到目前为止我们看到的数据帧的前几行，所有数据类型看起来都是正确的。

4.数据概述 (4. Data Overview)

car.describe()

This table gives us an overview of statistical information regarding the numerical data columns. Since only three of the columns are integers or floats, we only see these three listed on the chart. If the numerical data had incorrectly been coded as ‘objects’ we would not be able to view the statistical information on the columns.

下表为我们提供了有关数字数据列的统计信息的概述。由于只有三列是整数或浮点数，因此我们仅在图表上看到这三列。如果数字数据被错误地编码为“对象”，我们将无法查看列上的统计信息。

Using this table, we can see different values such as the mean, min, max and standard deviation. This table is useful for giving a quick overview of the data in the columns, and may allow us to identify outliers. For instance, when I was first looking at this table I thought that an average price of $500,000 for a used car was incredibly high. But then I looked at the data source and realized that the data is for an Indian company, and 500,000 rupees is about $6,600. Much more reasonable!

使用此表，我们可以看到不同的值，例如平均值，最小值，最大值和标准偏差。该表可用于快速概述各列中的数据，并可以使我们识别异常值。例如，当我第一次看这张桌子时，我认为二手车的平ASP格为500,000美元，这是令人难以置信的高价。但是后来我查看了数据源，发现数据是针对一家印度公司的，500,000卢比约合6,600美元。更合理！

Another thing that stood out from this table is that the minimum value for km_driven was 1 kilometer. That seemed very low for a used car, so I wanted to investigate this car further. To do this, we can sort the data frame by lowest to highest number of kilometers driven.

从该表中脱颖而出的另一件事是km_driven的最小值为1公里。对于二手车来说，这似乎非常低，所以我想进一步调查这辆车。为此，我们可以按行驶的最低到最高公里数对数据帧进行排序。

car.sort_values(by = 'km_driven')

The car at the top is the one with 1km on it, and since the car has had at least two owners and is from 2014, it does not seem realistic that it would only have 1 kilometer on it. This may be an outlier, or the data could have been entered incorrectly, so to be safe I will delete this row.

顶部的车是上面有1公里的那辆车，并且由于该车至少有两个所有者，并且是从2014年开始的，所以看起来只有1公里是不现实的。这可能是一个异常值，或者输入的数据可能不正确，为安全起见，我将删除此行。

car.drop([1312], inplace = True)
#1312 is the index of the row (which can be seen all the way on the left of the first row

5.可视化关联 (5. Visualize Correlations)

Using the Seaborn library, we can also visualize the correlations between different features. The correlations will only register for the columns with numerical data, but it is still helpful to investigate correlations the relationships between these features.

使用Seaborn库，我们还可以可视化不同特征之间的相关性。相关将只为带有数值数据的列注册，但是研究这些特征之间的关系仍然有帮助。

import seaborn as snssns.heatmap(car.corr(), annot= True);#annot = True shows the correlation values in the squares

The information we gain from this correlation chart makes sense intuitively, and may confirm some of our beliefs — which is always good to see. Year and selling price are relatively positively correlated, so as year increases so will the sale price. Year and kilometers driven are negatively correlated, so as the car gets older there will be more kilometers driven on it.

我们从此相关图表中获得的信息在直觉上是有道理的，并且可以证实我们的某些信念-总是很容易看到。年和销售价格是相对正相关的，因此，随着年的增长，销售价格也会增加。行驶的年数和公里数呈负相关，因此，随着汽车的老化，行驶的里程数也将增加。

特征工程 (Feature Engineering)

Now that we have taken a closer look at the data and the statistics that surround it, we can prepare our data for modeling.

现在，我们已经仔细研究了数据及其周围的统计信息，我们可以为建模准备数据。

The ‘name’ column in the data frame is very specific, and when we are modeling on a relatively limited data set it is sometimes better to be more vague. This allows the model to make predictions based on more past samples. To investigate the variety of car names in the data, we can use the code below to count how many of each car name is included.

数据框中的“名称”列非常具体，当我们在相对有限的数据集上建模时，有时最好变得更加模糊。这使模型可以根据更多过去的样本进行预测。为了调查数据中各种汽车名称，我们可以使用下面的代码来计算每个汽车名称中包含多少个汽车名称。

car['name'].value_counts()

There are over 1400 different car names included, and many have less that 30 examples. Our model will have a hard time predicting sale price if there is only a couple examples of each car. In order to make this a more general feature, we can just include the brand of the car. Luckily the brand name is the first word in each of the rows, so we can create a new feature with just the car brand names.

其中包括1400多种不同的汽车名称，其中许多示例不足30个。如果每辆车只有几个示例，我们的模型将很难预测销售价格。为了使它成为更通用的功能，我们可以只包含汽车的品牌。幸运的是，品牌名称是每一行的第一个单词，因此我们可以仅使用汽车品牌名称来创建新功能。

#make an empty list to append new names tobrand_name = []
for x in car.name:
    y = x.split(' ')
    brand_name.append(y[0]) #append only the first word to the list#we can drop the previous column that had the full name
car = car.drop(['name'], axis = 1)#and add the new column that just has the brand name
car['brand_name'] = brand_name#now let's check how many of each brand is in the column
car.brand_name.value_counts()

Since this column now just has the brand name there is a lot less than 1400 different values, and many have over 30 different cars. At this point you could still choose to delete the rows that have car brands that only have 1 or 2 cars (Force, Isuzu, Kian, Daewoo…), but I will leave these in for now. We have already limited the variety dramatically so our model should be stronger now.

由于此列现在仅具有品牌名称，因此存在少于1400个不同的值，并且许多具有30多种不同的汽车。此时，您仍然可以选择删除只有1或2辆汽车品牌的行(Force，Isuzu，Kian，Daewoo…)，但我现在将其保留。我们已经大大限制了品种，因此我们的模型现在应该更强大。

二值化特征 (Binarize the Features)

The last step before we can model on the features is to create binary columns for the features that are not numerical. It is hard for the computer to understand the difference in meaning between all the car names, so binarizing simply tells the computer “yes this is a Volvo,” or “no, this is not a Volvo.”

我们可以对特征建模的最后一步是为非数字特征创建二进制列。计算机很难理解所有汽车名称之间的含义差异，因此二值化只是告诉计算机“是，这是沃尔沃”，还是“否，这不是沃尔沃”。

Binarizing creates a specific column for each car brand, and then each row will have a 0 (not that car brand) or a 1 (it is that car brand). This process of creating binary columns for each feature is also called ‘dummifying’ the variable, and can be done easily with code in Pandas.

二值化为每个汽车品牌创建一个特定的列，然后每一行将具有0(不是该汽车品牌)或1(是该汽车品牌)。为每个功能创建二进制列的过程也称为“复制”变量，并且可以使用Pandas中的代码轻松完成。

car_dummies = pd.get_dummies(car, drop_first = True)
#set this equal to a new variable since it will be a different data set
#dropping the first column just removes the redundancy of having all the columns therecar_dummies.head()

With the dummy/binary columns, the data frame now looks like this — filled with columns of 0s and 1s, except for the three numerical columns. Although there are many more columns, this process makes it possible for the model to understand the information you are providing it with.

使用虚拟/二进制列，数据框现在看起来像这样-填充了0和1的列，除了三个数字列。尽管还有更多的列，但是此过程使模型可以理解所提供的信息。

造型 (Modeling)

Now (finally) we can model our data to predict the sale price of these used cars.

现在(最终)，我们可以对我们的数据进行建模，以预测这些二手车的销售价格。

分割目标变量和预测变量 (Split Target & Predictor Variables)

First, we need to define our X and y variables. Y is what we are predicting (sale price) and X is everything we are using to help us make this prediction.

首先，我们需要定义X和y变量。 Y是我们预测的价格(销售价格)，X是我们用来帮助进行此预测的所有信息。

X = car_dummies.copy()y = X.pop('selling_price')
#.pop() removes the column/list from X and saves it to the new variable

训练与测试拆分 (Train & Test Split)

Next we need to create a training and test group for our model. We can use the code below to randomly select 70% of the data as the train group for our model to train on, and 30% will remain as our test group to test the quality of our model.

接下来，我们需要为我们的模型创建一个培训和测试小组。我们可以使用下面的代码随机选择70％的数据作为模型进行训练的训练组，而剩下的30％将作为测试组来测试模型的质量。

from sklearn.model_selection import train_test_splitX_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=1)

标准化X值 (Standardize X Values)

Finally, we will need to standardize all the X values so that they are on a consistent scale. This will not change the proportionate relationship between the numbers.

最后，我们将需要标准化所有X值，以使它们处于一致的范围内。这不会改变数字之间的比例关系。

from sklearn.preprocessing import StandardScalerscaler = StandardScaler()
X_train = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns)
#you fit_transform on your train data only because you don't want your model to be influenced in any way by the test data. The test data acts as unseen, brand new data to test the quality of the model.X_test = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns)
#you only transform the test data so you can conform it to the parameters set with the mean from the training data

线性回归 (Linear Regression)

The first model we will try is a simple Linear Regression. It is important to assess the cross validation score, training score, and test score to reflect on how the model is performing.

我们将尝试的第一个模型是简单的线性回归。评估交叉验证分数，训练分数和测试分数以反映模型的性能非常重要。

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score#create a model instance
lr_model = LinearRegression()#fit the model on the training data
lr_model.fit(X_train, y_train)# get cross validated scores
scores = cross_val_score(lr_model, X_train, y_train, cv=5)
print("Cross-validated training scores:", scores)
print("Mean cross-validated training score:", scores.mean())
#training score
print("Training Score:", lr_model.score(X_train, y_train))
# evaluate the data on the test set
print("Test Score:", lr_model.score(X_test, y_test))

Since the training score was much better than the test, it indicates that the model was overfitting to the training data. This means that the model has a very hard time predicting on unseen data — not good! Also, the mean cross validation score is a very large negative number. In theory we want all three of these scores to be as close to 1.0 as possible, but since the scores are so bad for this model, we can try some other regression models that also regularize the data.

由于训练得分远胜于测试，因此表明该模型过度适合训练数据。这意味着该模型很难预测看不见的数据-不好！而且，平均交叉验证得分是一个非常大的负数。从理论上讲，我们希望所有这三个分数都尽可能接近1.0，但由于该分数对于该模型非常不利，因此我们可以尝试使用其他一些回归模型来对数据进行正则化。

里奇简历 (Ridge CV)

Ridge regression is one way to regularize the variables, and is often helpful when dealing with collinearity. Ridge is typically useful when there is a large number of features, and many/all of them impact the target variable in a similar strength. When doing any regression problem it is good practice to try out any and all of these models to see what performs the best. With the RidgeCV model we can also set a range of alpha values to try and the model will chose the best.

Ridge回归是正则化变量的一种方法，在处理共线性时通常很有用。当存在大量特征，并且许多/所有特征以相似的强度影响目标变量时，Ridge通常非常有用。在做任何回归问题时，最好尝试所有这些模型，看看哪种模型效果最好。使用RidgeCV模型，我们还可以设置一定范围的alpha值来尝试，模型将选择最佳的。

from sklearn.linear_model import RidgeCV
import numpy as np# create a RidgeCV model instance
ridge_model = RidgeCV(alphas=np.logspace(-10, 10, 30), cv=5)
# fit the model
ridge_model.fit(X_train, y_train)#mean cv score on training data
scores = cross_val_score(ridge_model, X_train, y_train, cv=5)print("Cross-validated training scores:", scores)
print("Mean cross-validated training score:", scores.mean())
#training score
print("Training Score:", ridge_model.score(X_train, y_train))
# evaluate the data on the test set
print("Test Score:", ridge_model.score(X_test, y_test))

Now all three scores are roughly the same, within a few decimals. Since these scores are much better than the Linear Regression scores, so we can assume that the regularization was helpful for modeling.

现在，所有三个分数都大致相同，只有几位小数。由于这些分数比线性回归分数好得多，因此我们可以假设正则化有助于建模。

套索简历 (Lasso CV)

Another way we can regularize our features is by using Lasso. This regularization is often helpful for reducing collinearity when there are many features that have almost no impact on the target variable. Lasso will level these out to zero, and only keep the features that strongly impact the predictions. Again, it is always best to try all models and see which ends up working best for your model.

我们可以规范化功能的另一种方法是使用套索。当许多特征几乎对目标变量没有影响时，这种正则化通常有助于减少共线性。套索会将这些值平整为零，并且仅保留对预测有重大影响的特征。同样，总是最好尝试所有模型，看看哪种模型最适合您的模型。

from sklearn.linear_model import LassoCV# create a LassoCV model instance
lasso_model = LassoCV(eps= [.0001, .001, .01, .1], alphas=np.logspace(-8, 8, 20), max_iter = 1000000, cv=5)
# fit the model
lasso_model.fit(X_train, y_train)# evaluate on the training set
training_score = lasso_model.score(X_train, y_train)
# evaluate on the test set
test_score = lasso_model.score(X_test, y_test)#mean cv score on training data
scores = cross_val_score(lasso_model, X_train, y_train, cv=5)print("Cross-validated training scores:", scores)
print("Mean cross-validated training score:", scores.mean())
#training score
print("Training Score:", lasso_model.score(X_train, y_train))
# evaluate the data on the test set
print("Test Score:", lasso_model.score(X_test, y_test))

The Lasso scores are very similar to Ridge, but the mean CV score is slightly higher. A good metric for comparing models is either the mean CV or the test score.

套索分数与里奇非常相似，但平均简历分数略高。比较模型的一个好的指标是平均CV或测试分数。

弹性网简历 (Elastic Net CV)

The last model we will test is Elastic Net CV, which creates a combination of Lasso and Ridge regularizations.

我们将测试的最后一个模型是Elastic Net CV，它创建了Lasso和Ridge正则化的组合。

#Elastic net model with scores
from sklearn.linear_model import ElasticNetCVenet_model = ElasticNetCV(alphas=np.logspace(-4, 4, 10), 
                     l1_ratio=np.array([.1, .5, .7, .9, .95, .99, 1]),
                     max_iter = 100000,
                     cv=5)
# fit the model
enet_model.fit(X_train, y_train)# evaluate on the training set
training_score = enet_model.score(X_train, y_train)
# evaluate on the test set
test_score = enet_model.score(X_test, y_test)#mean cv score on training data
scores = cross_val_score(enet_model, X_train, y_train, cv=5)print("Cross-validated training scores:", scores)
print("Mean cross-validated training score:", scores.mean())
print()
#training score
print("Training Score:", enet_model.score(X_train, y_train))
# evaluate the data on the test set
print("Test Score:", enet_model.score(X_test, y_test))

All three of the regularized models scored relatively similar, but by a small margin (.001), Lasso CV performed the best when comparing mean cross validation scores. Let’s look a bit closer at the Lasso model and how it made its predictions.

这三个正则化模型的得分都相对相似，但是在比较平均交叉验证得分时，Lasso CV的表现最好(0.001)。让我们仔细看看Lasso模型及其预测方法。

看最好的模型 (Looking at the Best Model)

功能重要性 (Feature Importance)

Viewing the coefficients can show us which features had the most (or least) impact on how the model made predictions. The chart below shows the features that had the largest positive impact on the sale price of a car. For instance, being a BMW, Mercedes, or Audi would cause the sale price to increase — as would being a newer car. By changing the ascending to ‘False’, we could also view the features that negatively impact price.

查看系数可以向我们展示哪些特征对模型的预测方式影响最大(或最小)。下图显示了对汽车售价产生最大正面影响的功能。例如，宝马，梅赛德斯或奥迪将导致销售价格上涨，而新车也将引起销售价格上涨。通过将升序更改为“假”，我们还可以查看对价格产生负面影响的功能。

fi = pd.DataFrame({
    'feature': X_train.columns,
    'importance': lasso_model.coef_
})fi.sort_values('importance', ascending=True, inplace=True)#sns.set_style('ticks')
sns.set(font_scale = 2)
fig, ax = plt.subplots()
# the size of A4 paper
fig.set_size_inches(16, 12)
sns.barplot(x='importance', y='feature', data=fi[-15:], orient='h', palette = 'rocket', saturation=0.7)  
ax.set_title("Feature Importance", fontsize=40, y=1.01)
ax.set_xlabel('Importance', fontsize = 30)
ax.set_ylabel('Feature', fontsize = 30)

预测与残差 (Predictions & Residuals)

Another way we could evaluate our model is by looking at what the model predicted compared to what the actual value was. This shows us where our model is making mistakes, and how wrong its predictions were. We can view this in a data frame, or we can transfer this information to a graph that contrasts the actual and predicted values.

我们评估模型的另一种方法是查看模型预测的值与实际值的比较。这向我们展示了我们的模型在哪里出错，以及它的预测有多错误。我们可以在数据框中查看此信息，也可以将这些信息传输到对比实际值和预测值的图形中。

predictions = lasso_model.predict(X_test)
residuals_df = pd.DataFrame(predictions, y_test)
residuals_df.reset_index(inplace = True)
residuals_df.rename({'selling_price': 'actual', 0: 'predictions'}, axis = 1, inplace = True)
residuals_df['residuals'] = residuals_df.actual - residuals_df.predictions
residuals_df

#predicted y values
predictions = lasso_model.predict(X_test)#residuals (or error between predictions and actual)
residuals = y_test - predictionssns.axes_style(style='white')sns.set(font_scale = 2)
fig, ax = plt.subplots()
fig.set_size_inches(16, 12)
ax = sns.regplot(x="predictions", y="actual", data= residuals_df,  scatter_kws = {'color': 'lightsalmon'}, 
                 line_kws = {'color': 'darksalmon'})
ax.set_xlabel('Predicted', fontsize = 30)
ax.set_ylabel('Actual', fontsize = 30)

From these values we can see that our model does a decent job of creating accurate predictions, but there are many outliers that don’t seem to conform to our model.

从这些值可以看出，我们的模型在创建准确的预测方面做得很好，但是有许多离群值似乎与我们的模型不符。

均方根误差 (Root Mean Squared Error)

One last way we can evaluate our model’s performance is by calculating the root MSE. This is the sum of all the squared residuals, and then the square root of that value. This tells us how far off our predictions are on average.

评估模型性能的最后一种方法是计算根MSE。这是所有残差平方的总和，然后是该值的平方根。这告诉我们平均预测值有多远。

from sklearn.metrics import mean_squared_error
(mean_squared_error(y_test, predictions))**0.5

The root MSE for this model was 327,518.584, which would be the equivalent of $4,370. While we always want to minimize this value, being able to predict a used car’s price within about $4,000 is not a bad accomplishment. Using this model, the car company could reasonably price their cars based only on details such as the brand, kilometers driven, fuel type, and year.

该模型的根MSE为327,518.584，相当于$ 4,370。尽管我们一直希望将其价值降至最低，但能够预测二手车的价格在4,000美元左右并不是一件坏事。使用该模型，汽车公司可以仅根据品牌，行驶公里数，燃油类型和年份等详细信息对汽车进行合理定价。

结论 (Conclusion)

Hopefully this was a helpful data project walk-though that guided you through some of the fundamentals of EDA, regression, and modeling in Python. Check out the full code on Github for more details.

希望这是一个有用的数据项目，可以指导您完成一些EDA，回归和Python建模的基础知识。在Github上查看完整的代码以获取更多详细信息。

If you’re ready for the next step, check out a guided walk-through of the iris data set to learn the basics of classification.

如果您准备好进行下一步，请查看虹膜数据集的导览，以了解分类的基础知识。