这个项目为回归类问题,预测手机的use_price,下面总结一下里面用到的代码和方法:
数据的预处理:
# View top 5 rows of the data
data.head()
# View a random sample of the data for better observation
data.sample(n=5, random_state=1)
# check number of rows and columns
data.shape
# checking column datatypes and number of non-null values
data.info()
# let's create a copy of the data to avoid any changes to original data
df = data.copy()
# Converting object data types to category
category_col = df.select_dtypes(exclude=np.number).columns.tolist() #排除所有数据类型
#返回DataFrame的列的子集,
#Notes:要选择所有数字类型,请使用np.number或'number'
#要选择字符串,您必须使用objectdtype,但是请注意,这将返回所有对象dtype列
#要选择日期时间,使用np.datetime64,'datetime'或 'datetime64'
#要选择timedeltas,使用np.timedelta64,'timedelta'或 'timedelta64'
#要选择Pandas类别dtype,请使用 'category'
df[category_col] = df[category_col].astype("category") #tolist 转化为list
# checking to see that all columns are of the right data types
df.dtypes
# checking for missing values
df.isnull().sum()
接下来进行数据的分析:Exploratory Data Analysis
# Let's look at the statistical summary of the data
df.describe(include="all").T #转置
# visual analysis of how the ram varies per brand
plt.figure(figsize=(15, 5))
plt.subplot(1, 2, 1)
sns.barplot(data=df, y="ram", x="brand_name")
plt.xticks(rotation=90)
plt.subplot(1, 2, 2)
sns.boxplot(data=df, y="ram", x="brand_name")
plt.xticks(rotation=90)
plt.show()
----------------------------------------------------------------------
correlation plots
numeric_columns = df.select_dtypes(include=np.number).columns.tolist()
# correlation heatmap
plt.figure(figsize=(15, 7))
sns.heatmap(
df[numeric_columns].corr(),
annot=True,
vmin=-1,
vmax=1,
fmt=".2f",
cmap="Spectral",
)
plt.show()
DATA PREPROCESSING 数据预处理
# fill missing columns with the column median
col_to_fill = [
"main_camera_mp",
"selfie_camera_mp",
"int_memory",
"ram",
"battery",
"weight",
]
df[col_to_fill] = df[col_to_fill].apply(lambda x: x.fillna(x.median()), axis=0)
#用median填补缺失值
# Plot of "new_price", "used_price" and "weight" before log transformation
cols_to_log = ["new_price", "used_price", "weight"]
# Log transformation
for colname in cols_to_log:
df[colname + "_log"] = np.log(df[colname])
df.drop(cols_to_log, axis=1, inplace=True) # drop previous price columns
# Create dummy variables for the categorical variables
df1 = pd.get_dummies(df, columns=["brand_name", "os", "4g", "5g"], drop_first=True)
# let's plot the boxplots of all numerical columns to check for outliers
plt.figure(figsize=(20, 30))
for i, var in enumerate(df.select_dtypes(include=np.number).columns.tolist()):
plt.subplot(5, 4, i + 1)
plt.boxplot(df[var], whis=1.5)
plt.tight_layout()
plt.title(var)
plt.show()
Build linear regression
# defining X and y variables
X = df1.drop(["used_price_log"], axis=1)
y = df1["used_price_log"]
# splitting the data in 70:30 ratio for train to test data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.30, random_state=42
)
linearregression = LinearRegression() # LinearRegression has been imported above
linearregression.fit(X_train, y_train) # fit the dependent and independent train data
# dataframe to show the model coefficients and intercept
coef_df = pd.DataFrame(
np.append(linearregression.coef_, linearregression.intercept_),
index=X_train.columns.tolist() + ["Intercept"],
columns=["Coefficients"],
)
coef_df
Model performance
# function to compute adjusted R-squared
def adj_r2_score(predictors, targets, predictions):
r2 = r2_score(targets, predictions)
n = predictors.shape[0]
k = predictors.shape[1]
return 1 - ((1 - r2) * (n - 1) / (n - k - 1))
# function to compute MAPE
def mape_score(targets, predictions):
return np.mean(np.abs(targets - predictions) / targets) * 100
# function to compute different metrics to check performance of a regression model
def model_performance_regression(model, predictors, target):
"""
Function to compute different metrics to check regression model performance
model: regressor
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
r2 = r2_score(target, pred) # to compute R-squared
adjr2 = adj_r2_score(predictors, target, pred) # to compute adjusted R-squared
rmse = np.sqrt(mean_squared_error(target, pred)) # to compute RMSE
mae = mean_absolute_error(target, pred) # to compute MAE
mape = mape_score(target, pred) # to compute MAPE
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"RMSE": rmse,
"MAE": mae,
"R-squared": r2,
"Adj. R-squared": adjr2,
"MAPE": mape,
},
index=[0],
)
return df_perf
# unlike sklearn, statsmodels does not add a constant to the data on its own
# we have to add the constant manually
X_train1 = sm.add_constant(X_train)
# adding constant to the test data
X_test1 = sm.add_constant(X_test)
olsmod0 = sm.OLS(y_train, X_train1).fit()
print(olsmod0.summary())
Checking Linear Regression Assumptions
We will be checking the following Linear Regression assumptions:
No Multicollinearity
Linearity of variables
Independence of error terms
Normality of error terms
No Heteroscedasticity
TEST FOR MULTICOLLINEARITY
Variance Inflation Factor (VIF):
General Rule of thumb:
If VIF is between 1 and 5, then there is low multicollinearity. If VIF is between 5 and 10, we say there is moderate multicollinearity. If VIF is exceeding 10, it shows signs of high multicollinearity.
# Shipiro test for normality
stats.shapiro(df_pred["Residuals"])
#Since p-value < 0.05, the residuals are not normal as per the Shapiro-Wilk test. Strictly speaking, the residuals are not normal. However, as an approximation, we can accept this distribution as close to being normal. So, the assumption is satisfied.
# histogram plot of the residual
sns.histplot(data=df_pred, x="Residuals", kde=True)
plt.title("Normality of residuals")
plt.show()
# let's plot the fitted values vs residuals
sns.residplot(
data=df_pred, x="Fitted Values", y="Residuals", color="purple", lowess=True
)
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.title("Fitted vs Residual plot")
plt.show()
# goldfeldquandt test for homoscedasticity
import statsmodels.stats.api as sms
from statsmodels.compat import lzip
name = ["F statistic", "p-value"]
test = sms.het_goldfeldquandt(df_pred["Residuals"], X_train4)
lzip(name, test)
#Since p-value > 0.05, we can say that the residuals are homoscedastic. So, this assumption is satisfied.
#summary
# Let us write the equation of linear regression
Equation = "Used Phone Price ="
print(Equation, end=" ")
for i in range(len(X_train4.columns)):
if i == 0:
print(np.round(olsmod2.params[i], 4), "+", end=" ")
elif i != len(X_train4.columns) - 1:
print(
"(",
np.round(olsmod2.params[i], 4),
")*(",
X_train4.columns[i],
")",
"+",
end=" ",
)
else:
print("(", np.round(olsmod2.params[i], 4), ")*(", X_train4.columns[i], ")")