【kaggle练习】RECELL PROJECT总结

阿龙的学习博客

已于 2023-02-25 21:05:41 修改

阅读量123

点赞数

文章标签： python 机器学习 pandas

于 2023-02-25 21:05:07 首次发布

本文链接：https://blog.csdn.net/weixin_47156614/article/details/129219347

版权

这个项目为回归类问题，预测手机的use_price,下面总结一下里面用到的代码和方法：

数据的预处理：

# View top 5 rows of the data
data.head()

# View a random sample of the data for better observation
data.sample(n=5, random_state=1)

# check number of rows and columns
data.shape

# checking column datatypes and number of non-null values
data.info()

# let's create a copy of the data to avoid any changes to original data
df = data.copy()

# Converting object data types to category
category_col = df.select_dtypes(exclude=np.number).columns.tolist() #排除所有数据类型
#返回DataFrame的列的子集，
#Notes：要选择所有数字类型，请使用np.number或'number'
#要选择字符串，您必须使用objectdtype，但是请注意，这将返回所有对象dtype列
#要选择日期时间，使用np.datetime64，'datetime'或 'datetime64'
#要选择timedeltas，使用np.timedelta64，'timedelta'或 'timedelta64'
#要选择Pandas类别dtype，请使用 'category'
df[category_col] = df[category_col].astype("category") #tolist 转化为list

# checking to see that all columns are of the right data types
df.dtypes

# checking for missing values
df.isnull().sum()

接下来进行数据的分析：Exploratory Data Analysis

# Let's look at the statistical summary of the data

df.describe(include="all").T #转置

# visual analysis of how the ram varies per brand
plt.figure(figsize=(15, 5))

plt.subplot(1, 2, 1)
sns.barplot(data=df, y="ram", x="brand_name")
plt.xticks(rotation=90)

plt.subplot(1, 2, 2)
sns.boxplot(data=df, y="ram", x="brand_name")
plt.xticks(rotation=90)

plt.show()


----------------------------------------------------------------------
correlation plots
numeric_columns = df.select_dtypes(include=np.number).columns.tolist()


# correlation heatmap
plt.figure(figsize=(15, 7))
sns.heatmap(
    df[numeric_columns].corr(),
    annot=True,
    vmin=-1,
    vmax=1,
    fmt=".2f",
    cmap="Spectral",
)
plt.show()

DATA PREPROCESSING 数据预处理

# fill missing columns with the column median
col_to_fill = [
    "main_camera_mp",
    "selfie_camera_mp",
    "int_memory",
    "ram",
    "battery",
    "weight",
]

df[col_to_fill] = df[col_to_fill].apply(lambda x: x.fillna(x.median()), axis=0)

#用median填补缺失值

# Plot of "new_price", "used_price" and "weight" before log transformation
cols_to_log = ["new_price", "used_price", "weight"]
# Log transformation
for colname in cols_to_log:
    df[colname + "_log"] = np.log(df[colname])

df.drop(cols_to_log, axis=1, inplace=True)  # drop previous price columns

# Create dummy variables for the categorical variables
df1 = pd.get_dummies(df, columns=["brand_name", "os", "4g", "5g"], drop_first=True)

# let's plot the boxplots of all numerical columns to check for outliers
plt.figure(figsize=(20, 30))

for i, var in enumerate(df.select_dtypes(include=np.number).columns.tolist()):
    plt.subplot(5, 4, i + 1)
    plt.boxplot(df[var], whis=1.5)
    plt.tight_layout()
    plt.title(var)

plt.show()

Build linear regression

# defining X and y variables
X = df1.drop(["used_price_log"], axis=1)
y = df1["used_price_log"]

# splitting the data in 70:30 ratio for train to test data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.30, random_state=42
)

linearregression = LinearRegression()  # LinearRegression has been imported above
linearregression.fit(X_train, y_train)  # fit the dependent and independent train data

# dataframe to show the model coefficients and intercept
coef_df = pd.DataFrame(
    np.append(linearregression.coef_, linearregression.intercept_),
    index=X_train.columns.tolist() + ["Intercept"],
    columns=["Coefficients"],
)

coef_df

Model performance

# function to compute adjusted R-squared

def adj_r2_score(predictors, targets, predictions):
    r2 = r2_score(targets, predictions)
    n = predictors.shape[0]
    k = predictors.shape[1]
    return 1 - ((1 - r2) * (n - 1) / (n - k - 1))


# function to compute MAPE
def mape_score(targets, predictions):
    return np.mean(np.abs(targets - predictions) / targets) * 100


# function to compute different metrics to check performance of a regression model
def model_performance_regression(model, predictors, target):
    """
    Function to compute different metrics to check regression model performance

    model: regressor
    predictors: independent variables
    target: dependent variable
    """

    # predicting using the independent variables
    pred = model.predict(predictors)

    r2 = r2_score(target, pred)  # to compute R-squared
    adjr2 = adj_r2_score(predictors, target, pred)  # to compute adjusted R-squared
    rmse = np.sqrt(mean_squared_error(target, pred))  # to compute RMSE
    mae = mean_absolute_error(target, pred)  # to compute MAE
    mape = mape_score(target, pred)  # to compute MAPE

    # creating a dataframe of metrics
    df_perf = pd.DataFrame(
        {
            "RMSE": rmse,
            "MAE": mae,
            "R-squared": r2,
            "Adj. R-squared": adjr2,
            "MAPE": mape,
        },
        index=[0],
    )

    return df_perf



# unlike sklearn, statsmodels does not add a constant to the data on its own
# we have to add the constant manually
X_train1 = sm.add_constant(X_train)
# adding constant to the test data
X_test1 = sm.add_constant(X_test)

olsmod0 = sm.OLS(y_train, X_train1).fit()
print(olsmod0.summary())

Checking Linear Regression Assumptions

We will be checking the following Linear Regression assumptions:

No Multicollinearity

Linearity of variables

Independence of error terms

Normality of error terms

No Heteroscedasticity

TEST FOR MULTICOLLINEARITY

Variance Inflation Factor (VIF):

General Rule of thumb:

If VIF is between 1 and 5, then there is low multicollinearity. If VIF is between 5 and 10, we say there is moderate multicollinearity. If VIF is exceeding 10, it shows signs of high multicollinearity.

# Shipiro test for normality
stats.shapiro(df_pred["Residuals"])
#Since p-value < 0.05, the residuals are not normal as per the Shapiro-Wilk test. Strictly speaking, the residuals are not normal. However, as an approximation, we can accept this distribution as close to being normal. So, the assumption is satisfied.

# histogram plot of the residual
sns.histplot(data=df_pred, x="Residuals", kde=True)
plt.title("Normality of residuals")
plt.show()

# let's plot the fitted values vs residuals

sns.residplot(
    data=df_pred, x="Fitted Values", y="Residuals", color="purple", lowess=True
)
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.title("Fitted vs Residual plot")
plt.show()

# goldfeldquandt test for homoscedasticity

import statsmodels.stats.api as sms
from statsmodels.compat import lzip

name = ["F statistic", "p-value"]
test = sms.het_goldfeldquandt(df_pred["Residuals"], X_train4)
lzip(name, test)
#Since p-value > 0.05, we can say that the residuals are homoscedastic. So, this assumption is satisfied.


#summary
# Let us write the equation of linear regression
Equation = "Used Phone Price ="
print(Equation, end=" ")
for i in range(len(X_train4.columns)):
    if i == 0:
        print(np.round(olsmod2.params[i], 4), "+", end=" ")
    elif i != len(X_train4.columns) - 1:
        print(
            "(",
            np.round(olsmod2.params[i], 4),
            ")*(",
            X_train4.columns[i],
            ")",
            "+",
            end="  ",
        )
    else:
        print("(", np.round(olsmod2.params[i], 4), ")*(", X_train4.columns[i], ")")