Kaggle 机器学习入门

最新推荐文章于 2024-06-01 11:57:29 发布

qq_38910709

最新推荐文章于 2024-06-01 11:57:29 发布

阅读量296

点赞数

文章标签：大数据

本文链接：https://blog.csdn.net/qq_38910709/article/details/125476701

版权

题目简介

如果让一个购房者描述一下他们梦想中的房子，他们可能不会从地下室天花板的高度或靠近东西方向的铁路开始。但这场游乐场竞赛的数据集证明，与卧室数量或白色尖桩篱笆相比，它对价格谈判的影响要大得多。

79个解释变量(几乎)描述了爱荷华州埃姆斯市住宅的每个方面，这个竞赛挑战你预测每个房子的最终价格。

获取数据

import pandas as pd

train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')

分析数据

返回数据形状, 可以看到有 1460 行，81 列

print(train_data.shape)

(1460, 81)

返回前5条数据，SalePrice 为价格， Id 为索引，其他为特征值

print(train_data.head())

   Id  MSSubClass MSZoning  ...  SaleType  SaleCondition SalePrice
0   1          60       RL  ...        WD         Normal    208500
1   2          20       RL  ...        WD         Normal    181500
2   3          60       RL  ...        WD         Normal    223500
3   4          70       RL  ...        WD        Abnorml    140000
4   5          60       RL  ...        WD         Normal    250000

获取数据集的简单描述，总行数、每个属性的类型和非空值的数量

print(train_data.info())

RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object

缺失值检查, 数据为缺失的数量，如 LotFrontage 列总数1460 缺失为259

print(train_data.isna().sum().where(lambda x: x > 0).dropna())

LotFrontage      259.0
Alley           1369.0
MasVnrType         8.0
MasVnrArea         8.0
BsmtQual          37.0
BsmtCond          37.0
BsmtExposure      38.0
BsmtFinType1      37.0
BsmtFinType2      38.0
Electrical         1.0
FireplaceQu      690.0
GarageType        81.0
GarageYrBlt       81.0
GarageFinish      81.0
GarageQual        81.0
GarageCond        81.0
PoolQC          1453.0
Fence           1179.0
MiscFeature     1406.0

通过describe()方法可以显示数值属性的描述

mean 平均值, std 标准差：反应数据的离散程度，一个较大的标准差，代表大部分数值和其平均值之间差异较大；一个较小的标准差，代表这些数值较接近平均值。

print(train_data.describe())

                Id   MSSubClass  LotFrontage        LotArea  OverallQual
count  1460.000000  1460.000000  1201.000000    1460.000000  1460.000000   
mean    730.500000    56.897260    70.049958   10516.828082     6.099315   
std     421.610009    42.300571    24.284752    9981.264932     1.382997   
min       1.000000    20.000000    21.000000    1300.000000     1.000000   
25%     365.750000    20.000000    59.000000    7553.500000     5.000000   
50%     730.500000    50.000000    69.000000    9478.500000     6.000000   
75%    1095.250000    70.000000    80.000000   11601.500000     7.000000   
max    1460.000000   190.000000   313.000000  215245.000000    10.000000

数据分布趋势,bins表示直方图的“柱”的个数,每个“柱”的值=该“柱”跨越的所有x对应的y值之和

import matplotlib.pyplot as plt
train_data_x = train_data.iloc[:, 1:-1]
train_data_x.hist(bins=50, figsize=(20, 15))
plt.show()

相关性分析，结果小则线性相关性差

corr_matrix = train_data.corr()
print(corr_matrix["SalePrice"].sort_values(ascending=False))

SalePrice        1.000000
OverallQual      0.790982
GrLivArea        0.708624
GarageCars       0.640409
GarageArea       0.623431
TotalBsmtSF      0.613581
1stFlrSF         0.605852
FullBath         0.560664
TotRmsAbvGrd     0.533723
YearBuilt        0.522897
YearRemodAdd     0.507101
GarageYrBlt      0.486362
MasVnrArea       0.477493
Fireplaces       0.466929
BsmtFinSF1       0.386420

数据处理

缺失值处理

# 缺失值填充NA，

# 处理文本转数字，缺失值非数字用中位数填充，非数字用LabelEncoder处理

def preprocessing(data):
    for i in data.columns:
        if data[i].dtype == "object":
            data[i].fillna("nan", inplace=True)
            data[i] = LabelEncoder().fit_transform(data[i])
        else:
            data[i].fillna(data[i].mean(), inplace=True)
    return data


train_data_encoded = preprocessing(train_data)
print(train_data_encoded)

划分测试集和训练集

排除第一行的id属性

x = train_data_encoded.iloc[:, 1:-1]
y = train_data_encoded.iloc[:, -1]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=123)

模型训练

forest_reg = RandomForestRegressor(n_estimators=100, random_state=42)
forest_reg.fit(x_train, y_train)
# 算一下均方根误差
housing_predictions = forest_reg.predict(x_train)
forest_mse = mean_squared_error(y_train, housing_predictions)
forest_rmse = np.sqrt(forest_mse)
print(forest_rmse)

11052.088424558873

测试集验证，测试分数较低

score = forest_reg.score(x_test, y_test)
print(score)

0.8753273576332103

总结：

测试分数较低，下次学习优化和调参

qq_38910709

关注

0
点赞
踩
1

收藏

觉得还不错? 一键收藏
0
评论
Kaggle 机器学习入门

机器学习
复制链接

扫一扫