kaggle房价预测(李沐思路解析与各种的坑)下篇

狄哥博客

已于 2024-04-01 15:59:28 修改

阅读量690

点赞数 11

文章标签： python

于 2024-04-01 14:55:07 首次发布

本文链接：https://blog.csdn.net/XXxia1XX/article/details/137232189

版权

本文详细描述了一位技术专家在完成Kaggle房价预测实战项目中的过程，包括数据预处理、定义损失函数、构建模型、使用Adam优化器训练、以及K折交叉验证策略。作者分享了代码片段和关键步骤，对初学者理解和实践深度学习项目有指导价值。

摘要由CSDN通过智能技术生成

一周时间,终于算是完成了这个实战项目,这篇文章分为上下,上篇主要是沐神的思路与自己踩的坑;下篇是代码与各种注释。大家可以两篇对照着阅读。(因为自己基础薄弱,代码篇很多都是为了理解,将某行代码拆开成多行,方便理解)

李沐kaggle视频链接:
15 实战：Kaggle房价预测 + 课程竞赛：加州2020年房价预测【动手学深度学习v2】_哔哩哔哩_bilibili

李沐kaggle代码链接:
4.10. 实战Kaggle比赛：预测房价 — 动手学深度学习 2.0.0 documentation (d2l.ai)

李沐kaggle比赛链接:
House Prices - Advanced Regression Techniques | Kaggle

以下将会是大量的代码与注释

	0.导入相关函数库
	1.接收数据与初始化数据
	    1.0  在该竞赛中数据集如何划分
	    	训练集划分为训练集与测试集,然后使用k折进行训练模型
	    	验证集作为最后预测,并将预测结果上传到kaggle竞赛中
		1.1  如何接受数据(代码中)
		1.2  如何对字符数据与非字符数据进行处理(代码中)
		1.3  如何对非字符型数据进行初始化(代码中)
		1.4 DataFrame,Numpy,Tensor类型的区别与运用场景(自己体验,或自己搜索)
	2.定义损失函数,定义模型,定义优化器
		2.0 如何定义损失函数(既为什么使用相对误差作为损失函数)(代码中)
		2.1 如何定义训练模型(代码中)
		2.2 如何定义优化器(自己搜索)
	3.训练模型并调整参数
		3.1 k折交叉验证是什么,有什么用,在这里又是如何使用的(和1.0相关)(自己搜索)
	4.预测函数
		4.1 如何生成预测的数据文件(与1.4相关)(代码中)
		4.2 kaggle竞赛流程是什么样的(自己搜索)

0.导入数据库

#这里的util,请看之前的文章
import pandas as pd
import numpy as np
import torch
from torch import nn
import util as d2l
import matplotlib.pyplot as plt

1.接收数据与初始化数据

1.1 接受训练集与验证集数据


# 这里获取数据  注意此处train的类型为DataFrame 
train = pd.read_csv('/root/house-price-overview/train.csv')

#注意这里获取了训练标签 ----------------------------------- 防止坑2
train_label = train['SalePrice']

train_labels = torch.tensor(train_labels.values.reshape(-1, 1), dtype=torch.float32)

#1.将测试集的数据初始化
test = pd.read_csv('/root/house-price-overview/test.csv')
#这里是因为我总是对test进行操作 导致最后输出的时候需要原test的列 于是不得不备份一下
test_data=pd.read_csv('/root/house-price-overview/test.csv')

1.2 对字符型与非字符型进行初始化与标准化

#注意这里是纵向合并,合并前一定要去掉不一致的列--------------------------避免坑1
all_features = pd.concat((train.iloc[:, 1:-1], test.iloc[:, 1:]))


# 若无法获得测试数据，则可根据训练数据计算均值和标准差
numeric_features = all_features.dtypes[all_features.dtypes != 'object'].index
all_features[numeric_features] = all_features[numeric_features].apply(
    lambda x: (x - x.mean()) / (x.std()))

# 在标准化数据之后，所有均值消失，因此我们可以将缺失值设置为0
all_features[numeric_features] = all_features[numeric_features].fillna(0)

# “Dummy_na=True”将“na”（缺失值）视为有效的特征值，并为其创建指示符特征
all_features = pd.get_dummies(all_features, dummy_na=True)


n_train = train.shape[0]
all_features = all_features.astype(float) 
train_features = torch.tensor(all_features[:n_train].values, dtype=torch.float32)
test_features = torch.tensor(all_features[n_train:].values, dtype=torch.float32)

# 以下是我分开初始化的代码,贴在这里主要方便理解上面的函数
# 对数据列进行筛选后处理 
#print(train.dtypes)
#获取所有数据列并将进行类型筛选,然后转化
data_types = train.dtypes

#### 这里选择将非字符型数据进行处理
#将所有非object设置为ture object设置为false
is_numeric =data_types != 'object'
#print(is_numeric)
#获取为非object的列名 并且形成数组
numeric_columns = data_types[is_numeric].index
numeric_features = numeric_columns
#对所有非object的列进行将数值特征转换为均值为零 方差为1的数据 然后将NA转换为0(这里的print主要是作为NAN填充后的对比)
train[numeric_features]=train[numeric_features].apply(lambda x:(x-x.mean())/x.std())
#print(train[numeric_features].head(20))
train[numeric_features]=train[numeric_features].fillna(0)
#print(train[numeric_features].head(20))




#### 对object数据进行处理
train = pd.get_dummies(train, dummy_na=True)
print(train)


#这里是将train中的Ture False的布尔值转化为1，0的数值类型
train = train.astype(float) 
train_features = train.drop('SalePrice', axis=1)


#### 将DataFrame转化为numpy 再转化为tensor 注意object不可以直接转化为numpy 故需要处理
n_train = train.shape[0]


train_features = torch.tensor(train_features.values, dtype=torch.float32)

print(train_features.shape)

2.定义损失函数,定义模型,定义优化器

#构建损失函数 注意"因为数值跨度较大 所以要使用的是对数作用的相对误差,原因如下：

#log_rmse（均方根对数误差）不是传统意义上的相对误差。相对误差通常是指预测值与真实值之差与真实值本身的比例。然而，log_rmse通过对预测值和真实值取对数来计算误差，这种方法对大误差和小误差的处理方式有所不同：
#当预测值和真实值相差很大时，取对数可以减少这种差异对整体误差的影响。
#当预测值和真实值接近时，取对数会放大这些小的差异。
#这种方法在处理具有较大范围或多数量级的目标变量时特别有用

loss = nn.MSELoss()
in_features = train_features.shape[1]
print(in_features)
#这里构建了一个线性网络 
def get_net():
    net = nn.Sequential(
    nn.Linear(in_features, 200),
    nn.Linear(200, 1)
    )
    return net
#net = nn.Sequential(nn.Linear(in_features,1))



def log_rmse(net, features, labels):
    # net(features) 对数值进行线性预测   
    # clamp 将数值取到1~正无穷间, 因为如下的对数平方计算损失函数时,对数在负无穷是五定义的

    clipped_preds = torch.clamp(net(features), 1, float('inf'))
    
    #torch.log计算的是预测值与真实值的对数
    #loss 计算平方差
    #sqrt计算平方差后的平方根

    rmse = torch.sqrt(loss(torch.log(clipped_preds),
                           torch.log(labels)))
    return rmse.item()

#训练模型 目的是记录训练误差与验证集误差
def trainer(net, train_features, train_labels, test_features, test_labels,
          num_epochs, learning_rate, weight_decay, batch_size):
    train_ls, test_ls = [], []
    train_iter = d2l.load_array((train_features, train_labels), batch_size)
    # 这里使用的是Adam优化算法
    optimizer = torch.optim.Adam(net.parameters(),
                                 lr = learning_rate,
                                 weight_decay = weight_decay)
    for epoch in range(num_epochs):
        for X, y in train_iter:
            optimizer.zero_grad()
            l = loss(net(X), y)
            l.backward()
            optimizer.step()
        train_ls.append(log_rmse(net, train_features, train_labels))
        if test_labels is not None:
            test_ls.append(log_rmse(net, test_features, test_labels))

    
    return train_ls, test_ls

3.训练模型并调整参数

#K折数据集转化函数 这个函数的作用是在进行 k 折交叉验证时，帮助你将数据分割成训练集和验证集
def get_k_fold_data(k, i, X, y):
    assert k > 1
    fold_size = X.shape[0] // k
    X_train, y_train = None, None
    for j in range(k):
        idx = slice(j * fold_size, (j + 1) * fold_size)
        X_part, y_part = X[idx, :], y[idx]
        if j == i:
            X_valid, y_valid = X_part, y_part
        elif X_train is None:
            X_train, y_train = X_part, y_part
        else:
            X_train = torch.cat([X_train, X_part], 0)
            y_train = torch.cat([y_train, y_part], 0)
    return X_train, y_train, X_valid, y_valid

#K折交叉验证 
def k_fold(k, X_train, y_train, num_epochs, learning_rate, weight_decay,
           batch_size):
    train_l_sum, valid_l_sum = 0, 0
    for i in range(k):
        data = get_k_fold_data(k, i, X_train, y_train)
        net = get_net()
        train_ls, valid_ls = trainer(net, *data, num_epochs, learning_rate,
                                   weight_decay, batch_size)
        train_l_sum += train_ls[-1]
        valid_l_sum += valid_ls[-1]
        if i == 0:
            d2l.plot(list(range(1, num_epochs + 1)), [train_ls, valid_ls],
                     xlabel='epoch', ylabel='rmse', xlim=[1, num_epochs],
                     legend=['train', 'valid'], yscale='log')
        print(f'折{i + 1}，训练log rmse{float(train_ls[-1]):f}, '
              f'验证log rmse{float(valid_ls[-1]):f}')
    return train_l_sum / k, valid_l_sum / k

4.预测函数

k, num_epochs, lr, weight_decay, batch_size = 5, 100, 0.01, 0, 64
train_l, valid_l = k_fold(k, train_features, train_labels, num_epochs, lr,
                          weight_decay, batch_size)
print(f'{k}-折验证: 平均训练log rmse: {float(train_l):f}, '
      f'平均验证log rmse: {float(valid_l):f}')

狄哥博客

关注

11
点赞
踩
16

收藏

觉得还不错? 一键收藏
0
评论
kaggle房价预测(李沐思路解析与各种的坑)下篇

构建损失函数注意"因为数值跨度较大所以要使用的是对数作用的相对误差,原因如下：#log_rmse（均方根对数误差）不是传统意义上的相对误差。相对误差通常是指预测值与真实值之差与真实值本身的比例。然而，log_rmse通过对预测值和真实值取对数来计算误差，这种方法对大误差和小误差的处理方式有所不同：#当预测值和真实值相差很大时，取对数可以减少这种差异对整体误差的影响。#当预测值和真实值接近时，取对数会放大这些小的差异。#这种方法在处理具有较大范围或多数量级的目标变量时特别有用。
复制链接

扫一扫