Kaggle-House Prices

本文介绍了参加Kaggle房价预测比赛的完整流程,包括数据预处理、特征工程、模型建立与优化。通过归一化数值特征、One-Hot编码类别特征、处理缺失值,构建Ridge Regression和Random Forest模型。采用Ensemble方法,如Bagging和Boosting,提升预测精度,如Adaboost+Ridge和XGBoost,展示了如何通过调整模型参数优化性能。
摘要由CSDN通过智能技术生成

前段时间尝试着做了一下kaggle中的House Prices,是一个回归问题,通过对给定的训练集进行分析,来预测测试集中的房屋价格,测试集中的数据主要是房屋特征(features),包活很多比如:卧室的数量,临街否等等共79个,在对其进行处理的过程中,要对数值型的特征进行归一化,而对字符型的特征进行pd.dummies()处理,及将字符用数字0,1,2等来代替,这样才能统筹兼顾对所有的特征进行处理,代码如下:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from  mxnet import autograd,gluon,init,nd
from mxnet.gluon import loss as gloss,data as gdata,nn
import gluonbook as gb 

train_data = pd.read_csv(r'D:\data\train.csv')
test_data = pd.read_csv(r'D:\data\test.csv')

#数据预处理
all_features = pd.concat([train_data.iloc[:,1:-1],test_data.iloc[:,1:]],axis = 0,ignore_index= True)
numeric_index = all_features.dtypes[all_features.dtypes != 'object']  #numeric_index 是Series格式
all_features[numeric_index.index] = all_features[numeric_index.index].apply(lambda x: (x - x.mean())/(x.std()))
all_features = all_features.fillna(all_features.mean())
#创建指示特征
all_features = pd.get_dummies(all_features,dummy_na = True) #这一步将(字符串类特征)用数字0和1来进行表示!增加了特征的数目从79到331


#提取训练数据和测试数据
n_train = train_data.shape[0]
train_features = nd.array(all_features[:n_train].values)
test_features = nd.array(all_features[n_train:].values)
train_labels = nd.array(train_data.iloc[:,-1]).reshape((-1,1)) #维度为(n,1)

#训练模型
loss = gloss.L2Loss()
def get_net():
    net = nn.Sequential()
    net.add(nn.Dense(256, activation='relu'),nn.Dense(1))
    net.initialize()
    return net
#定义竞赛所要求的的对数均方根误差
def log_rmse(net,train_features,train_labels):
    clipped_pred = nd.clip(net(train_features),1,float('inf')) #将数据控制在大于1的范围,小于1的数变为1
    rmse = nd.sqrt(2*loss(clipped_pred.log(),train_labels.log()).mean()) #乘以2是因为在gluon中的loss公式中为1/2
    return rmse.asscalar()
#定义训练函数
#这里可加上学习率逐渐衰减的方法,并在网络中加入丢弃层
def train(net,train_features,train_labels,test_features,test_labels,num_opochs,learning_rate,weight_decay,batch_size):
    train_l ,test_l = [],[]
    train_iter = gdata.DataLoader(gdata.ArrayDataset(train_features,train_labels),batch_size,shuffle = True)  #小样本读取数据
    #使用Adam优化
    trainer= gluon.Trainer(net.collect_params(),'adam',{'learning_rate':learning_rate,'wd': weight_decay})
    for epoch in range(num_opochs):
        for X,y in train_iter:
            with autograd.record():
                l = loss(net(X),y)
            l.backward()
            trainer.step(batch_size)
        train_l.append(log_rmse(net,train_features,train_labels))
        if test_labels is not None:
            test_l.append(log_rmse(net,test_features,test_labels))
    return train_l,test_l

#进行K折交叉验证(得到训练集)

def get_k_fold_data(k,i,x,y):
    assert(k>1)
    fold_size = x.shape[0] // k
    x_train,y_train = None,None
    for j in range(k):
        ix = slice(j * fold_size,(j+1) * fold_size)    #slice的这种用法学习了
        x_part,y_part = x[ix,:],y[ix]
        if j == i:
            x_valid,y_valid = x_part,y_part          
        elif x_train is None:                         
            x_train,y_train = x_part,y_part           #给x_train,y_train 第一次赋值
        else:
            x_train = nd.concat(x_train,x_part,dim = 0)  #将(k-1)个数据集合合并为训练集
            y_train = nd.concat(y_train,y_part,dim = 0) 
    return x_train,y_train,x_valid,y_valid


#返回训练和验证的平均误差
def k_fold(k,x_train,y_train,num_epochs,learning_rate,weight_decay,batch_size):
    train_l_sum,valid_l_sum = 0,0
    for i in range(k):
        data = get_k_fold_data(k,i,x_train,y_train)
        net = get_net()
        train_ls , valid_ls = train(net,*data,num_epochs,learning_rate,weight_decay,batch_size)
        train_l_sum += train_ls[-1]  #这里我们只需要在最后一个迭代周期中的损失,因为此时的数值是最优化参数w,b之后得到的最小损失
        valid_l_sum += valid_ls[-1] 
        if i == 0:
            #gb.semilogy具体参数可以查看模型选择、欠拟合和过拟合那一章节的内容
            gb.semilogy(range(1,num_epochs+1),train_ls,'epochs','rmse',range(1, num_epochs + 1), valid_ls,['train', 'valid'])
        print('fold %d,train rmse:%f,valid rmse: %f'%(i,train_ls[-1],valid_ls[-1]))
    return train_l_sum/k,valid_l_sum/k
k, num_epochs, lr, weight_decay, batch_size = 3, 30, 0.1, 20, 64 

train_l, valid_l = k_fold(k, train_features, train_labels, num_epochs,lr,weight_decay, batch_size)
print('%d-fold validation: avg train rmse: %f, avg valid rmse: %f'% (k, train_l, valid_l))


   

进行K折交叉验证之后,可以观察在迭代过程中,训练集合的误差和测试集合的误差,从而进行超参数的调节:

下一步,对整个训练集合进行预测,并生成竞赛所需要提交的.csv文件:

#定义预测函数(这里没有使用上面的K折交叉,而是重新使用完整的训练数据集来重新训练模型,并保存csv格式)
def train_and_pred(train_features,test_features,train_labels,test_data,num_epochs,lr,weight_decay,batch_size):
    net = get_net()
    train_ls ,_ = train(net,train_features,train_labels,None,None,num_epochs,lr,weight_decay,batch_size)
    gb.semilogy(range(1,num_epochs+1),train_ls,'epochs','rmse')
    print('train rmse %f'%(train_ls[-1]))
    preds = net(test_features).asnumpy()
    test_data['SalePrice'] = pd.Series(preds.reshape((-1,1))[:,0])
    submission = pd.concat([test_data['Id'],test_data['SalePrice']],axis  =1)
    submission.to_csv(r'D:\data\submission.csv',index = False)
        
train_and_pred(train_features,test_features,train_labels,test_data,num_epochs,lr,weight_decay,batch_size)

到此基本框架完成,下一步就是调节参数的过程,可以适当增加神经网络的层数、节点数、降低learning rate,增大weight_decay等方法,也可以尝试增加丢弃层等。这个入门级的比赛旨在了解实际情况的建模经验,对数据处理模型建立和参数调节过程有个大概的感觉。

 

小计:

1)np.clip():

numpy.clip(a, a_min, a_max, out=None),具体用法可参考:https://blog.csdn.net/HHTNAN/article/details/79799612

2)函数中的*,例如:

train_ls, valid_ls = train(net, *data, num_epochs, learning_rate, weight_decay, batch_size)

* 函数接收参数为元组,** 表示函数接收参数为一个字典。

 

除此方法外,还可以使用Ridge Regression和Random Forest这两个模型来共同预测,然后求取平均值的方法来的到最后的值。大概过程如下:

房价预测案例

Step 1: 检视源数据集

In [5]:

import numpy as np
import pandas as pd

读入数据

  • 一般来说源数据的index那一栏没什么用,我们可以用来作为我们pandas dataframe的index。这样之后要是检索起来也省事儿。

In [6]:

train_df = pd.read_csv('../input/train.csv', index_col=0)
test_df = pd.read_csv('../input/test.csv', index_col=0)

检视源数据

In [7]:

train_df.head()

Out[7]:

<
  MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
Id                                          
1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub Inside ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500
2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub FR2 ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500
3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub Inside ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500
4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub Corner ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值