科大讯飞:酒店价格预测挑战赛!

 Datawhale干货 

方向:数据挖掘,科大讯飞赛事

  • 赛题名称:酒店住宿价格预测挑战赛

  • 赛题类型:数据挖掘

  • 赛题链接👇:

https://challenge.xfyun.cn/topic/info?type=accommodation-price&ch=vWxQGFU

赛题背景

国内疫情防控形势目前良好,国内外旅游市场也逐渐有所好转,市场格局也在正常的构建当中。随着旅游产业的恢复,也将对酒店行业拥有促进作用,让市场趋势逐渐向好的方向发展。

c6535e1c035e5d91eacdb6f385bbdcb6.jpeg

酒店住宿平台为了扩展旅行的可能性,不断地呈现出更独特、个性化的体验方式。不同地区之间的交通是否有明显差异、节假日差异、品质差异、位置差异都将影响酒店住宿价格,能否通过已知数据预测出酒店住宿价格存在着很大的挑战。

赛事任务

本次比赛任务是根据酒店相关信息数据,然后通过训练数据训练模型,预测测试集酒店住宿房间的价格结果。

赛题数据集

赛题数据由训练集、测试集据组成,包含15个字段,其中target字段为预测目标。

特征字段字段描述
id样本标识id
host_id酒店id
neighbourhood_group街区分组
neighbourhood街区
room_type房间类型
minimum_nights最低夜晚数
number_of_reviews评论数量
last_review最新评论时间
reviews_per_month每月评论数量
calculated_host_listings_count酒店的订单数量
availability未来365天内房间可以预订的天数
region_1_id酒店区域ID1
region_2_id酒店区域ID2
region_3_id酒店区域ID3
target酒店住宿价格(已脱敏处理)

评价指标

本次竞赛的评价标准采用MAE,即MAE越小,效果越好。评估代码参考:

from sklearn.metrics import mean_absolute_error
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
mean_absolute_error(y_true, y_pred)

赛题思路

赛题是一个典型的回归任务类型的比赛,需要考虑对标签进行缩放然后进行建模,并且需要考虑加入一些特征工程:

# 读取数据集并进行标签缩放
import pandas as pd
import numpy as np
train_df = pd.read_csv('./酒店住宿价格预测挑战赛公开数据/train.csv')
test_df = pd.read_csv('./酒店住宿价格预测挑战赛公开数据/test.csv')
train_df['target'] = np.log1p(train_df['target'])

from sklearn.model_selection import cross_val_predict, cross_validate
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_absolute_error

# 训练集特征工程
train_df['last_review_isnull'] = train_df['last_review'].isnull()
train_df['last_review_year'] = pd.to_datetime(train_df['last_review']).dt.year

train_df['neighbourhood_group_mean'] = train_df['neighbourhood_group'].map(train_df.groupby(['neighbourhood_group'])['target'].mean())
train_df['neighbourhood_group_counts'] = train_df['neighbourhood_group'].map(train_df['neighbourhood_group'].value_counts())

train_df['room_type_mean'] = train_df['room_type'].map(train_df.groupby(['room_type'])['target'].mean())
train_df['room_type_counts'] = train_df['room_type'].map(train_df['room_type'].value_counts())

train_df['region_1_id_mean'] = train_df['region_1_id'].map(train_df.groupby(['region_1_id'])['target'].mean())
train_df['region_1_counts'] = train_df['region_1_id'].map(train_df['region_1_id'].value_counts())

train_df['region_2_id_mean'] = train_df['region_2_id'].map(train_df.groupby(['region_2_id'])['target'].mean())
train_df['region_2_counts'] = train_df['region_2_id'].map(train_df['region_2_id'].value_counts())

train_df['region_3_id_mean'] = train_df['region_3_id'].map(train_df.groupby(['region_3_id'])['target'].mean())
train_df['region_3_counts'] = train_df['region_3_id'].map(train_df['region_3_id'].value_counts())

train_df['availability_month'] = train_df['availability'] // 30
train_df['availability_week'] = train_df['availability'] // 7

train_df['reviews_per_month_count'] = train_df['reviews_per_month'] * train_df['calculated_host_listings_count'] 
train_df['room_type_calculated_host_listings_count'] = train_df['room_type'].map(train_df.groupby(['room_type'])['calculated_host_listings_count'].sum())

# 测试集数据增强
test_df['last_review_isnull'] = test_df['last_review'].isnull()
test_df['last_review_year'] = pd.to_datetime(test_df['last_review']).dt.year

test_df['neighbourhood_group_mean'] = test_df['neighbourhood_group'].map(train_df.groupby(['neighbourhood_group'])['target'].mean())
test_df['neighbourhood_group_counts'] = test_df['neighbourhood_group'].map(train_df['neighbourhood_group'].value_counts())

test_df['room_type_mean'] = test_df['room_type'].map(train_df.groupby(['room_type'])['target'].mean())
test_df['room_type_counts'] = test_df['room_type'].map(train_df['room_type'].value_counts())

test_df['region_1_id_mean'] = test_df['region_1_id'].map(train_df.groupby(['region_1_id'])['target'].mean())
test_df['region_1_counts'] = test_df['region_1_id'].map(train_df['region_1_id'].value_counts())

test_df['region_2_id_mean'] = test_df['region_2_id'].map(train_df.groupby(['region_2_id'])['target'].mean())
test_df['region_2_counts'] = test_df['region_2_id'].map(train_df['region_2_id'].value_counts())

test_df['region_3_id_mean'] = test_df['region_3_id'].map(train_df.groupby(['region_3_id'])['target'].mean())
test_df['region_3_counts'] = test_df['region_3_id'].map(train_df['region_3_id'].value_counts())

test_df['availability_month'] = test_df['availability'] // 30
test_df['availability_week'] = test_df['availability'] // 7

test_df['reviews_per_month_count'] = test_df['reviews_per_month'] * test_df['calculated_host_listings_count'] 
test_df['room_type_calculated_host_listings_count'] = test_df['room_type'].map(train_df.groupby(['room_type'])['calculated_host_listings_count'].sum())

# 交叉验证训练模型
cat_val = cross_validate(
    CatBoostRegressor(verbose=0,n_estimators=1000),
    train_df.drop(['id', 'target', 'last_review'], axis=1),
    train_df['target'],
    return_estimator=True
)

lgb_val = cross_validate(
    LGBMRegressor(verbose=0, force_row_wise=True),
    train_df.drop(['id', 'target', 'last_review'], axis=1),
    train_df['target'],
    return_estimator=True
)

xgb_val = cross_validate(
    XGBRegressor(),
    train_df.drop(['id', 'target', 'last_review'], axis=1),
    train_df['target'],
    return_estimator=True
)

# 模型预测
pred = np.zeros(len(test_df))
# for clf in cat_val['estimator'] + lgb_val['estimator'] + xgb_val['estimator']:
for clf in cat_val['estimator']:
    pred += clf.predict(test_df.drop(['id', 'last_review'], axis=1))
pred /= 5
pred = np.exp(pred) - 1
pd.DataFrame({'id': range(30000, 40000), 'target': pred}).to_csv('a.csv', index=None)

完整代码见👇:

https://github.com/datawhalechina/competition-baseline/tree/master/competition/%E7%A7%91%E5%A4%A7%E8%AE%AF%E9%A3%9EAI%E5%BC%80%E5%8F%91%E8%80%85%E5%A4%A7%E8%B5%9B2023

参与学习

Datawhale、科大讯飞、天池联合发起的 AI 夏令营,第三期报名截止到8月15号,扫码申请。基于科大讯飞、天池最新赛事。

c7a4fceabdc94f1100959a0b594173c7.png

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值