经典的葡萄牙地区酒店运营数据分析

最新推荐文章于 2023-04-26 11:03:10 发布

Fuelliott

最新推荐文章于 2023-04-26 11:03:10 发布

阅读量680

点赞数 1

文章标签： python 数据分析

本文链接：https://blog.csdn.net/Fuelliott/article/details/125263557

版权

经典的葡萄牙地区酒店运营数据分析

文章目录

前言
一、数据导入、简单预览，以及数据清洗
二、订单数据分析
数据分析总结
三、取消率预测
影响预测模型权重较大的特征的具体影响

前言

数据来源：https://www.sciencedirect.com/science/article/pii/S2352340918315191
该数据集包含两家酒店的数据，一家假日酒店，一家城市酒店。前者位于度假区中心，后者则位于首都里斯本的市区。该数据集包括从2015年7月1日到2017年8月31日酒店的订单信息。

通过这几年的数据，提炼两家酒店的运营规律，完善用户画像，对订单取消率展开预测。

一、数据导入、简单预览，以及数据清洗

1.引入库

代码如下：

# 通用库
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import plotly.express as px
import folium
import matplotlib
zhfont1 = matplotlib.font_manager.FontProperties(fname="SourceHanSansSC-Bold.otf")
# 机器学习
from sklearn.model_selection import train_test_split, KFold, cross_validate, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier  # 随机森林
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
import eli5

2.读入数据

代码如下：

data = pd.read_csv(r'C:\xlwings\hotel_bookings.csv')  # 导入数据
print('预览数据')
# print(data.head(3).T)    # 预览头三条的数据

在这里插入图片描述

3.简单浏览数据

代码如下：

print('查看各字段属性')  # 查看各字段属性
# print(data.info())
print('查看缺失项')  # 查看缺失项
# print(data.isnull().sum()[data.isnull().sum().values != 0])  # 得出children,country,agent,company四个字段含有缺失项

print('查看数据大小')  # 查看数据大小
# print(data.size)
# print(data.shape)
print('看数据数据及分布')  # 查看数据数据及分布
# print(data.describe().T)

所采集的数据年份区间为2015年到2017年，一共有近十二万条数据(119390)。32个特征，其中有12个为非数值型。
children,country,agent,company四个字段含有缺失项

4.数据的预处理

‘填充缺失值’

'1. agent 缺失，缺失值设置为0'
'2. company 缺失，缺失值设置为0'
'3. children 缺失，缺失值设置为0'
'4. country 缺失，缺失值设置为 Unknown'

代码如下：

dict_to_full = {
   'agent': 0, 'company': 0, 'children': 0, 'country': 'Unknown'}
data_full = data.fillna(dict_to_full)  # 填补缺失值

‘替换不规范值’

# print(data_full['meal'].value_counts())  
data_full['meal'].replace('Undefined', 'SC', inplace=True)
# 'Undefined'意味着自带食物SC,需要把前者替换为后者

‘根据业务逻辑删除冗余数据’
‘总人数为0（adults babies children 同时为0）没有统计意义，将这些记录行去掉’

index = (data_full['adults'] == 0) & (data_full['babies'] == 0) & (data_full['children'] == 0)
 # print(data_full.shape[0])
 data_full = data_full[~index]
 # print(data_full.shape[0])  # 清洗了冗余数据

预处理完毕后，把数据依酒店的种类一分为二，输出清洗完毕的数据。

 """把数据一分为二：resort酒店成功交易的订单，以及city酒店成功交易的订单"""
 resort_h = data_full.loc[(data_full['hotel'] == 'Resort Hotel') & (data_full['is_canceled'] == 0)]
 city_h = data_full.loc[(data_full['hotel'] == 'City Hotel') & (data_full['is_canceled'] == 0)]

 # 输出数据
 data_full.to_csv(r'C:\xlwings\hotel_bookings_data_full.csv')# 全数据
 resort_h.to_csv(r'C:\xlwings\hotel_bookings_resort_h.csv')	# resort hotel 交易成功订单的数据
 city_h.to_csv(r'C:\xlwings\hotel_bookings_city_h.csv')# city hotel 交易成功订单的数据

二、订单数据分析

导入数据

resort_h = pd.read_csv(r'C:\xlwings\hotel_bookings_resort_h.csv')
city_h = pd.read_csv(r'C:\xlwings\hotel_bookings_city_h.csv')
data_full = pd.

最低0.47元/天解锁文章