文章目录
- 探索式数据分析
- 二、分析来自国家(country)和取消预定之间的关系
- 2.1主要国家代码与名称对照表
- 2.2分析主要部分
- 2.2.1数据处理
- 数据处理1 选取 ---取消订单---中 主要国家部分,前
l
e
n
1
len1
len1个:
c a n c a e l i n d e x t e m p cancael\,index\, temp cancaelindextemp:主要国家名称
c a n c a e l v a l u e t e m p cancael\,value\,temp cancaelvaluetemp :主要国家 数量 - 数据处理2 选取 ---未取消订单---中 主要国家部分,前
l
e
n
2
len2
len2个:
u n c a n c a e l i n d e x t e m p uncancael\,index\, temp uncancaelindextemp:主要国家名称
u n c a n c a e l v a l u e t e m p uncancael\,value\,temp uncancaelvaluetemp :主要国家 数量
- 数据处理1 选取 ---取消订单---中 主要国家部分,前
l
e
n
1
len1
len1个:
- 2.3 画图
- 2.4分析
- 三、分析间隔时间(lead_time)和取消预定之间的关系
- 四、酒店人均价格&到店人数
- 五、预定取消情况和餐食选择
- 六、不同类型酒店的取消预定情况
- 七、来预测顾客是否会取消预定
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.ticker as ticker
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle,Circle
from matplotlib.collections import PatchCollection
%matplotlib inline
plt.rcParams['font.sans-serif']=["SimHei"] #用来正常显示中文标签
plt.rcParams['axes.unicode_minus']=False #用来正常显示负号
def figure_polaris(x,y):
sns.set(style="darkgrid")
fig,ax=plt.subplots(figsize=(x,y))
ax.xaxis.grid(False)
ax.yaxis.grid(True, which='major') # x坐标轴的网格使用主刻度
for item in ['top', 'right', 'left']:
ax.spines[item].set_visible(False) #去掉边框
导入信息及基本信息查看
data_origin = pd.read_csv('hotel_bookings.csv')
data_origin.head(5).T
0 | 1 | 2 | 3 | 4 | |
---|---|---|---|---|---|
hotel | Resort Hotel | Resort Hotel | Resort Hotel | Resort Hotel | Resort Hotel |
is_canceled | 0 | 0 | 0 | 0 | 0 |
lead_time | 342 | 737 | 7 | 13 | 14 |
arrival_date_year | 2015 | 2015 | 2015 | 2015 | 2015 |
arrival_date_month | July | July | July | July | July |
arrival_date_week_number | 27 | 27 | 27 | 27 | 27 |
arrival_date_day_of_month | 1 | 1 | 1 | 1 | 1 |
stays_in_weekend_nights | 0 | 0 | 0 | 0 | 0 |
stays_in_week_nights | 0 | 0 | 1 | 1 | 2 |
adults | 2 | 2 | 1 | 1 | 2 |
children | 0 | 0 | 0 | 0 | 0 |
babies | 0 | 0 | 0 | 0 | 0 |
meal | BB | BB | BB | BB | BB |
country | PRT | PRT | GBR | GBR | GBR |
market_segment | Direct | Direct | Direct | Corporate | Online TA |
distribution_channel | Direct | Direct | Direct | Corporate | TA/TO |
is_repeated_guest | 0 | 0 | 0 | 0 | 0 |
previous_cancellations | 0 | 0 | 0 | 0 | 0 |
previous_bookings_not_canceled | 0 | 0 | 0 | 0 | 0 |
reserved_room_type | C | C | A | A | A |
assigned_room_type | C | C | C | A | A |
booking_changes | 3 | 4 | 0 | 0 | 0 |
deposit_type | No Deposit | No Deposit | No Deposit | No Deposit | No Deposit |
agent | NaN | NaN | NaN | 304 | 240 |
company | NaN | NaN | NaN | NaN | NaN |
days_in_waiting_list | 0 | 0 | 0 | 0 | 0 |
customer_type | Transient | Transient | Transient | Transient | Transient |
adr | 0 | 0 | 75 | 75 | 98 |
required_car_parking_spaces | 0 | 0 | 0 | 0 | 0 |
total_of_special_requests | 0 | 0 | 0 | 0 | 1 |
reservation_status | Check-Out | Check-Out | Check-Out | Check-Out | Check-Out |
reservation_status_date | 2015-07-01 | 2015-07-01 | 2015-07-02 | 2015-07-02 | 2015-07-03 |
print('Shape of dataset:',data_origin.shape)
print('Size of dataser: ',data_origin.size)
data_origin.info()
Shape of dataset: (119390, 32)
Size of dataser: 3820480
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 32 columns):
hotel 119390 non-null object
is_canceled 119390 non-null int64
lead_time 119390 non-null int64
arrival_date_year 119390 non-null int64
arrival_date_month 119390 non-null object
arrival_date_week_number 119390 non-null int64
arrival_date_day_of_month 119390 non-null int64
stays_in_weekend_nights 119390 non-null int64
stays_in_week_nights 119390 non-null int64
adults 119390 non-null int64
children 119386 non-null float64
babies 119390 non-null int64
meal 119390 non-null object
country 118902 non-null object
market_segment 119390 non-null object
distribution_channel 119390 non-null object
is_repeated_guest 119390 non-null int64
previous_cancellations 119390 non-null int64
previous_bookings_not_canceled 119390 non-null int64
reserved_room_type 119390 non-null object
assigned_room_type 119390 non-null object
booking_changes 119390 non-null int64
deposit_type 119390 non-null object
agent 103050 non-null float64
company 6797 non-null float64
days_in_waiting_list 119390 non-null int64
customer_type 119390 non-null object
adr 119390 non-null float64
required_car_parking_spaces 119390 non-null int64
total_of_special_requests 119390 non-null int64
reservation_status 119390 non-null object
reservation_status_date 119390 non-null object
dtypes: float64(4), int64(16), object(12)
memory usage: 29.1+ MB
data_origin.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
is_canceled | 119390.0 | 0.370416 | 0.482918 | 0.00 | 0.00 | 0.000 | 1.0 | 1.0 |
lead_time | 119390.0 | 104.011416 | 106.863097 | 0.00 | 18.00 | 69.000 | 160.0 | 737.0 |
arrival_date_year | 119390.0 | 2016.156554 | 0.707476 | 2015.00 | 2016.00 | 2016.000 | 2017.0 | 2017.0 |
arrival_date_week_number | 119390.0 | 27.165173 | 13.605138 | 1.00 | 16.00 | 28.000 | 38.0 | 53.0 |
arrival_date_day_of_month | 119390.0 | 15.798241 | 8.780829 | 1.00 | 8.00 | 16.000 | 23.0 | 31.0 |
stays_in_weekend_nights | 119390.0 | 0.927599 | 0.998613 | 0.00 | 0.00 | 1.000 | 2.0 | 19.0 |
stays_in_week_nights | 119390.0 | 2.500302 | 1.908286 | 0.00 | 1.00 | 2.000 | 3.0 | 50.0 |
adults | 119390.0 | 1.856403 | 0.579261 | 0.00 | 2.00 | 2.000 | 2.0 | 55.0 |
children | 119386.0 | 0.103890 | 0.398561 | 0.00 | 0.00 | 0.000 | 0.0 | 10.0 |
babies | 119390.0 | 0.007949 | 0.097436 | 0.00 | 0.00 | 0.000 | 0.0 | 10.0 |
is_repeated_guest | 119390.0 | 0.031912 | 0.175767 | 0.00 | 0.00 | 0.000 | 0.0 | 1.0 |
previous_cancellations | 119390.0 | 0.087118 | 0.844336 | 0.00 | 0.00 | 0.000 | 0.0 | 26.0 |
previous_bookings_not_canceled | 119390.0 | 0.137097 | 1.497437 | 0.00 | 0.00 | 0.000 | 0.0 | 72.0 |
booking_changes | 119390.0 | 0.221124 | 0.652306 | 0.00 | 0.00 | 0.000 | 0.0 | 21.0 |
agent | 103050.0 | 86.693382 | 110.774548 | 1.00 | 9.00 | 14.000 | 229.0 | 535.0 |
company | 6797.0 | 189.266735 | 131.655015 | 6.00 | 62.00 | 179.000 | 270.0 | 543.0 |
days_in_waiting_list | 119390.0 | 2.321149 | 17.594721 | 0.00 | 0.00 | 0.000 | 0.0 | 391.0 |
adr | 119390.0 | 101.831122 | 50.535790 | -6.38 | 69.29 | 94.575 | 126.0 | 5400.0 |
required_car_parking_spaces | 119390.0 | 0.062518 | 0.245291 | 0.00 | 0.00 | 0.000 | 0.0 | 8.0 |
total_of_special_requests | 119390.0 | 0.571363 | 0.792798 | 0.00 | 0.00 | 0.000 | 1.0 | 5.0 |
一、数据预处理
data_origin.isnull().sum() #计算空值数量
hotel 0
is_canceled 0
lead_time 0
arrival_date_year 0
arrival_date_month 0
arrival_date_week_number 0
arrival_date_day_of_month 0
stays_in_weekend_nights 0
stays_in_week_nights 0
adults 0
children 4
babies 0
meal 0
country 488
market_segment 0
distribution_channel 0
is_repeated_guest 0
previous_cancellations 0
previous_bookings_not_canceled 0
reserved_room_type 0
assigned_room_type 0
booking_changes 0
deposit_type 0
agent 16340
company 112593
days_in_waiting_list 0
customer_type 0
adr 0
required_car_parking_spaces 0
total_of_special_requests 0
reservation_status 0
reservation_status_date 0
dtype: int64
可以看到 a g e n t agent agent 和 c o m p a n y company company 的缺失值比较多,考虑这两个指标删除
data_origin = data_origin.drop(['agent','company'],axis = 1) ##丢掉空值较多的这两个指标
探索式数据分析
相关系数图
data= data_origin
#相关系数图
data_corr = data.corr()
width, height = data_corr.shape
labels = data_corr.columns
patches,colors=[],[]
#绘制椭圆
for x in range(width):
for y in range(height):
d = np.abs(data_corr.iloc[x, y])
datum = data_corr.iloc[x, y]
patch = Circle((x, y), radius=d/4+0.2)
colors.append(datum)
patches.append(patch)
fig,ax=plt.subplots(figsize=(13,10))
cmap = sns.diverging_palette(10, 220, as_cmap=True)
coll = PatchCollection(patches,array=np.array(colors),cmap=cmap)
ax.add_collection(coll)
#设置坐标轴范围
ax.set_xlim(-0.5, width-.5)
ax.set_ylim(-0.5, height-.5)
#绘制分隔线
for i in range(0, width):
plt.axvline(i+.5, color="gray",linewidth=0.5)
plt.axhline(i+.5, color="gray",linewidth=0.5)
#添加坐标轴刻度
for i in range(0, width):
plt.text(i, -.6 ,str(i) ,fontsize=15,horizontalalignment="center")
plt.text(-.6, i ,labels[i]+ '---'+str(i),fontsize=15,horizontalalignment="right")
#绘制四周边框
plt.axvline(-.5, ymin=0, ymax=height-.5, color="grey",lw=2)
plt.axvline(width-.5, ymin=0, ymax=height-.5, color="grey", lw=2)
plt.axhline(height-.5, xmin=0, xmax=width-.5, color="grey",lw=2)
plt.axhline(-.5, xmin=0, xmax=width-.5, color="grey",lw=2)
#添加颜色条等
cbar=plt.colorbar(coll)
ax.invert_yaxis()
plt.axis("off")
# plt.savefig('corr1.png', dpi=1000, transparent=False)
plt.show()
data_origin['country']=data_origin['country'].replace(np.nan,'unknown')
二、分析来自国家(country)和取消预定之间的关系
2.1主要国家代码与名称对照表
2.2分析主要部分
因为 c o u n t r y country country有缺失值, 接下来先处理 c o u n t r y country country缺失值,然后根据图可以看出, c o u n t r y country country类别特别多,呈现长尾分布,所以确实值可以 替换成 u n k n o w n unknown unknown,然后选取主要部分(约 90 90% 90),再进行画图
figure_polaris(15,7)
# sns.set(style="darkgrid")
sns.countplot(data_origin['country'])
<matplotlib.axes._subplots.AxesSubplot at 0x2cc23a15088>
data_new = data_origin
list1 = list(data_new[data_new["is_canceled"]==1].groupby(["country"])["country"]
.count().sort_values(ascending=False))
list2 = list(data_new[data_new["is_canceled"]==0].groupby(["country"])["country"]
.count().sort_values(ascending=False))
len1 = 10
len2 = 15
print(
"被选中的已取消订单的country占比 %.2f%% " % (sum(list1[:len1])*100/sum(list1)), ##
"\n被选中的未取消订单的country占比 %.2f%% " % (sum(list2[:len2])*100/sum(list2)),
"\n未选中的已取消订单的country订单总数:",sum(list1)-sum(list1[:len1]),
"\n未选中的未取消订单的country订单总数:",sum(list2)-sum(list2[:len2]),
)
被选中的已取消订单的country占比 88.80%
被选中的未取消订单的country占比 89.66%
未选中的已取消订单的country订单总数: 4953
未选中的未取消订单的country订单总数: 7770
2.2.1数据处理
数据处理1 选取 —取消订单—中 主要国家部分,前
l
e
n
1
len1
len1个:
c
a
n
c
a
e
l
i
n
d
e
x
t
e
m
p
cancael\,index\, temp
cancaelindextemp:主要国家名称
c
a
n
c
a
e
l
v
a
l
u
e
t
e
m
p
cancael\,value\,temp
cancaelvaluetemp :主要国家 数量
cancael_index_temp = list(data_new[data_new["is_canceled"]==1].groupby(["country"])["country"]
.count().sort_values(ascending=False).index)[:len1]+["other country"]
cancael_value_temp = list(data_new[data_new["is_canceled"]==1].groupby(["country"])["country"]
.count().sort_values(ascending=False))[:len1] + [(sum(list1)-sum(list1[:len1]))]
数据处理2 选取 —未取消订单—中 主要国家部分,前
l
e
n
2
len2
len2个:
u
n
c
a
n
c
a
e
l
i
n
d
e
x
t
e
m
p
uncancael\,index\, temp
uncancaelindextemp:主要国家名称
u
n
c
a
n
c
a
e
l
v
a
l
u
e
t
e
m
p
uncancael\,value\,temp
uncancaelvaluetemp :主要国家 数量
uncancael_index_temp = list(data_new[data_new["is_canceled"]==0].groupby(["country"])["country"]
.count().sort_values(ascending=False).index)[:len2]+["other country"]
uncancael_value_temp = list(data_new[data_new["is_canceled"]==0].groupby(["country"])["country"]
.count().sort_values(ascending=False))[:len2]+ [(sum(list2)-sum(list2[:len2]))]
## 合并成一个,后面画图直接传参
country_iscanceled_index = (uncancael_index_temp + cancael_index_temp)
country_iscanceled_value = (uncancael_value_temp + cancael_value_temp)
## 计算是否取消预定 各自占比
temp = data_new.groupby(["is_canceled"])["is_canceled"].count().values
country_iscanceled_inner = list(temp / sum(temp))
2.3 画图
import pyecharts.options as opts
from pyecharts.charts import Pie
from pyecharts.commons.utils import JsCode
# [data_Uncanceled_percent,data_Canceled_percent]
inner_x_data = ["未取消预定", "已取消预定"]
inner_y_data = country_iscanceled_inner
inner_data_pair = [list(z) for z in zip(inner_x_data, inner_y_data)]
outer_x_data = country_iscanceled_index
outer_y_data = country_iscanceled_value
outer_data_pair = [list(z) for z in zip(outer_x_data, outer_y_data)]
(
Pie(init_opts=opts.InitOpts())
.add(
series_name="预订情况:",
data_pair=inner_data_pair,
radius=[0, "30%"],
center=["55%","50%"],
label_opts=opts.LabelOpts(position="inner",formatter="{b} \n\n {d}%"),#,"
)
.add(
series_name="来自国家:",
data_pair=outer_data_pair,
radius=["31%","50%"],
center=["55%","50%"],
label_opts=opts.LabelOpts(position="outer"),#
)
.set_colors(['#44a0d6',"#fc7716","#74c476","#9e9ac8","#4c72b0","#ee854a","#6acc64",
"#d65f5f","#8c613c","#dc7ec0","#797979","#d5bb67","#82c6e2","#faceb6",
"#fae9b6","#e3fab6","#b6faf6","#d6b6fa"])
.set_global_opts(
tooltip_opts =opts.TooltipOpts(formatter=" {a} </br> {b} {d}%",axis_pointer_type = "cross",),
legend_opts =opts.LegendOpts(type_='scroll',orient='vertical',pos_left="5%",pos_top= 'middle'),
title_opts=opts.TitleOpts(title="是否取消预定 & 顾客来自国家分布",pos_left="center")
)
.render_notebook()
)
2.4分析
根据以上可以看出,葡萄牙的Hotel,主要接待的本国游客(废话),齐次是英国。 来自葡萄牙(PRT)本国的,顾客中,取消订单的占比较大,而英国顾客取消预定的就相对较少
三、分析间隔时间(lead_time)和取消预定之间的关系
3.1 小提琴图
# `arrival_date_year` vs `lead_time` vs `is_canceled` exploration with violin plot
data = data_origin
figure_polaris(15,10)
# plt.figure(figsize=(15,10))
sns.violinplot(x='arrival_date_year', y ='lead_time', hue="is_canceled", data=data, palette="Set3", bw=.2,
cut=2, linewidth=2, iner= 'box', split = True)
sns.despine(left=True)
plt.title('Arrival Year VS Lead Time vs Canceled Situation', weight='bold', fontsize=20)
plt.xlabel('Year', fontsize=16)
plt.ylabel('Lead Time', fontsize=16)
Text(0, 0.5, 'Lead Time')
回归模型
# 查看从预定到离店时间特征的影响
import seaborn as sns
# group data for lead_time:
lead_cancel_data = data.groupby("lead_time")["is_canceled"].describe()
# use only lead_times wih more than 10 bookings for graph:
lead_cancel_data_10 = lead_cancel_data.loc[lead_cancel_data["count"] >= 10]
#show figure:
plt.figure(figsize=(15, 7))
x,y = pd.Series(lead_cancel_data_10.index, name="x_var"), pd.Series(lead_cancel_data_10["mean"].values * 100, name="y_var")
sns.regplot(x=x, y=lead_cancel_data_10["mean"].values * 100)
plt.title("Effect of lead time on cancelation", fontsize=16)
plt.xlabel("Lead time", fontsize=16)
plt.ylabel("Cancelations [%]", fontsize=16)
plt.show()
分析-小提琴图
\quad
展示了 到达年份(arrival_date_year) 间隔时间(lead_time) 和 是否取消(is_canceled)之间的关系。
\quad
其中 间隔时间指的是 预订输入日期到到达日期之间经过的天数
\quad
可以看出 没有取消预定的顾客中,提前预定时间的分部较固定。 取消预定的顾客中,提前时间要长一点
结论
\quad
提前预定时间较长的旅客,更有可能取消预定。
分析-回归图
可知:到店日的前几日取消预定的人很少,随着距离预定日越长时间的取消预定的人数越多,提前一年预定的取消率也更大,这也符合人们的常识。
四、酒店人均价格&到店人数
adr : Average Daily Rate as defined by dividing the sum of all lodging transactions by the total number of staying nights
adr: 每一订单的平均每日房价,其定义为所有住宿交易的总和除以总住宿天数
data = data_origin
# 查看adr的分布情况
plt.subplots(figsize=(15,5))
plt.scatter(data['adr'].index,data['adr'].values)
<matplotlib.collections.PathCollection at 0x2cc2b19fc48>
从图中可以看出,数据存在1离群值,打印该值查看。
# 用pd.loc 通过行索引 "Index" 中的具体值来取行数据
data.loc[list(data['adr']).index(max(list(data['adr'])))]['adr']
5400.0
可以看出,这是City Hotel的一笔订单,adr显示为5400,远远超出其他值,虽然在千一级别的订单基数下,这一离群值对平均值影响不大,为了严谨性考虑,我们后续删除这里离群值
## 替换月份 月 对应的数字, 方便画图的时候排序
import calendar
for i in range(1,13):
month2num =list(calendar.month_name)
data['arrival_date_month']=data['arrival_date_month'].replace(calendar.month_name[i],i)
data_new0 = data
data_new = data_new0[data_new0["is_canceled"] == 0]
data_new["people"] = data_new['adults'] + data_new["children"]
data_new["people"] = data_new["people"].replace(0.0,1) ##将总数0人的换成1,防止后面平均价格出现 inf
data_new["adr_per"] = data_new['adr'] / (data_new["people"]*(data_new["stays_in_weekend_nights"]+data_new["stays_in_week_nights"]))
# data_new["adr_per"] = data_new["adr_per"].replace(float('inf'),0)
data_aver_price_city_temp = data_new[data_new["hotel"]=="City Hotel"].groupby("arrival_date_month").mean()['adr_per']
data_aver_price_resort_temp = data_new[data_new["hotel"]=="Resort Hotel"].groupby("arrival_date_month").mean()['adr_per']
data_count_price_city_temp = data_new[data_new["hotel"]=="City Hotel"].groupby("arrival_date_month")['people'].count()
data_count_price_resort_temp = data_new[data_new["hotel"]=="Resort Hotel"].groupby("arrival_date_month")['people'].count()
## 整理数据,方便画图
data_aver_city = pd.DataFrame({"arrival_date_month": list(data_aver_price_city_temp.index),
"hotel": "City Hotel",
"average_cost": list(data_aver_price_city_temp.values),
"count":data_count_price_city_temp })
data_aver_resort = pd.DataFrame({"arrival_date_month": list(data_aver_price_resort_temp.index),
"hotel": "Resort Hotel",
"average_cost": list(data_aver_price_resort_temp.values),
"count":data_count_price_resort_temp })
data_average_hotel = pd.concat([data_aver_city, data_aver_resort], ignore_index=True)
# data_average_hotel
# sns.set_style("whitegrid")
sns.set(style="darkgrid")
## 设置颜色
fig, ax = plt.subplots(2,1,figsize=(15, 9))
sns.barplot(x="arrival_date_month", y="average_cost", hue="hotel", palette=sns.color_palette(), ax=ax[0],data=data_average_hotel)
ax[0].set_xlabel(' ', fontsize=15)
ax[0].set_ylabel('Cost', fontsize=15)
ax[0].set_title('Average Cost of Every Month', fontsize=18)
sns.barplot(x="arrival_date_month", y="count", hue="hotel", palette=sns.color_palette(),ax=ax[1],data=data_average_hotel)
# plt.xlabel("Month", fontsize=16)
ax[1].set_xlabel('Month', fontsize=15)
ax[1].set_ylabel('Count', fontsize=15)
ax[1].set_title('Average Order Count of Every Month', fontsize=18)
Text(0.5, 1.0, 'Order Count of Average Month')
分析
通过查询可知,葡萄牙旅游旺季为每年 6月-9月,以里斯本(葡萄牙首都)为例,7月和8月海滩是最热闹的,从 R e s o r t H o t e l Resort Hotel ResortHotel七八月份顾客数(订单数x人数每订单)最多刚好可以和实际情况相吻合。并且从"Cost of Average Month"图中可以看出,7月和8月同时是一年中平均每人消费价格最高的月份,可以看出,这两个月有较多游客来此度假。
五、预定取消情况和餐食选择
# data['country']=data['country'].replace(np.nan,'PRT')
data["meal"] = data["meal"].replace(np.nan,"SC")
# data.drop(data_new.index[zero_guests], inplace=True)
meal_data = data[["hotel", "is_canceled", "meal"]]
# meal_data
plt.figure(figsize=(15, 10))
plt.subplot(1,2,1)
plt.pie(meal_data.loc[meal_data["is_canceled"]==0, "meal"].value_counts(),
labels=meal_data.loc[meal_data["is_canceled"]==0, "meal"].value_counts().index,
autopct="%.2f%%")
plt.title("Meal Choice of Uncanceled People", fontsize=16)
plt.legend(loc="upper right")
plt.subplot(1,2,2)
plt.pie(meal_data.loc[meal_data["is_canceled"]==1, "meal"].value_counts(),
labels=meal_data.loc[meal_data["is_canceled"]==1, "meal"].value_counts().index,
autopct="%.2f%%")
plt.title("Meal Choice of Canceled People", fontsize=16)
plt.legend(loc="upper right")
<matplotlib.legend.Legend at 0x2cc295b51c8>
分析
两个图几乎没有差别,再次说明取消预订旅客和未取消预订旅客有基本相同的餐食选择
六、不同类型酒店的取消预定情况
## 数据处理
data_new = data_origin
## 分别计算 取消和未取消的记录数量
uncancel_hotel_count = data_new[data_new["is_canceled"]==0].groupby(["hotel"])["hotel"].count()
cancel_hotel_count = data_new[data_new["is_canceled"]==1].groupby(["hotel"])["hotel"].count()
x1 = uncancel_hotel_count['City Hotel']
x2 = uncancel_hotel_count['Resort Hotel']
y1 = cancel_hotel_count['City Hotel']
y2 = cancel_hotel_count['Resort Hotel']
data_Uncanceled_percent = (x1+x2)/(x1+x2+y1+y2) *100
data_Canceled_percent = (y1+y2)/(x1+x2+y1+y2) *100
data_City_percent = (x1+y1)/(x1+x2+y1+y2) *100
data_Resort_percent = (x2+y2)/(x1+x2+y1+y2) *100
data_Uncanceled_City = x1/ (x1+x2+y1+y2)*100
data_Uncanceled_Resort = x2/ (x1+x2+y1+y2)*100
data_Canceled_City = y1/ (x1+x2+y1+y2)*100
data_Canceled_Resort = y2/ (x1+x2+y1+y2)*100
## 画图
import pyecharts.options as opts
from pyecharts.charts import Pie
from pyecharts.commons.utils import JsCode
inner_x_data = ["未取消预定", "已取消预定"]
inner_y_data = [data_Uncanceled_percent, data_Canceled_percent]
inner_data_pair = [list(z) for z in zip(inner_x_data, inner_y_data)]
outer_x_data = ["城市酒店", "度假酒店", "城市酒店", "度假酒店"]
outer_y_data = [data_Uncanceled_City,data_Uncanceled_Resort, data_Canceled_City,data_Canceled_Resort]
outer_data_pair = [list(z) for z in zip(outer_x_data, outer_y_data)]
inner2_x_data = ["城市酒店", "度假酒店"]
inner2_y_data = [data_City_percent, data_Resort_percent]
inner2_data_pair = [list(z) for z in zip(inner2_x_data, inner2_y_data)]
outer2_x_data = [ "未取消预定", "已取消预定", "未取消预定", "已取消预定"]
outer2_y_data = [data_Uncanceled_City,data_Canceled_City, data_Uncanceled_Resort,data_Canceled_Resort]
outer2_data_pair = [list(z) for z in zip(outer2_x_data, outer2_y_data)]
(
Pie(init_opts=opts.InitOpts())
.add(
series_name="1-is_canceled:",
data_pair=inner_data_pair,
radius=[0, "35%"],
center=["30%", "50%"],
)
.add(
series_name="1-hotel:",
data_pair=outer_data_pair,
center=["30%", "50%"],
radius=["36%","65%"],
label_opts=opts.LabelOpts(position="inner"),
)
##----------------- pie2 -------------------------------##
.add(
series_name="2-hotel:",
data_pair=inner2_data_pair,
radius=[0, "35%"],
center=["70%", "50%"],
)
.add(
series_name="2-is_canceled:",
data_pair=outer2_data_pair,
center=["70%", "50%"],
radius=["36%","65%"],
label_opts=opts.LabelOpts(position="inner"),
)
.set_colors(['#6baed6',"#fd8d3c", "#74c476", "#9e9ac8"])
.set_global_opts(tooltip_opts=opts.TooltipOpts(formatter=" {a} </br> {b} {d}%",
# precision
axis_pointer_type = "cross",
), )
.set_series_opts(label_opts=opts.LabelOpts(position="inner",formatter="{b} \n\n {d}%"),)
.render_notebook()
)
分析:
综合来看,
预定取消类型上,
64
%
64\%
64% 以上顾客没有取消预定;
预定酒店类型上,
66
%
66\%
66% 以上的订单是对于城市酒店(
C
i
t
y
H
o
t
e
l
City Hotel
CityHotel)的,度假酒店占比较小。
在取消预定的订单中,城市酒店约为度假酒店的$3$倍,占整体取消订单中$3/4(74.8\%)$,主要是因为 在整体数量上,城市酒店占比约 $66\%$,但是相对于整体占比的$66\%$ ,占比$74\%$略高,说明 `城市酒店取消率略高` 。
七、来预测顾客是否会取消预定
第一步:计算每个特征与"is_canceled"的相关性,由于有些是类别变量,所以不能参与计算
data_new = data_origin
cancel_corr = data_new.corr()["is_canceled"]
cancel_corr.abs().sort_values(ascending=False)
is_canceled 1.000000
lead_time 0.293123
total_of_special_requests 0.234658
required_car_parking_spaces 0.195498
booking_changes 0.144381
previous_cancellations 0.110133
is_repeated_guest 0.084793
adults 0.060017
previous_bookings_not_canceled 0.057358
days_in_waiting_list 0.054186
adr 0.047557
babies 0.032491
stays_in_week_nights 0.024765
arrival_date_year 0.016660
arrival_date_month 0.011022
arrival_date_week_number 0.008148
arrival_date_day_of_month 0.006130
children 0.005048
stays_in_weekend_nights 0.001791
Name: is_canceled, dtype: float64
第二步 特征模型训练
建立base model,使用决策树,随机森林,逻辑回归、XGBC分类器,查看哪个训练结果更好
# for ML:
from sklearn.model_selection import train_test_split, KFold, cross_validate, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier # 随机森林
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
import eli5 # Feature importance evaluation
#手动选择要包括的列
#为了使模型更通用并防止泄漏,排除了(预订更改、等待日、到达年份、指定房间类型、预订状态、国家/地区,列表)
#包括国家将提高准确性,但它也可能使模型不那么通用
num_features = ["lead_time","total_of_special_requests","required_car_parking_spaces",
"previous_cancellations","is_repeated_guest","adults","previous_bookings_not_canceled",
"adr","babies","stays_in_weekend_nights","arrival_date_week_number","arrival_date_day_of_month",
"children","stays_in_week_nights"]
cat_features = ["hotel","arrival_date_month","meal","market_segment",
"distribution_channel","reserved_room_type","deposit_type","customer_type"]
#分离特征和预测值
features = num_features + cat_features
X = data_new.drop(["is_canceled"], axis=1)[features]
y = data_new["is_canceled"]
#预处理数值特征:
#对于大多数num cols,除了日期,0是最符合逻辑的填充值
#这里没有日期遗漏。
num_transformer = SimpleImputer(strategy="constant")
# 分类特征的预处理:
cat_transformer = Pipeline(steps=[("imputer", SimpleImputer(strategy="constant", fill_value="Unknown")),
("onehot", OneHotEncoder(handle_unknown='ignore'))])
# 数值和分类特征的束预处理:
preprocessor = ColumnTransformer(transformers=[("num", num_transformer, num_features),
("cat", cat_transformer, cat_features)])
# 定义要测试的模型:
base_models = [("DT_model", DecisionTreeClassifier(random_state=42)),
("RF_model", RandomForestClassifier(random_state=42,n_jobs=-1)),
("LR_model", LogisticRegression(random_state=42,n_jobs=-1,solver='liblinear')),
("XGB_model", XGBClassifier(random_state=42, n_jobs=-1))]
#将数据分成“kfold”部分进行交叉验证,
#使用shuffle确保数据的随机分布:
kfolds = 4 # 4 = 75% train, 25% validation
split = KFold(n_splits=kfolds, shuffle=True, random_state=42)
#对每个模型进行预处理、拟合、预测和评分:
for name, model in base_models:
#将数据和模型的预处理打包到管道中:
model_steps = Pipeline(steps=[('preprocessor', preprocessor),
('model', model)])
#获取每个模型的交叉验证分数:
cv_results = cross_val_score(model_steps,
X, y,
cv=split,
scoring="accuracy",
n_jobs=-1)
# output:
min_score = round(min(cv_results), 4)
max_score = round(max(cv_results), 4)
mean_score = round(np.mean(cv_results), 4)
std_dev = round(np.std(cv_results), 4)
print(f"{name} cross validation accuarcy score: {mean_score} +/- {std_dev} (std) min: {min_score}, max: {max_score}")
DT_model cross validation accuarcy score: 0.8196 +/- 0.0024 (std) min: 0.8157, max: 0.822
RF_model cross validation accuarcy score: 0.8521 +/- 0.0018 (std) min: 0.8494, max: 0.8542
LR_model cross validation accuarcy score: 0.8085 +/- 0.0018 (std) min: 0.806, max: 0.8108
XGB_model cross validation accuarcy score: 0.8403 +/- 0.0007 (std) min: 0.8394, max: 0.8413
可知: RF算法的准确度更高一点
※ 部分代码以及思路 参考一下文章
CSND:kaggle——Hotel booking demand酒店预订需求 by 牛牛liunian
Kaggle: house-booking by swapnilwagh061993
Kaggle: Hotel bookings ML project - kernel688ef04346 by somepro