EDA and Predictive Analysis of Hotel Booking Demand Datasets 2.0
数据背景:该数据集包含酒店预订相关信息,数据信息范围包括:酒店类型、订单是否取消、预订时间、入住时长、入住人数、用户国籍等。
数据分析目的:1、分析用户的特征分布;2、分析酒店业务经营情况 ;3、预测酒店订单是否会被取消,找出重要影响因素
数据来源链接: https://www.kaggle.com/jessemostipak/hotel-booking-demand
以下通过Python对酒店预订数据进行探索性数据分析(Exploratory Data Analysis)和预测分析(Predictive Analysis):
一、数据准备
import os
import zipfile
import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
# 忽略警告
warnings.filterwarnings('ignore')
# 正常显示中文和负号
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
# 可视化显示在页面
%matplotlib inline
# 设定绘图风格
plt.style.use('ggplot')
# 声明变量
dataset_path = './' # 数据集路径
zip_filename = 'archive_4.zip' # zip文件名
zip_filepath = os.path.join(dataset_path, zip_filename) # zip文件路径
# 解压数据集
with zipfile.ZipFile(zip_filepath) as zf:
dataset_filename = zf.namelist()[0] # 数据集文件名(在zip中)
dataset_filepath = os.path.join(dataset_path, dataset_filename) # 数据集文件路径
print ("解压zip...",)
zf.extractall(path = dataset_path)
print ("完成。")
解压zip...
完成。
# 导入数据集
df_data = pd.read_csv(dataset_filepath)
# 查看加载的数据基本信息
print ('数据集基本信息:')
df_data.info()
数据集基本信息:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 32 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 hotel 119390 non-null object
1 is_canceled 119390 non-null int64
2 lead_time 119390 non-null int64
3 arrival_date_year 119390 non-null int64
4 arrival_date_month 119390 non-null object
5 arrival_date_week_number 119390 non-null int64
6 arrival_date_day_of_month 119390 non-null int64
7 stays_in_weekend_nights 119390 non-null int64
8 stays_in_week_nights 119390 non-null int64
9 adults 119390 non-null int64
10 children 119386 non-null float64
11 babies 119390 non-null int64
12 meal 119390 non-null object
13 country 118902 non-null object
14 market_segment 119390 non-null object
15 distribution_channel 119390 non-null object
16 is_repeated_guest 119390 non-null int64
17 previous_cancellations 119390 non-null int64
18 previous_bookings_not_canceled 119390 non-null int64
19 reserved_room_type 119390 non-null object
20 assigned_room_type 119390 non-null object
21 booking_changes 119390 non-null int64
22 deposit_type 119390 non-null object
23 agent 103050 non-null float64
24 company 6797 non-null float64
25 days_in_waiting_list 119390 non-null int64
26 customer_type 119390 non-null object
27 adr 119390 non-null float64
28 required_car_parking_spaces 119390 non-null int64
29 total_of_special_requests 119390 non-null int64
30 reservation_status 119390 non-null object
31 reservation_status_date 119390 non-null object
dtypes: float64(4), int64(16), object(12)
memory usage: 29.1+ MB
总共有32个列标签,119390条数据记录,存在缺失数据。
二、数据清洗
# 逐条预览每个字段的数据计数,例:
col_data=df_data['meal'].value_counts()
plt.barh(y=col_data.index,width=col_data.values,height=0.5)
<BarContainer object of 5 artists>
分析字段数据的含义并对32个列标签初步分类,拟定数据清洗方法,如下表:
(一)缺失数据处理
#复制源数据
df_copy=df_data.copy()
# 查看有缺失数据的字段缺失个数占比
print ('缺失个数占比:')
df_copy.isnull().sum()[df_copy.isnull().sum()!=0]/df_copy.shape[0]
缺失个数占比:
children 0.000034
country 0.004087
agent 0.136862
company 0.943069
dtype: float64
# 'children'列缺失值较少,直接删除缺失的行
df_copy.dropna(subset=['children'],inplace=True)
# 'country'列缺失值较少,直接删除缺失的行
df_copy.dropna(subset=['country'],inplace=True)
#‘agent'列缺失值占比近14%,使用0替换列中的缺失值
df_copy['agent'].fillna(value=0, inplace = True)
# 'company'列缺失值过多,占比约94%,直接删除列
df_copy.drop(['company'],axis =1,inplace=True)
# 查验是否还有缺失数据
print ('含缺失数据的列的个数:')
df_copy.isnull().sum()[df_copy.isnull().sum()!=0].count()
含缺失数据的列的个数:
0
(二)删除重复记录
数据集无主键,暂不处理重复记录
(三)异常值处理
# 剔除入住晚数为0的记录
df_copy.drop(df_copy[df_copy['stays_in_weekend_nights']+df_copy['stays_in_week_nights']==0].index,inplace=True)
# 剔除入住人数为0的记录
df_copy.drop(df_copy[df_copy['adults']+df_copy['children']+df_copy['babies']==0].index,inplace=True)
# 将'children'字段数据类型修改为整型
df_copy.children = df_copy.children.astype(int)
# 将'meal'字段中的Undefined 修改为 SC
df_copy.meal.replace("Undefined", "SC", inplace=True)
# 将'agent'字段数据类型修改为字符串型
df_copy.agent = df_copy.agent.astype(int)
df_copy['agent'] = df_copy['agent'].apply(str)
#查看'adr'字段数据分布
plt.boxplot(df_copy['adr'])
{'whiskers': [<matplotlib.lines.Line2D at 0x2359613ed60>,
<matplotlib.lines.Line2D at 0x23596154070>],
'caps': [<matplotlib.lines.Line2D at 0x23596154370>,
<matplotlib.lines.Line2D at 0x23596154580>],
'boxes': [<matplotlib.lines.Line2D at 0x2359613ea90>],
'medians': [<matplotlib.lines.Line2D at 0x23596154850>],
'fliers': [<matplotlib.lines.Line2D at 0x23596154b20>],
'means': []}
#删除'adr'字段离群点
df_copy.drop(df_copy[df_copy['adr']>5000].index,inplace=True)
# 重置索引
df_copy.reset_index(drop=True,inplace=True)
# 查验加载的数据基本信息
print ('数据集基本信息:')
df_copy.info()
数据集基本信息:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 118087 entries, 0 to 118086
Data columns (total 31 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 hotel 118087 non-null object
1 is_canceled 118087 non-null int64
2 lead_time 118087 non-null int64
3 arrival_date_year 118087 non-null int64
4 arrival_date_month 118087 non-null object
5 arrival_date_week_number 118087 non-null int64
6 arrival_date_day_of_month 118087 non-null int64
7 stays_in_weekend_nights 118087 non-null int64
8 stays_in_week_nights 118087 non-null int64
9 adults 118087 non-null int64
10 children 118087 non-null int32
11 babies 118087 non-null int64
12 meal 118087 non-null object
13 country 118087 non-null object
14 market_segment 118087 non-null object
15 distribution_channel 118087 non-null object
16 is_repeated_guest 118087 non-null int64
17 previous_cancellations 118087 non-null int64
18 previous_bookings_not_canceled 118087 non-null int64
19 reserved_room_type 118087 non-null object
20 assigned_room_type 118087 non-null object
21 booking_changes 118087 non-null int64
22 deposit_type 118087 non-null object
23 agent 118087 non-null object
24 days_in_waiting_list 118087 non-null int64
25 customer_type 118087 non-null object
26 adr 118087 non-null float64
27 required_car_parking_spaces 118087 non-null int64
28 total_of_special_requests 118087 non-null int64
29 reservation_status 118087 non-null object
30 reservation_status_date 118087 non-null object
dtypes: float64(1), int32(1), int64(16), object(13)
memory usage: 27.5+ MB
三、数据分析及可视化
(一)用户的特征分布
从用户属性、用户行为分析用户的特征分布,判断其对取消订单的相关性:
1、用户的地域分布
#引入世界地图信息表
import geopandas
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
#按'country'字段分组对酒店的订单量计数
df_country=df_copy['country'].value_counts().reset_index().rename (columns= {
'index': 'code',