Kaggle项目:酒店预订需求数据的探索与预测

EDA and Predictive Analysis of Hotel Booking Demand Datasets 2.0

数据背景:该数据集包含酒店预订相关信息,数据信息范围包括:酒店类型、订单是否取消、预订时间、入住时长、入住人数、用户国籍等。

数据分析目的:1、分析用户的特征分布;2、分析酒店业务经营情况 ;3、预测酒店订单是否会被取消,找出重要影响因素

数据来源链接: https://www.kaggle.com/jessemostipak/hotel-booking-demand

以下通过Python对酒店预订数据进行探索性数据分析(Exploratory Data Analysis)和预测分析(Predictive Analysis):

一、数据准备

import os
import zipfile
import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
import seaborn as sns 
import warnings
# 忽略警告
warnings.filterwarnings('ignore')
# 正常显示中文和负号
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False
# 可视化显示在页面 
%matplotlib inline
# 设定绘图风格
plt.style.use('ggplot')
# 声明变量
dataset_path = './'    # 数据集路径
zip_filename = 'archive_4.zip'     # zip文件名
zip_filepath = os.path.join(dataset_path, zip_filename)    # zip文件路径

# 解压数据集
with zipfile.ZipFile(zip_filepath) as zf:
    dataset_filename = zf.namelist()[0]      # 数据集文件名(在zip中)
    dataset_filepath = os.path.join(dataset_path, dataset_filename)  # 数据集文件路径
    print ("解压zip...",)
    zf.extractall(path = dataset_path)
    print ("完成。")
解压zip...
完成。
# 导入数据集
df_data = pd.read_csv(dataset_filepath)
# 查看加载的数据基本信息
print ('数据集基本信息:')
df_data.info()
数据集基本信息:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 32 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   hotel                           119390 non-null  object 
 1   is_canceled                     119390 non-null  int64  
 2   lead_time                       119390 non-null  int64  
 3   arrival_date_year               119390 non-null  int64  
 4   arrival_date_month              119390 non-null  object 
 5   arrival_date_week_number        119390 non-null  int64  
 6   arrival_date_day_of_month       119390 non-null  int64  
 7   stays_in_weekend_nights         119390 non-null  int64  
 8   stays_in_week_nights            119390 non-null  int64  
 9   adults                          119390 non-null  int64  
 10  children                        119386 non-null  float64
 11  babies                          119390 non-null  int64  
 12  meal                            119390 non-null  object 
 13  country                         118902 non-null  object 
 14  market_segment                  119390 non-null  object 
 15  distribution_channel            119390 non-null  object 
 16  is_repeated_guest               119390 non-null  int64  
 17  previous_cancellations          119390 non-null  int64  
 18  previous_bookings_not_canceled  119390 non-null  int64  
 19  reserved_room_type              119390 non-null  object 
 20  assigned_room_type              119390 non-null  object 
 21  booking_changes                 119390 non-null  int64  
 22  deposit_type                    119390 non-null  object 
 23  agent                           103050 non-null  float64
 24  company                         6797 non-null    float64
 25  days_in_waiting_list            119390 non-null  int64  
 26  customer_type                   119390 non-null  object 
 27  adr                             119390 non-null  float64
 28  required_car_parking_spaces     119390 non-null  int64  
 29  total_of_special_requests       119390 non-null  int64  
 30  reservation_status              119390 non-null  object 
 31  reservation_status_date         119390 non-null  object 
dtypes: float64(4), int64(16), object(12)
memory usage: 29.1+ MB

总共有32个列标签,119390条数据记录,存在缺失数据。

二、数据清洗

# 逐条预览每个字段的数据计数,例:
col_data=df_data['meal'].value_counts()
plt.barh(y=col_data.index,width=col_data.values,height=0.5)
<BarContainer object of 5 artists>

在这里插入图片描述

分析字段数据的含义并对32个列标签初步分类,拟定数据清洗方法,如下表:
在这里插入图片描述

(一)缺失数据处理

#复制源数据
df_copy=df_data.copy()
# 查看有缺失数据的字段缺失个数占比
print ('缺失个数占比:')
df_copy.isnull().sum()[df_copy.isnull().sum()!=0]/df_copy.shape[0]
缺失个数占比:
children    0.000034
country     0.004087
agent       0.136862
company     0.943069
dtype: float64
# 'children'列缺失值较少,直接删除缺失的行
df_copy.dropna(subset=['children'],inplace=True)
# 'country'列缺失值较少,直接删除缺失的行
df_copy.dropna(subset=['country'],inplace=True)
#‘agent'列缺失值占比近14%,使用0替换列中的缺失值
df_copy['agent'].fillna(value=0, inplace = True)
# 'company'列缺失值过多,占比约94%,直接删除列
df_copy.drop(['company'],axis =1,inplace=True)
# 查验是否还有缺失数据
print ('含缺失数据的列的个数:')
df_copy.isnull().sum()[df_copy.isnull().sum()!=0].count()
含缺失数据的列的个数:
0

(二)删除重复记录

数据集无主键,暂不处理重复记录

(三)异常值处理

# 剔除入住晚数为0的记录
df_copy.drop(df_copy[df_copy['stays_in_weekend_nights']+df_copy['stays_in_week_nights']==0].index,inplace=True)
# 剔除入住人数为0的记录
df_copy.drop(df_copy[df_copy['adults']+df_copy['children']+df_copy['babies']==0].index,inplace=True)
# 将'children'字段数据类型修改为整型
df_copy.children = df_copy.children.astype(int)
# 将'meal'字段中的Undefined 修改为 SC 
df_copy.meal.replace("Undefined", "SC", inplace=True)
# 将'agent'字段数据类型修改为字符串型
df_copy.agent = df_copy.agent.astype(int)
df_copy['agent'] = df_copy['agent'].apply(str)
#查看'adr'字段数据分布
plt.boxplot(df_copy['adr'])
{'whiskers': [<matplotlib.lines.Line2D at 0x2359613ed60>,
  <matplotlib.lines.Line2D at 0x23596154070>],
 'caps': [<matplotlib.lines.Line2D at 0x23596154370>,
  <matplotlib.lines.Line2D at 0x23596154580>],
 'boxes': [<matplotlib.lines.Line2D at 0x2359613ea90>],
 'medians': [<matplotlib.lines.Line2D at 0x23596154850>],
 'fliers': [<matplotlib.lines.Line2D at 0x23596154b20>],
 'means': []}

在这里插入图片描述

#删除'adr'字段离群点
df_copy.drop(df_copy[df_copy['adr']>5000].index,inplace=True)
# 重置索引
df_copy.reset_index(drop=True,inplace=True)
# 查验加载的数据基本信息
print ('数据集基本信息:')
df_copy.info()
数据集基本信息:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 118087 entries, 0 to 118086
Data columns (total 31 columns):
 #   Column                          Non-Null Count   Dtype  
---  ------                          --------------   -----  
 0   hotel                           118087 non-null  object 
 1   is_canceled                     118087 non-null  int64  
 2   lead_time                       118087 non-null  int64  
 3   arrival_date_year               118087 non-null  int64  
 4   arrival_date_month              118087 non-null  object 
 5   arrival_date_week_number        118087 non-null  int64  
 6   arrival_date_day_of_month       118087 non-null  int64  
 7   stays_in_weekend_nights         118087 non-null  int64  
 8   stays_in_week_nights            118087 non-null  int64  
 9   adults                          118087 non-null  int64  
 10  children                        118087 non-null  int32  
 11  babies                          118087 non-null  int64  
 12  meal                            118087 non-null  object 
 13  country                         118087 non-null  object 
 14  market_segment                  118087 non-null  object 
 15  distribution_channel            118087 non-null  object 
 16  is_repeated_guest               118087 non-null  int64  
 17  previous_cancellations          118087 non-null  int64  
 18  previous_bookings_not_canceled  118087 non-null  int64  
 19  reserved_room_type              118087 non-null  object 
 20  assigned_room_type              118087 non-null  object 
 21  booking_changes                 118087 non-null  int64  
 22  deposit_type                    118087 non-null  object 
 23  agent                           118087 non-null  object 
 24  days_in_waiting_list            118087 non-null  int64  
 25  customer_type                   118087 non-null  object 
 26  adr                             118087 non-null  float64
 27  required_car_parking_spaces     118087 non-null  int64  
 28  total_of_special_requests       118087 non-null  int64  
 29  reservation_status              118087 non-null  object 
 30  reservation_status_date         118087 non-null  object 
dtypes: float64(1), int32(1), int64(16), object(13)
memory usage: 27.5+ MB

三、数据分析及可视化

(一)用户的特征分布

从用户属性、用户行为分析用户的特征分布,判断其对取消订单的相关性:

1、用户的地域分布
#引入世界地图信息表
import geopandas
world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
#按'country'字段分组对酒店的订单量计数
df_country=df_copy['country'].value_counts().reset_index().rename (columns= {
   'index': 'code',
1. 功能需求分析 用户管理:添加用户,修改用户密码。 客户管理:添加客户,查询,修改,删除客户信息。 客房管理:添加客房,查询,修改,删除客房信息。 客房类型管理:添加客房类型,修改客房类型。 订房:预订客房,取消预订房间。 客房登记信息管理:查看客房登记信息。 2. 概念设计 用户实体ER图 客户信息实体ER图 客房信息实体ER图 客房类型ER图 登记记录ER图 总ER图 3. 逻辑结构设计 1. 客人信息表:tbclient "字段名 "数据类型 "空/非空 "约束条件 "其他说明 " "clientId "int "not null "IDENTITY(1"客户ID " " " " ",1) " " " " " "PRIMARY " " " " " "KEY " " "name "varchar(20)"not null " "客户姓名 " "sex "varchar(2) "not null " "性别 " "identityCar"varchar(30)"not null " "证件号 " "d " " " " " "phone "varchar(20)"not null " "联系电话 " 2. 登录信息表:tbemployee "字段名 "数据类型 "空/非空 "约束条件 "其他说明 " "employeeId "int "not null "IDENTITY(1"用户编号 " " " " ",1) " " " " " "PRIMARY " " " " " "KEY " " "userName "varchar(20)"not null " "用户名 " "password "varchar(20)"not null " "密码 " "per "int "not null " "权限 " 3. 房间类型表:tbtype "字段名 "数据类型 "空/非空 "约束条件 "其他说明 " "typeId "int "not null "IDENTITY(1"类型编号 " " " " ",1) " " " " " "PRIMARY " " " " " "KEY " " "typeName "varchar(20)"not null " "类型名 " "price "int "not null " "价格 " 4. 房间信息表:tbroom "字段名 "数据类型 "空/非空 "约束条件 "其他说明 " "roomId "int "not null "IDENTITY(1"房间ID " " " " ",1) " " " " " "PRIMARY " " " " " "KEY " " "roomNum "int "not null " "房间号 " "typeId "int "not null "foreign "房间类型I" " " " "key "D " " " " "REFERENCES" " " " " "tbtype(typ" " " " " "eId) " " "status "varhar(10) "not null " "房间状态 " 5. 客户住房登记信息表:tbcheckin "字段名 "数据类型 "空/非空 "约束条件 "其他说明" "checkId "int "not null"IDENTITY(1,1) "登记ID " " " " "PRIMARY KEY " " "roomNum "int "not null"foreign key "房间号 " " " " "REFERENCES " " " " " "tbroom(roomNum" " " " " ") " " "clientId "int "not null"foreign key "客户ID " " " " "REFERENCES " " " " " "tbclient(clent" " " " " "Id) " " "startDate "date "not null" "预订入住" " " " " "日期 " "lastDate "date "not null" "退房日期" "spe "varchar(50" " "描述 " " ") " " " " ----------------------- 酒店订房系统数据库设计全文共4页,当前为第1页。 酒店订房系统数据库设计全文共4页,当前为第2页。 酒店订房系统数据库设计全文共4页,当前为第3页。 酒店订房系统数据库设计全文共4页,当前为第4页。
评论 5
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值