Kaggle项目 - Hotel Booking Demand

本文通过对Kaggle上的Hotel Booking Demand数据集进行探索性数据分析(EDA),涉及数据清洗、异常值处理和关键指标分析。研究了不同酒店的取消情况,发现城市酒店的取消比例高于度假酒店,并揭示了新客与老客的取消行为差异。进一步分析显示,团队预订和线上旅行团的取消比率较高,不同月份的取消率和订单量存在显著季节性,而城市酒店在夏季价格波动明显,度假酒店则相对稳定。
摘要由CSDN通过智能技术生成
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.rc("font",family="SimHei",size="12")  #用于解决中文显示不了的问题
sns.set_style("whitegrid") 

Exploratory data analysis (EDA) 探索性数据分析

一、Data Clean 数据清洗

1.1对空值、NA进行处理

data = pd.read_csv('hotel_booking_demand.csv')
data.head()
hotel is_canceled lead_time arrival_date_year arrival_date_month arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights stays_in_week_nights adults ... deposit_type agent company days_in_waiting_list customer_type adr required_car_parking_spaces total_of_special_requests reservation_status reservation_status_date
0 Resort Hotel 0 342 2015 July 27 1 0 0 2 ... No Deposit NaN NaN 0 Transient 0.0 0 0 Check-Out 2015-07-01
1 Resort Hotel 0 737 2015 July 27 1 0 0 2 ... No Deposit NaN NaN 0 Transient 0.0 0 0 Check-Out 2015-07-01
2 Resort Hotel 0 7 2015 July 27 1 0 1 1 ... No Deposit NaN NaN 0 Transient 75.0 0 0 Check-Out 2015-07-02
3 Resort Hotel 0 13 2015 July 27 1 0 1 1 ... No Deposit 304.0 NaN 0 Transient 75.0 0 0 Check-Out 2015-07-02
4 Resort Hotel 0 14 2015 July 27 1 0 2 2 ... No Deposit 240.0 NaN 0 Transient 98.0 0 1 Check-Out 2015-07-03

5 rows × 32 columns

data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 119390 entries, 0 to 119389
Data columns (total 32 columns):
hotel                             119390 non-null object
is_canceled                       119390 non-null int64
lead_time                         119390 non-null int64
arrival_date_year                 119390 non-null int64
arrival_date_month                119390 non-null object
arrival_date_week_number          119390 non-null int64
arrival_date_day_of_month         119390 non-null int64
stays_in_weekend_nights           119390 non-null int64
stays_in_week_nights              119390 non-null int64
adults                            119390 non-null int64
children                          119386 non-null float64
babies                            119390 non-null int64
meal                              119390 non-null object
country                           118902 non-null object
market_segment                    119390 non-null object
distribution_channel              119390 non-null object
is_repeated_guest                 119390 non-null int64
previous_cancellations            119390 non-null int64
previous_bookings_not_canceled    119390 non-null int64
reserved_room_type                119390 non-null object
assigned_room_type                119390 non-null object
booking_changes                   119390 non-null int64
deposit_type                      119390 non-null object
agent                             103050 non-null float64
company                           6797 non-null float64
days_in_waiting_list              119390 non-null int64
customer_type                     119390 non-null object
adr                               119390 non-null float64
required_car_parking_spaces       119390 non-null int64
total_of_special_requests         119390 non-null int64
reservation_status                119390 non-null object
reservation_status_date           119390 non-null object
dtypes: float64(4), int64(16), object(12)
memory usage: 29.1+ MB
data.isnull().sum()  #空值计数
hotel                                  0
is_canceled                            0
lead_time                              0
arrival_date_year                      0
arrival_date_month                     0
arrival_date_week_number               0
arrival_date_day_of_month              0
stays_in_weekend_nights                0
stays_in_week_nights                   0
adults                                 0
children                               4
babies                                 0
meal                                   0
country                              488
market_segment                         0
distribution_channel                   0
is_repeated_guest                      0
previous_cancellations                 0
previous_bookings_not_canceled         0
reserved_room_type                     0
assigned_room_type                     0
booking_changes                        0
deposit_type                           0
agent                              16340
company                           112593
days_in_waiting_list                   0
customer_type                          0
adr                                    0
required_car_parking_spaces            0
total_of_special_requests              0
reservation_status                     0
reservation_status_date                0
dtype: int64
  • 缺失值处理方法:可以删除,可以零填充,均值填充,众数填充等
  • 从上述结果可以看到children、country、agent、company都有缺失值,company甚至达到112593,总行才119390
  • 对四种缺失值进行判断处理:
    • comapny缺失太多,删除
    • children很少,用众数填充
    • country国家众数填充
    • agent旅游机构考虑0填充,代表不属于任何机构
df 
  • 0
    点赞
  • 26
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值