了解样本基本信息
import pandas as pd
path = 'C:/Users/Administrator/Desktop/酒店需求案例分析/hotel_bookings.csv'
data = pd.read_csv(path)
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 500)
print(data.describe())
is_canceled lead_time arrival_date_year arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights stays_in_week_nights adults children babies is_repeated_guest previous_cancellations previous_bookings_not_canceled booking_changes agent company days_in_waiting_list adr required_car_parking_spaces total_of_special_requests
count 119390.000000 119390.000000 119390.000000 119390.000000 119390.000000 119390.000000 119390.000000 119390.000000 119386.000000 119390.000000 119390.000000 119390.000000 119390.000000 119390.000000 103050.000000 6797.000000 119390.000000 119390.000000 119390.000000 119390.000000
mean 0.370416 104.011416 2016.156554 27.165173 15.798241 0.927599 2.500302 1.856403 0.103890 0.007949 0.031912 0.087118 0.137097 0.221124 86.693382 189.266735 2.321149 101.831122 0.062518 0.571363
std 0.482918 106.863097 0.707476 13.605138 8.780829 0.998613 1.908286 0.579261 0.398561 0.097436 0.175767 0.844336 1.497437 0.652306 110.774548 131.655015 17.594721 50.535790 0.245291 0.792798
min 0.000000 0.000000 2015.000000 1.000000 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 1.000000 6.000000 0.000000 -6.380000 0.000000 0.000000
25% 0.000000 18.000000 2016.000000 16.000000 8.000000 0.000000 1.000000 2.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 9.000000 62.000000 0.000000 69.290000 0.000000 0.000000
50% 0.000000 69.000000 2016.000000 28.000000 16.000000 1.000000 2.000000 2.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 14.000000 179.000000 0.000000 94.575000 0.000000 0.000000
75% 1.000000 160.000000 2017.000000 38.000000 23.000000 2.000000 3.000000 2.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 229.000000 270.000000 0.000000 126.000000 0.000000 1.000000
max 1.000000 737.000000 2017.000000 53.000000 31.000000 19.000000 50.000000 55.000000 10.000000 10.000000 1.000000 26.000000 72.000000 21.000000 535.000000 543.000000 391.000000 5400.000000 8.000000 5.000000
异常值处理
统计出现缺失值的各列所包含的缺失值数目
import pandas as pd
path = 'C:/Users/Administrator/Desktop/酒店需求案例分析/hotel_bookings.csv'
data = pd.read_csv(path)
missing = data.isnull().sum()[data.isnull().sum()>0]
print(missing)
children 4
country 488
agent 16340
company 112593
dtype: int64
填充缺失值
# children和country列由于缺失数目较小,采用对应列众数进行填充
data['children'].fillna(data['children'].mode()[0],inplace=True)
data['country'].fillna(data['country'].mode()[0],inplace=True)
# agent列将缺失值单独作为新类别,标记为0
data['agent'].fillna(0,inplace=True)
# company缺失值过多 直接删除此列
data.drop('company',axis=1,inplace=True)
脏数据处理
#将meal中undefined和SC都表示未预定,合并为一类
data['meal'].replace('Undefined','SC',inplace=True)
#删除入住人数为0以及入住天数为0,删除细分市场不确定的,删除客房日均价小于0的列
zero_guest = data[data[['adults', 'children', 'babies']].sum(axis = 1)==0]
data.drop(zero_guest.index,inplace=True)
zero_day = data[data[['stays_in_weekend_nights','stays_in_week_nights']].sum(axis = 1)==0]
data.drop(zero_day.index,inplace=True)
uadr = data[data['adr']<0]
data.drop(uadr.index,inplace=True)
umarket_segment = data[data['market_segment']=='Undefined']
data.drop(umarket_segment.index,inplace=True)
酒店信息分析
城市酒店与度假酒店对比分析
不同年份两家酒店总营业额比较
data['total_adr'] = (data['stays_in_weekend_nights']+data['stays_in_week_nights'])*data['adr']
pd.pivot_table(data,values='total_adr',index=['arrival_date_year'],columns=['hotel'],aggfunc='sum').plot.bar()
plt.show()
2015年,城市酒店与度假酒店年总营业额持平。其余年份,城市酒店营业额都遥遥领先于度假酒店。
营业额精确到月份
data['total_adr'] = (data['stays_in_weekend_nights']+data['stays_in_week_nights'])*data['adr']
dict = {
'July':7,'August':8,'September':9,'October':10,'November':11,'December':12,'January':1,'February':2,'March':3,'April':4,'May':5,'June':6,}
data['arrival_date_month'] = data['arrival_date_month'].map(dict)
pd.pivot_table(data,values='total_adr',index=['arrival_date_month'],columns=['hotel'],aggfunc='sum').plot.bar()
plt.show()
从图可以发现,7,8月份度假酒店月营业额高于城市酒店,其余月份的营业额均是城市酒店高于度假酒店,且两家酒店月营业额走势一致。
两家酒店月均收费比较
pd.pivot_table(data,values='total_adr',index=['arrival_date_month'],columns=['hotel'],aggfunc='mean').plot()
plt.show()
可以看出,度假酒店的月均收费变化较大,七八月份的月均收费较高,这可能是其此时月营业额高于城市酒店的原因。其余月份,城市酒店的日均收费高于度假酒店。
酒店类型与预定率取消率
a = data[data['is_canceled'] == 0].groupby(['hotel']).is_canceled.count()
b = data[data[