Airbin_0720

项目背景

数据是从kaggle下载的,Airbnb是一个让大众出租住宿民宿的网站,提供短期出租房屋或房间的服务,并且以其独特性的居住体验发展迅速,这里我们拿到的数据是2019年纽约的民宿数据。

  • 这里需要注意一点,有的压缩包解压之后直接使用csv文件会有问题,正确的方式是,1)右击csv文件,打开方式选择txt,然后保存,Encloding选ANSI保存,2)再使用Excel打开csv文件,然后另存为,保存类型选择CSV UTF-8(逗号分隔),这样文件可以解决基本的中文乱码/列错位等等问题;
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
df = pd.read_csv('AB_NYC_2019v.csv',encoding='utf_8_sig')
df.head()
idnamehost_idhost_nameneighbourhood_groupneighbourhoodlatitudelongituderoom_typepriceminimum_nightsnumber_of_reviewslast_reviewreviews_per_monthcalculated_host_listings_countavailability_365
034177117Sonder | The Biltmore | 1BR219517861Sonder (NYC)ManhattanTheater District40.76118-73.98635Entire home/apt699290NaNNaN327365
134202946Sonder | 21 Chelsea | Stunning 1BR + Rooftop219517861Sonder (NYC)ManhattanChelsea40.74285-73.99595Entire home/apt277290NaNNaN327365
235005862Sonder | 21 Chelsea | Classic 1BR + Rooftop219517861Sonder (NYC)ManhattanChelsea40.74288-73.99438Entire home/apt275290NaNNaN327365
333528090Sonder | 21 Chelsea | Vibrant 1BR + Rooftop219517861Sonder (NYC)ManhattanChelsea40.74250-73.99443Entire home/apt260290NaNNaN327365
434570839Sonder | 21 Chelsea | Quaint 1BR + Rooftop219517861Sonder (NYC)ManhattanChelsea40.74305-73.99415Entire home/apt254290NaNNaN327365
df.isnull().sum()
id                                    0
name                                 16
host_id                               0
host_name                            21
neighbourhood_group                   0
neighbourhood                         0
latitude                              0
longitude                             0
room_type                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
last_review                       10052
reviews_per_month                 10052
calculated_host_listings_count        0
availability_365                      0
dtype: int64
df.describe()
idhost_idlatitudelongitudepriceminimum_nightsnumber_of_reviewsreviews_per_monthcalculated_host_listings_countavailability_365
count4.889500e+044.889500e+0448895.00000048895.00000048895.00000048895.00000048895.00000038843.00000048895.00000048895.000000
mean1.901714e+076.762001e+0740.728949-73.952170152.7206877.02996223.2744661.3732217.143982112.781327
std1.098311e+077.861097e+070.0545300.046157240.15417020.51055044.5505821.68044232.952519131.622289
min2.539000e+032.438000e+0340.499790-74.2444200.0000001.0000000.0000000.0100001.0000000.000000
25%9.471945e+067.822033e+0640.690100-73.98307069.0000001.0000001.0000000.1900001.0000000.000000
50%1.967728e+073.079382e+0740.723070-73.955680106.0000003.0000005.0000000.7200001.00000045.000000
75%2.915218e+071.074344e+0840.763115-73.936275175.0000005.00000024.0000002.0200002.000000227.000000
max3.648724e+072.743213e+0840.913060-73.71299010000.0000001250.000000629.00000058.500000327.000000365.000000

描述性统计分析:

  • 一共有约48000+条数据
  • 价格[price]的均值为152美元,中位数106美元,数据向右/高偏移;存在极高/低异常值;
  • 最少预订天/夜晚[minimum_nights]存在极高异常值;(如果有特殊情况,结合实际处理,这里以常规情况为例)
  • 每位房东的房源数量[calculated_host_listings_count]均值是7,中位数是1,说明大部分房东都只有1个房源,存在极高异常值;
  • 最近一次浏览日期[last_review]和每月浏览数量均值[reviews_per_month]都有10000+个缺失值,0填充;

1.数据处理

df['last_review'].fillna(0,inplace=True)
df['reviews_per_month'].fillna(0,inplace=True)
df['unit_price'] = (df.price / df.minimum_nights)  #增加一列字段 均价
df.unit_price.plot()  #查看单价的大概分布
<matplotlib.axes._subplots.AxesSubplot at 0x19f1beffa20>

在这里插入图片描述

df.unit_price.describe()
count    48895.000000
mean        70.174247
std        157.620388
min          0.000000
25%         20.000000
50%         44.500000
75%         81.500000
max       8000.000000
Name: unit_price, dtype: float64
df[df['unit_price'] > 500].count()
id                                316
name                              316
host_id                           316
host_name                         316
neighbourhood_group               316
neighbourhood                     316
latitude                          316
longitude                         316
room_type                         316
price                             316
minimum_nights                    316
number_of_reviews                 316
last_review                       316
reviews_per_month                 316
calculated_host_listings_count    316
availability_365                  316
unit_price                        316
dtype: int64
  • 有316条房间单价unit_price大于500,占总样本数的0.64%,根据统计学,这部分数据我们做删除处理;
df = df[df.unit_price < 500]
df = df[df.unit_price > 1]
df.describe()
idhost_idlatitudelongitudepriceminimum_nightsnumber_of_reviewsreviews_per_monthcalculated_host_listings_countavailability_365unit_price
count4.822100e+044.822100e+0448221.00000048221.00000048221.00000048221.00000048221.00000048221.00000048221.00000048221.00000048221.000000
mean1.903368e+076.764915e+0740.728922-73.952029143.4734666.17347223.4705211.1000407.190871111.84989961.690299
std1.097072e+077.860904e+070.0545500.046198176.82356810.94288944.6880661.60207133.161159131.09985863.639782
min2.539000e+032.438000e+0340.499790-74.24442010.0000001.0000000.0000000.0000001.0000000.0000001.016667
25%9.488762e+067.833913e+0640.690040-73.98291069.0000001.0000001.0000000.0400001.0000000.00000020.000000
50%1.969877e+073.086886e+0740.722960-73.955570105.0000003.0000005.0000000.3800001.00000044.00000044.000000
75%2.914645e+071.074344e+0840.763170-73.936120175.0000005.00000024.0000001.6100002.000000224.00000080.000000
max3.648724e+072.743213e+0840.913060-73.71299010000.000000365.000000629.00000058.500000327.000000365.000000499.500000
df.unit_price.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x19f178b8208>

在这里插入图片描述

2.民宿区域分析

  • 从neighbourhood_group/neighbourhood两个维度查看不同区域民宿数量分布
df_pie = df.neighbourhood_group.value_counts().reset_index()
df_pie
indexneighbourhood_group
0Manhattan21287
1Brooklyn19880
2Queens5614
3Bronx1070
4Staten Island370
plt.figure(figsize=(15,9.3))
explode = (0.1,0,0,0,0) #为了使最大占比区域分离
plt.pie(x='neighbourhood_group',labels='index',explode=explode,autopct='%1.1f%%',data=df_pie)
plt.title("The Rate of different Landmark Point")
plt.show()

在这里插入图片描述

pivot = df.pivot_table(index='neighbourhood',values='id',aggfunc='count')
df_pie_2 = pivot.sort_values(by='id',ascending=False).reset_index().head(20) #这里只取值前20,否则显示太多
df_pie_2.head()
neighbourhoodid
0Williamsburg3888
1Bedford-Stuyvesant3666
2Harlem2638
3Bushwick2445
4Upper West Side1936
plt.figure(figsize=(15,9.3))
plt.pie(x='id',labels='neighbourhood',autopct='%1.1f%%',data=df_pie_2)
plt.show()

在这里插入图片描述

plt.figure(figsize=(10,8))
pivot[pivot.id > 2500].sort_values(by='id',ascending=False).plot.bar()
plt.xticks(rotation=30)
(array([0, 1, 2]), <a list of 3 Text xticklabel objects>)




<Figure size 720x576 with 0 Axes>

在这里插入图片描述

  • 房间最多的区域前3个区域是Williamsburg 3888,Bedford-Stuyvesant 3666,Harlem 2638

3.对比Top3区域的均价

df1 = df[['neighbourhood','unit_price']].groupby(by=['neighbourhood','unit_price']).count().reset_index()
df1.head()
neighbourhoodunit_price
0Allerton5.000000
1Allerton12.142857
2Allerton15.750000
3Allerton16.333333
4Allerton17.000000
top3 = ['Williamsburg','Bedford-Stuyvesant','Harlem']
for i in top3:
    plt.hist(df1[df1.neighbourhood == i].unit_price,bins=20)
    plt.xlabel(i)
    plt.show()

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

  • 前两个价格区间类似,都集中在20美元以下,最后一个集中在20美元左右,之后都呈递减趋势;
  • 值得注意的是,前3个区域的民宿价格根据数量递减,最高价格也呈递减趋势;
  • 第一多的Williamsburg最高房价高达500,而第二多的Bedford-Stuyvesant价格最高达400多(小于500),而第三的Harlem则仅需350;

4.最受欢迎的民宿

df.sort_values(by=['number_of_reviews','reviews_per_month'],ascending=False).iloc[:10,:] #注意by后面参数的顺序影响结果
idnamehost_idhost_nameneighbourhood_groupneighbourhoodlatitudelongituderoom_typepriceminimum_nightsnumber_of_reviewslast_reviewreviews_per_monthcalculated_host_listings_countavailability_365unit_price
53219145202Room near JFK Queen Bed47621202DonaQueensJamaica40.66730-73.76831Private room4716292019/7/514.58233347.0
8515903972Great Bedroom in Manhattan4734398JjManhattanHarlem40.82085-73.94025Private room4916072019/6/217.75329349.0
4259903947Beautiful Bedroom in Manhattan4734398JjManhattanHarlem40.82124-73.93838Private room4915972019/6/237.72334249.0
4709891117Private Bedroom in Manhattan4734398JjManhattanHarlem40.82264-73.94041Private room4915942019/6/157.57333949.0
1533010101135Room Near JFK Twin Beds47621202DonaQueensJamaica40.66939-73.76975Private room4715762019/6/2713.40217347.0
159238168619Steps away from Laguardia airport37312959MayaQueensEast Elmhurst40.77006-73.87683Private room4615432019/7/111.59516346.0
14803834190Manhattan Lux Loft.Like.Love.Lots.Look !2369681CarolManhattanLower East Side40.71921-73.99116Private room9925402019/7/66.95117949.5
442716276632Cozy Room Family Home LGA Airport NO CLEANING FEE26432133DanielleQueensEast Elmhurst40.76335-73.87007Private room4815102019/7/616.22534148.0
100013474320Private brownstone studio Brooklyn12949460AsaBrooklynPark Slope40.67926-73.97711Entire home/apt16014882019/7/18.141269160.0
44566166172LG Private Room/Family Friendly792159WandaBrooklynBushwick40.70283-73.92131Private room6034802019/7/76.701020.0
  • 前10个最受欢迎的民宿,其中4家在Manhattan,4家在Queens,2家在Brooklyn
  • 其中前8个民宿均价都在50上下浮动,最末有1个极高值160,有1个极低值20,再看房型,160的房型是整屋出租,其他都是单间
  • 前10家民宿的浏览量在480~630之间稳定浮动
  • 前10家民宿中,排前5的房子由2个人经营,即一个人开多家民宿,猜测是有专业的运营团队,后5个目前不确定(因为这里只显示高人气房子)

下面我们看下排行后面5个的房东是否只经营1家民宿,从后面的全部数据中筛选

df[df['host_name'] == 'Maya'].count()
id                                46
name                              46
host_id                           46
host_name                         46
neighbourhood_group               46
neighbourhood                     46
latitude                          46
longitude                         46
room_type                         46
price                             46
minimum_nights                    46
number_of_reviews                 46
last_review                       46
reviews_per_month                 46
calculated_host_listings_count    46
availability_365                  46
unit_price                        46
dtype: int64
df['host_name'].isin(['Carol']).value_counts()
False    48176
True        45
Name: host_name, dtype: int64
df['host_name'].isin(['Danielle']).value_counts()
False    48145
True        76
Name: host_name, dtype: int64
df['host_name'].isin(['Asa']).value_counts()
False    48220
True         1
Name: host_name, dtype: int64
df['host_name'].isin(['Wanda']).value_counts()
False    48217
True         4
Name: host_name, dtype: int64
  • 这里通过两种方法可以求得后5个人气民宿的房东所拥有的民宿数量,排行前10的民宿中,有9家都是团队运营,只有一家房东Asa是个人运营;
  • 可能Airbin更趋于一种商业模式,而非最初的出发点,把自己闲置的房间share给旅行者;
  • 有点类似中国的二手房东,只是目标客户不再是本地人,而是旅行者,租赁方式也由长租改为短租;
  • 不过Airbin经过这么久的发展,已经是成熟的市场,团队应该比个人更专业些,排行靠前也就不足为奇了;

下面我们算下个人和团体的比例。计算方式:房东名下只有1套房的的视为个体,1次以上的视为团体

df2 = df['host_name'].value_counts()
df2.head()
Michael         400
David           390
Sonder (NYC)    327
John            291
Alex            272
Name: host_name, dtype: int64
df2.describe()
count    11355.000000
mean         4.244826
std         14.076734
min          1.000000
25%          1.000000
50%          1.000000
75%          2.000000
max        400.000000
Name: host_name, dtype: float64
(df2 > 1).sum() #团体经营
4518
(df2 == 1).sum() #个体经营
6837
  • 个体经营者约6837人,团体经营者约4518人,个人运营者更多,不过注意前者只有1套房子,后者有多套房子
  • 一共有48895条数据,减去6837条个人运营的1个房子,剩下的都是团体运营的房子,团体拥有总房子数量为个人的6倍左右;

5. 受欢迎的房子特点

keyword = df['name'].to_frame().to_csv('hotword.txt', index=False, sep=',', header=None) #方法to_frame()将Series转DataFrame
from wordcloud import WordCloud
text = open('hotword.txt',encoding = 'utf_8_sig').read()
wordcloud = WordCloud(background_color="white",max_font_size=50).generate(text)
plt.figure(figsize=(10,6.18))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis('off')
plt.show()

在这里插入图片描述

通过词云呈现出高热点搜索词:

  • 搜索词最多的是Private/Apartment/Bedroom等房型,所以一定要标注出房价类型
  • 其次出现heart of/Central Park/In Brooklyn/East Village等描述地理位置的词汇
  • 同样出现的有形容词类如Beautiful/Cozy/Bright/Charming/Moden/Lovely/Quiet等
  • 房东在添加标题的时候可以考虑加入高频搜索词汇,以吸引更多的人浏览

后记:这里只是粗略的进行分析,如果有面积,实际上更深入的还可以利用统计学/机器学习领域的最小二乘法去分析下定价,或者天气等不同因素的影响,挖掘更深度的东西;

已经开始看机器学习了,不确定是否用得上,不过一口气听了10多集,觉得还行,另外,在学习machine learning之前,建议先打好基础,包括统计学,很多人推荐很多经典书,这里推荐一本深入浅出统计学,我刷了2遍,竟然发现machine learning很多用到统计学的。

最近看的一本书里提到一句话,大概意思是:数据分析只能在现有的格局里找到最优解,但是人类的思想却可以突破当下的困局,从而带来全面的革新。所以数据分析有优势,但也确实有其弊端。

继续学习吧~

  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值