Airbnb短租

最新推荐文章于 2024-07-15 10:06:36 发布

qq_41553076

最新推荐文章于 2024-07-15 10:06:36 发布

阅读量318

点赞数

文章标签：数据分析 python 大数据机器学习数据挖掘

本文链接：https://blog.csdn.net/qq_41553076/article/details/107894589

版权

Airbnb短租

原始数据

需要用到的数据集，如下图。数据集中包含的数据是比较丰富。能从多个维度进行探索。
在这里插入图片描述

了解数据

导入数据

数据量最大的要数calendar_detail，里面包含一千万条数据，内容是每个房屋每天情况。其次是listings_detail和listings更多的是用户的评价。

import pandas as pd
import numpy as np
path1='/home/jhon/Desktop/DATA/renting/calendar_detail.csv'
path2='/home/jhon/Desktop/DATA/renting/listings_detail.csv'
path3='/home/jhon/Desktop/DATA/renting/listings.csv'
path4='/home/jhon/Desktop/DATA/renting/reviews.csv'
path5='/home/jhon/Desktop/DATA/renting/reviews_detail.csv'

calendar=pd.read_csv(path1)
listings=pd.read_csv(path3)
reviews=pd.read_csv(path5)

基本信息

calendar_detail所包含的字段名。特别注意，价格是用美元计算，而且面前还有“$"，所以需要对他们进行处理。

calendar.head(2)
#运行结果
	listing_id 	date 	available 	price 	adjusted_price 	minimum_nights 	maximum_nights
0 	1165040 	2019-04-17 	f 	$511.00 	$511.00 	1.0 	1125.0
1 	1165040 	2019-04-18 	t 	$511.00 	$511.00 	1.0 	1125.0

listings的字段

listings.columns
#运行结果
Index(['id', 'name', 'host_id', 'host_name', 'neighbourhood_group',
       'neighbourhood', 'latitude', 'longitude', 'room_type', 'price',
       'minimum_nights', 'number_of_reviews', 'last_review',
       'reviews_per_month', 'calculated_host_listings_count',
       'availability_365'],
      dtype='object')

reviews的字段

reviews.columns
#运行结果
Index(['listing_id', 'id', 'date', 'reviewer_id', 'reviewer_name', 'comments'], dtype='object')

这些基本信息很重要。

数据清洗

查看数据是否有空值，最简单粗暴的方法直接删了。这里就删了，整个数据样本有一千万条数据，不会有大的影响。

calendar.isnull().sum()

listing_id          0
date                0
available           0
price               0
adjusted_price      0
minimum_nights    358
maximum_nights    358
dtype: int64

同时日期的格式也更改了一下。

calendar1=calendar.copy()
calendar1.dropna(inplace=True)
calendar1['date']=calendar1['date'].astype('datetime64')

这步就需要把美元符去掉，同时要更改格式，修改成数值。在去除美元符后，还有千分符，这也必须处理。
price和adjusted_price是不同的，adjusted_price是优惠价格，但基本上没有差别。简单起见价格就采用price字段。


calendar1['price']=calendar1['price'].str.split('$',expand=True)[1]
calendar1['adjusted_price']=calendar1['adjusted_price'].str.split('$',expand=True)[1]

calendar1['adjusted_price']=calendar1['adjusted_price'].str.replace(',','').astype('float')
calendar1['price']=calendar1['price'].str.replace(',','').astype('float')

房屋维度

数据整理

探讨方向是很多的，先以房屋为主体进行分析。

calendar_g=calendar1.groupby(by='listing_id').mean()
a=pd.merge(calendar_g,listings,left_on='listing_id',right_on='id')
a['neighbourhood_group'].isnull().count()

house=a[['id','room_type','minimum_nights_x','maximum_nights',
         'longitude','latitude','price_x','price_y',
         'reviews_per_month','availability_365']]
house.eval('rate=price_x/price_y',inplace=True)

对不同价位的房屋进行分类。

house['level']=pd.cut(house['price_x'],
                      bins=[0,200,400,600,800,1500,3000,8000,50000],
                      labels=[1,2,3,4,5,6,7,8])
house['level'].value_counts()
#运行结果
2    9288
3    6970
1    4707
4    3199
5    2489
6    1117
7     563
8     104
Name: level, dtype: int64

数据可视化

地区分布

这里数据的可视化我用Tableau，当然你也可以plotly以及pychart等。如下图
在这里插入图片描述

镜头拉近看市区的房屋分布！！！

在这里插入图片描述

在分析过程中发现一下规律：

集中的主要区域在市中心附近；
房屋等级越高，分布的越是分散；
即便是北京市区，也是存在很高的聚集度。

对此认为，集中在市中心附近正常，此区域人口密度较大，对房屋的需求也较大。房屋等级越高，分布越分散。原因可能是具有更高消费能力的人，更喜欢在远离城市的地方，放松及修养。

不同类型房屋的数量和价格

这里按照房屋的价格，划分了8个等级，但是房屋的来源类型有三种。
在这里插入图片描述

从这里可以看出，等级1、2、3、4占了绝大部分。

利润率

在这里插入图片描述

这里的真实价格来源于calendar_detail，而实际来源于listings_detail。真实价格每天是在波动的，可能到节假日涨价，淡季价格降低一点，这都是可能的。listings_detail是平均价格，或者成本价。姑且就按照实际价格算或者成本来算。
从图中很明显的看出，等级越高的房屋，利润率越高。而按照不同房屋类型看，Shared room利润率是最高的。但这具体是什么原因，就需要结合实际的业务进行分析了。

时间维度

以下是从时间维度对数据进行分析。先对数据重新分组，然后聚合。

calendar2=calendar1[['date','listing_id','price']]
calendar2['level']=pd.cut(calendar2['price'],
                      bins=[0,200,400,600,800,1500,3000,8000,50000],
                      labels=[1,2,3,4,5,6,7,8])
                      
calendar2['year']=calendar1['date'].dt.year
calendar2['month']=calendar1['date'].dt.month
calendar2['day']=calendar1['date'].dt.day

b=calendar2.groupby(by=['level','year','month']).agg({'date':'count','price':'mean'})
b=b.dropna()
b=b.reset_index()

b
#运行结果
	level 	year 	month 	date 	price
0 	1 	2019 	4 	75776 	146.795780
1 	1 	2019 	5 	163579 	146.405504
2 	1 	2019 	6 	159471 	146.166908
3 	1 	2019 	7 	163156 	146.144101
4 	1 	2019 	8 	163746 	146.820570
... 	... 	... 	... 	... 	...
99 	8 	2019 	12 	3278 	13714.726052
100 	8 	2020 	1 	3158 	13337.024066
101 	8 	2020 	2 	2954 	13336.299932
102 	8 	2020 	3 	3148 	13337.912961
103 	8 	2020 	4 	1513 	13361.962327

104 rows × 5 columns

时间维度需要注意，2019年4月和2020年4月之所以下降幅度特别大，是因为这两个月的数据，只取了半个月数据。

总结

本次的数据特征太少，不足以探究，短租房价的影响因素。这里可能只有经纬度和价格，但相关性还比较低，就不继续探索了。

qq_41553076

关注

0
点赞
踩
4

收藏

觉得还不错? 一键收藏
0
评论
Airbnb短租

天池比赛：Airbnb短租项目（上）项目介绍下面的图片是，需要用到的数据集。数据集中包含的数据是特别丰富的。正如官网上所言，能从多个维度进行探索，这就见仁见智了。了解数据导入数据数据量最大的要数calendar_detail，里面包含一千万条数据，内容是每个房屋每天情况。其次是listings_detail和listings更多的是用户的评价，其他的数据这里就不介绍了。import pandas as pdimport numpy as nppath1='/home/jhon/Deskto
复制链接

扫一扫