Explore Data Secrets in Airbnb

Background

Since 2008, guests and hosts have used Airbnb to travel in a more unique, personalized way. As part of the Airbnb Inside initiative, this blog dig into the dataset of the listing activity of homestays in Seattle, WA(https://www.kaggle.com/airbnb/seattle/data). Three questions will be explored in this blog.

First Question: What are the prices of those with different room type?

There are three different room types: entire house/apt, private room and shared room. Price distribution of each type is investigated first:

 

We can illustrate roughly from these plot that the price of Entire home is higher than that of Private room, and the price of Private room is higher than that of Shared room. The following box plot can make it more clear:

This means that the more space you occupy, the more price you need to pay.

Second Question: What are the review scores of those with more availability?

We choose the availability in 30 days as a measurement of availability and review scores rating as the review scores. Plot the distribution of availability first:

There are two peaks at the two ends. So we treat the availability lower than 5 as low availability and higher than 25 as high availability, and then compare the review scores distributions of them.

There is no clear difference between the review scores distribution of the low availability and high availability. This may be because most of the review scores cluster around 95-100. Then we compare the distribution by exact statistics:

High Availability

Low Availability

count   

1152

855

mean      

93.73

94.59

std        

7.23

7.25

min       

20.00

40.00

25%

91.00

92.00

50%

96.00

97.00

75%

98.00

100.00

max      

100.00

100.00

The table shows that the review scores of low availability is a bit higher. More bookings mean the room/house is more popular and get higher review scores. Meanwhile, most of the time people are willing to give a high scores, and more reviews generated can increase possibility of low scores, so the difference is quite narrow.

Third Question: How well can we predict the price?

We use most of the continuous value features and one-hot encode some categorical value features to predict the price with a linear regression model. We run a test with a hyperparameter cutoffs(Number of missing values allowed in the used columns), and use r2 score to measure the performance:

 

From the plot above, we can illustrate that the model works best with 30 features and achieve r2 score of about 0.57. Then we dive into the coefficients of the best model and list the top 20(Based on absolute value of the coefficients):

From the table above, we can see that the room type matters most. The shared room has the biggest negative correlation with the price. The private room has the second biggest negative correlation with the price. This is consistent with the conclusion of question 1: The entire home has higher price, then the private room.

*The detailed code can be found in the github

Summary

The price of Entire home is higher than that of Private room, and the price of Private room is higher than that of Shared room.

The review scores of low availability is a bit higher, but the difference is quite narrow.

The best linear regression model achieves a r2 score of 0.56. Among the coefficients of the best model, the room type matters most. The shared room has the biggest negative correlation with the price. The private room has the second biggest negative correlation with the price.

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值