Explore Data Secrets in Airbnb

weixinyu2012

已于 2023-01-05 14:13:37 修改

阅读量78

点赞数

分类专栏： Data Science 文章标签： python pandas

于 2023-01-05 10:26:22 首次发布

本文链接：https://blog.csdn.net/weixinyu2012/article/details/128558760

版权

Data Science 专栏收录该内容

2 篇文章 0 订阅

订阅专栏

Background

Since 2008, guests and hosts have used Airbnb to travel in a more unique, personalized way. As part of the Airbnb Inside initiative, this blog dig into the dataset of the listing activity of homestays in Seattle, WA(https://www.kaggle.com/airbnb/seattle/data). Three questions will be explored in this blog.

First Question: What are the prices of those with different room type?

There are three different room types: entire house/apt, private room and shared room. Price distribution of each type is investigated first:

We can illustrate roughly from these plot that the price of Entire home is higher than that of Private room, and the price of Private room is higher than that of Shared room. The following box plot can make it more clear:

This means that the more space you occupy, the more price you need to pay.

Second Question: What are the review scores of those with more availability?

We choose the availability in 30 days as a measurement of availability and review scores rating as the review scores. Plot the distribution of availability first:

There are two peaks at the two ends. So we treat the availability lower than 5 as low availability and higher than 25 as high availability, and then compare the review scores distributions of them.

There is no clear difference between the review scores distribution of the low availability and high availability. This may be because most of the review scores cluster around 95-100. Then we compare the distribution by exact statistics:

	High Availability	Low Availability
count	1152	855
mean	93.73	94.59
std	7.23	7.25
min	20.00	40.00
25%	91.00	92.00
50%	96.00	97.00
75%	98.00	100.00
max	100.00	100.00

The table shows that the review scores of low availability is a bit higher. More bookings mean the room/house is more popular and get higher review scores. Meanwhile, most of the time people are willing to give a high scores, and more reviews generated can increase possibility of low scores, so the difference is quite narrow.

Third Question: How well can we predict the price?

We use most of the continuous value features and one-hot encode some categorical value features to predict the price with a linear regression model. We run a test with a hyperparameter cutoffs(Number of missing values allowed in the used columns), and use r2 score to measure the performance:

From the plot above, we can illustrate that the model works best with 30 features and achieve r2 score of about 0.57. Then we dive into the coefficients of the best model and list the top 20(Based on absolute value of the coefficients):

From the table above, we can see that the room type matters most. The shared room has the biggest negative correlation with the price. The private room has the second biggest negative correlation with the price. This is consistent with the conclusion of question 1: The entire home has higher price, then the private room.

*The detailed code can be found in the github

Summary

The price of Entire home is higher than that of Private room, and the price of Private room is higher than that of Shared room.

The review scores of low availability is a bit higher, but the difference is quite narrow.

The best linear regression model achieves a r2 score of 0.56. Among the coefficients of the best model, the room type matters most. The shared room has the biggest negative correlation with the price. The private room has the second biggest negative correlation with the price.