Analyzing Airbnb data for Boston and Seattle

Analyzing Airbnb data for Boston and Seattle

In this article, I will talk about some thing about Airbnb data for Boston and Seattle in 2018。And I will answer 3 questions by analyzing these data.

1. Business understanding

I collect one month data in 2018 from Airbnb in Boston and Seattle, and I will put them together; then I will choose some features to build a model; at last, I will answer 3 questions:

  1. What is top positive and negative factors that can affect the price of a house;
  2. Try to build a model to predict the score;
  3. What should a host do to improve the score(review_scores_rating) of his house from this model;

2. Data understanding

# import the basic libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
%matplotlib inline

2.1 Load the data

# I chooose data from Boston and Seattle AirBNB

boston_cal = pd.read_csv('./data/Boston/calendar.csv', low_memory=False)
boston_list = pd.read_csv('./data/Boston/listings.csv', low_memory=False)
boston_rev = pd.read_csv('./data/Boston/reviews.csv', low_memory=False)

seattle_cal = pd.read_csv('./data/Seattle/calendar.csv', low_memory=False)
seattle_list = pd.read_csv('./data/Seattle/listings.csv', low_memory=False)
seattle_rev = pd.read_csv('./data/Seattle/reviews.csv', low_memory=False)

2.2 Preview the data

print('boston calender size: ', boston_cal.shape)
print('boston listings size: ', boston_list.shape)
print('boston reviews size: ', boston_rev.shape)

boston calender size: (2286360, 7)
boston listings size: (6264, 106)
boston reviews size: (225880, 6)

boston_cal.head()

在这里插入图片描述

boston_list.head()

boston_list.head()

boston_rev.head()

在这里插入图片描述

  1. We can see that ‘calender’ data can join ‘listings’ data.
  2. Calender data show the listing id, data, available, price, adjusted_price, minimum_nights and maximum_nights.
  3. Listings data show some detailed information. However, obviously, listings data contains too many unnecessary messages, it need to be Extract-Transform-Load later.
  4. We don’t need reviews data, because the most important information in it is comments from guests. And we can hardly use them.

3. Data preperation

3.1 Merge listings and calender

# Obviouly, 'listing_id' in calender data refers to 'id' in listings data
seattle_cal = seattle_cal.rename(columns={'listing_id': 'id'})
boston_cal = boston_cal.rename(columns={'listing_id': 'id'})
seattle_df = pd.merge(seattle_list,seattle_cal,on='id',how='outer')
boston_df = pd.merge(boston_list,boston_cal,on='id',how='outer')
boston_df.head()

3.2 Put two dataframes in one dataframe(concat)

boston_df['city'] = 'boston'
seattle_df['city'] = 'seattle'
df = pd.concat([boston_df,seattle_df],ignore_index=True)
df.shape

(5595815, 112)

3.3 Working with missing data

3.3.1 Checking columns that contains more than 80% miss data
df_80 = df.columns[df.count()/df.shape[0]<=0.80]
df_80

Index([‘space’, ‘neighborhood_overview’, ‘notes’, ‘transit’, ‘access’,
‘interaction’, ‘house_rules’, ‘thumbnail_url’, ‘medium_url’,
‘xl_picture_url’, ‘host_about’, ‘host_acceptance_rate’,
‘neighbourhood_group_cleansed’, ‘square_feet’, ‘weekly_price’,
‘monthly_price’, ‘security_deposit’, ‘license’],
dtype=‘object’)

3.3.2 Checking each columns in df_80
df['space'].value_counts()

At Zeus, we focus exclusively on 30 day+ stays to meet the longer-term needs of our valued guests. When you stay in a Zeus property you will be met with top-notch amenities including premium mattresses and bedding, organic bath products, bath linens, a fully equipped kitchen, and super-fast wifi. Our staff is on call should anything come up that needs our attention or that can make your stay more enjoyable. 56210
The apartment is in a high-end building, conveniently located near South Lake Union’s technology center. With 24/7 guest support and concierge services, we combine a fully-furnished premium apartment with the consistent, high-quality experience and amenities that you expect from leading hotels. Expertly designed with customized furniture, the apartment’s furnishings are stylish, contemporary and specifically tailored for this apartment. We equip our apartments with everything you’ll need for your home away from home. Need more towels or anything additional during your stay? Our 24/7 guest support and concierge services are here for you. Kitchen: - Fully-stocked with cookware, glasses, mugs, plates, and everything you’ll need to host a dinner at your home away from home. - Glass top, Energy Efficient Range - Full-sized, Energy Efficient Refridgerator and Freezer - 1100 Watt Microwave - Paper Products and Trash Bags - Premium Dishwasher - Dish Soap, Dish Pods, Hand Soap - Coffee Maker w 25915

This column talks about the description of the house which given by the house owner. Obviously, ‘space’ column contains a lot of text information and we can’t group them easily. We can just drop it.

df['neighborhood_overview'].value_counts()

Located in central South Lake Union these apartments are a short walk to some well known technology businesses nearby (Amazon, Impinj, Northeastern University, Path). You can also enjoy the large park across the street. 22265
These downtown apartments are within a mile of Pike Place Market, the Space Needle, the Seattle Aquarium and a multitude of upscale restaurants and bars. You’re also just steps away from an assortment of venues where you can enjoy theater, comedy and every conceivable kind of music. Here, returning home for the afternoon certainly doesn’t mean an end to an exciting day. 17155
Fenway highlights - Walkability score - 95 - 2 minute walk to Sweet Cheeks Q (favorite family-style BBQ) - 10 minute walk to The Hawthorne (upscale bar serving craft cocktails) 17155

This column talks about the surrouding of the house. Similarly, we can hardly find some useful imformation from it. We can drop it.

We check all the columns in df_80, and put all columns that need to be deleted in a new list:
Now, we need to drop ‘space’, ‘neighborhood_overview’, ‘notes’, ‘transit’, ‘access’, ‘interaction’, ‘house_rules’, ‘thumbnail_url’, ‘medium_url’, ‘xl_picture_url’, ‘host_about’, ‘host_acceptance_rate’, ‘neighbourhood_group_cleansed’, ‘square_feet’, ‘license’.
At the same time we also need to fill up ‘weekly_price’,‘monthly_price’, ‘security_deposit’.

list_drop = ['space', 'neighborhood_overview', 'notes', 'transit', 'access',
       'interaction', 'house_rules', 'thumbnail_url', 'medium_url',
       'xl_picture_url', 'host_about', 'host_acceptance_rate',
       'neighbourhood_group_cleansed', 'square_feet', 'license']
def drop_columns(df, cols):
	'''
	Usage: Delete colums
	Input: 
	df - input dataframe
    cols - the feature to be process
    Output:
    df - dataframe which have been process
	'''
	df.drop(columns=list_drop, inplace=True)
	return df

Now we need to fill up ‘weekly_price’, ‘monthly_price’, ‘security_deposit’.

list_fill_zero = ['weekly_price', 'monthly_price', 'security_deposit']
def fill_colums(df, cols):
	'''
	Usage: Fill up colums
	Input: 
	df - input dataframe
    cols - the feature to be process
    Output:
    df - dataframe which have been process
	'''
	for col in cols:
		df[col].fillna(value=0, inplace=True)
	return df

Before we process these columns, we need to complete our list_drop and list_fill_zero.

3.3.3 Get into each features
df.select_dtypes(include=['float', 'int']).columns

Index([‘id’, ‘scrape_id’, ‘host_id’, ‘host_listings_count’,
‘host_total_listings_count’, ‘latitude’, ‘longitude’, ‘accommodates’,
‘bathrooms’, ‘bedrooms’, ‘beds’, ‘guests_included’, ‘minimum_nights_x’,
‘maximum_nights_x’, ‘minimum_minimum_nights’, ‘maximum_minimum_nights’,
‘minimum_maximum_nights’, ‘maximum_maximum_nights’,
‘minimum_nights_avg_ntm’, ‘maximum_nights_avg_ntm’, ‘availability_30’,
‘availability_60’, ‘availability_90’, ‘availability_365’,
‘number_of_reviews’, ‘number_of_reviews_ltm’, ‘review_scores_rating’,
‘review_scores_accuracy’, ‘review_scores_cleanliness’,
‘review_scores_checkin’, ‘review_scores_communication’,
‘review_scores_location’, ‘review_scores_value’,
‘calculated_host_listings_count’,
‘calculated_host_listings_count_entire_homes’,
‘calculated_host_listings_count_private_rooms’,
‘calculated_host_listings_count_shared_rooms’, ‘reviews_per_month’,
‘minimum_nights_y’, ‘maximum_nights_y’],
dtype=‘object’)

df.select_dtypes(include=['object']).columns

Index([‘listing_url’, ‘last_scraped’, ‘name’, ‘summary’, ‘description’,
‘experiences_offered’, ‘picture_url’, ‘host_url’, ‘host_name’,
‘host_since’, ‘host_location’, ‘host_response_time’,
‘host_response_rate’, ‘host_is_superhost’, ‘host_thumbnail_url’,
‘host_picture_url’, ‘host_neighbourhood’, ‘host_verifications’,
‘host_has_profile_pic’, ‘host_identity_verified’, ‘street’,
‘neighbourhood’, ‘neighbourhood_cleansed’, ‘city’, ‘state’, ‘zipcode’,
‘market’, ‘smart_location’, ‘country_code’, ‘country’,
‘is_location_exact’, ‘property_type’, ‘room_type’, ‘bed_type’,
‘amenities’, ‘price_x’, ‘weekly_price’, ‘monthly_price’,
‘security_deposit’, ‘cleaning_fee’, ‘extra_people’, ‘calendar_updated’,
‘has_availability’, ‘calendar_last_scraped’, ‘first_review’,
‘last_review’, ‘requires_license’, ‘jurisdiction_names’,
‘instant_bookable’, ‘is_business_travel_ready’, ‘cancellation_policy’,
‘require_guest_profile_picture’, ‘require_guest_phone_verification’,
‘date’, ‘available’, ‘price_y’, ‘adjusted_price’],
dtype=‘object’)

Obviouly, columns named ‘description’, ‘country’, ‘_url’, ‘state’, ‘street’… are useless.

list_drop = list_drop + ['country', 'country_code', 'description', 'host_name', 'host_picture_url', 
           'host_thumbnail_url', 'host_url', 'last_scraped', 'smart_location', 
            'host_verifications', 'listing_url', 'name', 'picture_url', 'last_scraped', 'market',
            'state', 'street', 'summary', 'zipcode', 'requires_license', 
            'host_neighbourhood', 'experiences_offered', 'calendar_last_scraped',
            'jurisdiction_names', 'is_business_travel_ready', 'price_y']

We can see that some object columns need to convert into int or float

list_to_flout = ['cleaning_fee', 'extra_people', 'monthly_price', 'weekly_price', 'adjusted_price', 'price_x', 'security_deposit']
def convert_to_flout(df, cols):
	'''
	Usage: Convert colums to float
	Input: 
	df - input dataframe
    cols - the feature to be process
    Output:
    df - dataframe which have been process
	'''
	for col in cols:
	df[col]=df[col].replace('[$,]','',regex=True).astype(float)
	return df
df['calendar_updated'].value_counts()

today 1939610
2 weeks ago 325945
yesterday 310980
a week ago 248565
3 weeks ago 217905
2 months ago 216445
3 days ago 185785
3 months ago 172280
4 weeks ago 143080
4 days ago 137605
5 weeks ago 129940
5 days ago 118625
4 months ago 112420
2 days ago 108770

def convert_calendar_int(x):
    if('a week' in x):
        return 1
    elif('week' in x):
        return str.split(x)[0]
    elif ('month' in x):
        return int(str.split(x)[0])*4
    elif ('never' in x):
        return int(0)
    else:
        return int(1)
df['calendar_updated']=df['calendar_updated'].apply(lambda x:convert_calendar_int(x))

3.4 Result DataFrame

list_numerical_drop = ['id', 'scrape_id', 'host_id', 'minimum_nights_x', 'maximum_nights_x', 
                       'minimum_nights_y', 'maximum_nights_y']
def drop_row(df):
	'''
	Usage: Drop row that contains missing values
	Input: 
	df - input dataframe
    Output:
    df - dataframe which have been process
	'''
	df_miss = df.columns[(df.count()/len(df)) < 1]
	df.dropna(subset=df_miss, inplace=True, axis=0)
	return df

Check the result Dataframe

df.select_dtypes(include=['object']).columns

Index([‘host_since’, ‘host_is_superhost’,
‘host_has_profile_pic’, ‘host_identity_verified’,
‘neighbourhood_cleansed’, ‘city’, ‘is_location_exact’, ‘property_type’,
‘room_type’, ‘bed_type’, ‘has_availability’,
‘instant_bookable’, ‘cancellation_policy’,
‘require_guest_profile_picture’, ‘require_guest_phone_verification’,
‘available’],
dtype=‘object’)

def get_dump(df):
	list_cat = df.select_dtypes(include=['object']).columns
	df =pd.get_dummies(df, columns=list_cat, drop_first=True)
	return df

4. Modeling

4.1 Find the factors that affect price

corr_matrix = df.corr()
corr_matrix['price'].sort_values(ascending=False)

Q1 The top factors that affect price of house:
price 1.000000
host_total_listings_count 0.497184
accommodates 0.452723
beds 0.395042
bedrooms 0.392926
cleaning_fee 0.367075
guests_included 0.318110
bathrooms 0.306648

host_is_superhost_True -0.125351
calculated_host_listings_count_private_rooms -0.138105
number_of_reviews -0.149646
number_of_reviews_ltm -0.160282
cancellation_policy_moderate -0.164931
reviews_per_month -0.199273
room_type_Private room -0.326181
Name: price, Length: 185, dtype: float64

4.2 Heatmap

But we get 178 features, I don’t think this picture would help us.

# Generate a mask for the upper triangle
mask = np.zeros_like(corr_matrix, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr_matrix, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

在这里插入图片描述

4.3 Prediction

corr_matrix['score'].sort_values(ascending=False)

score 1.000000
host_is_superhost_True 0.368763
amenities 0.183053
number_of_reviews_ltm 0.162046
reviews_per_month 0.151192
cancellation_policy_moderate 0.146941
number_of_reviews 0.145018
city_seattle 0.144506
property_type_Guest suite 0.103642
host_identity_verified_True 0.094759

price -0.080093
neighbourhood_cleansed_Mission Hill -0.081475
neighbourhood_cleansed_Back Bay -0.087674
availability_90 -0.097441
calculated_host_listings_count_shared_rooms -0.103292
require_guest_profile_picture_True -0.112825
instant_bookable_True -0.117803
require_guest_phone_verification_True -0.119340
neighbourhood_cleansed_Downtown -0.121227
cancellation_policy_strict_14_with_grace_period -0.138403
availability_60 -0.144002
cancellation_policy_strict -0.156851
availability_30 -0.181728
calculated_host_listings_count_private_rooms -0.195279
host_total_listings_count -0.225087
calculated_host_listings_count_entire_homes -0.230170
calculated_host_listings_count -0.255190
Name: score, Length: 185, dtype: float64

4.3.1 linear_model
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
X = df.drop(columns=['score','city_seattle'])
Y = df['score']
X_train, X_test, y_train, y_test = train_test_split(X, Y , test_size=0.2, random_state=42)
linear_model = LinearRegression(normalize=True)
linear_model.fit(X_train, y_train)
y_preds = linear_model.predict(X_test)
r2_scores = r2_score(y_test,y_preds)
r2_scores

0.24278702995958334
r2_score is really near zero, so LinearRegression is not best choice for us to predict the ‘score’.
This time I will use RamdonForest model, and chose 10 features I have selected above.

4.3.2 RandomForest
list_features = ['host_is_superhost_True',
                 'amenities',
                 'number_of_reviews_ltm',
                 'reviews_per_month',
                 'cancellation_policy_moderate',
                 'host_identity_verified_True',
                 'is_location_exact_True',
                 'host_response_rate',
                 'extra_people',
                 'security_deposit',
                 'calculated_host_listings_count',
                 'cancellation_policy_strict',
                 'neighbourhood_cleansed_Downtown',
                 'require_guest_phone_verification_True',
                 'require_guest_profile_picture_True',
                 'instant_bookable_True',
                 'require_guest_profile_picture_True',
                 'price',
                 'available_True',
                 'neighbourhood_cleansed_Chinatown']
X_rfg = df[list_features]
Y_rfg = df['score']
X_train, X_test, y_train, y_test = train_test_split(X_rfg, Y_rfg, test_size=0.3, random_state=42)
rfg_model = RandomForestRegressor(max_depth=8, random_state=42, n_estimators=100,
                                 min_samples_leaf = 2, min_samples_split = 10)
rfg_model.fit(X_train, y_train)
y_rfg_preds = rfg_model.predict(X_test)
r2_rfg_scores = r2_score(y_test, y_rfg_preds)
r2_rfg_scores

0.38408818776902665
It’s better than lineargression, but still not good.

4.3.3 Modeling Process
from sklearn.model_selection import GridSearchCV
def find_best_rgf(df, cols):
	'''
	Usage: find best RandomRorestRegression model
	Input: 
	df - input dataframe
	col - features
    Output:
    best_clf - best model
    
	'''
	X_rfg = df[list_features2]
	Y_rfg = df['score']
	X_train, X_test, y_train, y_test = train_test_split(X_rfg, Y_rfg, test_size=0.3, random_state=42)
	param_grid = {'max_depth':[8, 16, 20],
              'min_samples_split':[8, 16, 20],
              'min_samples_leaf':[2, 4]}
	grid_search = GridSearchCV(rfg_model2, param_grid, cv=3)
	grid_search.fit(X_train,y_train)
	best_clf = grid_search.best_estimator_
	return best_clf
	

5. Summary

Summary: RandomForest is a better model for predicting the score(scored by guests), both traing sets and testing set get an acceptable or a good re_score and MSE.

Q2: Find a model to predict the score.

def show_importances(model, df):

    importances = model.feature_importances_ 
    feat_names = df.columns
    tree_result = pd.DataFrame({'feature': feat_names, 'importance': importances})
    tree_result.sort_values(by='importance',ascending=True)[-10:].plot(x='feature', y='importance', kind='barh')

show_importances(best_clf,X_train)

在这里插入图片描述

Q2: The best model to predict the score: RandomForestRegressor( bootstrap=True, criterion=‘mse’, max_depth=20, max_features=‘auto’, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=2, min_samples_split=16, min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=-1, oob_score=False, random_state=42, verbose=0, warm_start=False)

Q3: Acording to this image, we should conclude that all guests are happy to live a great house with a low price.
1, The host should decrease the price of his house properly. If his house get a higher score, more guests would like to rent his house.
2, Price, amenities, wether the host is a superhost, the number of host listings, response rate, security deposit would be top 6 factors that would help the host and his house to get a higher score. And the other 4 factors can only help the hosts a little.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值