Analysis Of AirBNB
![在这里插入图片描述](https://i-blog.csdnimg.cn/blog_migrate/472da6df67698a2247af2eb9b141c0a6.png)
1 商业理解(business understanding)
Problem I want to solve:
I just split Seatte houses into two parts by the price.The high price house’s price is more than median price (119).The low price house’s price is less than median price(119).
then I want to find out that:
Question1. What’s the differece between high price houses and low price houses.
Question2. If you are a low/high house host,what should you do to improve the review score value?
Question3. Question3 If we are the house hosts,and if we want to be a superhost,what should we do while we are high price house host or low price house host?
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
# import ImputingValues as t
from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
import seaborn as sns
# from mpl_toolkits.basemap import Basemap
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import accuracy_score, r2_score, mean_squared_error
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor, GradientBoostingRegressor
from datetime import datetime
from sklearn.model_selection import GridSearchCV
import random
import numbers
# from helper import *
pd.set_option("max_columns", None)
pd.set_option("max_rows", None)
%matplotlib inline
2. 数据理解(data understanding)
2.1 Load the data
path = 'D:/Code/Udacity/02_DataScientist/Write_A_Data_Science_Blog_Post/My_Analysis_Of_ArBNB_new/data/Seattle_AirBNB_Data/'
df_Seattle_listings = pd.read_csv(path + 'listings.csv')
df_Seattle_listings.head(3)
2.2 Preview the data
The data are mainly divided into the following aspects:
Host information
host_response_time,host_response_rate,host_is_superhost,host_listings_count,host_total_listings_count
House hardware information
neighbourhood_group_cleansed,zipcode,property_type,room_type,accommodates,bathrooms,bedrooms
House other information
price,security_deposit,cleaning_fee,minimum_nights,maximum_nights,availability_365,instant_bookable,cancellation_policy
House scrore information review_scores_rating,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value
def Value_counts(das, nhead = 5):
tmp = das.value_counts().reset_index().rename_axis({'index':das.name},axis = 1)
value = pd.DataFrame(['value {}'.format(i) for i in range(nhead)],index = range(nhead)).join(tmp.iloc[:,0],how = 'left').set_index(0).T
freq = pd.DataFrame(['freq {}'.format(i) for i in range(nhead)],index = range(nhead)).join(tmp.iloc[:,1],how = 'left').set_index(0).T
nnull = das.isnull().sum()
freqother = pd.DataFrame([nnull,das.shape[0] - nnull - freq.sum(axis = 1).sum()],index = ['freqNull','freqOther']).T.rename_axis({0:das.name})
op = pd.concat([value,freq,freqother],axis = 1)
return(op)
def Summary(da):
op = pd.concat([pd.DataFrame({"type": da.dtypes, "n": da.notnull().sum(axis = 0)}), da.describe().T.iloc[:,1:],
pd.concat(map(lambda i: Value_counts(da.loc[:,i]), da.columns))], axis = 1).loc[da.columns]
op.index.name = "Columns"
return(op)
def MissingCategorial(df,x):
missing_vals = df[x].map(lambda x: int(x!=x))
return sum(missing_vals)*1.0/df.shape[0]
def MissingContinuous(df,x):
missing_vals = df[x].map(lambda x: int(np.isnan(x)))
return sum(missing_vals) * 1.0 / df.shape[0]
df_Seattle_listings_summary = Summary(df_Seattle_listings).reset_index()
df_Seattle_listings_summary.to_csv(path+'df_Seattle_listings_summary.csv')
df_Seattle_listings_summary
The table blow show us that:
n. The length of the col value.
type. The type of the col.
mean.std.min. The mean,std,min of the col,and of course if the col is object type,it will be null.
25%,50%,75%. The quantile of col.
value0,value1,value2,value3,value4,value5. The most five proportion value of the col.
freq0,freq 1,freq 2,freq 4,freq 4. The most five proportion value’s count of the col.
freqNull,freqOther. The Null/Other value’s count of the col.
Discussion:
From the table above, we can see that several features have just single value or have a high miss rate or have a high proportion value,those features have little value for us to analysis,so we will process them first.
3 数据准备(data preparation)- Data clean
3.1 First process.
1Singe value process. If a feature only have one unque value,then it have no value for our analysis. And at last,we delete scrape_id,experiences_offered,and so on .
2 Null value process. If a feature only have a miss rate more than 0.85,then it have no value for our analysis. And at last,we delete thumbnail_url,xl_picture_url,and so on .
3 Big proportion process. If a feature only have one value rate more than 0.9,then it have litte value for our analysis. And at last,we delete host_has_profile_pic,street,and so on.
1Singe value process.
If a feature only have one unque value,then it have no value for our analysis. And at last,we delete scrape_id,experiences_offered,and so on
def delete_singe_value_features(df,col,all_features,remove_features):
'''
Usage: delete the singe value feature
Input:
df - input dataframe
col - the feature to be process
all_features - all features in the df
remove_features - list to record the delete features
Output:
df - dataframe which have been process
all_features - all features now we are watch
remove_features - features we have remove from all features
'''
if len(set(df[col])) == 1:
print('delete {} from the dataset because it is a constant'.format(col))
del df[col]
all_features.remove(col)
remove_features.append({col:'singe_value'})
return df,all_features,remove_features
all_features = list(df_Seattle_listings.columns)
select_features = all_features
remove_features = []
threshold_rate = 0.85
for col in select_features:
df_Seattle_listings,select_features,remove_features = delete_singe_value_features(df_Seattle_listings,col,select_features,remove_features)
remove_features
2 Null value process.
If a feature only have a miss rate more than 0.85,then it have no value for our analysis. And at last,we delete thumbnail_url,xl_picture_url,and so on
def process_null_value(df,col,all_features,remove_features,threshold_rate):
'''
Usage: clean the col if the most proportion is bigger than threshold_rate
Input:
df - input dataframe
col - the feature to be process
threshold_rate - threshold rate
Output:
df - dataframe which have been process
remove_flag - the flag indicate wheather the col haven been deleted
'''
miss_rate = df[col].isnull().sum()/df.shape[0]
if miss_rate > threshold_rate:
print('{} has a miss rate {} and be removed'.format(col,miss_rate))
df = df.drop([col],axis = 1)
remove_features.append({col:'miss rate is too high'})
all_features.remove(col)
return df,all_features,remove_features
# 删除缺失值较多的行
threshold_rate = 0.85
for col in select_features:
df_Seattle_listings,select_features,remove_features = process_null_value(df_Seattle_listings,col,select_features,remove_features,threshold_rate)
remove_features
3 Big proportion process.
If a feature only have one value rate more than 0.9,then it have litte value for our analysis. And at last,we delete host_has_profile_pic,street,and so on.
def delete_high_proportion_features(df,col,all_features,remove_features,threshold_rate = 0.9):
'''
Usage: clean the col if the most proportion is bigger than threshold_rate
Input:
df - input dataframe
col - the feature to be process
all_features - all features now we are watch
remove_features - features we have remove from all features
threshold_rate - threshold rate
Output:
df - dataframe which have been process
all_features - all features now we are watch
remove_features - features we have remove from all features
'''
most_proportion = df[col].value_counts().reset_index().sort_values(by = col,ascending = False).loc[0,col]/df.shape[0]
# print("we are processing {} .....".format(col))
if most_proportion > threshold_rate:
df = df.drop([col],axis = 1)
all_features.remove(col)
remove_features.append({col:'high proportion'})
print('{} has a most proportion ={} ,and be removed'.format(col,most_proportion))
return df,all_features,remove_features
# 删除单一值占比超过0.9的列
threshold_rate = 0.9
for col in select_features:
df_Seattle_listings,select_features,remove_features = delete_high_proportion_features(df_Seattle_listings,col,select_features,remove_features,threshold_rate)
remove_features
#观察变量
df_Seattle_listings_summary[df_Seattle_listings_summary['Columns'].isin(all_features)]
3.2 Choose variables to continue observe
预测各个价格区间段内,对用户多次订购影响最大的因素,从以下几个方面选择
After the first process step,we select features to watch in the following ways:
- Host information. host_response_time,host_response_rate,host_is_superhost,host_listings_count,host_total_listings_count
- House hardware information. neighbourhood_group_cleansed,zipcode,property_type,room_type,accommodates,bathrooms,bedrooms
- House other information. price,security_deposit,cleaning_fee,minimum_nights,maximum_nights,availability_365,instant_bookable,cancellation_policy
- House scrore information. review_scores_rating,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value
select_features = ['host_response_time','host_response_rate','host_is_superhost','host_total_listings_count','neighbourhood_group_cleansed'\
,'zipcode','property_type','room_type','accommodates','bathrooms','bedrooms','beds','price','security_deposit','cleaning_fee','minimum_nights','maximum_nights'\
,'availability_365','number_of_reviews','review_scores_rating','review_scores_accuracy','review_scores_cleanliness','review_scores_checkin'\
,'review_scores_communication','review_scores_location','review_scores_value','instant_bookable','cancellation_policy']
df_Seattle_listings_summary[df_Seattle_listings_summary['Columns'].isin(select_features)].reset_index().drop(['index'],axis = 1)
df_Seattle_listings = df_Seattle_listings[select_features]
#备份一遍数据
df_Seattle_listings_bak = df_Seattle_listings.copy()
df_Seattle_listings.columns
3.3 Variable transformation(针对性处理)
- host_response_time. The feature ‘host_response_time’ can means if a host’respone time is faster ,then we can say the host have a better sevice.so I process it to be a sequence variable.The varible is bigger ,then the sevice is better.
- host_response_rate. the host_response_rate should be a numerical value,so I trim the “%” from the value.
- **price,security_deposit,cleaning_fee.**Those two col are money value,so I trim ‘$’ from them.
# 变量清洗
# host_response_time 认为反应时间越快,说明服务越好,因此
host_response_time_mapping = {'a few days or more':1,'within a day':2,'within a few hours':3,'within an hour':4}
df_Seattle_listings['host_response_time'] = df_Seattle_listings['host_response_time'].replace(host_response_time_mapping)
# host_response_rate
# 去掉%号
df_Seattle_listings['host_response_rate'] = df_Seattle_listings['host_response_rate'].apply(lambda x:int(str(x).split('%')[0]) if x == x else x)
#处理价格变量 price security_deposit cleaning_fee
money_features = ['price','security_deposit','cleaning_fee']
for col in money_features:
df_Seattle_listings[col] = df_Seattle_listings[col].apply(lambda x: float(str(x).replace('$','').replace(',','')) if x == x else x)
3.4 Numerical variable processing(数值变量处理。)
We select num_features to process:
- If the miss rate is more than 0.6 then delete this variable,and add a col to indicate wheather the value is null.
- If the miss rate is less than 0.6,then fill the miss value with random value from the not miss value.
# accommodates,bathrooms,beds,minimum_nights,maximum_nights,availability_365,number_of_reviews,review_scores_rating,review_scores_accuracy
# review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value
process_num_features = ['price','security_deposit','cleaning_fee','host_total_listings_count','accommodates','bathrooms','bedrooms','beds','minimum_nights','maximum_nights','availability_365','number_of_reviews',\
'review_scores_rating','review_scores_accuracy','review_scores_cleanliness','review_scores_checkin',\
'review_scores_communication','review_scores_location','review_scores_value','host_response_time','host_response_rate']
def fill_numeric_null_value(df,col,process_num_features,remove_features,threshold_rate = 0.8):
'''
Usage: fill the Null value of the col
Input:
df - input dataframe
col - the feature to be process
Output: filled dataframe including the fixed col and a filled flag
'''
allFeatures = list(df.columns)
if col in allFeatures:
miss_rate = df[col].isnull().sum()/df.shape[0]
print('{} miss rate is {}'.format(col,miss_rate))
if miss_rate > 0:
col_flag = str(col)+'_flag'
df[col_flag] = df[col].map(lambda x: 0 if x != x else 1)
if miss_rate > threshold_rate:
df = df.drop([col],axis = 1)
process_num_features.remove(col)
remove_features.append({col:'miss rate is too high'})
else:
# 获取非缺失的数值
not_missing = df.loc[df[col] == df[col],col]
# 获取缺失值所在位置
missing_index = df.loc[df[col] != df[col],col].index
# 随机产生补充缺失值的list
miss_makeup = random.sample(list(not_missing),len(missing_index))
# 补偿缺失值
df.loc[missing_index,col] = miss_makeup
return df,process_num_features,remove_features
threshold_rate = 0.8
for col in process_num_features:
df_Seattle_listings,process_num_features,remove_features = fill_numeric_null_value(df_Seattle_listings,col,process_num_features,remove_features,threshold_rate)
remove_features
3.5 Categorical variable processing(分类变量处理)
We process the categorical varibles in the following ways:
- If miss rate is more than 0.8 then delete this variable,else fill then miss value with ‘-1’.
- One-hot encoding.
- 1)空值处理。如果空值占比 >0.8,删除;否则使用特殊值进行填充。
- 2)one-hot编码。
def fill_categorical_null_value(df,col,process_cat_features,remove_features,threshold_rate = 0.8):
'''
Usage: fill the Null value of the col
Input:
df - input dataframe
col - the feature to be process
Output: filled dataframe including the fixed col and a filled flag
'''
allFeatures = list(df.columns)
if col in allFeatures:
missingRate = MissingCategorial(df,col)
print('{0} has missing rate as {1}'.format(col,missingRate))
if missingRate > threshold_rate:
process_cat_features.remove(col)
remove_features.append({col:'miss rate is too high'})
del df[col]
if 0 < missingRate < threshold_rate:
uniq_valid_vals = [i for i in df[col] if i == i]
uniq_valid_vals = list(set(uniq_valid_vals))
if isinstance(uniq_valid_vals[0], numbers.Real):
missing_position = df.loc[df[col] != df[col]][col].index
not_missing_sample = [-1]*len(missing_position)
df.loc[missing_position, col] = not_missing_sample
else:
# In this way we convert NaN to NAN, which is a string instead of np.nan
df[col] = df[col].map(lambda x: str(x).upper())
return df,process_cat_features,remove_features
# 对分类变量进行one-hot处理
process_cat_features = ['host_is_superhost','neighbourhood_group_cleansed','zipcode','property_type','room_type','instant_bookable','cancellation_policy']
threshold_rate = 0.8
for col in process_cat_features:
df_Seattle_listings,process_cat_features,remove_features = fill_categorical_null_value(df_Seattle_listings,col,process_cat_features,remove_features,threshold_rate)
df_Seattle_listings = pd.get_dummies(data = df_Seattle_listings,columns = process_cat_features)
df_listings_clean = df_Seattle_listings.copy()
df_listings_clean_summary = Summary(df_listings_clean).reset_index()
df_listings_clean_summary.to_csv(path +'df_listings_clean_summary.csv')
df_listings_clean_summary
3.6 reveiw again(再次遍历处理)
after we process features by the ways above all,we should process the single value ,the big proportion again.
- 1)缺失值
- 2)单一值处理。
clean_all_features = list(df_listings_clean.columns)
for col in clean_all_features:
df_listings_clean,clean_all_features,remove_features = delete_singe_value_features(df_listings_clean,col,clean_all_features,remove_features)
# 删除单一值占比超过0.9的列
threshold_rate = 0.85
for col in clean_all_features:
df_listings_clean,clean_all_features,remove_features = delete_high_proportion_features(df_listings_clean,col,clean_all_features,remove_features,threshold_rate)
remove_features
Because host_is_superhost_f and host_is_superhost_t are strongly correlated, so we just keep one of them.
And then we do same operate to instant_bookable_f and instant_bookable_t
df_listings_clean = df_listings_clean.drop(['host_is_superhost_f','instant_bookable_f'],axis = 1)
remove_features.append({'host_is_superhost_f':'Binary redundant variable'})
remove_features.append({'instant_bookable_f':'Binary redundant variable'})
df_listings_clean.to_csv(path+'df_listings_clean.csv')
4.EDA(数据探索)
I want to find out that :
- what’s the difference between the high price house and the low price house.
- If we are the host,when our houses is high/low price house,what should we do to improve the review score?
- If the host is a superhost,what’s difference between high/low price houses.
obs_cols = ['accommodates','bathrooms','bedrooms','beds','security_deposit','cleaning_fee','minimum_nights','maximum_nights',
'review_scores_cleanliness','review_scores_location',
'host_response_time_flag','host_response_rate_flag','security_deposit_flag',
'host_is_superhost_t','property_type_Apartment','property_type_House','room_type_Entire home/apt','room_type_Private room',
'instant_bookable_t','cancellation_policy_flexible','cancellation_policy_moderate',
'cancellation_policy_strict_14_with_grace_period','review_scores_value']
## 2 拆分价格区间。将价格拆分为 低、中、高,三个区间,查看不同的价格区间,影响用户订房的因素。
price_mid = df_listings_clean['price'].quantile(0.5)
print('price_mid = {}'.format(price_mid))
df_listings_clean['price_flag'] = df_listings_clean['price'].apply(lambda x : 'low' if x <= price_mid else 'high' )
df_low_price = df_listings_clean[df_listings_clean['price_flag'] == 'low']
df_high_price = df_listings_clean[df_listings_clean['price_flag'] == 'high']
def get_x_y(df,col):
indexNum = df[df['Columns'] == col].index.tolist()[0]
x = df.loc[df['Columns'] == col][['value 0','value 1','value 2','value 3','value 4']]
xList = list(x.T[indexNum])
xList.append('Null Value')
xList.append('Other')
y = df.loc[df['Columns'] == col][['freq 0','freq 1','freq 2','freq 3','freq 4','freqNull'
,'freqOther']]
yList = list(y.T[indexNum])
return xList,yList
def campare_plot(df_high,df_low,col):
x1,y1 = get_x_y(df_high,col)
x2,y2 = get_x_y(df_low,col)
font2 = {'weight':'normal','size': 20}
colors = ['lightcoral','gold','g','c','m','crimson','brown']
plt.figure(figsize=(14, 6))
plt.title('Gaussian colored noise')
plt.subplot(1,2,1)
plt.title('high price house',font2)
plt.xlabel(col,font2)
plt.ylabel('count',font2)
plt.xticks(np.arange(len(x1)), x1)
plt.bar(np.arange(len(x1)),y1,color = colors,linewidth=20.0)
plt.subplot(1,2,2)
plt.xlabel(col,font2)
plt.title('low price house',font2)
plt.ylabel('count',font2)
plt.xticks(np.arange(len(x2)), x2)
plt.bar(np.arange(len(x2)),y2,color = colors)
df_high_price_summary = Summary(df_high_price).reset_index()
df_low_price_summary = Summary(df_low_price).reset_index()
df_high_price_summary.to_csv(path+'df_high_price_summary.csv')
df_low_price_summary.to_csv(path+'df_low_price_summary.csv')
Question1 What’s the differece between high price houses and low price houses?
Figture explain:
In the follow figtures, I will choose the most 5 proportion value and Null Value and the Other Value ,
to check their differece between high price houses and low price houses
accommodates
The most 5 proportion accommodates of high price house is (4,2,6,3,5),while the low price house is (2,4,3,1,5).
So the high price houses have more accommodates than the low price houses.
campare_plot(df_high_price_summary,df_low_price_summary,obs_cols[0])
bathrooms
In general, most of the high price houses and low price houses only have one bathrooms, but on average, the high price houses have more bathrooms than low price houses
campare_plot(df_high_price_summary,df_low_price_summary,obs_cols[1])
bedrooms
Be similar like bathrooms. The most houses wheather high price houses or low price houses have only one bedrooms,but on average, the high price houses have more bethrooms than low price houses
campare_plot(df_high_price_summary,df_low_price_summary,obs_cols[2])
beds
In general, most of the high price houses and low price houses only have one beds , but on average, the high price houses have more beds than low price houses
campare_plot(df_high_price_summary,df_low_price_summary,obs_cols[3])
security deposit
In general, the security deposit of high price houses is much more than low pirce houses
campare_plot(df_high_price_summary,df_low_price_summary,obs_cols[4])
cleaning fee
In general, the cleaning fee of high price houses is much more than low pirce houses
campare_plot(df_high_price_summary,df_low_price_summary,obs_cols[5])
minimum_nights
In general, the minimum nights of high price houses is a little bit more than low pirce houeses
campare_plot(df_high_price_summary,df_low_price_summary,obs_cols[6])
review scores cleanlines
In general, the review scores cleanlines of high price houses is a little bit less than low pirce houeses
campare_plot(df_high_price_summary,df_low_price_summary,obs_cols[8])
host_is_superhost_t
In general, the low pirce houses have more superhosts than high price houses.
campare_plot(df_high_price_summary,df_low_price_summary,'host_is_superhost_t')
cancellation_policy_flexible
In general, the low pirce houses have more cancellation policy flexible houses than high price houses.
campare_plot(df_high_price_summary,df_low_price_summary,'cancellation_policy_flexible')
review_scores_value
In general, the review scores value of high price houses is a little bit more than low pirce houeses,but not very much.
campare_plot(df_high_price_summary,df_low_price_summary,'review_scores_value')
Question1 What’s the differece between high price houses and low price houses?
Conclusion
- Household Appliances. The high price houses provide more facility than low price houses,like accommodates,bedrooms,bathrooms and beds.
- House sevice. The low price houses performance better than the high price houses,for example,low price houses needs less cleaning fee than
high price houses,and more proportion of low price houses’ hosts are superhost. - review score value. The price dosen’t influence review scores value very much.
5. Build Module(建立模型)
Question2 If you are a low/high house host,what should you do to improve the review score value?
def ROC_AUC(df, score, target, plot=True):
df2 = df.copy()
s = list(set(df2[score]))
s.sort()
tpr_list = [0]
fpr_list = [0]
for k in s:
df2['label_temp'] = df[score].map(lambda x: int(x >= k))
TP = df2[(df2.label_temp==1) & (df2[target]==1)].shape[0]
FN = df2[(df2.label_temp == 1) & (df2[target] == 0)].shape[0]
FP = df2[(df2.label_temp == 0) & (df2[target] == 1)].shape[0]
TN = df2[(df2.label_temp == 0) & (df2[target] == 0)].shape[0]
try:
TPR = TP / (TP + FN)
except:
TPR =0
try:
FPR = FP / (FP + TN)
except:
FPR = 0
tpr_list.append(TPR)
fpr_list.append(FPR)
tpr_list.append(1)
fpr_list.append(1)
ROC_df = pd.DataFrame({'tpr': tpr_list, 'fpr': fpr_list})
ROC_df = ROC_df.sort_values(by='tpr')
ROC_df = ROC_df.drop_duplicates()
auc = 0
ROC_mat = np.mat(ROC_df)
for i in range(1, ROC_mat.shape[0]):
auc = auc + (ROC_mat[i, 1] + ROC_mat[i - 1, 1]) * (ROC_mat[i, 0] - ROC_mat[i - 1, 0]) * 0.5
if plot:
plt.plot(ROC_df['fpr'], ROC_df['tpr'])
plt.plot([0, 1], [0, 1])
plt.title("AUC={}%".format(int(auc * 100)))
return auc
def GridSearch(X_train, X_test, y_train, y_test, criterion = ['mse'],tree_Flag = 'Regression',n_estimators = [300, 600],
method = 'RF', learning_rate = 0.5, validate = False, cv = 5,
max_features = ['auto'], max_depth = [10, 20, 40], min_samples_leaf = [2,4],min_samples_split = [10,20,40], n_jobs = -1):
'''
Usage: use gridsearch to find optimal parameters for the random forest (RF) regressor.
Input: training and testing sets from X and y variables
Output: the best regressor
'''
best_clf = np.NAN
# 区分是回归模型
if tree_Flag == 'regression':
parameters = {'criterion': criterion,
'n_estimators': n_estimators,
'max_depth': max_depth,
'min_samples_leaf':min_samples_leaf,
'max_features':max_features,
'min_samples_split':min_samples_split
}
clf = RandomForestRegressor(random_state=42, n_jobs = n_jobs)
#Use gridsearch to find the best-model parameters.
grid_obj = GridSearchCV(clf, parameters, cv = cv)
grid_fit = grid_obj.fit(X_train, y_train)
#obtaining best model, fit it to training set
best_clf = grid_fit.best_estimator_
best_clf.fit(X_train, y_train)
# Make predictions using the new model.
best_train_predictions = best_clf.predict(X_train)
print('The training MSE Score is', mean_squared_error(y_train, best_train_predictions))
print('The training R2 Score is', r2_score(y_train, best_train_predictions))
if validate:
best_test_predictions = best_clf.predict(X_test)
print('The testing MSE Score is', mean_squared_error(y_test, best_test_predictions))
print('The testing R2 Score is', r2_score(y_test, best_test_predictions))
# 如果是分类模型
elif tree_Flag == 'classifier':
clf = RandomForestClassifier(oob_score=True)
param_test1 = {'n_estimators':n_estimators}
gsearch1 = GridSearchCV(estimator = RandomForestClassifier(),param_grid = param_test1, scoring='roc_auc',cv=5)
gsearch1.fit(X_train, y_train)
best_n_estimators = gsearch1.best_params_['n_estimators']
param_test2 = {'max_depth':max_depth, 'min_samples_split':min_samples_split, 'min_samples_leaf':min_samples_leaf}
gsearch2 = GridSearchCV(estimator = RandomForestClassifier(n_estimators= best_n_estimators),param_grid = param_test2, scoring='roc_auc',cv=5)
gsearch2.fit(X_train, y_train)
best_max_depth, best_min_samples_split, best_min_samples_leaf = gsearch2.best_params_['max_depth'],gsearch2.best_params_['min_samples_leaf'],gsearch2.best_params_['min_samples_split']
param_test3 ={'max_features':['sqrt','log2']}
gsearch3 = GridSearchCV(estimator = RandomForestClassifier(n_estimators= best_n_estimators,
max_depth = best_max_depth,
min_samples_split = best_min_samples_split,
min_samples_leaf = best_min_samples_leaf),
param_grid = param_test3, scoring='roc_auc',cv=5)
gsearch3.fit(X_train,y_train)
best_max_features = gsearch3.best_params_['max_features']
best_clf = RandomForestClassifier(oob_score=True, n_estimators= best_n_estimators,
max_depth = best_max_depth,min_samples_split = best_min_samples_split,
min_samples_leaf = best_min_samples_leaf,max_features = best_max_features)
best_clf.fit(X_train,y_train)
# print(best_clf.oob_score_)
y_predprob = best_clf.predict_proba(X_train)[:,1]
result = pd.DataFrame({'real':y_train,'pred':y_predprob})
#print("AUC Score (Train): %f" % metrics.roc_auc_score(y, y_predprob))
auc = ROC_AUC(result, 'pred', 'real',False)
print('The training Auc is', auc )
if validate:
y_predprob = best_clf.predict_proba(X_test)[:,1]
result = pd.DataFrame({'real':y_test,'pred':y_predprob})
#print("AUC Score (Train): %f" % metrics.roc_auc_score(y, y_predprob))
auc = ROC_AUC(result, 'pred', 'real',False)
print('The testing Auc is', auc )
return best_clf
# 低房价变量重要性
df_low_price_X = df_low_price[obs_cols].drop(['review_scores_value'],axis = 1)
df_low_price_y = df_low_price[['review_scores_value']].iloc[:,0]
df_low_price_X_train, df_low_price_X_test, df_low_price_y_train, df_low_price_y_test = train_test_split(df_low_price_X, df_low_price_y, test_size = 0.3, random_state = 42)
# 低房价变量重要性
criterion = ['mse']
method = 'RF'
n_estimators = [200,400]
max_features = [10,15,22]
max_depth = [10, 20, 40]
min_samples_leaf = [2,4]
learning_rate = 0.001
tree_Flag = ''
best_clf = GridSearch(df_low_price_X_train, df_low_price_X_test, df_low_price_y_train, df_low_price_y_test,tree_Flag = 'regression', method = method, learning_rate = learning_rate, \
criterion = criterion, n_estimators = n_estimators, \
max_features = max_features, max_depth = max_depth, min_samples_leaf = min_samples_leaf)
def show_importances(best_clf,df):
importances = best_clf.feature_importances_
feat_names = df.columns
tree_result = pd.DataFrame({'feature': feat_names, 'importance': importances})
tree_result.sort_values(by='importance',ascending=True)[-10:].plot(x='feature', y='importance', kind='barh')
show_importances(best_clf,df_low_price_X_train)
# 高房价变量重要性
df_high_price_X = df_high_price[obs_cols].drop(['review_scores_value'],axis = 1)
df_high_price_y = df_high_price[['review_scores_value']].iloc[:,0]
df_high_price_X_train, df_high_price_X_test, df_high_price_y_train, df_high_price_y_test = train_test_split(df_high_price_X, df_high_price_y, test_size = 0.3, random_state = 42)
criterion = ['mse']
method = 'RF'
n_estimators = [200,400]
max_features = [10,15,22]
max_depth = [10, 20, 40]
min_samples_leaf = [2,4]
learning_rate = 0.001
best_clf = GridSearch(df_high_price_X_train, df_high_price_X_test, df_high_price_y_train, df_high_price_y_test, tree_Flag = 'regression',method = method, learning_rate = learning_rate, \
criterion = criterion, n_estimators = n_estimators, \
max_features = max_features, max_depth = max_depth, min_samples_leaf = min_samples_leaf)
show_importances(best_clf,df_high_price_X_train)
Question2 If you are a low/high house host,what should you do to improve the review score value?
Conclusion
- From the pictures above,we can see both high price houses’ users and low price houses’ users care about
review_scores_cleanliness,review_scores_cleanliness,cleaning_fee,security_deposit,maximum_nights,minimum_nights.accommodates. - If you are a low price houses’s host,you should try to be a superhost at first,and then maybe you should not make your the houses cancellation policy to be a strict grace period.
- If you are a high price houses’ host , more care about beds,and bedrooms,and wheather the house is a Apartment.
Question3 What features influence host to a superhost while the house is a high or low price house?
# 低房价变量重要性
df_low_price_X = df_low_price[obs_cols].drop(['host_is_superhost_t'],axis = 1)
df_low_price_y = df_low_price[['host_is_superhost_t']].iloc[:,0]
df_low_price_X_train, df_low_price_X_test, df_low_price_y_train, df_low_price_y_test = train_test_split(df_low_price_X, df_low_price_y, test_size = 0.3, random_state = 42)
criterion = ['gini']
method = 'RF'
n_estimators = [200,400]
max_features = [10,15,22]
max_depth = [10, 20, 40]
min_samples_split = [10,20,40]
min_samples_leaf = [2,4]
learning_rate = 0.001
best_clf = GridSearch(df_low_price_X_train, df_low_price_X_test, df_low_price_y_train, df_low_price_y_test,tree_Flag = 'classifier', method = method, learning_rate = learning_rate, \
criterion = criterion, n_estimators = n_estimators, \
max_features = max_features, max_depth = max_depth, min_samples_leaf = min_samples_leaf,min_samples_split = min_samples_split)
show_importances(best_clf,df_low_price_X_train)
# 高房价变量重要性
df_high_price_X = df_high_price[obs_cols].drop(['host_is_superhost_t'],axis = 1)
df_high_price_y = df_high_price[['host_is_superhost_t']].iloc[:,0]
df_high_price_X_train, df_high_price_X_test, df_high_price_y_train, df_high_price_y_test = train_test_split(df_high_price_X, df_high_price_y, test_size = 0.3, random_state = 42)
criterion = ['gini']
method = 'RF'
n_estimators = [200,400]
max_features = [10,15,22]
max_depth = [10, 20, 40]
min_samples_split = [10,20,40]
min_samples_leaf = [2,4]
learning_rate = 0.001
best_clf = GridSearch(df_high_price_X_train, df_high_price_X_test, df_high_price_y_train, df_high_price_y_test,tree_Flag = 'classifier', method = method, learning_rate = learning_rate, \
criterion = criterion, n_estimators = n_estimators, \
max_features = max_features, max_depth = max_depth, min_samples_leaf = min_samples_leaf,min_samples_split = min_samples_split)
show_importances(best_clf,df_high_price_X_train)
Question3 If we are the house hosts,If we want to be a superhost,what should we do while we are high price house host or low price house host?
conclusion
From the figtures above,we can see that both of low/high price house’s hosts are been influenced by cleaning_fee,maximum_nights,review_scores_value,secutity_deposit,review_scores_cleanliness,host_reponse_rate_flag,host_reponse_time_flag.
So if we want to be superhost,there have not much different between low price houses and high price houses