What's the differece between high price houses and low price houses of airbnb?


在这里插入图片描述

1 商业理解(business understanding)

Problem I want to solve:
I just split Seatte houses into two parts by the price.The high price house’s price is more than median price (119).The low price house’s price is less than median price(119).
then I want to find out that:
Question1. What’s the differece between high price houses and low price houses.
Question2. If you are a low/high house host,what should you do to improve the review score value?
Question3. Question3 If we are the house hosts,and if we want to be a superhost,what should we do while we are high price house host or low price house host?

import pandas as pd
import numpy as np

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
# import ImputingValues as t
from sklearn.ensemble import RandomForestClassifier

import matplotlib.pyplot as plt
import seaborn as sns
# from mpl_toolkits.basemap import Basemap
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import accuracy_score, r2_score, mean_squared_error
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor, GradientBoostingRegressor
from datetime import datetime
from sklearn.model_selection import GridSearchCV
import random

import numbers
# from helper import *
pd.set_option("max_columns", None)
pd.set_option("max_rows", None)
%matplotlib inline

2. 数据理解(data understanding)

2.1 Load the data

path = 'D:/Code/Udacity/02_DataScientist/Write_A_Data_Science_Blog_Post/My_Analysis_Of_ArBNB_new/data/Seattle_AirBNB_Data/'
df_Seattle_listings = pd.read_csv(path + 'listings.csv')

df_Seattle_listings.head(3)

数据预览

2.2 Preview the data

The data are mainly divided into the following aspects:
Host information
host_response_time,host_response_rate,host_is_superhost,host_listings_count,host_total_listings_count
House hardware information
neighbourhood_group_cleansed,zipcode,property_type,room_type,accommodates,bathrooms,bedrooms
House other information
price,security_deposit,cleaning_fee,minimum_nights,maximum_nights,availability_365,instant_bookable,cancellation_policy
House scrore information review_scores_rating,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value

def Value_counts(das, nhead = 5):
    
    tmp = das.value_counts().reset_index().rename_axis({'index':das.name},axis = 1)
    
    value = pd.DataFrame(['value {}'.format(i) for i in range(nhead)],index = range(nhead)).join(tmp.iloc[:,0],how = 'left').set_index(0).T
    
    freq = pd.DataFrame(['freq {}'.format(i) for i in range(nhead)],index = range(nhead)).join(tmp.iloc[:,1],how = 'left').set_index(0).T
   
    nnull = das.isnull().sum()
   
    freqother = pd.DataFrame([nnull,das.shape[0] - nnull - freq.sum(axis = 1).sum()],index = ['freqNull','freqOther']).T.rename_axis({0:das.name})
    
    op = pd.concat([value,freq,freqother],axis = 1)
    return(op)
def Summary(da):
    
    op = pd.concat([pd.DataFrame({"type": da.dtypes, "n": da.notnull().sum(axis = 0)}), da.describe().T.iloc[:,1:],
                    pd.concat(map(lambda i: Value_counts(da.loc[:,i]), da.columns))], axis = 1).loc[da.columns]
    op.index.name = "Columns"
    return(op)


def MissingCategorial(df,x):
    missing_vals = df[x].map(lambda x: int(x!=x))
    return sum(missing_vals)*1.0/df.shape[0]

def MissingContinuous(df,x):
    missing_vals = df[x].map(lambda x: int(np.isnan(x)))
    return sum(missing_vals) * 1.0 / df.shape[0]



df_Seattle_listings_summary = Summary(df_Seattle_listings).reset_index()

df_Seattle_listings_summary.to_csv(path+'df_Seattle_listings_summary.csv')

df_Seattle_listings_summary

The table blow show us that:

n. The length of the col value.

type. The type of the col.

mean.std.min. The mean,std,min of the col,and of course if the col is object type,it will be null.

25%,50%,75%. The quantile of col.

value0,value1,value2,value3,value4,value5. The most five proportion value of the col.

freq0,freq 1,freq 2,freq 4,freq 4. The most five proportion value’s count of the col.

freqNull,freqOther. The Null/Other value’s count of the col.

数据预览

Discussion:
From the table above, we can see that several features have just single value or have a high miss rate or have a high proportion value,those features have little value for us to analysis,so we will process them first.

3 数据准备(data preparation)- Data clean

3.1 First process.

1Singe value process. If a feature only have one unque value,then it have no value for our analysis. And at last,we delete scrape_id,experiences_offered,and so on .

2 Null value process. If a feature only have a miss rate more than 0.85,then it have no value for our analysis. And at last,we delete thumbnail_url,xl_picture_url,and so on .

3 Big proportion process. If a feature only have one value rate more than 0.9,then it have litte value for our analysis. And at last,we delete host_has_profile_pic,street,and so on.

1Singe value process.

If a feature only have one unque value,then it have no value for our analysis. And at last,we delete scrape_id,experiences_offered,and so on

def delete_singe_value_features(df,col,all_features,remove_features):
    '''
    Usage: delete the singe value feature
    Input:
    df - input dataframe
    col - the feature to be process
    all_features - all features in the df 
    remove_features - list to record the delete features
    Output: 
    df - dataframe which have been process
    all_features - all features now we are watch
    remove_features - features we have remove from all features
    '''
    
    if len(set(df[col])) == 1:
        print('delete {} from the dataset because it is a constant'.format(col))
        del df[col]
        all_features.remove(col)
        remove_features.append({col:'singe_value'})
        
    return df,all_features,remove_features


all_features = list(df_Seattle_listings.columns)
select_features = all_features
remove_features = []
threshold_rate = 0.85

for col in select_features:
     df_Seattle_listings,select_features,remove_features  = delete_singe_value_features(df_Seattle_listings,col,select_features,remove_features)

remove_features

在这里插入图片描述

2 Null value process.

If a feature only have a miss rate more than 0.85,then it have no value for our analysis. And at last,we delete thumbnail_url,xl_picture_url,and so on

def process_null_value(df,col,all_features,remove_features,threshold_rate):
    '''
    Usage: clean the col if the most proportion is bigger than threshold_rate
    Input:
    df - input dataframe
    col - the feature to be process
    threshold_rate - threshold rate
    Output: 
    df - dataframe which have been process
    remove_flag - the flag indicate wheather the col haven been deleted
    '''
    miss_rate = df[col].isnull().sum()/df.shape[0]
    if miss_rate > threshold_rate:
        print('{} has a miss rate {} and be removed'.format(col,miss_rate))
        df = df.drop([col],axis = 1)
        remove_features.append({col:'miss rate is too high'})
        all_features.remove(col)
    return df,all_features,remove_features


# 删除缺失值较多的行
threshold_rate = 0.85

for col in select_features:
    df_Seattle_listings,select_features,remove_features = process_null_value(df_Seattle_listings,col,select_features,remove_features,threshold_rate)
    
remove_features

在这里插入图片描述

3 Big proportion process.

If a feature only have one value rate more than 0.9,then it have litte value for our analysis. And at last,we delete host_has_profile_pic,street,and so on.

def delete_high_proportion_features(df,col,all_features,remove_features,threshold_rate = 0.9):
    '''
    Usage: clean the col if the most proportion is bigger than threshold_rate
    Input:
    df - input dataframe
    col - the feature to be process
    all_features - all features now we are watch
    remove_features - features we have remove from all features
    threshold_rate - threshold rate
    Output: 
    df - dataframe which have been process
    all_features - all features now we are watch
    remove_features - features we have remove from all features
    '''
    most_proportion = df[col].value_counts().reset_index().sort_values(by = col,ascending = False).loc[0,col]/df.shape[0]
#     print("we are processing {} .....".format(col))
    if most_proportion > threshold_rate:
        df = df.drop([col],axis = 1)
        all_features.remove(col)
        remove_features.append({col:'high proportion'})
        print('{} has a most proportion ={} ,and be removed'.format(col,most_proportion))
    return df,all_features,remove_features

# 删除单一值占比超过0.9的列
threshold_rate = 0.9

for col in select_features:
    df_Seattle_listings,select_features,remove_features = delete_high_proportion_features(df_Seattle_listings,col,select_features,remove_features,threshold_rate)


remove_features

在这里插入图片描述

#观察变量
df_Seattle_listings_summary[df_Seattle_listings_summary['Columns'].isin(all_features)]

在这里插入图片描述

3.2 Choose variables to continue observe

预测各个价格区间段内,对用户多次订购影响最大的因素,从以下几个方面选择

After the first process step,we select features to watch in the following ways:

  1. Host information. host_response_time,host_response_rate,host_is_superhost,host_listings_count,host_total_listings_count
  2. House hardware information. neighbourhood_group_cleansed,zipcode,property_type,room_type,accommodates,bathrooms,bedrooms
  3. House other information. price,security_deposit,cleaning_fee,minimum_nights,maximum_nights,availability_365,instant_bookable,cancellation_policy
  4. House scrore information. review_scores_rating,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value

select_features = ['host_response_time','host_response_rate','host_is_superhost','host_total_listings_count','neighbourhood_group_cleansed'\
                ,'zipcode','property_type','room_type','accommodates','bathrooms','bedrooms','beds','price','security_deposit','cleaning_fee','minimum_nights','maximum_nights'\
                 ,'availability_365','number_of_reviews','review_scores_rating','review_scores_accuracy','review_scores_cleanliness','review_scores_checkin'\
                   ,'review_scores_communication','review_scores_location','review_scores_value','instant_bookable','cancellation_policy']
df_Seattle_listings_summary[df_Seattle_listings_summary['Columns'].isin(select_features)].reset_index().drop(['index'],axis = 1)

再次预览数据

df_Seattle_listings = df_Seattle_listings[select_features]

#备份一遍数据
df_Seattle_listings_bak = df_Seattle_listings.copy()

df_Seattle_listings.columns

查看当前列

3.3 Variable transformation(针对性处理)

  1. host_response_time. The feature ‘host_response_time’ can means if a host’respone time is faster ,then we can say the host have a better sevice.so I process it to be a sequence variable.The varible is bigger ,then the sevice is better.
  2. host_response_rate. the host_response_rate should be a numerical value,so I trim the “%” from the value.
  3. **price,security_deposit,cleaning_fee.**Those two col are money value,so I trim ‘$’ from them.
# 变量清洗
# host_response_time 认为反应时间越快,说明服务越好,因此
host_response_time_mapping = {'a few days or more':1,'within a day':2,'within a few hours':3,'within an hour':4}

df_Seattle_listings['host_response_time'] = df_Seattle_listings['host_response_time'].replace(host_response_time_mapping)

# host_response_rate
# 去掉%号
df_Seattle_listings['host_response_rate'] = df_Seattle_listings['host_response_rate'].apply(lambda x:int(str(x).split('%')[0]) if x == x else x)

#处理价格变量 price security_deposit cleaning_fee
money_features = ['price','security_deposit','cleaning_fee']
for col in money_features:
    df_Seattle_listings[col] = df_Seattle_listings[col].apply(lambda x: float(str(x).replace('$','').replace(',','')) if x == x else x)

3.4 Numerical variable processing(数值变量处理。)

We select num_features to process:

  1. If the miss rate is more than 0.6 then delete this variable,and add a col to indicate wheather the value is null.
  2. If the miss rate is less than 0.6,then fill the miss value with random value from the not miss value.
# accommodates,bathrooms,beds,minimum_nights,maximum_nights,availability_365,number_of_reviews,review_scores_rating,review_scores_accuracy
# review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value
process_num_features = ['price','security_deposit','cleaning_fee','host_total_listings_count','accommodates','bathrooms','bedrooms','beds','minimum_nights','maximum_nights','availability_365','number_of_reviews',\
                    'review_scores_rating','review_scores_accuracy','review_scores_cleanliness','review_scores_checkin',\
                    'review_scores_communication','review_scores_location','review_scores_value','host_response_time','host_response_rate']

def fill_numeric_null_value(df,col,process_num_features,remove_features,threshold_rate = 0.8):
    '''
    Usage: fill the Null value of the col
    Input:
    df - input dataframe
    col - the feature to be process
    Output: filled dataframe including the fixed col and a filled flag
    '''
    allFeatures = list(df.columns)
    if col in allFeatures:
        miss_rate = df[col].isnull().sum()/df.shape[0]
        print('{} miss rate is {}'.format(col,miss_rate))
        if miss_rate > 0:
            col_flag = str(col)+'_flag'
            df[col_flag] = df[col].map(lambda x: 0 if x != x else 1)
            if miss_rate > threshold_rate:
                df = df.drop([col],axis = 1)
                process_num_features.remove(col)
                remove_features.append({col:'miss rate is too high'})
            else:
                # 获取非缺失的数值
                not_missing = df.loc[df[col] ==  df[col],col]
                # 获取缺失值所在位置
                missing_index = df.loc[df[col] !=  df[col],col].index
                # 随机产生补充缺失值的list
                miss_makeup = random.sample(list(not_missing),len(missing_index))
                # 补偿缺失值
                df.loc[missing_index,col] = miss_makeup
    
    return df,process_num_features,remove_features

threshold_rate = 0.8

for col in process_num_features:
    df_Seattle_listings,process_num_features,remove_features = fill_numeric_null_value(df_Seattle_listings,col,process_num_features,remove_features,threshold_rate)


remove_features

在这里插入图片描述

3.5 Categorical variable processing(分类变量处理)

We process the categorical varibles in the following ways:

  1. If miss rate is more than 0.8 then delete this variable,else fill then miss value with ‘-1’.
  2. One-hot encoding.
  • 1)空值处理。如果空值占比 >0.8,删除;否则使用特殊值进行填充。
  • 2)one-hot编码。

def fill_categorical_null_value(df,col,process_cat_features,remove_features,threshold_rate = 0.8):
    '''
    Usage: fill the Null value of the col
    Input:
    df - input dataframe
    col - the feature to be process
    Output: filled dataframe including the fixed col and a filled flag
    '''
    allFeatures = list(df.columns)
    if col in allFeatures:
        missingRate = MissingCategorial(df,col)
        print('{0} has missing rate as {1}'.format(col,missingRate))
        if missingRate > threshold_rate:
            process_cat_features.remove(col)
            remove_features.append({col:'miss rate is too high'})
            del df[col]
        if 0 < missingRate < threshold_rate:
            uniq_valid_vals = [i for i in df[col] if i == i]
            uniq_valid_vals = list(set(uniq_valid_vals))
            if isinstance(uniq_valid_vals[0], numbers.Real):
                missing_position = df.loc[df[col] != df[col]][col].index
                not_missing_sample = [-1]*len(missing_position)
                df.loc[missing_position, col] = not_missing_sample
            else:
                # In this way we convert NaN to NAN, which is a string instead of np.nan
                df[col] = df[col].map(lambda x: str(x).upper())
    
    return df,process_cat_features,remove_features


# 对分类变量进行one-hot处理
process_cat_features = ['host_is_superhost','neighbourhood_group_cleansed','zipcode','property_type','room_type','instant_bookable','cancellation_policy']

threshold_rate = 0.8


for col in process_cat_features:
    df_Seattle_listings,process_cat_features,remove_features = fill_categorical_null_value(df_Seattle_listings,col,process_cat_features,remove_features,threshold_rate)


df_Seattle_listings = pd.get_dummies(data = df_Seattle_listings,columns = process_cat_features)

df_listings_clean = df_Seattle_listings.copy()

df_listings_clean_summary = Summary(df_listings_clean).reset_index()
df_listings_clean_summary.to_csv(path +'df_listings_clean_summary.csv')
df_listings_clean_summary

在这里插入图片描述

3.6 reveiw again(再次遍历处理)

after we process features by the ways above all,we should process the single value ,the big proportion again.

  • 1)缺失值
  • 2)单一值处理。
clean_all_features = list(df_listings_clean.columns)

for col in clean_all_features:
     df_listings_clean,clean_all_features,remove_features  = delete_singe_value_features(df_listings_clean,col,clean_all_features,remove_features)


# 删除单一值占比超过0.9的列
threshold_rate = 0.85

for col in clean_all_features:
    df_listings_clean,clean_all_features,remove_features = delete_high_proportion_features(df_listings_clean,col,clean_all_features,remove_features,threshold_rate)
    
remove_features

在这里插入图片描述
Because host_is_superhost_f and host_is_superhost_t are strongly correlated, so we just keep one of them.
And then we do same operate to instant_bookable_f and instant_bookable_t

df_listings_clean = df_listings_clean.drop(['host_is_superhost_f','instant_bookable_f'],axis = 1)

remove_features.append({'host_is_superhost_f':'Binary redundant variable'})

remove_features.append({'instant_bookable_f':'Binary redundant variable'})

df_listings_clean.to_csv(path+'df_listings_clean.csv')

4.EDA(数据探索)

I want to find out that :

  1. what’s the difference between the high price house and the low price house.
  2. If we are the host,when our houses is high/low price house,what should we do to improve the review score?
  3. If the host is a superhost,what’s difference between high/low price houses.
obs_cols = ['accommodates','bathrooms','bedrooms','beds','security_deposit','cleaning_fee','minimum_nights','maximum_nights',
            'review_scores_cleanliness','review_scores_location',
            'host_response_time_flag','host_response_rate_flag','security_deposit_flag',
           'host_is_superhost_t','property_type_Apartment','property_type_House','room_type_Entire home/apt','room_type_Private room',
            'instant_bookable_t','cancellation_policy_flexible','cancellation_policy_moderate',
            'cancellation_policy_strict_14_with_grace_period','review_scores_value']

## 2 拆分价格区间。将价格拆分为 低、中、高,三个区间,查看不同的价格区间,影响用户订房的因素。
price_mid = df_listings_clean['price'].quantile(0.5)
print('price_mid = {}'.format(price_mid))
df_listings_clean['price_flag'] = df_listings_clean['price'].apply(lambda x : 'low' if x <= price_mid else  'high' )

df_low_price = df_listings_clean[df_listings_clean['price_flag'] == 'low']

df_high_price = df_listings_clean[df_listings_clean['price_flag'] == 'high']

在这里插入图片描述

def get_x_y(df,col):
    indexNum = df[df['Columns'] == col].index.tolist()[0]
    
    x = df.loc[df['Columns'] == col][['value 0','value 1','value 2','value 3','value 4']]
    
    xList = list(x.T[indexNum])
    xList.append('Null Value')
    xList.append('Other')
    
    y = df.loc[df['Columns'] == col][['freq 0','freq 1','freq 2','freq 3','freq 4','freqNull'
                                                                       ,'freqOther']]
    yList = list(y.T[indexNum])

    return xList,yList

def campare_plot(df_high,df_low,col):
    x1,y1 = get_x_y(df_high,col)
    x2,y2 = get_x_y(df_low,col)
    
    font2 = {'weight':'normal','size': 20}
    colors = ['lightcoral','gold','g','c','m','crimson','brown']

    plt.figure(figsize=(14, 6))
    plt.title('Gaussian colored noise')
    
    plt.subplot(1,2,1)
    plt.title('high price house',font2)
    plt.xlabel(col,font2)
    plt.ylabel('count',font2)
    plt.xticks(np.arange(len(x1)), x1)
    plt.bar(np.arange(len(x1)),y1,color = colors,linewidth=20.0)

    plt.subplot(1,2,2)
    plt.xlabel(col,font2)
    plt.title('low price house',font2)
    plt.ylabel('count',font2)
    plt.xticks(np.arange(len(x2)), x2)
    plt.bar(np.arange(len(x2)),y2,color = colors)


df_high_price_summary = Summary(df_high_price).reset_index()
df_low_price_summary = Summary(df_low_price).reset_index()

df_high_price_summary.to_csv(path+'df_high_price_summary.csv')
df_low_price_summary.to_csv(path+'df_low_price_summary.csv')

Question1 What’s the differece between high price houses and low price houses?

Figture explain:
In the follow figtures, I will choose the most 5 proportion value and Null Value and the Other Value ,
to check their differece between high price houses and low price houses

accommodates
The most 5 proportion accommodates of high price house is (4,2,6,3,5),while the low price house is (2,4,3,1,5).
So the high price houses have more accommodates than the low price houses.

campare_plot(df_high_price_summary,df_low_price_summary,obs_cols[0])

accommodates compare

bathrooms
In general, most of the high price houses and low price houses only have one bathrooms, but on average, the high price houses have more bathrooms than low price houses

campare_plot(df_high_price_summary,df_low_price_summary,obs_cols[1])

bathrooms compare

bedrooms
Be similar like bathrooms. The most houses wheather high price houses or low price houses have only one bedrooms,but on average, the high price houses have more bethrooms than low price houses

campare_plot(df_high_price_summary,df_low_price_summary,obs_cols[2])

bedrooms compare

beds
In general, most of the high price houses and low price houses only have one beds , but on average, the high price houses have more beds than low price houses

campare_plot(df_high_price_summary,df_low_price_summary,obs_cols[3])

beds compare

security deposit
In general, the security deposit of high price houses is much more than low pirce houses

campare_plot(df_high_price_summary,df_low_price_summary,obs_cols[4])

security compare
cleaning fee
In general, the cleaning fee of high price houses is much more than low pirce houses

campare_plot(df_high_price_summary,df_low_price_summary,obs_cols[5])

cleaning fee compare
minimum_nights
In general, the minimum nights of high price houses is a little bit more than low pirce houeses

campare_plot(df_high_price_summary,df_low_price_summary,obs_cols[6])

minimum nights compare

review scores cleanlines
In general, the review scores cleanlines of high price houses is a little bit less than low pirce houeses

campare_plot(df_high_price_summary,df_low_price_summary,obs_cols[8])

在这里插入图片描述

host_is_superhost_t
In general, the low pirce houses have more superhosts than high price houses.

campare_plot(df_high_price_summary,df_low_price_summary,'host_is_superhost_t')

host superhost compare

cancellation_policy_flexible
In general, the low pirce houses have more cancellation policy flexible houses than high price houses.

campare_plot(df_high_price_summary,df_low_price_summary,'cancellation_policy_flexible')

cancellation_policy_flexible

review_scores_value
In general, the review scores value of high price houses is a little bit more than low pirce houeses,but not very much.

campare_plot(df_high_price_summary,df_low_price_summary,'review_scores_value')

review_scores_value

Question1 What’s the differece between high price houses and low price houses?
Conclusion

  1. Household Appliances. The high price houses provide more facility than low price houses,like accommodates,bedrooms,bathrooms and beds.
  2. House sevice. The low price houses performance better than the high price houses,for example,low price houses needs less cleaning fee than
    high price houses,and more proportion of low price houses’ hosts are superhost.
  3. review score value. The price dosen’t influence review scores value very much.

5. Build Module(建立模型)

Question2 If you are a low/high house host,what should you do to improve the review score value?


def ROC_AUC(df, score, target, plot=True):
    df2 = df.copy()
    s = list(set(df2[score]))
    s.sort()
    tpr_list = [0]
    fpr_list = [0]
    for k in s:
        df2['label_temp'] = df[score].map(lambda x: int(x >= k))
        TP = df2[(df2.label_temp==1) & (df2[target]==1)].shape[0]
        FN = df2[(df2.label_temp == 1) & (df2[target] == 0)].shape[0]
        FP = df2[(df2.label_temp == 0) & (df2[target] == 1)].shape[0]
        TN = df2[(df2.label_temp == 0) & (df2[target] == 0)].shape[0]
        try:
            TPR = TP / (TP + FN)
        except:
            TPR =0
        try:
            FPR = FP / (FP + TN)
        except:
            FPR = 0
        tpr_list.append(TPR)
        fpr_list.append(FPR)
    tpr_list.append(1)
    fpr_list.append(1)
    ROC_df = pd.DataFrame({'tpr': tpr_list, 'fpr': fpr_list})
    ROC_df = ROC_df.sort_values(by='tpr')
    ROC_df = ROC_df.drop_duplicates()
    auc = 0
    ROC_mat = np.mat(ROC_df)
    for i in range(1, ROC_mat.shape[0]):
        auc = auc + (ROC_mat[i, 1] + ROC_mat[i - 1, 1]) * (ROC_mat[i, 0] - ROC_mat[i - 1, 0]) * 0.5
    if plot:
        plt.plot(ROC_df['fpr'], ROC_df['tpr'])
        plt.plot([0, 1], [0, 1])
        plt.title("AUC={}%".format(int(auc * 100)))
    return auc

def GridSearch(X_train, X_test, y_train, y_test, criterion = ['mse'],tree_Flag = 'Regression',n_estimators = [300, 600],
                  method = 'RF', learning_rate = 0.5, validate = False, cv = 5,
                  max_features = ['auto'], max_depth = [10, 20, 40], min_samples_leaf = [2,4],min_samples_split = [10,20,40], n_jobs = -1):
    '''
    Usage: use gridsearch to find optimal parameters for the random forest (RF) regressor.
    Input: training and testing sets from X and y variables
    Output: the best regressor
    '''
    
    best_clf = np.NAN
    # 区分是回归模型
    if tree_Flag == 'regression':
        
        parameters = {'criterion': criterion,
                  'n_estimators': n_estimators,
                  'max_depth': max_depth,
                  'min_samples_leaf':min_samples_leaf,
                  'max_features':max_features,
                  'min_samples_split':min_samples_split
                 }
        
        clf = RandomForestRegressor(random_state=42, n_jobs = n_jobs)

        #Use gridsearch to find the best-model parameters.
        grid_obj = GridSearchCV(clf, parameters, cv = cv)
        grid_fit = grid_obj.fit(X_train, y_train)

        #obtaining best model, fit it to training set
        best_clf = grid_fit.best_estimator_
        best_clf.fit(X_train, y_train)

        # Make predictions using the new model.
        best_train_predictions = best_clf.predict(X_train)
        print('The training MSE Score is', mean_squared_error(y_train, best_train_predictions))
        print('The training R2 Score is', r2_score(y_train, best_train_predictions))

        if validate:
            best_test_predictions = best_clf.predict(X_test)
            print('The testing MSE Score is', mean_squared_error(y_test, best_test_predictions))
            print('The testing R2 Score is', r2_score(y_test, best_test_predictions))
            
    # 如果是分类模型
    elif tree_Flag == 'classifier':
        clf = RandomForestClassifier(oob_score=True)
       
        param_test1 = {'n_estimators':n_estimators}
    
        gsearch1 = GridSearchCV(estimator = RandomForestClassifier(),param_grid = param_test1, scoring='roc_auc',cv=5)
        gsearch1.fit(X_train, y_train)
        best_n_estimators = gsearch1.best_params_['n_estimators'] 
        
        param_test2 = {'max_depth':max_depth, 'min_samples_split':min_samples_split, 'min_samples_leaf':min_samples_leaf}
        
        gsearch2 = GridSearchCV(estimator = RandomForestClassifier(n_estimators= best_n_estimators),param_grid = param_test2, scoring='roc_auc',cv=5)
        
        gsearch2.fit(X_train, y_train)
        
        best_max_depth, best_min_samples_split, best_min_samples_leaf = gsearch2.best_params_['max_depth'],gsearch2.best_params_['min_samples_leaf'],gsearch2.best_params_['min_samples_split']

        param_test3 ={'max_features':['sqrt','log2']}
        gsearch3 = GridSearchCV(estimator = RandomForestClassifier(n_estimators= best_n_estimators,
                                                                   max_depth = best_max_depth,
                                                                   min_samples_split = best_min_samples_split,
                                                                   min_samples_leaf = best_min_samples_leaf),
                                param_grid = param_test3, scoring='roc_auc',cv=5)
        gsearch3.fit(X_train,y_train)
        best_max_features = gsearch3.best_params_['max_features']

        best_clf = RandomForestClassifier(oob_score=True, n_estimators= best_n_estimators,
                                    max_depth = best_max_depth,min_samples_split = best_min_samples_split,
                                    min_samples_leaf = best_min_samples_leaf,max_features = best_max_features)
        
        best_clf.fit(X_train,y_train)
#         print(best_clf.oob_score_)
        y_predprob = best_clf.predict_proba(X_train)[:,1]
        result = pd.DataFrame({'real':y_train,'pred':y_predprob})
        #print("AUC Score (Train): %f" % metrics.roc_auc_score(y, y_predprob))
        auc = ROC_AUC(result, 'pred', 'real',False)
        
        print('The training Auc is', auc )
        
        if validate:
            y_predprob = best_clf.predict_proba(X_test)[:,1]
            result = pd.DataFrame({'real':y_test,'pred':y_predprob})
            #print("AUC Score (Train): %f" % metrics.roc_auc_score(y, y_predprob))
            auc = ROC_AUC(result, 'pred', 'real',False)
            print('The testing Auc is', auc )
        
    return best_clf

# 低房价变量重要性
df_low_price_X = df_low_price[obs_cols].drop(['review_scores_value'],axis = 1)
df_low_price_y = df_low_price[['review_scores_value']].iloc[:,0]
df_low_price_X_train, df_low_price_X_test, df_low_price_y_train, df_low_price_y_test = train_test_split(df_low_price_X, df_low_price_y, test_size = 0.3, random_state = 42)



# 低房价变量重要性
criterion = ['mse']
method = 'RF' 
n_estimators =  [200,400]
max_features = [10,15,22] 
max_depth = [10, 20, 40] 
min_samples_leaf = [2,4]
learning_rate = 0.001
tree_Flag = ''

best_clf = GridSearch(df_low_price_X_train, df_low_price_X_test, df_low_price_y_train, df_low_price_y_test,tree_Flag = 'regression', method = method, learning_rate = learning_rate, \
                         criterion = criterion, n_estimators = n_estimators, \
                         max_features = max_features, max_depth = max_depth, min_samples_leaf = min_samples_leaf)

在这里插入图片描述


def show_importances(best_clf,df):

    importances = best_clf.feature_importances_ 
    feat_names = df.columns
    tree_result = pd.DataFrame({'feature': feat_names, 'importance': importances})
    tree_result.sort_values(by='importance',ascending=True)[-10:].plot(x='feature', y='importance', kind='barh')

show_importances(best_clf,df_low_price_X_train)

在这里插入图片描述

#  高房价变量重要性
df_high_price_X = df_high_price[obs_cols].drop(['review_scores_value'],axis = 1)
df_high_price_y = df_high_price[['review_scores_value']].iloc[:,0]
df_high_price_X_train, df_high_price_X_test, df_high_price_y_train, df_high_price_y_test = train_test_split(df_high_price_X, df_high_price_y, test_size = 0.3, random_state = 42)


criterion = ['mse']
method = 'RF' 
n_estimators =  [200,400]
max_features = [10,15,22] 
max_depth = [10, 20, 40] 
min_samples_leaf = [2,4]
learning_rate = 0.001

best_clf = GridSearch(df_high_price_X_train, df_high_price_X_test, df_high_price_y_train, df_high_price_y_test, tree_Flag = 'regression',method = method, learning_rate = learning_rate, \
                         criterion = criterion, n_estimators = n_estimators, \
                         max_features = max_features, max_depth = max_depth, min_samples_leaf = min_samples_leaf)

在这里插入图片描述

show_importances(best_clf,df_high_price_X_train)

在这里插入图片描述

Question2 If you are a low/high house host,what should you do to improve the review score value?
Conclusion

  1. From the pictures above,we can see both high price houses’ users and low price houses’ users care about

    review_scores_cleanliness,review_scores_cleanliness,cleaning_fee,security_deposit,maximum_nights,minimum_nights.accommodates.
  2. If you are a low price houses’s host,you should try to be a superhost at first,and then maybe you should not make your the houses cancellation policy to be a strict grace period.
  3. If you are a high price houses’ host , more care about beds,and bedrooms,and wheather the house is a Apartment.

Question3 What features influence host to a superhost while the house is a high or low price house?

# 低房价变量重要性
df_low_price_X = df_low_price[obs_cols].drop(['host_is_superhost_t'],axis = 1)
df_low_price_y = df_low_price[['host_is_superhost_t']].iloc[:,0]
df_low_price_X_train, df_low_price_X_test, df_low_price_y_train, df_low_price_y_test = train_test_split(df_low_price_X, df_low_price_y, test_size = 0.3, random_state = 42)


criterion = ['gini']
method = 'RF' 
n_estimators =  [200,400]
max_features = [10,15,22] 
max_depth = [10, 20, 40] 
min_samples_split = [10,20,40]
min_samples_leaf = [2,4]
learning_rate = 0.001

best_clf = GridSearch(df_low_price_X_train, df_low_price_X_test, df_low_price_y_train, df_low_price_y_test,tree_Flag = 'classifier', method = method, learning_rate = learning_rate, \
                         criterion = criterion, n_estimators = n_estimators, \
                         max_features = max_features, max_depth = max_depth, min_samples_leaf = min_samples_leaf,min_samples_split = min_samples_split)

在这里插入图片描述

show_importances(best_clf,df_low_price_X_train)

在这里插入图片描述

#  高房价变量重要性
df_high_price_X = df_high_price[obs_cols].drop(['host_is_superhost_t'],axis = 1)
df_high_price_y = df_high_price[['host_is_superhost_t']].iloc[:,0]
df_high_price_X_train, df_high_price_X_test, df_high_price_y_train, df_high_price_y_test = train_test_split(df_high_price_X, df_high_price_y, test_size = 0.3, random_state = 42)

criterion = ['gini']
method = 'RF' 
n_estimators =  [200,400]
max_features = [10,15,22] 
max_depth = [10, 20, 40] 
min_samples_split = [10,20,40]
min_samples_leaf = [2,4]
learning_rate = 0.001

best_clf = GridSearch(df_high_price_X_train, df_high_price_X_test, df_high_price_y_train, df_high_price_y_test,tree_Flag = 'classifier', method = method, learning_rate = learning_rate, \
                         criterion = criterion, n_estimators = n_estimators, \
                         max_features = max_features, max_depth = max_depth, min_samples_leaf = min_samples_leaf,min_samples_split = min_samples_split)

在这里插入图片描述

show_importances(best_clf,df_high_price_X_train)

在这里插入图片描述

Question3 If we are the house hosts,If we want to be a superhost,what should we do while we are high price house host or low price house host?
conclusion

From the figtures above,we can see that both of low/high price house’s hosts are been influenced by cleaning_fee,maximum_nights,review_scores_value,secutity_deposit,review_scores_cleanliness,host_reponse_rate_flag,host_reponse_time_flag.
So if we want to be superhost,there have not much different between low price houses and high price houses

  • 1
    点赞
  • 1
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值