Airbnb用户运营数据预测分析

最新推荐文章于 2023-03-26 16:03:01 发布

守望者psh

最新推荐文章于 2023-03-26 16:03:01 发布

阅读量1.2k

点赞数 2

分类专栏：数据分析文章标签：数据分析 python

本文链接：https://blog.csdn.net/weixin_44891782/article/details/105983843

版权

数据分析专栏收录该内容

1 篇文章

订阅专栏

项目背景

本项目选用Kaggle竞赛的赛题，其目的是帮助Airbnb基于4年的历史数据，预测新用户未来下单的旅行目的地。
赛题链接: link

分析目标

描绘Airbnb的目标用户画像

评估Airbnb现有的推广渠道效果

运动随机森林和GDBT算法预测结果

涉及工具

IDE：Jupyter notebook
主要工具库：
numpy，pandas做数据清洗
matplotlib，seaborn做可视化展示
sklearn 做建模分析

数据描述

train_users_2.csv - the training set of users （训练数据）
test_users.csv - the test set of users （测试数据）
    id: user id （用户id）
    date_account_created（帐号注册时间）: the date of account creation
    timestamp_first_active（首次活跃时间）: timestamp of the first activity, note that it can be earlier than date_account_created or date_first_booking because a user can search before signing up
    date_first_booking（首次订房时间）: date of first booking
    gender（性别）
    age（年龄）
    signup_method（注册方式）
    signup_flow（注册页面）: the page a user came to signup up from
    language（语言）: international language preference
    affiliate_channel（付费市场渠道）: what kind of paid marketing
    affiliate_provider（付费市场渠道名称）: where the marketing is e.g. google, craigslist, other
    first_affiliate_tracked（注册前第一个接触的市场渠道）: whats the first marketing the user interacted with before the signing up
    signup_app（注册app）
    first_device_type(设备类型)
    first_browser（浏览器类型）
    country_destination（订房国家-需要预测的量）: this is the target variable you are to predict
sessions.csv - web sessions log for users（网页浏览数据）
    user_id（用户id）: to be joined with the column ‘id’ in users table
    action(用户行为)
    action_type（用户行为类型）
    action_detail（用户行为具体）
    device_type（设备类型）
    secs_elapsed（停留时长）
sample_submission.csv - correct format for submitting your predictions

分析过程

工具库导入

    import numpy as np #numpy库
    import pandas as pd #pandas库
    from matplotlib import pyplot as plt #图表展示库
    import seaborn as sns
    from sklearn.preprocessing import LabelEncoder  #标签转换库
    from sklearn.ensemble import RandomForestClassifier #随机森林模型
    from sklearn.naive_bayes import MultinomialNB #
    from sklearn.naive_bayes import ComplementNB #
    from xgboost.sklearn import XGBClassifier #XGB模型

    #图片网页展示
    %matplotlib inline 

    plt.style.use('seaborn')

数据准备

数据链接：link


#导入数据集
train_data=pd.read_csv('train_users_2.csv')#训练集
test_data=pd.read_csv('test_users.csv')#测试集

print('训练集数据概览：')
print('-' * 30)
train_data.info()
print('-' * 30)
train_data.head()  


print('测试集数据概览：')
print('-' * 30)
test_data.info()
print('-' * 30)
test_data.head()

#查看训练集和测试集数据差异
print('The different colunms between train/test data is :')
print(np.setdiff1d(train_data.columns, test_data.columns, assume_unique=True))

#将训练集目标字段删除后合并训练集和测试集
all_data=pd.concat([train_data.drop('country_destination',axis=1),test_data])

print(train_data.country_destination.value_counts())

#训练集目的地结果所占百分比
sns.set_style()
des_country=train_data.country_destination.value_counts(dropna=False)/train_data.shape[0]*100
des_country.plot(kind='bar',rot=0)

在这里插入图片描述

trian文件包含213451行数据，16个特征，每个特征的数据类型和非空数值，
date_first_booking空值较多，在特征提取时可以考虑删除

训练集因变量目的地所占比例

一、数据清洗

1、测试集中，date_first_booking字段所有数据为空值，可做删除处理
2、gender，age字段是比较有价值的字段，可以从中分析出关键信息
3、first_affiliate_tracked, language, first_device_type，这三个字段有少量缺失值，可以用方法填充
4、first_browser,待分析字段

    #空值清洗：将所有‘unknown’值替换成空值
    all_data.replace(to_replace='-unknown-',value=np.nan,inplace=True)

    #异常值清洗
    all_data.loc[all_data.age > 110 ,'age']=np.nan
    all_data.loc[all_data.age < 16 ,'age']=np.nan
    #由于年龄段在（0，16）和（110，∞）区间段的用户超出正常范围，因此将年龄字段中大于110岁和小于16岁的群体的年龄替换为空值
    #查看缺失值字段分别在各数据集中所占比例
print('The number of train_data rows :{}'. format(len(train_data)))
print('\nMissing Values as a percentage in training dataset')
print(all_data.iloc[:len(train_data)].isnull().sum().where(lambda x : x>0).dropna()/len(train_data))
print('\nMissing Values as a percentage in testing dataset')
print(all_data.iloc[len(train_data):].isnull().sum().where(lambda x : x>0).dropna()/len(test_data))
    The number of train_data rows :213451

Missing Values as a percentage in training dataset
date_first_booking         0.583473
gender                     0.448290
age                        0.416283
first_affiliate_tracked    0.028414
first_browser              0.127739
dtype: float64

Missing Values as a percentage in testing dataset
date_first_booking         1.000000
gender                     0.544190
age                        0.465859
language                   0.000016
first_affiliate_tracked    0.000322
first_browser              0.275831
dtype: float64

#用众数去替代缺失值，再删除无用字段‘date_first_booking’
all_data.language.fillna(all_data.language.mode()[0],inplace=True)
all_data.first_affiliate_tracked.fillna(all_data.first_affiliate_tracked.mode()[0],inplace=True)
all_data.drop('date_first_booking',axis=1,inplace=True)

# 将字符串格式的日期转换为日期格式的日期
all_data.date_account_created = pd.to_datetime(all_data.date_account_created)
all_data.timestamp_first_active = pd.to_datetime(all_data.timestamp_first_active, format='%Y%m%d%H%M%S')

#清洗完数据概览
all_data.info()
all_data.head()

二、用户画像分析

新用户获取问题分析
1、Airbnb一年能增长多少新用户

2、哪个月份增长最多

3、哪个渠道的用户增长最多

#Airbnb用户增长分析
fig = plt.figure(figsize=(15,15))
plt.subplot(221)
grouped_df = all_data.groupby([all_data.date_account_created.dt.year, all_data.date_account_created.dt.month]).count().id
grouped_df.plot(kind='line', xticks=range(1, len(grouped_df), 6), rot=45, color='g')
plt.xlabel('Creation Date (Y, M)')
plt.yticks([])
plt.title('Development of user base over years')

ax = fig.add_subplot(222)
all_data.groupby([all_data.timestamp_first_active.dt.year, 
                  all_data.timestamp_first_active.dt.month]).count().id.unstack().plot(
                    kind='line', ax=ax, 
                    xticks=range(all_data.timestamp_first_active.dt.year.min(), 
                                 all_data.timestamp_first_active.dt.year.max()+1),
                                 colormap='Paired')
plt.title('Monthly new users development over years (first activity)')
ax.legend(title='Month')


plt.subplot(223)
all_data.groupby([all_data.date_account_created.dt.month]).count().id.plot(kind='bar', color='g')
plt.title('New accounts by Month\nJul, Aug & Sep are on top')
plt.xlabel('Creation Month')

plt.subplot(224)
all_data.groupby([all_data.date_account_created.dt.year]).count().id.plot(kind='pie', shadow=False, autopct='%.1f%%', pctdistance=0.85)
plt.title('Creation year as percentage of the total user base')
plt.ylabel('User Base')

plt.subplots_adjust(hspace=0.5)
plt.show()

在这里插入图片描述

用户获取分析得出：

1、2014年注册新用户数占比为50.3%，2013年注册新用户数占比为30.1%，为新用户增长最多的2年。
2、一年中新增注册用户最多是三个月份为7、8、9三个月，最低的三个月份为10、11、12三个月。
3、2012、2013、2014这三年先新增用户占用户总数的9成以上。

用户使用语言分析：

#用户语言分析
plt.figure(figsize=(8,6))
all_data.language.value_counts().plot(kind='bar',color='g',rot=0)
plt.title('English is the most popular language ')
plt.xlabel('language used')
plt.ylabel('number')
plt.show()

在这里插入图片描述

得出结论：

用户语言绝大多数为英语

用户浏览器分析

#不同渠道用户使用浏览器分析
plt.figure(figsize=(15,5))
plt.subplot(121)
top_browsers = all_data.groupby('first_browser').id.count().nlargest(8).index
df = all_data.dropna(subset=['first_browser']).copy() # Drop Nulls to do the analysis
df.first_browser = df.first_browser.apply(lambda browser: 'Other' if browser not in top_browsers else browser)
sns.countplot(x='first_browser', data=df, hue='signup_app' ,order=df.first_browser.value_counts().index)
plt.xticks(rotation=30)
plt.subplot(122)
sns.countplot(x='first_browser',  data=df, hue='first_device_type' ,order=df.first_browser.value_counts().index)
plt.xticks(rotation=30)
plt.show()

在这里插入图片描述

从上图表分析可得出结论:

1、绝大部分用户使用网页端进行注册账户，其中主要使用的浏览器为Chrom，Safari，Firefox，IE这四种。
2、手机safari浏览器在手机客户端为使用频率最高的浏览器，使用设备为iPhone 和 iPads。
3、绝大部分Mac Desktop用户使用的是Safari浏览器。
4、绝大部分安卓用户使用的是Chrom浏览器。

用户年龄性别分析

#年龄性别分析
age_gender_countries = pd.read_csv('age_gender_bkts.csv')
plt.figure(figsize=(15,12))
plt.subplot(221)
sorted_df = age_gender_countries.groupby(["country_destination"]).population_in_thousands.sum().reset_index().sort_values('population_in_thousands', ascending=False)
sns.barplot(x="country_destination", y="population_in_thousands", hue="gender", order=sorted_df.country_destination, data=age_gender_countries, ci=None)
plt.title('Countries Visited By Gender')
plt.xlabel('Country')
plt.ylabel('Population in Thousands')


plt.subplot(222)
age_gender_countries.groupby(["age_bucket"]).population_in_thousands.sum().loc[age_gender_countries.age_bucket.iloc[:21].values[::-1]].plot(kind='bar', rot=45, color=plt.cm.tab20c(np.arange(len(age_gender_countries.age_bucket.unique()))))
plt.title('Age Buckets vs Number Of Users Who Made At Least a Booking in 2015')
plt.xlabel('Age Bucket')


plt.subplot(212)
plt.title('Age Buckets per Country vs Count')
buckets_count = len(age_gender_countries.age_bucket.unique())
ax = sns.barplot(x='country_destination', y="population_in_thousands", 
                 hue='age_bucket', hue_order=age_gender_countries.age_bucket.iloc[:buckets_count].values[::-1], 
                 data=age_gender_countries, ci=None, palette=sns.color_palette("tab20c", buckets_count))
ax.legend(title='Age Bucket', bbox_to_anchor=(1, 0.5), loc=6)
plt.xlabel('Country of Destination')
plt.ylabel('Population in Thousands')
plt.subplots_adjust(hspace=0.35)
plt.show()

在这里插入图片描述

性别年龄分析得出：

1、各目的地国家女性比用户比男性用户略多。
2、2015年40-59岁年龄段的用户最多，10岁到54岁用户呈上升趋势，54岁往后年龄段的用户呈下降趋势。
3、各个年龄段中以美国为目的地的用户数量远多于其他国家为目的地的用户数量。

用户浏览网站行为分析：

#导入数据
session=pd.read_csv('sessions.csv')
session.head()
#用户网站浏览行为分析
sessions = pd.read_csv('sessions.csv')
sessions_grouped = sessions.groupby(['user_id', 'action']).secs_elapsed.mean()

crafted_features = pd.Series(sessions_grouped.index.get_level_values(1)).value_counts().nlargest().index
crafted_features

print('You have session data for {} of users'.format(round(len(sessions.user_id.unique())/len(all_data), 2)))

sessions_df = sessions_grouped.unstack()[crafted_features]
print(sessions_df.head())

plt.figure(figsize=(11,8))
sessions[sessions.action.isin(crafted_features)].groupby('action').secs_elapsed.mean().plot(kind='bar', rot=0)
plt.ylabel('Average Elapsed Time In Seconds')
plt.title('Most common 5 actions')
plt.show()

在这里插入图片描述

三、渠道分析

新用户注册渠道分析

plt.figure(figsize=(18,15))
plt.subplot(221)
ax = sns.countplot(x='signup_method', data=all_data, hue='first_device_type' ,order=all_data.signup_method.value_counts().index)
ax.legend(loc=1)
plt.title('Signup method counts with different devices')
plt.ylabel('')


plt.subplot(222)
sns.countplot(x=all_data.signup_flow)
plt.ylabel('')
plt.title('Signup flow counts')

plt.subplot(223)
sns.countplot(x=all_data.signup_app, order=all_data.signup_app.value_counts().index)
plt.title('Signup application')

plt.subplot(224)
top_x = 7
sizes = np.append(all_data.first_browser.value_counts().iloc[:top_x].values, all_data.first_browser.value_counts().iloc[top_x+1:].values.sum())
labels = np.append(all_data.first_browser.value_counts().iloc[:top_x].index, 'Other')
plt.pie(sizes, labels=labels, autopct='%.1f%%',
        shadow=False, pctdistance=0.85, labeldistance=1.05, startangle=10, explode=[0.15 if (i == 0 or i == len(sizes)-1) else 0 for i in range(len(sizes))])
plt.title('First browser used')
plt.subplots_adjust(hspace=0.35)

在这里插入图片描述

注册渠道分析可知：

1、新用户注册渠道主要有2个，APP注册和Facebook注册，这两个渠道涵盖了绝大部分新用户，且用户使用终端为Mac Desktop。
2、80%的用户在注册界面直接注册，20%通过第三方广告链接进行注册。
3、网页端为新用户注册使用最高的客户端，其次是IOS，安卓和手机网页端。
4、浏览器方面用户使用最多的Chrome，然后依次是Mobile Safari,Safari,Firefox,IE和其他浏览器。

营销渠道分析

plt.figure(figsize=(15,7))
plt.subplot(121)
top_x = 6
sizes = np.append(all_data.affiliate_provider.value_counts().iloc[:top_x].values, all_data.affiliate_provider.value_counts().iloc[top_x+1:].values.sum())
labels = np.append(all_data.affiliate_provider.value_counts().iloc[:top_x].index, 'Other')
plt.bar(x=range(top_x+1), height=sizes, tick_label=labels)
plt.xticks(rotation=45)
plt.title('Affiliate Providers')

plt.subplot(122)
grouped_df = all_data.groupby('affiliate_channel').count().id.nlargest(len(all_data.affiliate_channel.unique()))
explode_thr = 4
plt.pie(grouped_df, labels=grouped_df.index, autopct='%.1f%%', shadow=True, pctdistance=0.88, labeldistance=1.05, startangle=30, 
        explode = [0 if i < explode_thr else (i/len(grouped_df))-(explode_thr/len(grouped_df)) for i in range(len(grouped_df))])
plt.title('Affiliate Channels')
plt.show()

在这里插入图片描述

营销渠道分析得出：

直接营销占主要部分，其次Google提供部分营销渠道，剩下的10%的由其他渠道提供。

四、特征工程

#由于“date_first_booking”字段空值量很多，因此对训练集和测试集做删除该字段处理
train_data=train_data.drop(['date_first_booking'],axis=1)
test_data=test_data.drop(['date_first_booking'],axis=1)

#将‘date_account_created’字段中的年月日进行拆分
date_acc_created_train = np.vstack(train_data.date_account_created.astype(str).apply(
    lambda x : list(map(int, x.split('-')))).values)

train_data['create_year'] = date_acc_created_train[:, 0]
train_data['create_month'] = date_acc_created_train[:, 1]
train_data['create_day'] = date_acc_created_train[:, 2]
train_data = train_data.drop(['date_account_created'], axis = 1)

date_acc_created_test = np.vstack(test_data.date_account_created.astype(str).apply(
    lambda x : list(map(int, x.split('-')))).values)

test_data['create_year'] = date_acc_created_test[:, 0]
test_data['create_month'] = date_acc_created_test[:, 1]
test_data['create_day'] = date_acc_created_test[:, 2]
test_data = test_data.drop(['date_account_created'], axis = 1)

#将性别标签做转化
train_data.loc[train_data.gender == '-unknown-', 'gender'] = -1
train_data.loc[train_data.gender.isnull(), 'gender'] = -1
test_data.loc[test_data.gender == '-unknown-', 'gender'] = -1
test_data.loc[test_data.gender.isnull(), 'gender'] = -1

gender_enc = {'FEMALE' : 0,
             'MALE' : 1,
             'OTHER' : 2,
             -1 : -1}
for data in [train_data, test_data]:
    data.gender = data.gender.apply(lambda x : gender_enc[x])


#将年龄段不在[16，9]区间的数据用中位数做替换
train_data.loc[train_data.age > 90, 'age'] = np.nan
train_data.loc[train_data.age < 16, 'age'] = np.nan
test_data.loc[test_data.age > 90, 'age'] = np.nan
test_data.loc[test_data.age < 16, 'age'] = np.nan

train_data.loc[train_data.age.isnull(), 'age'] = train_data.age.median()
test_data.loc[test_data.age.isnull(), 'age'] = test_data.age.median()

#注册渠道
signup_enc = {'facebook' : 0,
             'google' : 1,
             'basic' : 2,
             'weibo' : 3}
for data in [train_data, test_data]:
    data.signup_method = data.signup_method.apply(lambda x : signup_enc[x])
#语言
test_data.loc[test_data.language == '-unknown-', 'language'] = test_data.language.mode()[0]

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
train_data.language = le.fit_transform(train_data.language)
test_data.language = le.fit_transform(test_data.language)

#其他特征
train_data.affiliate_channel = le.fit_transform(train_data.affiliate_channel)
train_data.affiliate_provider = le.fit_transform(train_data.affiliate_provider)
test_data.affiliate_channel = le.fit_transform(test_data.affiliate_channel)
test_data.affiliate_provider = le.fit_transform(test_data.affiliate_provider)

train_data.loc[train_data.first_affiliate_tracked.isnull(), 'first_affiliate_tracked'] = 'untracked'
train_data.first_affiliate_tracked = le.fit_transform(train_data.first_affiliate_tracked)

test_data.loc[test_data.first_affiliate_tracked.isnull(), 'first_affiliate_tracked'] = 'untracked'
test_data.first_affiliate_tracked = le.fit_transform(test_data.first_affiliate_tracked)


train_data.signup_app = le.fit_transform(train_data.signup_app)
train_data.first_device_type = le.fit_transform(train_data.first_device_type)
train_data.first_browser = le.fit_transform(train_data.first_browser)
test_data.signup_app = le.fit_transform(test_data.signup_app)
test_data.first_device_type = le.fit_transform(test_data.first_device_type)
test_data.first_browser = le.fit_transform(test_data.first_browser)

#删除无用字段
train_data = train_data.drop(['id', 'timestamp_first_active'], axis = 1)
train_data.head()

五、建模预测分析

from sklearn.preprocessing import LabelEncoder 
import seaborn as sns
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import ComplementNB

X_train, X_test, y_train, y_test = train_test_split(train_data.drop([ 'country_destination'], axis = 1),train_data.country_destination, test_size = 0.2, random_state = 800)

    #随机森林算法
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=25, random_state=555, max_depth=7)
rf.fit(X_train,y_train )
print('Accuracy score for RF:')
print(rf.score(X_test,y_test))

Accuracy score for RF:
0.6212316413295542

from sklearn.ensemble import GradientBoostingClassifier
gb = GradientBoostingClassifier(max_depth = 4, n_estimators = 100, random_state = 817)
gb.fit(X_train,y_train)
y_predict = gb.predict(X_test)
print('Accuracy score for GDBT:')
print(gb.score(X_test,y_test))

Accuracy score for GDBT:
0.6301562390199339