这周参加了资金流入流入预测比赛。这个比赛主要是根据余额宝用户的申购赎回数据来预测未来每日的资金流入流出情况,是一个典型的时间序列类赛题。
下载数据后可以发现有五个文件。用户信息表,用户申购赎回数表,收益率表,上海银行间拆放利率表。赛题的任务就是利用2013年7月~2014年8月的数据,预测2014年9月每一天申购和赎回的总量。其中申购总量和赎回总量均为连续型变量,我们常采用MSE、MAE、MAPE等为评价指标。
赛题评估指标,某些天的异常表现容易严重影响传统的评价指标,尤其是MSE。所以采用积分制,每天10分,若某天预测误差较小,则该天获得满分,若某天预测误差较大,则该天获得0分。那么最后的总积分为0.45的申购预测得分和0.55的赎回预测得分。
资金流入流出预测需要先统计每天的申购总额和赎回总额。然后重点关注临近数据,中位数等等。根据时间序列规则与模型愁绪特征、利用机器学习方法建模。
数据探索流程
统计总量->观察时序图(以月、周为周期)->以星期为周期查看数据(小提琴图、分布图、柱状图、箱型图)->构造变量分析申购和赎回之间相关性(线性相关性弱,热力图)->分析独立性(存在依赖关联,mvtest())->按月查看,分析月份之间差异(密度曲线图)->逐月按天绘制直方图分析申购赎回(每周开始倾向购买,每周中期倾向于赎回;周末不倾向交易;赎回波动性比购买大)->逐周按天绘制热力图查看(有必要继续分析节假日及特殊日期,包括6.18等购物节,节前交易量下降,节后上升)->分析大额交易(某笔异常大额交易占某天交易总额比重过大)->大额交易(大于100万)与交易总额、与小额交易之间趋势分析->银行拆解利率与交易量关联->余额宝收益与交易量关联->分析用户信息(城市、性别、星座)
import pandas as pd
import numpy as np
import warnings
import seaborn as sns
import datetime
import matplotlib.pyplot as plt
from scipy import stats
warnings.filterwarnings('ignore')
# load the balance data
data_balance = pd.read_csv("E:/比赛/时间序列/Data/user_balance_table.csv")
# add tiemstamp to dataset
data_balance['date'] = pd.to_datetime(data_balance['report_date'], format= "%Y%m%d")
data_balance['day'] = data_balance['date'].dt.day
data_balance['month'] = data_balance['date'].dt.month
data_balance['year'] = data_balance['date'].dt.year
data_balance['week'] = data_balance['date'].dt.week
data_balance['weekday'] = data_balance['date'].dt.weekday
# total amount
total_balance = data_balance.groupby(['date'])['total_purchase_amt','total_redeem_amt'].sum()
total_balance.reset_index(inplace=True)
# Add time stamp
total_balance['day'] = total_balance['date'].dt.day
total_balance['month'] = total_balance['date'].dt.month
total_balance['year'] = total_balance['date'].dt.year
total_balance['week'] = total_balance['date'].dt.week
total_balance['weekday'] = total_balance['date'].dt.weekday
# Load user's information
users = pd.read_csv('E:/比赛/时间序列/Data/user_profile_table.csv')
# 获得大额用户的集合
temp = data_balance[(data_balance['total_purchase_amt'] >= 1000000) | (data_balance['total_redeem_amt'] >= 1000000)]
big_users_set = set(temp[temp['date'] >= datetime.datetime(2014,4,1)]['user_id'])
len(big_users_set)
3903
max(data_balance['user_id'])
28041
# 标记大额用户
data_balance['big_user'] = 0
data_balance.loc[data_balance['user_id'].isin(big_users_set), 'big_user'] = 1
# 统计大额用户与小额用户的日总交易额的区别
total_balance_bigNsmall = data_balance.groupby(['date','big_user'], as_index=False)['total_purchase_amt','total_redeem_amt'].sum()
# 画出大额用户与小额用户交易的日总交易量图
fig = plt.figure(figsize=(20,8))
plt.plot(total_balance_bigNsmall[total_balance_bigNsmall['big_user'] == 1]['date'], total_balance_bigNsmall[total_balance_bigNsmall['big_user'] == 1]['total_purchase_amt'],label='big_purchase')
plt.plot(total_balance_bigNsmall[total_balance_bigNsmall['big_user'] == 1]['date'], total_balance_bigNsmall[total_balance_bigNsmall['big_user'] == 1]['total_redeem_amt'],label='big_redeem')
plt.plot(total_balance_bigNsmall[total_balance_bigNsmall['big_user'] == 0]['date'], total_balance_bigNsmall[total_balance_bigNsmall['big_user'] == 0]['total_purchase_amt'],label='small_purchase')
plt.plot(total_balance_bigNsmall[total_balance_bigNsmall['big_user'] == 0]['date'], total_balance_bigNsmall[total_balance_bigNsmall['big_user'] == 0]['total_redeem_amt'],label='small_redeem')
plt.legend(loc='best')
plt.title("The time series of big and small user of Purchase and Redeem")
plt.xlabel("Time")
plt.ylabel("Amount")
plt.show()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-RZffeG3H-1597932366225)(output_8_0.png)]
# 统计大额小额用户购买量占比
np.sum(total_balance_bigNsmall[total_balance_bigNsmall['big_user'] == 1]['total_purchase_amt']) / np.sum(total_balance_bigNsmall[total_balance_bigNsmall['big_user'] == 0]['total_purchase_amt'])
5.1456178397775805
# 统计大额小额用户赎回量占比
np.sum(total_balance_bigNsmall[total_balance_bigNsmall['big_user'] == 1]['total_redeem_amt']) / np.sum(total_balance_bigNsmall[total_balance_bigNsmall['big_user'] == 0]['total_redeem_amt'])
5.422141272341089
# 画出非0交易的分布图
frequency = data_balance[(data_balance['direct_purchase_amt'] != 0) | (data_balance['total_redeem_amt'] != 0)][['user_id','tBalance']].groupby('user_id', as_index=False).count()
sns.distplot(frequency['tBalance'])
<matplotlib.axes._subplots.AxesSubplot at 0x282b46441c8>
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-o7Xh5dnA-1597932366226)(output_11_1.png)]
# 获取频繁交易用户集合
hot_users_set = set(frequency[frequency['tBalance'] > 30]['user_id'])
# 获取频繁用户的交易纪录
data_balance['is_hot_users'] = 0
data_balance.loc[data_balance['user_id'].isin(hot_users_set) , 'is_hot_users'] = 1
# 统计频繁用户与非频繁用户的日总交易额的区别
total_balance_hotNcold = data_balance.groupby(['date','is_hot_users'], as_index=False)['total_purchase_amt','total_redeem_amt'].sum()
# 绘制频繁用户与非频繁用户总购买赎回量的时序图
fig = plt.figure(figsize=(20,8))
plt.plot(total_balance_hotNcold[total_balance_hotNcold['is_hot_users'] == 1]['date'], total_balance_hotNcold[total_balance_hotNcold['is_hot_users'] == 1]['total_purchase_amt'],label='hot_purchase')
plt.plot(total_balance_hotNcold[total_balance_hotNcold['is_hot_users'] == 1]['date'], total_balance_hotNcold[total_balance_hotNcold['is_hot_users'] == 1]['total_redeem_amt'],label='hot_redeem')
plt.plot(total_balance_hotNcold[total_balance_hotNcold['is_hot_users'] == 0]['date'], total_balance_hotNcold[total_balance_hotNcold['is_hot_users'] == 0]['total_purchase_amt'],label='cold_purchase')
plt.plot(total_balance_hotNcold[total_balance_hotNcold['is_hot_users'] == 0]['date'], total_balance_hotNcold[total_balance_hotNcold['is_hot_users'] == 0]['total_redeem_amt'],label='cold_redeem')
plt.legend(loc='best')
plt.title("The time series of big and small user of Purchase and Redeem")
plt.xlabel("Time")
plt.ylabel("Amount")
plt.show()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-916ffNDV-1597932366227)(output_15_0.png)]
# 画出冷启动用户与老用户交易的箱型图
temp = data_balance[['year','month','user_id','total_purchase_amt','total_redeem_amt']].groupby(['year','month','user_id'], as_index=False).sum()
user_old_set = set()
plt.figure(figsize=(10,30))
for i in range(1, 9):
newset = set(temp[(temp['year'] == 2014) & (temp['month'] == i)]['user_id'])
this_month = data_balance[(data_balance['year'] == 2014) & (data_balance['month'] == i)]
this_month['cold'] = 0
this_month.loc[this_month['user_id'].isin(newset - user_old_set), 'cold'] = 1
plt.subplot(4,2,i)
plt.title('This month : ' + str(i))
sns.boxplot(x="cold", y="total_purchase_amt" , data=this_month[(this_month['direct_purchase_amt'] != 0) | (this_month['total_redeem_amt'] != 0)])
user_old_set = user_old_set | newset
plt.show()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-c1Bsi0O2-1597932366228)(output_16_0.png)]
# 用户的其他属性
users.head()
user_id | sex | city | constellation | |
---|---|---|---|---|
0 | 2 | 1 | 6411949 | 狮子座 |
1 | 12 | 1 | 6412149 | 摩羯座 |
2 | 22 | 1 | 6411949 | 双子座 |
3 | 23 | 1 | 6411949 | 双鱼座 |
4 | 25 | 1 | 6481949 | 双鱼座 |
# 添加城市、星座、性别
data_balance = pd.merge(data_balance, users, on='user_id')
# 统计每个城市用户的日总交易额的区别并绘制分布估计图
fig = plt.figure(figsize=(10,5))
for i in np.unique(data_balance['city']):
temp = data_balance.groupby(['date','city'], as_index=False)['total_purchase_amt','total_redeem_amt'].sum()
ax = sns.kdeplot( temp[temp['city'] == i]['total_purchase_amt'],label=i)
plt.legend(loc='best')
plt.title("The time series of different city of Purchase and Redeem")
plt.xlabel("Time")
plt.ylabel("Amount")
plt.show()
fig = plt.figure(figsize=(10,5))
for i in np.unique(data_balance['city']):
temp = data_balance.groupby(['date','city'], as_index=False)['total_purchase_amt','total_redeem_amt'].sum()
ax = sns.kdeplot( temp[temp['city'] == i]['total_redeem_amt'],label=i)
plt.legend(loc='best')
plt.title("The time series of different city of Purchase and Redeem")
plt.xlabel("Time")
plt.ylabel("Amount")
plt.show()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-LjS4b3lh-1597932366229)(output_19_0.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-AZK4QOGx-1597932366230)(output_19_1.png)]
# 统计每个性别用户的日总交易额的区别,并绘制时序图
temp = data_balance.groupby(['date','sex'], as_index=False)['total_purchase_amt','total_redeem_amt'].sum()
fig = plt.figure(figsize=(20,6))
plt.plot(temp[temp['sex'] == 1]['date'], temp[temp['sex'] == 1]['total_purchase_amt'],label='Male')
plt.plot(temp[temp['sex'] == 0]['date'], temp[temp['sex'] == 0]['total_purchase_amt'],label='Female')
plt.legend(loc='best')
plt.title("The time series of two gender user of Purchase")
plt.xlabel("Time")
plt.ylabel("Amount")
plt.show()
fig = plt.figure(figsize=(20,6))
plt.plot(temp[temp['sex'] == 1]['date'], temp[temp['sex'] == 1]['total_purchase_amt'],label='Male')
plt.plot(temp[temp['sex'] == 0]['date'], temp[temp['sex'] == 0]['total_redeem_amt'],label='Female')
plt.legend(loc='best')
plt.title("The time series of two gender user of Redeem")
plt.xlabel("Time")
plt.ylabel("Amount")
plt.show()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-csaW0qY9-1597932366231)(output_20_0.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-L4JGONKM-1597932366231)(output_20_1.png)]
# 统计每个星座用户的日总交易额的区别 并绘制分布估计图
fig = plt.figure(figsize=(10,5))
for i in np.unique(data_balance['constellation']):
temp = data_balance.groupby(['date','constellation'], as_index=False)['total_purchase_amt','total_redeem_amt'].sum()
ax = sns.kdeplot( temp[temp['constellation'] == i]['total_purchase_amt'],label=i)
plt.legend(loc='best')
plt.title("The time series of small deal of Purchase and Redeem")
plt.xlabel("Time")
plt.ylabel("Amount")
plt.show()
fig = plt.figure(figsize=(10,5))
for i in np.unique(data_balance['constellation']):
temp = data_balance.groupby(['date','constellation'], as_index=False)['total_purchase_amt','total_redeem_amt'].sum()
ax = sns.kdeplot( temp[temp['constellation'] == i]['total_redeem_amt'],label=i)
plt.legend(loc='best')
plt.title("The time series of small deal of Purchase and Redeem")
plt.xlabel("Time")
plt.ylabel("Amount")
plt.show()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-PfupAvot-1597932366232)(output_21_0.png)]
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-W8WJuF3i-1597932366233)(output_21_1.png)]
# 统计每个性别用户的日总交易额的区别
temp = data_balance.groupby(['date'], as_index=False)['direct_purchase_amt','share_amt'].sum()
# 画出每日利息的增长/直接购买量的时序图
fig = plt.figure(figsize=(20,6))
plt.plot(temp['date'], temp['share_amt'] / temp['direct_purchase_amt'] ,label='Rate')
plt.legend(loc='best')
plt.title("The rate of Share / Purchase")
plt.xlabel("Time")
plt.ylabel("Amount")
plt.show()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-gfBkRCBQ-1597932366234)(output_23_0.png)]
# 加载支付宝利率数据
share = pd.read_csv('E:/比赛/时间序列/Data/mfd_day_share_interest.csv')
share = share.rename(columns = {'mfd_date': 'date'})
share_features = [x for x in share.columns if x not in ['date']]
share['date'] = pd.to_datetime(share['date'], format= "%Y%m%d")
share['day'] = share['date'].dt.day
share['month'] = share['date'].dt.month
share['year'] = share['date'].dt.year
share['week'] = share['date'].dt.week
share['weekday'] = share['date'].dt.weekday
# 绘制支付宝利率与交易额的时序图
fig,ax1 = plt.subplots(figsize=(15,5))
plt.plot(temp['date'], temp['share_amt'],'b',label="Share_amt")
plt.legend()
ax2=ax1.twinx()
plt.plot(share['date'], share['mfd_daily_yield'],'g',label="Share rate")
plt.legend(loc=2)
plt.title("The correlation between share rate and share amount")
plt.xlabel("Time")
plt.ylabel("Rate & Amount")
plt.show()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-ZiQqt04k-1597932366235)(output_25_0.png)]
# 支付宝利率与每日利息的增长/直接购买量的时序图
fig,ax1 = plt.subplots(figsize=(15,5))
plt.plot(temp['date'], temp['share_amt'] / temp['direct_purchase_amt'],'b',label="Share_amt / Direct_amt")
plt.legend()
ax2=ax1.twinx()
plt.plot(share['date'], share['mfd_daily_yield'],'g',label="Share rate")
plt.legend(loc=2)
plt.title("The correlation between share rate and Share/Purchase")
plt.xlabel("Time")
plt.ylabel("Rate & Amount")
plt.show()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-9hQLnjlJ-1597932366236)(output_26_0.png)]
# 聚合两种不同购买方式
temp = data_balance.groupby(['date'], as_index=False)['purchase_bal_amt','purchase_bank_amt'].sum()
# 画出不同购买方式日购买量的时序图
fig = plt.figure(figsize=(20,6))
plt.plot(temp['date'], temp['purchase_bal_amt'],label='Bal')
plt.plot(temp['date'], temp['purchase_bank_amt'],label='Bank')
plt.legend(loc='best')
plt.title("The purchase of bal and bank")
plt.xlabel("Time")
plt.ylabel("Amount")
plt.show()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-x64IC9bt-1597932366237)(output_28_0.png)]
# 画出不同赎回方式日赎回量的时序图
temp = data_balance.groupby(['date'], as_index=False)['tftobal_amt','tftocard_amt'].sum()
fig = plt.figure(figsize=(20,6))
plt.plot(temp['date'], temp['tftobal_amt'],label='Bal')
plt.plot(temp['date'], temp['tftocard_amt'],label='Bank')
plt.legend(loc='best')
plt.title("The redeem of bal and bank")
plt.xlabel("Time")
plt.ylabel("Amount")
plt.show()
[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-v6fkbeOE-1597932366238)(output_29_0.png)]