本次学习的内容来自阿里天池大赛赛题。
赛题简介
蚂蚁金服拥有上亿会员并且业务场景中每天都涉及大量的资金流入和流出,面对如此庞大的用户群,资金管理压力会非常大。在既保证资金流动性风险最小,又满足日常业务运转的情况下,精准地预测资金的流入流出情况变得尤为重要。此届大赛以《资金流入流出预测》为题,期望参赛者能够通过对例如余额宝用户的申购赎回数据的把握,精准预测未来每日的资金流入流出情况。对货币基金而言,资金流入意味着申购行为,资金流出为赎回行为 。
赛题与数据链接见:https://tianchi.aliyun.com/competition/entrance/231573/information
本次学习共计八天,五个阶段,分别是:
(一)数据探索与分析
(二)时间序列规则
(三)时间序列模型
(四)特征工程
(五)建模预测
(六)学习总结
本篇是(一)数据探索与分析 学习笔记
对数据进行探索与分析,了解数据,是有效建模的基础。
数据有五个csv文件,包含的信息如下:
文件名 | |
comp_predict_table.csv | 输出结果格式示例,包括日期,赎回总额,申购总额 |
mfd_bank_shibor.csv | 14个月之间银行的拆借利率 |
mfd_day_share_interest.csv | 余额宝在某天的收益率 |
user_balance_table.csv | 用户申购赎回数据表(用户在某天的支付宝资金流动数据) |
user_profile_table.csv | 用户id,性别,城市,星座 |
竞赛中使用的数据主要包含四个部分,分别为用户基本信息数据、用户申购赎回数据、收益率表和银行间拆借利率表。前两者是用户相关的数据,后两者与用户无关,是与银行政策相关的数据。
具体数据信息见:https://tianchi.aliyun.com/competition/entrance/231573/information
对数据进行探索性分析:
一、时间序列分析
这部分主要涉及的是 时间序列图
========载入数据
读取用户申购赎回数据表,数据信息如表:
今日总购买量和今日总赎回量是本次的预测变量。因此主要探索这两个变量和其它变量的关系。
分别绘制所有用户在年,月,日不同时间段的申购赎回总量图。
#导入模块
import pandas as pd
import numpy as np
import warnings
import datetime
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats
import warnings
warnings.filterwarnings('ignore')
# 设置数据集路径
dataset_path = 'Dataset/'
# 读取数据
data_balance = pd.read_csv(dataset_path+'user_balance_table.csv')
# Load the balance data
data_balance = pd.read_csv('Dataset/user_balance_table.csv')
# add tiemstamp to dataset
data_balance['date'] = pd.to_datetime(data_balance['report_date'], format= "%Y%m%d")
data_balance['day'] = data_balance['date'].dt.day
data_balance['month'] = data_balance['date'].dt.month
data_balance['year'] = data_balance['date'].dt.year
data_balance['week'] = data_balance['date'].dt.week
data_balance['weekday'] = data_balance['date'].dt.weekday
# total amount
total_balance = data_balance.groupby(['date'])['total_purchase_amt','total_redeem_amt'].sum()
total_balance.reset_index(inplace=True)
# Add time stamp
total_balance['day'] = total_balance['date'].dt.day
total_balance['month'] = total_balance['date'].dt.month
total_balance['year'] = total_balance['date'].dt.year
total_balance['week'] = total_balance['date'].dt.week
total_balance['weekday'] = total_balance['date'].dt.weekday
# Load user's information
users = pd.read_csv('Dataset/user_profile_table.csv')
# 聚合时间数据
total_balance = data_balance.groupby(['date'])['total_purchase_amt','total_redeem_amt'].sum().reset_index()
# 生成测试集区段数据 测试(预测)未来一个月的数据
start = datetime.datetime(2014,9,1)
testdata = []
while start != datetime.datetime(2014,10,1):
temp = [start, np.nan, np.nan]
testdata.append(temp)
start += datetime.timedelta(days = 1)
testdata = pd.DataFrame(testdata)
testdata.columns = total_balance.columns
首先探索数据的整体情况:
# 画出每日总购买与赎回量的时间序列图
fig = plt.figure(figsize=(20,6))
plt.plot(total_balance['date'], total_balance['total_purchase_amt'],label='purchase')
plt.plot(total_balance['date'], total_balance['total_redeem_amt'],label='redeem')
plt.legend(loc='best')
plt.title("The lineplot of total amount of Purchase and Redeem from July.13 to Sep.14")
plt.xlabel("Time")
plt.ylabel("Amount")
plt.show()
申购量和赎回量数据有一定的同步性,申购量数据波动更大,数据大体可以分为2013.7-2013.11、2013.11-2014.4、2014.4-2014.9几个阶段。
# 画出4月份以后的时间序列图
total_balance_1 = total_balance[total_balance['date'] >= datetime.datetime(2014,4,1)]
fig = plt.figure(figsize=(20,6))
plt.plot(total_balance_1['date'], total_balance_1['total_purchase_amt'])
plt.plot(total_balance_1['date'], total_balance_1['total_redeem_amt'])
plt.legend()
plt.title("The lineplot of total amount of Purchase and Redeem from April.14 to Sep.14")
plt.xlabel("Time")
plt.ylabel("Amount")
plt.show()
在这个阶段,数据具有周期平稳性。
下面探索数据和日期之间的关系。
# 分别画出每个月中每天购买赎回量的时间序列图
fig = plt.figure(figsize=(15,15))
plt.subplot(4,1,1)
plt.title("The time series of total amount of Purchase and Redeem for August, July, June, May respectively")
total_balance_2 = total_balance[total_balance['date'] >= datetime.datetime(2014,8,1)]
plt.plot(total_balance_2['date'], total_balance_2['total_purchase_amt'])
plt.plot(total_balance_2['date'], total_balance_2['total_redeem_amt'])
plt.legend()
total_balance_3 = total_balance[(total_balance['date'] >= datetime.datetime(2014,7,1)) & (total_balance['date'] < datetime.datetime(2014,8,1))]
plt.subplot(4,1,2)
plt.plot(total_balance_3['date'], total_balance_3['total_purchase_amt'])
plt.plot(total_balance_3['date'], total_balance_3['total_redeem_amt'])
plt.legend()
total_balance_4 = total_balance[(total_balance['date'] >= datetime.datetime(2014,6,1)) & (total_balance['date'] < datetime.datetime(2014,7,1))]
plt.subplot(4,1,3)
plt.plot(total_balance_4['date'], total_balance_4['total_purchase_amt'])
plt.plot(total_balance_4['date'], total_balance_4['total_redeem_amt'])
plt.legend()
total_balance_5 = total_balance[(total_balance['date'] >= datetime.datetime(2014,5,1)) & (total_balance['date'] < datetime.datetime(2014,6,1))]
plt.subplot(4,1,4)
plt.plot(total_balance_5['date'], total_balance_5['total_purchase_amt'])
plt.plot(total_balance_5['date'], total_balance_5['total_redeem_amt'])
plt.legend()
plt.xlabel("Time")
plt.ylabel("Amount")
plt.show()
分别绘制2014年5-8月间,每天的申购赎回数据的变化趋势。
数据呈现周期性波动,但申购赎回量与日期变量似乎不存在明显的关系。
绘制2013年8-9月间,每天的申购赎回数据的变化趋势。
在此阶段数据中,申购赎回量与日期变量存在一定的关系。
# 分别画出13年8月与9月每日购买赎回量的时序图
fig = plt.figure(figsize=(15,9))
total_balance_last8 = total_balance[(total_balance['date'] >= datetime.datetime(2013,8,1)) & (total_balance['date'] < datetime.datetime(2013,9,1))]
plt.subplot(2,1,1)
plt.plot(total_balance_last8['date'], total_balance_last8['total_purchase_amt'])
plt.plot(total_balance_last8['date'], total_balance_last8['total_redeem_amt'])
plt.legend()
total_balance_last9 = total_balance[(total_balance['date'] >= datetime.datetime(2013,9,1)) & (total_balance['date'] < datetime.datetime(2013,10,1))]
plt.subplot(2,1,2)
plt.plot(total_balance_last9['date'], total_balance_last9['total_purchase_amt'])
plt.plot(total_balance_last9['date'], total_balance_last9['total_redeem_amt'])
plt.legend()
plt.xlabel("Time")
plt.ylabel("Amount")
plt.show()
二、翌日特征分析
这个部分对数据和其在一个星期中哪一天的关系进行分析。我们使用到了 数据分布可视化 ,主要使用sns函数绘制分布图。
下面是几种常见的分布图:
1. 直方图
2. 密度曲线图(直方图的变种)
3. 箱型图
4. 小提琴图
# 画出每个翌日的数据分布于整体数据的分布图
a = plt.figure(figsize=(10,10))
scatter_para = {'marker':'.', 's':3, 'alpha':0.3}
line_kws = {'color':'k'}
plt.subplot(2,2,1)
plt.title('The distrubution of total purchase')
sns.violinplot(x='weekday', y='total_purchase_amt', data = total_balance_1, scatter_kws=scatter_para, line_kws=line_kws)
plt.subplot(2,2,2)
plt.title('The distrubution of total purchase')
sns.distplot(total_balance_1['total_purchase_amt'].dropna())
plt.subplot(2,2,3)
plt.title('The distrubution of total redeem')
sns.violinplot(x='weekday', y='total_redeem_amt', data = total_balance_1, scatter_kws=scatter_para, line_kws=line_kws)
plt.subplot(2,2,4)
plt.title('The distrubution of total redeem')
sns.distplot(total_balance_1['total_redeem_amt'].dropna())
小提琴图中白点表示中位数,宽度表示频率,申购赎回量与星期日有明显关系。
# 按翌日对数据聚合后取均值
week_sta = total_balance_1[['total_purchase_amt', 'total_redeem_amt', 'weekday']].groupby('weekday', as_index=False).mean()
# 分析翌日的中位数特征
plt.figure(figsize=(12, 5))
ax = plt.subplot(1,2,1)
plt.title('The barplot of average total purchase with each weekday')
ax = sns.barplot(x="weekday", y="total_purchase_amt", data=week_sta, label='Purchase')
ax.legend()
ax = plt.subplot(1,2,2)
plt.title('The barplot of average total redeem with each weekday')
ax = sns.barplot(x="weekday", y="total_redeem_amt", data=week_sta, label='Redeem')
ax.legend()
# 画出翌日的箱型图
plt.figure(figsize=(12, 5))
ax = plt.subplot(1,2,1)
plt.title('The boxplot of total purchase with each weekday')
ax = sns.boxplot(x="weekday", y="total_purchase_amt", data=total_balance_1)
ax = plt.subplot(1,2,2)
plt.title('The boxplot of total redeem with each weekday')
ax = sns.boxplot(x="weekday", y="total_redeem_amt", data=total_balance_1)
绘制皮尔曼相关性:
# 使用OneHot方法将翌日特征划分,获取划分后特征
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
total_balance = total_balance.reset_index()
week_feature = encoder.fit_transform(np.array(total_balance['weekday']).reshape(-1, 1)).toarray()
week_feature = pd.DataFrame(week_feature,columns=['weekday_onehot']*len(week_feature[0]))
feature = pd.concat([total_balance, week_feature], axis = 1)[['total_purchase_amt', 'total_redeem_amt','weekday_onehot','date']]
feature.columns = list(feature.columns[0:2]) + [x+str(i) for i,x in enumerate(feature.columns[2:-1])] + ['date']
# 画出划分后翌日特征与标签的斯皮尔曼相关性
f, ax = plt.subplots(figsize = (15, 8))
plt.subplot(1,2,1)
plt.title('The spearman coleration between total purchase and each weekday')
sns.heatmap(feature[[x for x in feature.columns if x not in ['total_redeem_amt', 'date'] ]].corr('spearman'),linewidths = 0.1, vmax = 0.2, vmin=-0.2)
plt.subplot(1,2,2)
plt.title('The spearman coleration between total redeem and each weekday')
sns.heatmap(feature[[x for x in feature.columns if x