时间序列分析
为数据集添加时间戳
data_balance['date'] = pd.to_datetime(data_balance['report_date'], format= "%Y%m%d")
data_balance['day'] = data_balance['date'].dt.day
data_balance['month'] = data_balance['date'].dt.month
data_balance['year'] = data_balance['date'].dt.year
data_balance['week'] = data_balance['date'].dt.week
data_balance['weekday'] = data_balance['date'].dt.weekday
根据时间戳筛选数据
total_balance[total_balance['date'] >= datetime.datetime(2014,4,1)]
onehot编码
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
week_feature = encoder.fit_transform(np.array(total_balance['weekday']).reshape(-1, 1)).toarray()
核密度图
sns.kdeplot(total_balance[(total_balance['date'] >= datetime.datetime(2013,i,1)) & (total_balance['date'] < datetime.datetime(2013,i+1,1))]['total_purchase_amt'],label='13Y,'+str(i)+'M')
kdeplot里面为任意数组,绘制的图像为根据这些数值分布,估计出变量的概率密度曲线。
核密度估计(Kernel density estimation),是一种用于估计概率密度函数的非参数方法,为独立同分布F的n个样本点,设其概率密度函数为f,核密度估计为以下:
参考资料:https://blog.csdn.net/yuanxing14/article/details/41948485
https://www.bilibili.com/video/BV11t411j7Me?from=search&seid=9359500736608177144
绘制标签随星期,周次变化的热力图
test = np.zeros((max(total_balance_1['week']) - min(total_balance_1['week']) + 1, 7))
test[total_balance_1['week'] - min(total_balance_1['week']), total_balance_1['weekday']] = total_balance_1['total_purchase_amt']
f, ax = plt.subplots(figsize = (10, 4))
sns.heatmap(test,linewidths = 0.1, ax=ax)
ax.set_title("Purchase")
ax.set_xlabel('weekday')
ax.set_ylabel('week')
筛选数据集
data_balance['big_purchase'] = 0
data_balance.loc[data_balance['total_purchase_amt'] > 1000000, 'big_purchase'] = 1
多坐标
fig,ax1 = plt.subplots(figsize=(15,5))
plt.plot(bank['date'], bank['Interest_3_M'],'b',label="Interest_3_M")
ax2=ax1.twinx()
plt.plot(total_balance['date'], total_balance['total_purchase_amt'],'g',label="Total purchase")
plt.legend(loc=2)