数据驱动实践五
预测客户的下一个购买日
本系列文章中所采用的大部分行为和分析都是基于一个同样的思想方法:以客户所值得的方式对待他们,要早于他们的预期(例如,LTV);在不好的事情发生之前采取行动(例如,流失)。
在这方面预测性分析可以提供许多帮助,其中一个重要的分析就是预测客户的下一个购买日。想象一下,如果你提前预测到客户会在下一个星期之内再次购买,你会采取什么主动措施吗?
本文中我们将使用在线零售数据集,且采用如下步骤:
- 数据清洗和整理
- 特征工程
- 选择一个机器学习模型
- 多分类模型
- 超参数调优
数据清洗整理(Data Wrangling)
#import libraries
from __future__ import division
from datetime import datetime, timedelta,date
import pandas as pd
%matplotlib inline
from sklearn.metrics import classification_report,confusion_matrix
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
from sklearn.cluster import KMeans
#do not show warnings
import warnings
warnings.filterwarnings("ignore")
#import plotly for visualization
import chart_studio.plotly as py
import plotly.offline as pyoff
import plotly.graph_objs as go
#import machine learning related libraries
from sklearn.svm import SVC
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
import xgboost as xgb
from sklearn.model_selection import KFold, cross_val_score, train_test_split
#initiate plotly
pyoff.init_notebook_mode()
tx_data = pd.read_excel('Online Retail.xlsx')
#convert date field from string to datetime
tx_data['InvoiceDate'] = pd.to_datetime(tx_data['InvoiceDate'])
#create dataframe with uk data only
tx_uk = tx_data.query("Country=='United Kingdom'").reset_index(drop=True)
#print first 10 rows
tx_data.head(10)
我们将使用前六个月的客户行为数据来预测客户在后三个月的购买行为。
tx_6m = tx_uk[(tx_uk.InvoiceDate < datetime(2011,9,1)) & (tx_uk.InvoiceDate >= datetime(2011,3,1))].reset_index(drop=True)
tx_next = tx_uk[(tx_uk.InvoiceDate >= datetime(2011,9,1)) & (tx_uk.InvoiceDate < datetime(2011,12,1))].reset_index(drop=True)
同时,我们创建一个新的DataFrame来表示用户级别的特征集;需要计算在后三个月的购买时间与前六个月最后一次购买时间的差。
tx_user = pd.DataFrame(tx_6m['CustomerID'].unique())
tx_user.columns = ['CustomerID']
#create a dataframe with customer id and first purchase date in tx_next
tx_next_first_purchase = tx_next.groupby('CustomerID').InvoiceDate.min().reset_index()
tx_next_first_purchase.columns = ['CustomerID','MinPurchaseDate']
#create a dataframe with customer id and last purchase date in tx_6m
tx_last_purchase = tx_6m.groupby('CustomerID').InvoiceDate.max().reset_index()
tx_last_purchase.columns = ['CustomerID','MaxPurchaseDate']
#merge two dataframes
tx_purchase_dates = pd.merge(tx_last_purchase,tx_next_first_purchase