用户购买行为预测—机器学习实现
用户购买行为预测—机器学习实现
数据集探索分析(EDA)
无论在工作还是参加数据科学竞赛时,当我们拿到一份新的数据集,首先需要对数据进行探索性分析(EDA)。本次用户购买预测的数据集是典型的时间序列数据,往往用于处理一些有日期的业务流水。类似的场景有预测未来商品的销量、预测设备是否故障等。
对于时间序列数据的EDA,一般可以从以下几方面着手:
- 数据集的整体体量
- 是否有缺失值、异常值等脏数据
- 数据的趋势性特征
- 数据的分布情况
引入工具包
import numpy as np # linear algebra
import pandas as pd
from datetime import datetime, date, timedelta
from scipy.stats import skew # for some statistics
from scipy.special import boxcox1p
from scipy.stats import boxcox_normmax
from sklearn.linear_model import ElasticNetCV, LassoCV, RidgeCV, Ridge, Lasso, ElasticNet
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor, RandomForestClassifier
from sklearn.feature_selection import mutual_info_regression
from sklearn.svm import SVR, LinearSVC
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.preprocessing import RobustScaler, LabelEncoder
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
import os
import re
import seaborn as sns
import matplotlib.pyplot as plt
import time
from itertools import product
import datetime as dt
import calendar
import gc
RANDOM_SEED = 42
# 引入中文字体
from matplotlib.font_manager import FontProperties
myfont = FontProperties(fname="/home/aistudio/NotoSansCJKsc-Light.otf", size=12)
加载、分析数据集
加载数据集
PATH = './data/data19383/'
train = pd.read_csv(PATH + 'train.csv')
test = pd.read_csv(PATH + 'submission.csv').set_index('customer_id')
对于特别大的文件,我们需要做一些内存检查:
# 训练集文件
mem_train = train.memory_usage(index=True).sum()
mem_test = test.memory_usage(index=True).sum()
print(u"训练集使用内容 "+ str(mem_train/ 1024**2)+" MB")
print(u"测试集使用内存 "+ str(mem_test/ 1024**2)+" MB")
下面是对数据集的基本信息分析:
train.info()
summary_stats_table(train)
接下来我们对一些数据类型进行处理,以优化内存使用:
# 处理id字段
train['order_detail_id'] = train['order_detail_id'].astype(np.uint32)
train['order_id'] = train['order_id'].astype(np.uint32)
train['customer_id'] = train['customer_id'].astype(np.uint32)
train['goods_id'] = train['goods_id'].astype(np.uint32)
train['goods_class_id'] = train['goods_class_id'].astype(np.uint32)
train['member_id'] = train['member_id'].astype(np.uint32)
# 处理状态字段,同时处理空值,将空值置为0
train['order_status'] = train['order_status'