https://www.kaggle.com/c/ga-customer-revenue-prediction
18年做的一个比赛,现在回忆下当时的辛酸
一行代码拿到比赛前十名,当然现在已经百名开外了
import pandas as pd;pd.read_csv('test.csv', usecols=[4,8], dtype={
'fullVisitorId': 'str'}, converters={
'totals': lambda t: float(dict(eval(t))['transactionRevenue']) if 'transactionRevenue' in t else 0}).groupby(['fullVisitorId'])['totals'].sum().to_frame(name='PredictedLogRevenue').apply(pd.np.log1p).to_csv('sub.csv')
说明
80/20规则已被证明适用于许多企业-只有一小部分客户产生了大部分收入。因此,营销团队面临着在促销策略上进行适当投资的挑战。
RStudio是针对R和企业就绪产品的免费和开放工具的开发人员,以供团队扩展和共享工作。RStudio已与Google Cloud和Kaggle合作,展示了彻底的数据分析可能产生的业务影响。
在这场比赛中,您面临挑战,需要分析Google Mercury Store(也称为GStore,在其中出售Google赃物)客户数据集,以预测每位客户的收入。希望结果是,对于那些选择在Google Analytics(分析)数据之上使用数据分析的公司,运营变更将更具可操作性,并且可以更好地利用营销预算。
评估
均方根误差(RMSE)
根据均方根误差对提交的内容评分。RMSE定义为:
RMSE = 1 n ∑ i = 1 n ( y i − y ^ i ) 2 \textrm{RMSE} = \sqrt{\frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2} RMSE=n1∑i=1n(yi−y^i)2
其中y hat是客户的预测收入的自然对数,y是实际总收入值加1的自然对数。
提交文件
对于fullVisitorId 测试集中的每个,您必须以预测其总收入的自然对数PredictedLogRevenue。提交文件应包含标题,并具有以下格式:
fullVisitorId,PredictedLogRevenue
0000000259678714014,0
0000049363351866189,0
0000053049821714864,0
etc.
奖品
这是项目的数据介绍,当然大家可以自己进入最上边的网址查看
代码
下面给出一段如何分析数据的代码,是基于jupyter notebook开发的,当然也是仅供参考
Objective of the notebook:
In this notebook, let us explore the given dataset and make some inferences along the way. Also finally we will build a baseline light gbm model to get started.
Objective of the competition:
In this competition, we a’re challenged to analyze a Google Merchandise Store (also known as GStore, where Google swag is sold) customer dataset to predict revenue per customer.
目的: 我们探索给定的数据集,并沿着这条路做一些推论。最后,我们将建立一个lightGBM模型开始。 比赛目的: 在这个竞争中,我们面临的挑战是分析Google商品商店(也称为GStore,Google swag在这里销售)客户数据集,以预测每个客户的收入。
import os
import json
import numpy as np
import pandas as pd
from pandas.io.json import json_normalize
import matplotlib.pyplot as plt
import seaborn as sns
color = sns.color_palette()
%matplotlib inline
from plotly import tools
import plotly.offline as py
py.init_notebook_mode(connected=True)
import plotly.graph_objs as go
from sklearn import model_selection, preprocessing, metrics
import lightgbm as lgb
pd.options.mode.chained_assignment = None
pd.options.display.max_columns = 999
About the dataset:
Similar to most other kaggle competitions, we are given two datasets
train.csv
test.csv
Each row in the dataset is one visit to the store. We are predicting the natural log of the sum of all transactions per user.
The data fields in the given files are
fullVisitorId- A unique identifier for each user of the Google Merchandise Store.
channelGrouping - The channel via which the user came to the Store.
date - The date on which the user visited the Store.
device - The specifications for the device used to access the Store.
geoNetwork - This section contains information about the geography of the user.
sessionId - A unique identifier for this visit to the store.
socialEngagementType - Engagement type, either “Socially Engaged” or “Not Socially Engaged”.
totals - This section contains aggregate values across the session.
trafficSource - This section contains information about the Traffic Source from which the session originated.
visitId - An identifier for this session. This is part of the value usually stored as the _utmb cookie. This is only unique to the user. For a completely unique ID, you should use a combination of fullVisitorId and visitId.
visitNumber - The session number for this user. If this is the first session, then this is set to 1.
visitStartTime - The timestamp (expressed as POSIX time).
Also it is important to note that some of the fields are in json format.
Thanks to this wonderful kernel by Julian, we can convert all the json fields in the file to a flattened csv format which generally use in other competitions.
关于数据集: 与大多数其他kgle竞争相似,我们给出了两个数据集。 DataSet中的每一行都是对该商店的一次访问。我们预测每个用户的所有事务总和的自然日志。 给定文件中的数据字段是 FulfVistoRID——谷歌商品商店的每个用户的唯一标识符。 频道分组-用户通过该频道到商店的频道。 日期-用户访问商店的日期。 设备-用于访问商店的设备的规格。 地理信息-本节包含用户地理信息。 SeSession ID——访问该商店的唯一标识符。 社会交往类型-参与类型,要么“社会参与”或“没有社会参与”。 总计-本节包含整个会话的聚合值。 这个部分包含关于会话来源的信息来源。 VISITId——这个会话的标识符。这是通常存储为UTUMB cookie的值的一部分。这只是用户独有的。对于完全唯一的ID,应该使用FuleVistor和VISIDID的组合。 VisualType——这个用户的会话号。如果这是第一个会话,则设置为1。 VisualStestTime-时间戳(表示为POSIX时间)。 同样重要的是要注意的是,一些字段是JSON格式。 由于Julian的这个出色的内核,我们可以将文件中的所有json字段转换为扁平的csv格式,这种格式通常用于其他比赛。
def load_df(csv_path='train.csv', nrows=None):
JSON_COLUMNS = ['device', 'geoNetwork', 'totals', 'trafficSource']
df = pd.read_csv(csv_path,
converters={
column: json.loads for column in JSON_COLUMNS},
dtype={
'fullVisitorId': 'str'}, # Important!!
nrows=nrows)
for column in JSON_COLUMNS:
column_as_df = json_normalize(df[column])
column_as_df.columns = [f"{column}.{subcolumn}" for subcolumn in column_as_df.columns]
df = df.drop(column, axis=1).merge(column_as_df, right_index=True, left_index=True)
print(f"Loaded {os.path.basename(csv_path)}. Shape: {df.shape}")
return df
%%time
train_df = load_df()
test_df = load_df("test.csv")
train_df
Target Variable Exploration:
Since we are predicting the natural log of sum of all transactions of the user, let us sum up the transaction revenue at user level and take a log and then do a scatter plot.
目标变量探索: 由于我们预测用户所有事务的总和的自然日志,因此让我们在用户级别上总结事务收入,并获取日志,然后进行散点绘图。
train_df["totals.transactionRevenue"] = train_df["totals.transactionRevenue"].astype('float')
gdf = train_df.groupby("fullVisitorId")["totals.transactionRevenue"].sum().reset_index()
plt.figure(figsize=(8,6))
plt.scatter(range(gdf.shape[0]), np.sort(np.log1p(gdf["totals.transactionRevenue"].values)))
plt.xlabel('index', fontsize=12)
plt.ylabel('TransactionRevenue', fontsize=12)
plt.show()
Wow, This confirms the first two lines of the competition overview.
- The 80/20 rule has proven true for many businesses–only a small percentage of customers produce most of the revenue. As such, marketing teams are challenged to make appropriate investments in promotional strategies.
Infact in this case, the ratio is even less.
哇,这证实了前两行的竞争概述。 80/20条规则对于许多企业来说都是正确的——只有一小部分客户产生了大部分收入。因此,营销团队面临挑战,以作出适当的投资促销策略。 事实上,在这种情况下,比率甚至更低。
nzi = pd.notnull(