阿里巴巴笔试题：数据分析与建模测试

最新推荐文章于 2024-05-12 15:40:56 发布

睡熊猛醒

最新推荐文章于 2024-05-12 15:40:56 发布

阅读量2.8k

点赞数

本文链接：https://blog.csdn.net/weixin_41089007/article/details/90452477

版权

该文介绍了在处理阿里巴巴笔试题时，针对数据分析与建模任务，如何解决类别不均衡问题。作者尝试了多种方法，包括模型选择、特征工程、阈值调整、模型集成和特征选择，最终使用LightGBM模型取得了良好效果。

摘要由CSDN通过智能技术生成

阿里巴巴笔试题：数据分析与建模测试

请阅读以下文字答题。

Field Descriptions:
isbuyer - Past purchaser of product
buy_freq - How many times purchased in the past
visit_freq - How many times visited website in the past
buy_interval - Average time between purchases
sv_interval - Average time between website visits
expected_time_buy - ?
expected_time_visit - ?
last_buy - Days since last purchase.
last_visit - Days since last website visit.
multiple_buy - ?
multiple_visit - ?
uniq_url - Number of unique urls we observed web browser on.
num_checkins - Number of times we observed web browser.
y_buy - Outcome variable of interest, Did they purchase in period of interest.

Question:

Each observation in the provided training/test dataset is a web browser (or cookie) in our observed Universe. The goal is to model the behavior of a future purchase and classify cookies into those that will purchase in the future and those that will not. y_buy is the outcome variable that represents if a cookie made a purchase in the period of interest. All of the rest of the columns in the data set were recorded prior to this purchase and may be used to predict purchase. Please use ‘ads_train.csv’ as training data to create at least two different classes of models (e.g. logistic regression, random forest, etc.) to classify these cookies into future buyers or not. Explain your choice of model, how you did model selection, how you validated the quality of the model, and which variables are most informative of purchase. Also, comment on any general trends or anomalies in the data you can identify as well as propose a meaning for those fields not defined. The deliverable is a document with text and figures illustrating your thought process, how you began to explore the data, and a comparison of the models that you created. When evaluating your models, consider metrics such as AUC of Precision-Recall Curve, precision, recall. This should take about 6 hours and can be done using any programming language or statistical package (R or Python are preferred). Finally, perform prediction on test dataset ‘ads_test.csv’ using your chosen model(s) and report predicted probabilities of future purchase and predicted labels of future purchase.

Please also do include codes with your document (Python Jupyter/R knitr is recommended)

题目：

所提供的训练/测试数据集中的每个观察都是我们观察到的宇宙中的一个Web浏览器（或cookie）。目标是对未来购买行为进行建模，并将cookies分为未来购买和不购买两类。y_buy是一个结果变量，它表示一个cookie是否在感兴趣的期间内进行了购买。数据集中的所有其他列都是在此次购买之前记录的，可以用来预测购买情况。请使用“ads-train.csv”作为训练数据，创建至少两种不同类型的模型（如逻辑回归、随机森林等），以将这些cookies分类。解释您对模型的选择，您是如何进行模型选择的，您是如何验证模型的质量的，以及哪些变量是购买时最有用的信息。此外，对您可以识别的数据中的任何一般趋势或异常进行评论，并对那些未定义的字段提出含义。可交付结果是一个文档，其中包含说明您的思想过程、您如何开始探索数据以及您创建的模型的比较的文本和数字。在评估您的模型时，请考虑精度召回曲线、精度、召回的AUC等指标。这需要大约6个小时，并且可以使用任何编程语言或统计包（首选R或Python）。最后，使用您选择的模型对测试数据集“ads_test.csv”进行预测，并报告预测的未来购买概率和预测的未来购买标签。请在文档中包含代码（建议使用python jupyter/r knitr）

数据含义分析

isbuyer - Past purchaser of product 过去是否购买产品

buy_freq - How many times purchased in the past 过去购买过多少次

visit_freq - How many times visited website in the past 过去访问过多少次网站

buy_interval - Average time between purchases 平均购买间隔时间

sv_interval - Average time between website visits 网站访问之间的平均时间

expected_time_buy - ? 预期购买时间

expected_time_visit - ? 预期访问时间

last_buy - Days since last purchase. 上次购买后的天数

last_visit - Days since last website visit. 上次访问网站后的天数

multiple_buy - ? 之前是否多次购买商品

multiple_visit - ? 之前是否多次访问网站

uniq_url - Number of unique urls we observed web browser on.

在Web浏览器上观察到的唯一URL数

num_checkins - Number of times we observed web browser. 观察到的Web浏览器的次数

y_buy - Outcome variable of interest, Did they purchase in period of interest.

利息的结果变量，他们是否在利息期内购买。

1.第一天，简单的看了一下数据，发现正负样本类别及其不均衡，所以将预测指标多样化，正负样本预测P、R、F1分开看，将缺失值简单填充0，多个模型预测，最好结果为逻辑回归：

可以发现，正样本均没有预测出来，模型基本没有学习到什么东西。

2.采用Smote方法尝试解决类别不均衡问题，，正样本系数为10，负样本为1，最好结果：

可以看出，虽然AUC值下降，但是正样本预测准确率小有上升，还是有用的。

3.使用分类阈值移动来进一步处理类别不均衡问题，根据公式p=m/(m+n)=0.005以及模型本身预测能力综合考虑，将阈值设为0.1，结果如下:

最低0.47元/天解锁文章

睡熊猛醒

关注

0
点赞
踩
6

收藏

觉得还不错? 一键收藏
1
评论
阿里巴巴笔试题：数据分析与建模测试

阿里巴巴笔试题：数据分析与建模测试请阅读以下文字答题。Field Descriptions:isbuyer - Past purchaser of productbuy_freq - How many times purchased in the pastvisit_freq - How many times visited website in the pastbuy_inter...
复制链接

扫一扫