Capital one TPS整理

最新推荐文章于 2024-09-25 11:59:15 发布

weixin_30578677

最新推荐文章于 2024-09-25 11:59:15 发布

阅读量134

点赞数

文章标签：大数据 python

原文链接：http://www.cnblogs.com/ffeng0312/p/10275071.html

版权

Credit Card Fraud Detection 7 times from 2015 to 2017

What machine learning model would you use to classify fraudulent transactions on credit cards?

feature selection

how to use classification method, which one is good to use?Later there will also be a problem which method is the least useful.

bias variance trade off - What does regularization do?

target missing

false positive/false negative - Are false positives or false negatives more important? What is the effect of FP and FN?

What is VIF (in regression output)?

potential issues

exploratory analysis and data cleaning

How would you handle missing or garbage data?

How would you use existing features to add new features?

Logistic regression, random forests

Difference between random forest and gradient boosted tree.

Anomaly detection/novelty detection techniques might be also helpful because of the huge data imbalance that normally exists in such scenarios.

Asked a lot of possible problems with the model and how should you deal with that when time is limited.

Couple things to keep in mind regarding fraud:
1) you're dealing with an imbalanced data set (your fraud cases may be 3-5% of all your data). So, consider either oversampling, or giving higher weight to your fraud cases.
2) you data may not have all the true fraud cases - in other words, there maybe actual fraud cases not captured in your data. So, some form of anomaly detection may be needed.

预测用户是否会注销信用卡 -3 times in 2018

如果给你一堆dataset，比如信用卡一年的交易记录、客户个人信息，银行想预测客户会不会在一个月之内关户，如果会的话，银行打算发一点cashback rewards给这些人挽留一下。让你建模预关户。以下是面试官的问题：

1. 你会选哪些feature？（感觉是随便说，只要有关系。追问如果是一堆transaction的日期之类的，应该怎样rebuild feature）
2. 怎么做data cleaning：
a.       怎样detect outlier？. From 1point 3acres bbs
b.       怎样fill in missing data？(我说可以填constant比如mean，然后他追问填mean在什么情况下不合适、怎样更好)
c.       如果target value也missing了怎么办
3. 你选什么model？(我说decision tree，然后他让我说有没有其他model，优缺点分别是什么，target是什么。target应该是一个binary的值whether the customer will close the account in one month，如果regression得到了0~1之间的值就代表how likely)
4. 怎么看model 的performance，用什么package. From 1point 3acres bbs
5. 如果data size很大有1TB，怎样sample，用什么package. From 1point 3acres bbs
6. 如果model不准确，会给银行造成什么损失？
7. 如果用model predict得到了一堆target的值，应该怎样根据target发rewards (我说画个distribution，给最可能关户的百分之几客户发rewards。追问除了这种方式还有什么方式，我也不确定是考modeling还是business sense)
8. 最后一个是地里看到的一模一样的open question，两人都有5000limit，但是一个用100%一个只用2%，这两人有没有可能都在一月之内关户。面试官应该看你第一反应是考虑model的问题还是考虑其他方面。

从feature engineering 到最后 model tuning and validation 的所有步骤。

如何建model,用了哪些parameter,结果如何还有为什么要选这个model

credit card churn model
   1. Feature engineering，比如从start date算出tenure 等等
   2. Missing value
   3. 用什么模型，为什么
   4. 现在数据量加大，怎么办？spark。如果你要选，用RSpark还是PySpark？为什么
   5. 现在模型output出来，一个credit limit 使用率0%的用户和使用率95%的用户都很危险，都很可能马上就关掉信用卡，你会怎么处理？我回答churn model是起点，一般marketing department会根据churn model的结果设计retention program。对于这两类危险用户，需要设计不同的incentive plan。
         1）使用率0%的用户，基本上很难挽回。
         2）使用率95%的用户大概率可以挽回，降低利率，增加cashback等等。。。
         3）可以根据测试结果再搞个uplift model，看哪些high churn users可以挽回的，着重施加treatment。

tell me some useful packages you use in R/python? 1 Answer
how do you detect multicollinearity? 1 Answer
how do you join two data sets?

Other questions:

our sever run cost is xxx, 其他固定成本是xxx，能容纳xxx TB流量。我们大概有xxx个客户，每个客户交付给我们server使用费为xxx／month。我们给每个用户分配xxxGB，但是平均每个用户只会用掉期中的xx%，所以我们可以把剩下的空间再去接纳更多的客户。问：每年盈利是多少？现有另外一种server b， cost is xxx，capacity is xxx。。。请权衡比较我们要不要把已有server换成server b-baidu

题目是有一个运动产品的零售商，来找你优化他们的在线广告竞拍系统，提高response rate。假设你有的数据是3, 000, 000用户的访问数据，每行数据有150多个column，已知overall的response rate是1/1000。被问的问题有：
1. 选什么作为target？
Response or not
2. 选什么metrics?
AUC-ROC
3. 怎么处理NA?
It depends. If NA is meaningful, leave it there. If NA is missing due to data extracation, do some simple if-else condition/mean(median)/regression to fill
4. 怎么做feature engineering?
Encode categorical varaible, use 'groupby' and 'mean/medium/std' to generate some features
4. 数据量特别大怎么办？
mapreduce，但是我没用过，就拿本地并行优化举了个例子，怎么分配数据给各个线程，然后怎么把数据收回来合并。
5. 模型用什么？
GBDT，lightGBM/XGB
6. 怎么评估模型表现？
k-fold CV
7. Overfitting/underfitting怎么办？
分别讨论了一下。想办法获取更多的数据，调整hyper-parameter。
8. 如果模型预测出了问题，会有什么影响？
分情况讨论了一下整体上会有什么变化，对单个用户有什么影响。