数据分析记录

记录数据分析套路

1、数据清洗

  • one hot:用于离散变量
    
  • 卡方分箱+WOE编码:用于连续特征 https://zhuanlan.zhihu.com/p/146476834 
    

卡方分箱+WOE编码
–“可以把非线性的特征转化为线性”.例如在风控场景里,我们可能用到客户的年龄做特征。我们知道肯定不是年龄越大风险越高,或者年龄越大风险越低,一定是有个年龄段的风险是比其他年龄段高些。
在这里插入图片描述

  • z-normalize: 用于连续变量
    
  • min-max normalized:用于连续变量
    
  • 二值化:用于连续变量(连续变量分组,转化为分类变量0/1)
    
  • 根据年龄分组进行z-normalize: 
    当大多数变量与年龄相关,当变量直方图一坨一坨分开不连续,,用于连续变量
    eg:x1属于Agegroup6,x1->(x1-mean(xi,xi属于Agegroup6))/std(xi,xi属于Agegroup6)
    记得连续变量按年龄分组z-normalize后,“年龄变量”需要min-max normalized,分类变量全部二值化
    
  • 分箱  
    

分箱:
等距分箱、等频分箱、卡方分箱、最小熵分箱:https://cloud.tencent.com/developer/article/1388206
KS分箱:https://blog.csdn.net/hxcaifly/article/details/84593770
其他:(看注释)缺失与分箱:https://blog.csdn.net/happy5205205/article/details/95062467;代码:https://zhuanlan.zhihu.com/p/355796708

树分箱:https://blog.csdn.net/fulk6667g78o8/article/details/120318104

卡方分箱、树分箱是有监督的。训练集分箱完会获得特征相邻连续的几个区间以及区间的上下限,即每个箱子不重合,箱子的上下限可直接用于测试集特征分箱

2、特征选择

  • 根据缺失筛选变量(注意变量相关性和样本量之间的平衡,当某关键变量缺失过多,可以通过丢弃样本尽量保留变量)

  • 根据统计分析筛选变量

    引用 early Recognition of Burn- and trauma-Related Acute Kidney injury: A pilot comparison of Machine Learning techniques

The Shapiro-Wilkes test and histogram analysis were used to determine normality.
.
Continuous normally distributed variables were compared using means (standard deviation[SD]) using the 2-sample t-test, while discrete variables were compared using the non-parametric Chi-square test.Non-parametric continuous data compared using medians (interquartile range [IQR]), when appropriate, were analyzed using the Mann-Whitney U test. categorical variables were represented by frequency(%)
.
Multivariate logistic regression was used to determine predictors of AKI with age and burn size serving as covariates. Repeated measures analysis of variance was used for time series data.
线性回归中F检验、参数t检验、R^2的相关定义(注意Logistic回归对于自变量因变量分布没有要求,而线性回归有较多对于自变量因变量分布的要求):
https://zhuanlan.zhihu.com/p/48541799?ivk_sa=1024320u
https://www.cnblogs.com/wqbin/p/11109650.html
https://zhuanlan.zhihu.com/p/176688072
https://blog.csdn.net/Noob_daniel/article/details/76087829
在这里插入图片描述
.
A p-value<0.05 was considered statistically significant with receiver operator characteristic (ROC) analysis also performed to compare AKI biomarker performance

  • 根据机器学习方法筛选变量

    引用 Using data mining techniques for multi-diseases prediction modeling of hypertension and hyperlipidemia by common risk factors

Stage I first selects the risk factors of hypertension and hyperlipidemia using six data mining approaches: logistic regression analysis, C5.0 decision tree, Classification and Regression Tree (CART), Chi-squared Automatic Interaction Detector (CHAID), exhaustive CHAID, and discriminant analysis

  • 根据相关系数筛选变量
  • 根据IV(WOE)指标筛选变量
    https://blog.csdn.net/shenxiaoming77/article/details/78771698
  • 0
    点赞
  • 0
    收藏
    觉得还不错? 一键收藏
  • 0
    评论
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值